Emotional-Aware-Reasoning
Audio-augmented emotion-aware language models with cross-speaker generalization
// DESCRIPTION
Real vs. Synthetic Speech for Emotion-Aware LLM Reasoning
This research investigates the impact of real versus synthetic speech inputs on emotion-aware reasoning in large language models. As multimodal LLMs increasingly process speech inputs, understanding how the provenance and quality of audio signals affects emotional understanding and downstream reasoning is critical for applications in mental health, customer service, and human-AI interaction.
We construct a controlled evaluation framework with paired real and TTS-generated speech samples across 6 emotion categories, testing how LLMs with speech encoders perform on emotion recognition, empathetic response generation, and emotion-conditioned reasoning tasks. Our findings reveal a significant "synthetic gap" where models trained primarily on real speech show 15-25% performance degradation on synthetic inputs, and vice versa.
The study further explores whether emotional prosodic features (pitch contour, speaking rate, voice quality) are faithfully preserved in synthetic speech and how their absence or distortion affects the reasoning chain. We find that current TTS systems adequately convey basic emotions but fail on subtle affective states like sarcasm, resignation, and mixed emotions.
We propose a domain adaptation technique using contrastive learning between real and synthetic speech embeddings that reduces the synthetic gap by 60%, enabling more robust emotion-aware LLM deployment regardless of input source.
// HIGHLIGHTS
- 15-25% performance gap identified between real and synthetic speech for emotion-aware LLM reasoning
- Controlled evaluation across 6 emotion categories with paired real/TTS samples
- Contrastive domain adaptation reduces the synthetic gap by 60%
- Analysis of prosodic feature preservation in modern TTS for subtle affective states
- Implications for mental health, customer service, and human-AI interaction applications