Improving Sarcasm Detection from Speech and Text through Attention-based Fusion Exploiting the Interplay of Emotions and Sentiments
This paper addresses the challenge of automatic sarcasm detection by leveraging the characteristic mismatch between emotional tone and literal meaning that often signals sarcastic intent. Our approach integrates multiple data streams—speech audio, transcribed text, emotion recognition, and sentiment analysis—to improve detection accuracy.
The key innovation of our work is the attention-based fusion mechanism that explicitly models the interactions between these modalities. Rather than treating each data stream independently, our system learns to identify patterns of incongruity, such as positive words delivered with negative emotional prosody, which are hallmarks of sarcasm.
Our experiments demonstrate that this multimodal approach significantly outperforms unimodal baselines, achieving substantial improvements in precision, recall, and F1 scores. The research has important applications for making conversational AI systems more socially aware and for improving accessibility technology for individuals with sensory processing challenges or neurodevelopmental conditions that affect sarcasm perception.