Improving Sarcasm Detection from Speech and Text through Attention-based Fusion Exploiting the Interplay of Emotions and Sentiments

May 13, 2024·
Xiyuan Gao
,
Shekhar Nayak
Matt Coler
Matt Coler
· 1 min read
Abstract
Sarcasm detection presents unique challenges in speech technology, particularly for individuals with disorders that affect pitch perception or those lacking contextual auditory cues. While previous research has established the significance of integrating textual, audio and visual data in sarcasm detection, these studies overlook the interactions between modalities. We propose an approach that synergizes audio, textual, sentiment and emotion data to enhance sarcasm detection. This involves augmenting sarcastic audio with corresponding text using Automatic Speech Recognition (ASR), supplemented with information based on emotion recognition and sentiment analysis. Our methodology leverages the strengths of each modality: emotion recognition algorithms analyze the audio data for affective cues, while sentiment analysis processes the text generated from ASR. The integration of these modalities aims to capture the nuanced nature of sarcasm, where emotional tone may contradict the literal meaning of the words spoken.
Type
Publication
Proceedings of Meetings on Acoustics, 54(1), 060002

This paper addresses the challenge of automatic sarcasm detection by leveraging the characteristic mismatch between emotional tone and literal meaning that often signals sarcastic intent. Our approach integrates multiple data streams—speech audio, transcribed text, emotion recognition, and sentiment analysis—to improve detection accuracy.

The key innovation of our work is the attention-based fusion mechanism that explicitly models the interactions between these modalities. Rather than treating each data stream independently, our system learns to identify patterns of incongruity, such as positive words delivered with negative emotional prosody, which are hallmarks of sarcasm.

Our experiments demonstrate that this multimodal approach significantly outperforms unimodal baselines, achieving substantial improvements in precision, recall, and F1 scores. The research has important applications for making conversational AI systems more socially aware and for improving accessibility technology for individuals with sensory processing challenges or neurodevelopmental conditions that affect sarcasm perception.