Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in Speech

Sep 18, 2022·
Gao, X.
,
Nayak, S.
Matt Coler
Matt Coler
· 1 min read
Abstract
Sarcasm is a frequently used linguistic device which is expressed in a multitude of ways, both with acoustic cues (including pitch, intonation, intensity, etc.) and visual cues (including facial expression, eye gaze, etc.). While cues used in the expression of sarcasm are well-described in the literature, there is a striking paucity of attempts to perform automatic sarcasm detection in speech. To explore this gap, we elaborate a methodology of implementing Inductive Transfer Learning (ITL) based on pre-trained Deep Convolutional Neural Networks (DCNNs) to detect sarcasm in speech. To those ends, the multimodal dataset MUStARD is used as a target dataset in this study. The two selected pre-trained DCNN models used are Xception and VGGish, which we trained on visual and audio datasets. Results show that VGGish, which is applied as a feature extractor in the experiment, performs better than Xception, which has its convolutional layers and pooling layers retrained. Both models achieve a higher F-score compared to the baseline Support Vector Machines (SVM) model by 7% and 5% in unimodal sarcasm detection in speech.
Type
Publication
Proceedings of the 23rd Annual Conference of the International Speech Communication Association (Interspeech 2022)

This paper addresses the challenge of automated sarcasm detection in speech, an area that has received relatively little attention in speech technology research despite the importance of sarcasm in human communication.

The research implements inductive transfer learning based on pre-trained deep convolutional neural networks (DCNNs) to identify sarcastic speech patterns. Using the multimodal MUStARD dataset, we compare two pre-trained DCNN models: Xception and VGGish. Our findings indicate that VGGish, when used as a feature extractor, outperforms Xception with retrained convolutional and pooling layers. Both approaches achieved notable improvements over baseline Support Vector Machine (SVM) models, with F-score increases of 7% and 5% respectively.

This work represents an important step toward more nuanced speech recognition systems that can detect subtle linguistic features like sarcasm.