SarcasticSpeech: Speech Synthesis for Sarcasm in Low-Resource Scenarios

Aug 1, 2023·

Zhu Li

Xiyuan Gao

Shekhar Nayak

Matt Coler

· 1 min read

PDF ISCA

Abstract

Sarcastic speech synthesis, the ability to generate speech that conveys sarcasm, can have several significant implications in various contexts, such as entertainment and better human-computer interaction. This study presents a first attempt to apply transfer learning techniques from a diverse speech style dataset to the challenging domain of sarcastic speech synthesis. The limited availability of specific sarcastic speech data poses significant challenges in capturing the expressive nature of sarcasm. By leveraging transfer learning, a pre-trained model is fine-tuned using a dataset encompassing various speech styles, including sarcastic speech. The synthesized sound contains some robotic elements, indicating moderate performance improvements in sarcastic speech synthesis through transfer learning. Future work will explore the application of multi-modal approaches to improve sarcastic speech synthesis and further enhance the expressiveness and naturalness of generated sarcastic speech.

Type

Conference Paper

Publication

In Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW2023)

This paper tackles the challenge of synthesizing sarcastic speech, a vital yet underexplored component of expressive speech synthesis. Sarcasm, characterized by a mismatch between literal meaning and intended message, relies heavily on prosodic cues that are difficult to model, especially with limited data resources.

Our research represents one of the first attempts to apply transfer learning to sarcastic speech synthesis. By leveraging a pre-trained model and fine-tuning it with a dataset that includes various speech styles alongside sarcastic samples, we demonstrate that it’s possible to generate speech with some sarcastic qualities despite data constraints.

The study identifies several key challenges in this domain:

The scarcity of dedicated sarcastic speech datasets
The complex, context-dependent nature of sarcastic prosody
The need to balance expressiveness with naturalism in synthesized speech

While our current results show moderate success with some artifactual elements in the synthesized output, this work establishes an important baseline and direction for future research. We propose that multimodal approaches incorporating textual, acoustic, and potentially visual cues might lead to more convincing sarcastic speech synthesis in future iterations.

This research has applications in enhancing human-computer interaction, creating more engaging virtual assistants, and developing tools for entertainment and educational contexts where expressive speech is valuable.

Last updated on Aug 1, 2023