Speech Synthesis Basics: A Hands-on Introduction

Speech Synthesis Basics: A Hands-on Introduction

May 16, 2025 · 4 min read

Speech synthesis has evolved dramatically since its early mechanical implementations. This tutorial explores concatenative synthesis - one of the fundamental techniques where we combine pre-recorded speech segments to create new utterances. While modern systems use neural networks, understanding concatenation provides insight into core speech technology challenges.

Prerequisites

  • Basic computer skills
  • Google account (for Colab)
  • No programming experience required!

Getting Started

  1. Open Google Colab
  2. Create a new notebook
  3. Record these sounds using any audio recording app:
    • ‘k’ sound (as in “cat”)
    • ‘a’ sound (as in “cat”)
    • ’t’ sound (as in “cat”)
    • ‘ka’ syllable
    • ‘at’ syllable

Keep recordings short (1-2 seconds) and save as .mp3, .wav, or .m4a.

Project Walkthrough

Step 1: Setup

We’ll start by installing the necessary libraries and importing them:

# Install required libraries
!pip install pydub
from google.colab import files
import IPython.display as ipd
from pydub import AudioSegment
import numpy as np
import matplotlib.pyplot as plt
import io

print("Let's create synthetic speech by combining sounds!")

Step 2: Upload Individual Sounds

Next, we’ll upload our sound files:

# Upload sound files
print("Upload your 'k' sound file")
k_file = files.upload()
k_filename = list(k_file.keys())[0]

print("Upload your 'a' sound file")
a_file = files.upload()
a_filename = list(a_file.keys())[0]

print("Upload your 't' sound file")
t_file = files.upload()
t_filename = list(t_file.keys())[0]

# Load into AudioSegment objects
k_sound = AudioSegment.from_file(k_filename)
a_sound = AudioSegment.from_file(a_filename)
t_sound = AudioSegment.from_file(t_filename)

Step 3: Visualize Sounds

We can visualize each sound’s waveform to better understand their acoustic properties:

# Function to plot waveform
def plot_waveform(audio_segment, title):
    samples = np.array(audio_segment.get_array_of_samples())
    plt.figure(figsize=(10, 2))
    plt.plot(samples)
    plt.title(title)
    plt.ylabel("Amplitude")
    plt.xlabel("Time (samples)")
    plt.show()

# Show waveforms
plot_waveform(k_sound, "Waveform of 'k'")
plot_waveform(a_sound, "Waveform of 'a'")
plot_waveform(t_sound, "Waveform of 't'")

This will create visualizations like:

XXX

Step 4: Basic Concatenation

Now, let’s combine our sounds:

# Simple concatenation
cat = k_sound + a_sound + t_sound

# Play the result
print("Basic 'cat' concatenation:")
ipd.display(ipd.Audio(data=cat.export(format="wav").read(), rate=44100))

Step 5: Adding Pauses

Let’s experiment with adding pauses:

# Create pauses of different lengths
short_pause = AudioSegment.silent(duration=100)  # 100ms
medium_pause = AudioSegment.silent(duration=300)  # 300ms
long_pause = AudioSegment.silent(duration=800)  # 800ms

# Add pauses between sounds
cat_with_short_pauses = k_sound + short_pause + a_sound + short_pause + t_sound
cat_with_long_pauses = k_sound + long_pause + a_sound + long_pause + t_sound

# Play results
print("'cat' with short pauses:")
ipd.display(ipd.Audio(data=cat_with_short_pauses.export(format="wav").read(), rate=44100))

print("'cat' with long pauses:")
ipd.display(ipd.Audio(data=cat_with_long_pauses.export(format="wav").read(), rate=44100))

Step 6: Adjusting Speed

We can also alter the playback speed:

# Function to modify speech rate
def change_speed(audio, speed_factor):
    return audio._spawn(audio.raw_data, overrides={"frame_rate": int(audio.frame_rate * speed_factor)})

# Create sounds with varied speeds
fast_cat = change_speed(cat, 1.5)  # Faster
slow_cat = change_speed(cat, 0.7)  # Slower

# Play results
print("Fast 'cat':")
ipd.display(ipd.Audio(data=fast_cat.export(format="wav").read(), rate=44100))

print("Slow 'cat':")
ipd.display(ipd.Audio(data=slow_cat.export(format="wav").read(), rate=44100))

Step 7: Improving with Syllables

For more natural-sounding results, we’ll try using syllables instead of phonemes:

# Upload syllable recordings
print("Upload recording of 'ka' syllable")
ka_file = files.upload()
ka_filename = list(ka_file.keys())[0]

print("Upload recording of 'at' syllable")
at_file = files.upload()
at_filename = list(at_file.keys())[0]

# Load syllables
ka_sound = AudioSegment.from_file(ka_filename)
at_sound = AudioSegment.from_file(at_filename)

# Create with crossfade
cat_syllable = ka_sound.append(at_sound, crossfade=80)

print("'cat' from syllables with crossfade:")
ipd.display(ipd.Audio(data=cat_syllable.export(format="wav").read(), rate=44100))

Why Pure Concatenation Sounds Unnatural

If you’ve followed along with the tutorial, you’ve likely noticed that our concatenated speech sounds robotic. Here’s why:

  • Co-articulation - In natural speech, sounds influence each other. The ‘k’ in “cat” is different from the ‘k’ in “kit” due to the following vowel.
  • Missing transitions - Natural speech has smooth transitions between sounds that get lost when concatenating isolated phonemes.
  • Prosody issues - Natural speech has rhythm, stress, and intonation patterns that simple concatenation ignores.
  • Context sensitivity - The same phoneme varies based on surrounding sounds and position in words.

Modern Solutions

While our simple concatenation experiment produces robotic speech, modern systems use more sophisticated approaches:

  • Diphone Synthesis - Using sound-to-sound transitions instead of isolated phonemes
  • Unit Selection - Choosing optimal sound segments from large databases
  • Statistical Parametric Synthesis - Modeling speech parameters
  • Neural TTS - Using deep learning to generate natural speech

Try it yourself

You can access the complete notebook for this tutorial here: XXX