Benchmarking Text-to-Speech Models

INTRODUCTION

The Voder being demonstrated at the New York World's Fair

In the early 2000s, Statistical Parametric Synthesis (SPS) emerged as a new approach to speech synthesis. This method aimed to address the limitations of earlier techniques, such as unit selection synthesis, by reducing the need for large, high-quality databases of recorded speech. Instead of relying on extensive recordings, SPS used statistical models to generate speech to make the synthesis process more efficient and flexible. SPS works by analyzing key acoustic elements of speech, such as pitch, duration, and tonal qualities, which give human speech its natural character. These parameters are then used to construct models based on frameworks like Hidden Markov Models (HMMs), which are trained on smaller datasets to estimate the likelihood of specific sounds and transitions between them. By analyzing these features, the system can predict and generate new speech outputs without relying on pre-recorded phrases or words. Unlike earlier methods, which relied on large, curated databases that were costly to produce, SPS significantly reduced these dependencies. It achieved this by generating speech through statistical models trained on much smaller datasets, making the process more efficient and cost-effective. Additionally, one specific implementation of SPS in the early 2000s was the introduction of Hidden Markov Model (HMM)-based speech synthesis. Unlike SPS, which focuses on analyzing acoustic features from recorded data, HMM-based systems took a different approach by mapping text to parameters such as frequency, amplitude, and duration of sounds using statistical estimates. These parameters were then used to generate speech. The HMM approach gained popularity because it required less storage space than unit selection systems. It also offered greater adaptability to different speaking styles or voices by adjusting model parameters within the same underlying architecture. HMMs work by modeling speech as a sequence of states, where each state corresponds to a specific linguistic unit (i.e. phemone, diphone). The model then estimates the transition probabilities between these states, as well as the probability distributions of acoustic features (i.e., pitch, duration, and energy) associated with each state. By using this approach, the system can then generate smoother transitions between sounds, which makes the process more efficient in terms of storage and computation.

Tacotron 2 Audio Samples (taken from Tacotron2 GitHub)

Generative AI has rapidly emerged as a transformative force across numerous industries, reshaping how humans interact with technology. The journey of speech synthesis technology spans many decades, with the earliest breakthrough coming in 1939 when Bell Telephone Laboratory demonstrated the Voder (Voice Operating Demonstrator), the first useful speech synthesizer. The Voder synthesized speech by mimicking the characteristics of the human vocal tract, utilizing a wrist bar that created a buzz sound (a driving signal produced by a relaxation oscillator), ten keys that altered the gains of bandpass-filters, and a foot pedal that controlled pitch. To generate speech, operators would 'play' the Voder like an instrument, synchronizing the bar, keys, and pedal to sound out words. However, only trained technicians could produce sounds containing the emotion and tone crucial to human speech, and the Voder was limited in the types of sounds it could generate.

Building on these early developments, concatenative speech synthesis (CSS) was introduced in the 1980s, offering significant advancements over the Voder. Unlike the Voder, where speech was manually controlled by operators, CSS used recorded speech segments as an effective method to synthesize speech automatically. By selecting and combining such pre-recorded segments from a corpus of recorded speech, this technology produced more natural-sounding speech with greater efficiency and accuracy. This process is achieved by extracting small units, such as phonemes, syllables, or whole words, from the corpus and seamlessly stitching them together to form fluid, coherent speech. Among the various techniques used to select and combine these segments, diphone synthesis and unit selection synthesis are two of the most prominent methods that have shaped the development of this technology.

Diphone synthesis, one of the earliest techniques, focuses on using diphones—the transitions between two adjacent phonemes. In this method, speech is divided into pairs of phonemes that capture the key transitions necessary for natural-sounding speech. By joining these diphones together, the system can create speech that maintains clear and intelligible sound transitions. The primary advantage of diphone synthesis is its ability to ensure smooth phonetic transitions between adjacent sounds, which is crucial for intelligibility. However, the method’s reliance on these small speech units can result in less natural-sounding speech due to the limited number of recorded segments and the mechanical nature of the concatenation process.

A notable example of this diphone-based synthesis is MITalk, which was developed in the late 1970s as the first significant diphone-based text-to-speech synthesizer. Unlike earlier systems, MITalk was designed to automatically process and synthesize speech from text, making it one of the earliest attempts to create a more natural-sounding voice. The system was operated through five core modules, each of which played a specific role in the overall synthesis process. First, the FORMAT module preprocessed the input text to prepare it for analysis. It segmented the text into manageable parts and marked sentence boundaries, stress indicators, and punctuation to ensure that there is accurate interpretation in later stages. This was followed by the DECOMP module, which broke down words into smaller units called morphemes using a recursive matching algorithm against a 12,000-entry dictionary. Next, the PARSER module identified key grammatical structures, such as noun and verb phrases, and calculated rhythmic features like timing, duration, and pauses to enhance the naturalness of the generated speech. For words that could not be analyzed by the DECOMP module, the system passed them to the SOUND2 module. Here, approximately 400 letter-to-sound rules were applied to determine the appropriate pronunciations for unknown words. Finally, the PHONET module fine-tuned each sound’s phonetic details, adjusting loudness, pitch, and duration to create a more realistic intonation using input from the previous modules. Through this process, MITalk produced a synthesized voice that represented major progress in text-to-speech technology and marked an important milestone in speech synthesis development.

  • “He has read the whole thing.”

  • “He reads books.”

Tacotron or Human?

MITalk’s ATN diagram for noun groups

However, despite its contributions, diphone synthesis had limitations, particularly in the quality and naturalness of speech. The main limitation was its reliance on pre-recorded diphone segments, which often resulted in robotic and mechanical sounding voices. These limitations motivated researchers to explore new techniques, such as Unit Selection Synthesis. This method further builds on the idea of concatenating recorded speech, similar to diphone synthesis, but it uses much larger units. These units, often entire words or phrases, enable more natural-sounding and fluid speech synthesis. In unit selection synthesis, instead of relying on small segments, the system draws from a vast database of recorded speech, often containing hours of material, to combine full words and phrases into complete sentences. To select the most appropriate speech units, the system relies on two key cost functions: target cost and concatenation cost. The target cost measures how closely a particular speech unit from the database matches the target unit, which is the desired speech sound or sequence. It compares various attributes of the available speech units, such as pitch, duration, and timbre, to the intended target, assigning a lower cost to units that most closely match the desired characteristics. A lower target cost indicates a better match, ensuring that the speech unit is appropriate for the context. The concatenation cost, on the other hand, evaluates how well consecutive speech units will fit together. This cost function looks at how smoothly the speech units transition into each other and calculates the overall quality of sound produced when they are concatenated. It ensures there are no noticeable gaps or mismatches and that the units blend well without sudden changes in pitch, timing, or other speech qualities. A low concatenation cost indicates that the units can be smoothly joined together, resulting in speech that sounds more natural and fluid. Hence, by balancing these two cost functions, the system ensures that the selected speech units match the target characteristics and combine in a way that produces coherent and realistic speech.

While unit selection synthesis provided more natural-sounding speech compared to earlier methods like diphone synthesis, it presented several challenges. A major disadvantage was the need for a large database of recorded speech. Since the technique relied on selecting full words or phrases from a large collection of recordings, the database often grew quite large, sometimes containing hours of material. This posed significant demands in terms of both storage and processing power. Additionally, the quality of the synthesized speech depended heavily on the quality of the recorded units in the database. Any inconsistencies, noise, or non-ideal recording conditions could negatively impact the final output. This reliance on the quality of the pre-recorded data meant that the system could only produce speech as good as the material available, making it hard to achieve high-quality synthesis when the data was limited or of poor quality. These challenges related to resource requirements and data quality led researchers to find new solutions in speech synthesis technology.

  • “She earned a doctorate in sociology at Columbia University.”

METHODOLOGY

  1. Dataset Description

    The dataset used in this study is a publicly available collection of 13,100 short audio clips. These clips feature a single speaker reading passages from seven non-fiction books. Each clip includes a corresponding transcription and varies in length from 1 to 10 seconds, resulting in approximately 24 hours of recorded audio. The texts, originally published between 1884 and 1964, are in the public domain, and the audio recordings, created in 2016–2017 as part of the LibriVox project, are also publicly available. The dataset can be found at LibriVox. In order to systematically evaluate TTS model performance, a subset of this dataset was curated. This subset includes sentences of different lengths and styles, which are designed to address different aspects of speech synthesis. The selection includes short, concise phrases that assess basic articulation and clarity, as well as longer, more complex sentences that test syntactic processing and contextual understanding. The dataset also includes a variety of categories, such as emotional language that calls for expressive delivery, technical content that requires precision and clarity, and conversational tones that emphasize naturalness. By treating all sentences as part of a single unified dataset, a balanced evaluation is ensured, and performance metrics are averaged across all categories.Models

  2. Models

    Following this data acquisition, a thorough examination of Text-to-Speech (TTS) model variables was conducted to assess each model's characteristics and performance capabilities. The independent variable in this study is the Text-to-Speech (TTS) model architecture, which refers to the different types of models used to convert text into speech. These models synthesize speech from text using different techniques, including spectral modeling, neural networks, and transformer-based approaches. Each model utilizes its own set of algorithms and techniques to generate natural-sounding speech, with varying strengths and weaknesses. The table below presents a overview of each TTS model, including their names and architectural features:




















    Summary of TTS models selected for data collection and performance evaluation

  3. Performance metrics and evaluation criteria

    1. Accuracy: Accuracy in TTS systems measures how faithfully the synthesized speech represents the intended textual content. It reflects the model's ability to generate correct words, phrases, and pronunciations, ensuring the output matches the input text without errors or omissions. In this study, accuracy is evaluated using two primary metrics: Word Error Rate (WER) and Character Error Rate (CER). WER assesses the proportion of incorrect, missing, or extra words in the synthesized speech transcription compared to the reference text, making it a key indicator of semantic fidelity. CER, on the other hand, measures errors at the character level, offering finer granularity in analyzing complex sentence structures. By applying these metrics across a curated dataset of varying linguistic styles, this study examines the TTS models' ability to maintain accuracy and coherence across simple phrases, ensuring reliable and precise speech synthesis in diverse contexts.

    2. Audio Quality: Audio quality in TTS systems assesses how natural, smooth, and pleasant the synthesized speech sounds to human listeners. It evaluates the tonal balance, richness, and absence of unwanted artifacts, which are critical for creating speech that feels authentic and engaging. In this study, Spectral Flatness is used as a key metric to measure the tonal quality of the audio. Spectral flatness quantifies how noise-like or tone-like the sound is by analyzing the distribution of energy across the frequency spectrum; lower values indicate more natural, harmonic audio, while higher values suggest noisier or less natural synthesis. This study focuses on the TTS models' ability to deliver high-quality audio across emotional, technical, and conversational contexts.

    3. Clarity: Clarity in TTS systems evaluates how easily the synthesized speech can be understood in contexts with complex and technical content. Clear speech is characterized by accurate pronunciation, minimal distortion, and a balanced signal-to-noise ratio. In this study, clarity is assessed using metrics such as Harmonics-to-Noise Ratio (HNR) and Signal-to-Noise Ratio (SNR). HNR measures the proportion of harmonic sound components relative to noise, indicating the smoothness and quality of the voice. SNR quantifies the balance between the desired speech signal and background noise, ensuring intelligibility. These metrics are particularly relevant for testing TTS models in challenging scenarios, such as technical narrations or question-based sentences, where precision and articulation are critical. High clarity ensures that the synthesized speech remains intelligible and reliable across diverse use cases.

    4. Expressiveness: Expressiveness in TTS systems measures the ability to convey emotions, intonation, and natural prosody, making the synthesized speech feel engaging and human-like. It reflects how well the model can adjust pitch, rhythm, and emphasis to suit various linguistic contexts, such as emotional or conversational tones. In this study, Pitch is used as a key metric to evaluate expressiveness, as it captures variations in intonation and stress that contribute to the naturalness of speech. By analyzing pitch dynamics across emotional sentences, exclamatory phrases, and conversational passages, the study assesses how effectively each TTS model can deliver expressive and contextually appropriate speech.

    5. Efficiency: Efficiency in Text-to-Speech (TTS) systems refers to the speed and computational resources required to synthesize speech. It is a critical factor for applications that demand real-time or large-scale processing, such as virtual assistants and audiobook generation. In this study, Time to Run is the primary metric for evaluating efficiency, measuring the time taken by each model to process and generate audio for varying sentence lengths. Root Mean Square (RMS) is also analyzed to assess the energy distribution of the audio signals, which can impact computational load during synthesis.

Probabilistic parameters of a hidden Markov model (example)

X — states

y — possible observations

a — state transition probabilities

b — output probabilities

However, although HMMs offered a more flexible and efficient way to model speech, the output still often lacked the natural flow of human speech. This was due to the model's reliance on simplified parameters, which couldn't fully analyze the subtle variations and complexities of real human speech. For example, HMM synthesis often produced muffled sounding audio compared to human voices. In the early 2010’s, advancements in Neural Network TTS synthesis allowed speech synthesis models based on neural network architecture to achieve similar levels of voice production as professionally recorded audio. Models like WaveNet are trained on large amounts of audio and produce audio that models the waveforms of human speech. However, WaveNet attempts to model these waveforms by developing patterns to represent how the human speech waveform changes over time. Consequently, the audio generated has realistic aspects of human voices but does not form words of human language. The recent development of Tacotron 2 utilizes a modified version of WaveNet to generate speech with a high degree of similarity to natural human speech from a text input. Tacotron 2 converts textual data into mel spectrograms before leveraging WaveNet as a vocoder to convert the mel spectrograms into full speech.

From these beginnings, technology has evolved dramatically. Today’s text-to-speech technology uses AI systems that can generate content like text, images, and even lifelike speech, greatly enhancing the capabilities of digital tools. In particular, modern TTS technology has seen remarkable progress, revolutionizing areas such as virtual assistants, accessibility tools, and multimedia production. TTS technology enables systems to synthesize human-like speech, making them integral in enhancing user experiences by providing seamless, natural communication between machines and humans. As the demand for more intuitive and interactive digital solutions grows, TTS systems have evolved to produce increasingly accurate, expressive, and natural-sounding voices. These advancements are driven by innovations in model architectures and deep learning techniques, which allow for greater flexibility, clarity, and emotional expressiveness in synthetic speech. However, despite the rapid evolution of TTS models, there is still a clear need for objective benchmarking methods that assess these models on metrics relevant to human-centered qualities, such as intelligibility and naturalness. Objective metrics, such as Word Error Rate (WER), Character Error Rate (CER), and prosodic factors like pitch and cadence, provide reliable means to assess and compare model performance. These metrics are essential for evaluating key aspects of human-centered qualities like intelligibility, clarity, and expressiveness, which are central to user satisfaction and engagement. This study seeks to benchmark multiple TTS model architectures using a comprehensive set of objective metrics relevant to human interaction and communication. Key metrics include WER, CER, signal-to-noise ratio (SNR), pitch variation, and articulation index, among others.