ToucanTTS - A Toolkit for State-of-the-Art Speech Synthesis

a massively multilingual model covering over 7,000 languages

Free Online ToucanTTS

Try out ToucanTTS with the following examples.

What is ToucanTTS?

ToucanTTS is a toolkit developed by the Institute for Natural Language Processing (IMS) at the University of Stuttgart, Germany, for teaching, training, and using state-of-the-art speech synthesis models. It is built entirely in Python and PyTorch, aiming to be simple, beginner-friendly, yet powerful.

ToucanTTS Features

Multilingual and Multi-Speaker Support

Supports multilingual speech synthesis covering over 7,000 languages through a massively multilingual pretrained model. Enables multi-speaker speech synthesis and cloning of prosody (rhythm, stress, intonation) across speakers

Human-in-the-Loop Editing

Allows human-in-the-loop editing of synthesized speech, e.g., for poetry reading and literary studies

Interactive Demos

Provides interactive demos for massively multilingual speech synthesis, style cloning across speakers, voice design, and human-edited poetry reading

Architecture and Components

Primarily based on the FastSpeech 2 architecture with modifications like a normalizing flow-based PostNet inspired by PortaSpeech. Includes a self-contained aligner trained with Connectionist Temporal Classification (CTC) and spectrogram reconstruction for various applications.Offers pretrained models for the multilingual model, aligner, embedding function, vocoder, and embedding GAN

Ease of Use

Built entirely in Python and PyTorch, aiming to be simple and beginner-friendly while still powerful

Articulatory Representations

The IMS Toucan system incorporates articulatory representations of phonemes as input, allowing multilingual data to benefit low-resource languages

How to use ToucanTTS?

Let's get started with ToucanTTS in just a few simple steps.

Installation

Clone the IMS-Toucan repository from GitHub:

git clone https://github.com/DigitalPhonetics/IMS-Toucan

Preparing Data

Write a function mapping audio paths to transcripts, create a custom training pipeline script, and ensure text frontend language support.

Download Pretrained Models

Download pretrained models like the multilingual model, aligner, and vocoder using the provided script.

Apply Patches/Fixes

Apply any necessary patches or fixes to the codebase.

Model Training

Run your custom training pipeline script to fine-tune the models on your datas

Inference

Use provided interactive demos or scripts to generate speech from text using the trained models.

Frequently Asked Questions

Have a question? Check out some of the common queries below.

What is the primary architecture used in ToucanTTS?

ToucanTTS is primarily based on the FastSpeech 2 architecture with modifications like a normalizing flow-based PostNet inspired by PortaSpeech.

How does ToucanTTS support low-resource languages?

ToucanTTS incorporates articulatory representations of phonemes as input, allowing multilingual data to benefit low-resource languages.

Can ToucanTTS be used for multi-speaker speech synthesis?

Yes, ToucanTTS enables multi-speaker speech synthesis and cloning of prosody (rhythm, stress, intonation) across speakers.

What kind of demos are available in ToucanTTS?

ToucanTTS provides interactive demos for massively multilingual speech synthesis, style cloning across speakers, voice design, and human-edited poetry reading.

How many languages are covered by the massively multilingual pretrained model in ToucanTTS?

The massively multilingual pretrained model in ToucanTTS covers over 7,000 languages.

Is ToucanTTS easy to use?

Yes, ToucanTTS is built entirely in Python and PyTorch, aiming to be simple and beginner-friendly while still powerful.