Jakub Piotr Clapa
September 13, 2023
Reading time:
TL;DR: Collabora is building the best natural-sounding Open Source speech synthesis solution which is ready for commercial use – based only on properly licensed speech datasets and unrestricted Open Source code.
In the modern digital era, the influence of speech technology is rapidly expanding. Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions with chatbots. We're introducing a new player in this field – WhisperSpeech, an Open Source text-to-speech model developed by Collabora.
Text-to-speech technology is no longer limited to screen readers or creating impromptu audiobooks from blog posts for personal use. Its applications now extend across a range of areas beyond traditional uses. While the most apparent application is to produce captivating narrated content or improve podcasts, WhisperSpeech has potential uses in:
Audio Editing: TTS models offer creators the ability to seamlessly modify audio tracks in podcasts and videos. This YouTube video demonstrates the replacement of explicit content with synthesized speech, providing a creative solution for self-censorship and content adaptation. Another application is editing interviews and improving vocal performances without re-recording.
Interactive Voice Response (IVR) Systems: WhisperSpeech's natural-sounding speech is ideal for IVR systems, making automated interactions more personalized and engaging for customers.
Public Announcements: In public spaces or commercial environments, the model's realistic speech could be harnessed for clear and effective announcements.
WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.
Collabora holds ambitious plans for the future of WhisperSpeech. Larger models with even higher speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, our company builds this foundational model using properly licensed datasets, facilitating commercial use without legal concerns.
An end-to-end generation example, inspired by one famous president’s speech (click on the video to play it):
WhisperSpeech's foundation traces back to the SPEAR TTS paper from Google Research. Unfortunately Google released neither the code nor the weights that inspired the model's impressive quality and straightforward design, so we ventured into developing a new Open Source TTS model.
At the time, Collabora was working on WhisperLive, a live transcription tool based on OpenAI's Whisper model. The exceptional accuracy and multilingual capabilities of these models were striking, yet Google's SPEAR TTS solution used custom models for speech transcription and semantic token extraction. This led to the question: Could Whisper be employed to tackle these tasks and improve their approach by using a high-quality supervised model?
WhisperSpeech's innovative architecture takes its inspiration from the Whisper speech recognition model and reverses its operation to move from transcription to text-to-speech synthesis. This unique approach opens doors to a host of possibilities in generating natural speech. The model utilizes existing open-source technologies such as the Encodec audio codec from Meta and the Vocos vocoder from charactr, ensuring efficiency by building upon established solutions.
The insights gained from the SPEAR TTS architecture were pivotal in this advancement. Dividing speech synthesis into two separate phases—reading and speaking—has made the process more efficient, improving both manageability and accuracy. These two aspects are essential considering how the subtle variations in a single sentence can be expressed.
Whisper, the foundation of WhisperSpeech, consists of two main parts: the encoder and the decoder. The encoder produces a continuous stream of embeddings, each containing contextual information and prosody that enrich the audio data. The decoder then transforms these embeddings into words, utilizing cross-attention mechanisms to identify sound fragments composing each word.
WhisperSpeech operates in reverse by taking text as input, generating embeddings, and then transforming them into sound. Whisper's encoder output is quantized to form semantic tokens (500kbps), enriched with phonetic and prosodic attributes. Meta's Encodec model is then used to compress the audio data into acoustic tokens operating at 1.5kbps. These tokens are handled by popular seq2seq transformers and Collabora meticulously trained models for the Semantic-to-Acoustic (S2A) and the Text-to-Semantic (T2S) processes. These models, combined with the open-source Vocos vocoder, yield high-quality speech from text inputs, producing impressive WhisperSpeech outputs.
Collabora's WhisperSpeech project holds promise in reshaping communication, content creation, and interaction through text-to-speech technology. WhisperSpeech's unique approach, built upon the successes of Whisper and SPEAR TTS, has the potential to establish new standards in open-source natural speech synthesis.
As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperSpeech's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperSpeech is poised to make an impact in the speech technology landscape. For a more in-depth discussion about WhisperSpeech, check out the latest episode of Democratizing AI: "Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion".
15/01/2025
With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…
19/12/2024
In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…
08/10/2024
Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…
15/08/2024
After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…
01/08/2024
We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…
27/06/2024
With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…
Comments (3)
Stuart Naylor:
Sep 14, 2023 at 12:40 PM
Hi Jakub, I have a tangential question as wih TTS there are a lot of models to choose from but I am still struggling to find a single casual online BSS (Blind Source Seperation) alg or model for binaural audio that favours low computational cost for embedded.
Its been a long term hindrance top opensource voice assistants as key elements at the start of the audio chain are missing.
I do follow the excellent work you guys do at Collabora and in this area there seems to be no libs, alsa or other BSS and for me binaural would be choice.
So I just thought I would highlight in Linux it seems to be missing whilst we have quite a choice with the likes of TTS even if another is great news.
Reply to this comment
Reply to this comment
Stuart Naylor:
Sep 21, 2023 at 03:59 PM
PS it doesn't have to be BSS as there are filters also and same with binaural but after some research the binaural alg likely could have better SNR with lower computional cost and looking at what Google did it seems it does fit low cost devices.
The start of the 'Smart Assistant' audio chain is missing in Linux as opensource and its extremely heavy on science and math, but surely with contacts of bigger entities someone can provide something to what are now very outdated methods of Speex and Rnnoise.
Reply to this comment
Reply to this comment
Jakub Piotr Cłapa:
Sep 21, 2023 at 09:20 PM
These are good points, thank you. We decided to start with text-to-speech because we noticed a flurry of research activity on proprietary models which resulted in much better quality and we did not want Open Source solutions to be left behind. But we did discuss voice enhancement and directional hearing through multiple microphones (most of us have a few microphones around these days and yet speech on video calls is often undecipherable) as another potential project that could be interesting, especially for embedded AI applications.
There are some recent papers on using neural networks to do source separation: https://paperswithcode.com/task/audio-source-separation/latest but it would require some work to filter out those that talk about music track separation and such.
Reply to this comment
Reply to this comment
Add a Comment