Marcus Edel
January 25, 2024
Reading time:
In this blog post Vineet Suryan, Jakub Piotr Cłapa, and Marcus Edel share their research and findings towards implementing a real-time communication with an AI chatbot.
You know that anticipation that sets in when you’re expecting a message from a potential interest? Keeping your phone in your constant peripheral, lunging at every buzz? Now chatbots can give you that same excitement!
However, the great advantage of bots is that they can reply instantly without spending any time typing or even thinking. But if you reflect further, you’ll realize that the machine, supposedly responding in milliseconds, has a clear delay between human speech and the bot’s spoken answer. While the information is accurate, the delay makes the interaction feel unnatural and could frustrate users.
That is why at Collabora, we looked at every piece of the process and implemented an ultra-low latency pipeline using WhisperLive and WhisperSpeech.
There is both a long and a short answer on how we achieved ultra-low latency communications with an AI chatbot.
Simply put, by using WhisperLive and WhisperSpeech:
WhisperLive is a nearly-live implementation of OpenAI's Whisper. The project is a real-time transcription application that uses the OpenAI Whisper model to convert speech input into text output. It can be used to transcribe both live audio input from microphone and pre-recorded audio files. Unlike traditional speech recognition systems that rely on continuous audio streaming, we use voice activity detection (VAD) to detect the presence of speech and only send the audio data to Whisper when speech is detected. This helps to reduce the amount of data sent to the Whisper model and improves the accuracy of the transcription output. Check out our transcription post and the WhisperLive repository for more details.
WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.
Collabora holds ambitious plans for the future of WhisperLive and WhisperSpeech. More extensive models with even higher transcription and speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, we are building this foundational model using properly licensed datasets which facilitates commercial use without legal concerns. Explore our text-to-speech post and the WhisperSpeech repository for some background.
To aptly break it down, there are two main reasons that chatbot conversations are slow: an algorithmic one and a hardware one.
Algorithmically, the usual pipeline is implemented in a very sequential way. That can be described as recording the audio, waiting until the audio is transcribed, sending the transcription to the large-language model, generating the audio using the text-to-speech model, and sending everything to the client. On the other hand, we implemented a highly parallel pipeline that doesn't wait for one process to finish before it triggers the next process. A high-level overview of our pipeline can be described as:
The hardware-related explanation is straightforward. The core issue lies in the size of Transcription, Large-Language Models, and Text-to-Speech models, which are typically enormous. Even the smaller models boast over 50 million parameters, all of which need to be stored in RAM. The problem with RAM is its relative slowness. To counteract this, CPUs and GPUs are equipped with substantial cache memory situated close to the processor for quicker access. The specifics vary depending on the processor's type and model, but the crux of the matter is that most models run slowly either because they exceed the cache capacity or they fail to fully utilize the available hardware.
A straightforward way to speed up inference is to just buy better hardware or to take better advantage of the hardware you have. We opted for the second and incorporated multiple optimization strategies including torch.compile
, TensorRT
, Batching
, Quantization
, KV caching
, Attention
, Decoding
, etc.
In this blog post we will not go into the details but will dive into some of these optimization techniques in a related follow-up post.
Want to see this in action? Just head over to the WhisperFusion repository and run the script, or watch the sample video:
The future of customer interaction lies in the harmonious fusion of sophisticated AI and powerful communication technologies. The era of waiting for delayed responses or dealing with inefficient chatbots is fading thanks to the innovative strides we have shown with WhisperFusion.
You can achieve real-time, efficient, intelligent communication by using WhisperLive and WhisperSpeech rapid processing capabilities and low-latency communication implementations. This adaptability ensures that your model remains a step ahead as your business expands while adhering to customers' needs, a marker of delivering top-notch service.
As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperFusion's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperFusion is poised to make an impact in the communication technology landscape.
If you have questions or ideas, join us on our Gitter #lounge channel or leave a comment down below.
07/01/2025
A testament to its long standing community interest and devote volunteers, FOSDEM will be celebrating its 25th anniversary this year. Join…
20/12/2024
The Rockchip RK3588 upstream support has progressed a lot over the last few years. As 2024 comes to a close, it is a great time to have…
09/12/2024
Collabora will be at NeurIPs this week to dive into the latest academic findings in machine learning and research advancements that are…
Comments (0)
Add a Comment