Breaking language barriers: Fine-tuning Whisper for Hindi

Breaking language barriers: Fine-tuning Whisper for Hindi

Vineet Suryan
February 19, 2025

Share this post:

Reading time:

Automatic speech recognition has advanced significantly in recent years, but regional languages like Hindi come with unique challenges. With its rich use of diacritics and complex characters, accurate transcription requires specialized models. We’re excited to present Whisper for Hindi, a fine-tuned version of OpenAI’s Whisper, designed specifically for Hindi Automatic Speech Recognition (ASR). With 2,500 hours of Hindi speech data and innovative techniques like Indic Normalization, this model sets a new benchmark for Hindi ASR.

Why normalization matters in Hindi ASR

In Hindi, the meaning of a sentence can drastically change if diacritics or characters are removed, making normalization a critical part of the pipeline. Consider this example:

Original Hindi sentence:

हमने उस उम्मीदवार को चुना।

Whisper's Default Normalization:

Whisper applies aggressive text normalization, often stripping diacritics and compressing words. Here's how the same sentence looks after Whisper normalization:

हमन उस उमम दव र क चन

The removal of diacritics and loss of word boundaries result in text that is difficult to interpret and often meaningless.

Indic Normalization to the Rescue:

Instead of Whisper's default normalization, we employed Indic Normalization from the IndicNLP Library, which retains diacritics and complex characters, producing more linguistically accurate transcriptions:

हमने उस उम्मीदवार को चुना।

While Whisper's default normalization might reduce the Word Error Rate (WER) on numeric benchmarks, it sacrifices semantic accuracy. For Hindi, maintaining diacritics and preserving complex characters is vital for transcription quality, even if it slightly increases the WER. This trade-off ensures that the transcriptions are meaningful and contextually accurate.

The dataset journey

Shrutilipi by AI4Bharat A rich ASR corpus mined from All India Radio news bulletins spanning 12 Indian languages. Total Hours: 6400+ Hindi Subset: ~1600 hours
IITM Madras SpringLab Licensed under CC-4.0., mock conversations and monologues on diverse topics form the backbone of this dataset. Total Hours: ~900 hours
Mozilla Foundation’s Common Voice 11.0 Released under CC-1.0, adding robustness through community-contributed voice data.
Google Fleurs Used for testing, providing a comprehensive benchmark with its extensive multilingual dataset, licensed under CC-4.0.

Together these datasets provide diverse accents, styles, and content, making Whisper-Hindi robust for real-world ASR applications.

WebDataset: The game changer

Training large datasets using HuggingFace’s default dataloaders can be slow and inefficient, especially when working with thousands of hours of speech data. By converting our preprocessed datasets into WebDataset shards, we achieved up to 5x speed-up in training.

WebDataset enables sequential sampling and better utilization of hardware, minimizing data-loading bottlenecks. With its ability to support distributed training and efficient data streaming, WebDataset proved instrumental in fine-tuning Whisper-Hindi on large-scale datasets. This speed-up not only saved time but also allowed us to experiment and iterate faster.

Training details

Our training setup was optimized for efficiency and performance. Leveraging a single RTX 4090, we fine-tuned the models over 3 epochs using a linear learning rate scheduler and default AdamW 8bit with a weight decay of 0.05. Despite using lower maximum learning rates than those in the Whisper paper, our approach effectively balances computational efficiency with robust performance.

Hardware: Single NVIDIA RTX 4090 GPU
Training duration: 3 epochs for all model sizes
Effective batch size: 128
Optimizer: AdamW 8-bit (from bitsandbytes) with: -
- β₁ = 0.9
- β₂ = 0.999
- ε = 1e-8
- Weight Decay = 0.05
Learning rate scheduler: Linear scheduler
Learning rate comparison:

Model Size	Our Max LR	Paper Max LR
Tiny	3.75e-4	1.5e-3
Base	3.5e-4	1e-3
Small	1.75e-4	5e-4

We opted for lower maximum learning rates than those in the Whisper paper to ensure gentle adaptation of the pre-trained weights during fine-tuning—this helps prevent the model from overshooting the optimum or getting stuck in suboptimal local minima.

All our training plots are publicly available on Weights & Biases. You can follow our progress and inspect the training curves through this Weights & Biases Dashboard.

Performance at a glance

Baseline Word-Error-Rate (WER %) on Google Fleurs (Indic Normalization):

Model Size	Whisper Norm	Indic Norm
Tiny	172.60	196.57
Base	149.17	160.58
Small	67.37	89.73

Fine-Tuned Whisper WER (%):

Model Size	Whisper Norm	Indic Norm
Tiny	14.21	22.15
Base	11.78	19.44
Small	10.11	17.35

Fig.1 - WER comparison by model size & normalization type.

While the Whisper Normalization variant achieves lower WER, it does so at the expense of semantic integrity. In contrast, IndicNormalization, although resulting in a slightly higher WER, preserves critical diacritics and complex characters—ensuring that transcriptions remain both linguistically and contextually accurate. This deliberate trade-off underlines the practical benefits of our approach for real-world Hindi ASR applications.

Where to find the model & dataset

HuggingFace:
GitHub Repository: collabora/whisper-finetuning
Preprocessed Webdataset: collabora/hindi-preprocessed-wds-full

The GitHub repository includes scripts for preprocessing, WebDataset conversion, and training. Feel free to explore and contribute.

Future work

Looking ahead, we plan to focus on several key improvements. One of the primary areas of development will be hyperparameter tuning to optimize the model’s performance further. Additionally, we aim to train medium and large-sized models, which should improve transcription accuracy, particularly for more complex speech patterns. Expanding the training dataset is also a priority, as adding more diverse Hindi dialects, accents, and real-world speech data will help improve the model's robustness and generalization ability. These efforts will contribute to refining Whisper for Hindi and making it more effective in diverse ASR applications.

Acknowledgments

This project was made possible because of the following resources:

AI4Bharat for the Shrutilipi dataset.
IIT Madras SpringLab for the SpringX-Hindi dataset.
mozilla-foundation for the common_voice_11_0 Hindi dataset.
google for the fleurs dataset.
IndicNLP Library for their powerful Indic normalization tools.

Outlook

Whisper-Hindi represents a significant step forward in creating high-quality ASR models for Hindi. By leveraging Indic Normalization and optimizing the training pipeline with WebDataset, we’ve achieved a balance between accuracy and semantic integrity.

The use of WebDataset not only accelerated training but also enabled efficient utilization of resources for large-scale ASR tasks. Its sequential sampling approach ensures that hardware is utilized optimally, allowing us to handle massive datasets without compromising on training speed.

Future work may explore incorporating more diverse dialects and further refining normalization techniques to push the boundaries of Hindi ASR even further.

Try it for yourself from Hugging Face or GitHub and help us make ASR accessible to everyone!

Transforming speech technology with WhisperLive

Collabora's WhisperFusion nominated for Embedded Award 2024

WhisperFusion: Ultra-low latency conversations with an AI chatbot

Transforming speech technology with WhisperLive

Collabora's WhisperFusion nominated for Embedded Award 2024

WhisperFusion: Ultra-low latency conversations with an AI chatbot

Comments (0)

Add a Comment

Search the newsroom

Latest News & Events

PanVK is officially Vulkan 1.1 conformant

14/04/2025

PanVK has reached a new milestone, and is now officially conformant with the Vulkan 1.1 specification on the Arm Mali-G610 GPU! The submission…

A tale of three demos: Breakthroughs in Open Source graphics at Embedded World 2025

10/04/2025

Three demos. One stand. From end-to-end HDR and a brand-new SoC running PanVK, to NVK and WebGPU out of the box — discover how Collabora…

GStreamer 1.26: Improved hardware efficiency, the MPEG-5 LCEVC codec, and more

09/04/2025

Collabora once again played a key role in the latest release of GStreamer, contributing enhancements such as improved hardware efficiency,…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기