We're hiring!
*

Breaking language barriers: Fine-tuning Whisper for Hindi

Vineet Suryan avatar

Vineet Suryan
February 19, 2025

Share this post:

Reading time:

Automatic speech recognition has advanced significantly in recent years, but regional languages like Hindi come with unique challenges. With its rich use of diacritics and complex characters, accurate transcription requires specialized models. We’re excited to present Whisper for Hindi, a fine-tuned version of OpenAI’s Whisper, designed specifically for Hindi Automatic Speech Recognition (ASR). With 2,500 hours of Hindi speech data and innovative techniques like Indic Normalization, this model sets a new benchmark for Hindi ASR.

Why normalization matters in Hindi ASR

In Hindi, the meaning of a sentence can drastically change if diacritics or characters are removed, making normalization a critical part of the pipeline. Consider this example:

  • Original Hindi sentence:

हमने​ उस​ उम्मीदवार​ को​ चुना​।

  • Whisper's Default Normalization:

Whisper applies aggressive text normalization, often stripping diacritics and compressing words. Here's how the same sentence looks after Whisper normalization:

हमन उस उमम दव र क चन

The removal of diacritics and loss of word boundaries result in text that is difficult to interpret and often meaningless.

  • Indic Normalization to the Rescue:

Instead of Whisper's default normalization, we employed Indic Normalization from the IndicNLP Library, which retains diacritics and complex characters, producing more linguistically accurate transcriptions:

हमने उस उम्मीदवार को चुना।

While Whisper's default normalization might reduce the Word Error Rate (WER) on numeric benchmarks, it sacrifices semantic accuracy. For Hindi, maintaining diacritics and preserving complex characters is vital for transcription quality, even if it slightly increases the WER. This trade-off ensures that the transcriptions are meaningful and contextually accurate.

The dataset journey

  • Shrutilipi by AI4Bharat A rich ASR corpus mined from All India Radio news bulletins spanning 12 Indian languages. Total Hours: 6400+ Hindi Subset: ~1600 hours
  • IITM Madras SpringLab Licensed under CC-4.0., mock conversations and monologues on diverse topics form the backbone of this dataset. Total Hours: ~900 hours
  • Mozilla Foundation’s Common Voice 11.0 Released under CC-1.0, adding robustness through community-contributed voice data.
  • Google Fleurs Used for testing, providing a comprehensive benchmark with its extensive multilingual dataset, licensed under CC-4.0.

Together these datasets provide diverse accents, styles, and content, making Whisper-Hindi robust for real-world ASR applications.

WebDataset: The game changer

Training large datasets using HuggingFace’s default dataloaders can be slow and inefficient, especially when working with thousands of hours of speech data. By converting our preprocessed datasets into WebDataset shards, we achieved up to 5x speed-up in training.

WebDataset enables sequential sampling and better utilization of hardware, minimizing data-loading bottlenecks. With its ability to support distributed training and efficient data streaming, WebDataset proved instrumental in fine-tuning Whisper-Hindi on large-scale datasets. This speed-up not only saved time but also allowed us to experiment and iterate faster.

Training details

Our training setup was optimized for efficiency and performance. Leveraging a single RTX 4090, we fine-tuned the models over 3 epochs using a linear learning rate scheduler and default AdamW 8bit with a weight decay of 0.05. Despite using lower maximum learning rates than those in the Whisper paper, our approach effectively balances computational efficiency with robust performance.

  • Hardware: Single NVIDIA RTX 4090 GPU
  • Training duration: 3 epochs for all model sizes
  • Effective batch size: 128
  • Optimizer: AdamW 8-bit (from bitsandbytes) with: -
    • β₁ = 0.9
    • β₂ = 0.999
    • ε = 1e-8
    • Weight Decay = 0.05
  • Learning rate scheduler: Linear scheduler
  • Learning rate comparison:
Model Size  Our Max LR Paper Max LR
Tiny 3.75e-4  1.5e-3
Base 3.5e-4 1e-3
Small  1.75e-4 5e-4


We opted for lower maximum learning rates than those in the Whisper paper to ensure gentle adaptation of the pre-trained weights during fine-tuning—this helps prevent the model from overshooting the optimum or getting stuck in suboptimal local minima.

All our training plots are publicly available on Weights & Biases. You can follow our progress and inspect the training curves through this Weights & Biases Dashboard.

Performance at a glance

Baseline Word-Error-Rate (WER %) on Google Fleurs (Indic Normalization):

Model Size  Whisper Norm Indic Norm
Tiny 172.60 196.57
Base 149.17 160.58
Small  67.37 89.73


Fine-Tuned Whisper WER (%):

Model Size  Whisper Norm Indic Norm
Tiny 14.21 22.15
Base 11.78  19.44
Small  10.11 17.35

 

Fig.1 - WER comparison by model size & normalization type.


While the Whisper Normalization variant achieves lower WER, it does so at the expense of semantic integrity. In contrast, IndicNormalization, although resulting in a slightly higher WER, preserves critical diacritics and complex characters—ensuring that transcriptions remain both linguistically and contextually accurate. This deliberate trade-off underlines the practical benefits of our approach for real-world Hindi ASR applications.

Where to find the model & dataset

The GitHub repository includes scripts for preprocessing, WebDataset conversion, and training. Feel free to explore and contribute.

Future work

Looking ahead, we plan to focus on several key improvements. One of the primary areas of development will be hyperparameter tuning to optimize the model’s performance further. Additionally, we aim to train medium and large-sized models, which should improve transcription accuracy, particularly for more complex speech patterns. Expanding the training dataset is also a priority, as adding more diverse Hindi dialects, accents, and real-world speech data will help improve the model's robustness and generalization ability. These efforts will contribute to refining Whisper for Hindi and making it more effective in diverse ASR applications.

Acknowledgments

This project was made possible because of the following resources:

Outlook

Whisper-Hindi represents a significant step forward in creating high-quality ASR models for Hindi. By leveraging Indic Normalization and optimizing the training pipeline with WebDataset, we’ve achieved a balance between accuracy and semantic integrity.

The use of WebDataset not only accelerated training but also enabled efficient utilization of resources for large-scale ASR tasks. Its sequential sampling approach ensures that hardware is utilized optimally, allowing us to handle massive datasets without compromising on training speed.

Future work may explore incorporating more diverse dialects and further refining normalization techniques to push the boundaries of Hindi ASR even further.

Try it for yourself from Hugging Face or GitHub and help us make ASR accessible to everyone!

 

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


 

Search the newsroom

Latest News & Events

Breaking language barriers: Fine-tuning Whisper for Hindi

19/02/2025

We're proud to announce that Whisper is now available in Hindi! With 2,500 hours of Hindi speech data and innovative techniques like Indic…

Mesa 25.0: PanVK moves towards production quality

04/02/2025

The first release candidate of Mesa 25.0 has recently shipped, bringing with it multiple updates to Panfrost, and most notably to PanVK,…

Welcoming the libsurvive project

29/01/2025

Collabora's involvement in Open Source XR development continues to grow today as we welcome the libsurvive project, the open source lighthouse…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.