Vineet Suryan
February 19, 2025
Reading time:
Automatic speech recognition has advanced significantly in recent years, but regional languages like Hindi come with unique challenges. With its rich use of diacritics and complex characters, accurate transcription requires specialized models. We’re excited to present Whisper for Hindi, a fine-tuned version of OpenAI’s Whisper, designed specifically for Hindi Automatic Speech Recognition (ASR). With 2,500 hours of Hindi speech data and innovative techniques like Indic Normalization, this model sets a new benchmark for Hindi ASR.
In Hindi, the meaning of a sentence can drastically change if diacritics or characters are removed, making normalization a critical part of the pipeline. Consider this example:
हमने उस उम्मीदवार को चुना।
Whisper applies aggressive text normalization, often stripping diacritics and compressing words. Here's how the same sentence looks after Whisper normalization:
हमन उस उमम दव र क चन
The removal of diacritics and loss of word boundaries result in text that is difficult to interpret and often meaningless.
Instead of Whisper's default normalization, we employed Indic Normalization from the IndicNLP Library, which retains diacritics and complex characters, producing more linguistically accurate transcriptions:
हमने उस उम्मीदवार को चुना।
While Whisper's default normalization might reduce the Word Error Rate (WER) on numeric benchmarks, it sacrifices semantic accuracy. For Hindi, maintaining diacritics and preserving complex characters is vital for transcription quality, even if it slightly increases the WER. This trade-off ensures that the transcriptions are meaningful and contextually accurate.
Together these datasets provide diverse accents, styles, and content, making Whisper-Hindi robust for real-world ASR applications.
Training large datasets using HuggingFace’s default dataloaders can be slow and inefficient, especially when working with thousands of hours of speech data. By converting our preprocessed datasets into WebDataset shards, we achieved up to 5x speed-up in training.
WebDataset enables sequential sampling and better utilization of hardware, minimizing data-loading bottlenecks. With its ability to support distributed training and efficient data streaming, WebDataset proved instrumental in fine-tuning Whisper-Hindi on large-scale datasets. This speed-up not only saved time but also allowed us to experiment and iterate faster.
Our training setup was optimized for efficiency and performance. Leveraging a single RTX 4090, we fine-tuned the models over 3 epochs using a linear learning rate scheduler and default AdamW 8bit with a weight decay of 0.05. Despite using lower maximum learning rates than those in the Whisper paper, our approach effectively balances computational efficiency with robust performance.
Model Size | Our Max LR | Paper Max LR |
Tiny | 3.75e-4 | 1.5e-3 |
Base | 3.5e-4 | 1e-3 |
Small | 1.75e-4 | 5e-4 |
We opted for lower maximum learning rates than those in the Whisper paper to ensure gentle adaptation of the pre-trained weights during fine-tuning—this helps prevent the model from overshooting the optimum or getting stuck in suboptimal local minima.
All our training plots are publicly available on Weights & Biases. You can follow our progress and inspect the training curves through this Weights & Biases Dashboard.
Baseline Word-Error-Rate (WER %) on Google Fleurs (Indic Normalization):
Model Size | Whisper Norm | Indic Norm |
Tiny | 172.60 | 196.57 |
Base | 149.17 | 160.58 |
Small | 67.37 | 89.73 |
Fine-Tuned Whisper WER (%):
Model Size | Whisper Norm | Indic Norm |
Tiny | 14.21 | 22.15 |
Base | 11.78 | 19.44 |
Small | 10.11 | 17.35 |
![]() |
Fig.1 - WER comparison by model size & normalization type. |
While the Whisper Normalization variant achieves lower WER, it does so at the expense of semantic integrity. In contrast, IndicNormalization, although resulting in a slightly higher WER, preserves critical diacritics and complex characters—ensuring that transcriptions remain both linguistically and contextually accurate. This deliberate trade-off underlines the practical benefits of our approach for real-world Hindi ASR applications.
The GitHub repository includes scripts for preprocessing, WebDataset conversion, and training. Feel free to explore and contribute.
Looking ahead, we plan to focus on several key improvements. One of the primary areas of development will be hyperparameter tuning to optimize the model’s performance further. Additionally, we aim to train medium and large-sized models, which should improve transcription accuracy, particularly for more complex speech patterns. Expanding the training dataset is also a priority, as adding more diverse Hindi dialects, accents, and real-world speech data will help improve the model's robustness and generalization ability. These efforts will contribute to refining Whisper for Hindi and making it more effective in diverse ASR applications.
This project was made possible because of the following resources:
Whisper-Hindi represents a significant step forward in creating high-quality ASR models for Hindi. By leveraging Indic Normalization and optimizing the training pipeline with WebDataset, we’ve achieved a balance between accuracy and semantic integrity.
The use of WebDataset not only accelerated training but also enabled efficient utilization of resources for large-scale ASR tasks. Its sequential sampling approach ensures that hardware is utilized optimally, allowing us to handle massive datasets without compromising on training speed.
Future work may explore incorporating more diverse dialects and further refining normalization techniques to push the boundaries of Hindi ASR even further.
Try it for yourself from Hugging Face or GitHub and help us make ASR accessible to everyone!
19/02/2025
We're proud to announce that Whisper is now available in Hindi! With 2,500 hours of Hindi speech data and innovative techniques like Indic…
04/02/2025
The first release candidate of Mesa 25.0 has recently shipped, bringing with it multiple updates to Panfrost, and most notably to PanVK,…
29/01/2025
Collabora's involvement in Open Source XR development continues to grow today as we welcome the libsurvive project, the open source lighthouse…
Comments (0)
Add a Comment