Building a Hindi and Hinglish Voice Bot: The Technical Reality
Building a Hindi and Hinglish Voice Bot: The Technical Reality
Building a voice bot that handles Hindi and Hinglish well is harder than building an English voice bot, and harder than building a monolingual Hindi bot. The difficulty is not any single component — it is that every layer of the stack (STT, LLM, TTS) makes different assumptions about language boundaries, and Hinglish violates all of them. This post covers the specific technical problems and what actually works in production as of 2026.
Why Is Hinglish Harder Than Hindi or English Separately?
Hinglish is not a dialect — it is code-switching, the real-time alternation between Hindi and English within a single utterance. A typical Hinglish sentence might be: "Mujhe is product ke baare mein thoda more information chahiye." The STT model must handle Devanagari phonology in one token and English phonology in the next, often without acoustic cues that a language switch is occurring.
Standard Whisper (OpenAI's open-source STT model) handles this poorly. In a 2024 benchmark by AI4Bharat on Indian call center audio, Whisper large-v3 achieved a word error rate of 22.4% on Hinglish compared to 8.1% on clean American English. The errors cluster around code-switch boundaries — the model tends to "commit" to one language and misrecognize words from the other.
Fine-tuned models trained specifically on Hinglish code-switching data perform substantially better. AI4Bharat's IndicWhisper and Sarvam AI's Saarika model achieve WERs of 7–10% on Hinglish when trained on Indian call center corpora. The practical lesson is that using a generic multilingual STT model for Hinglish will produce unacceptable error rates regardless of model size.
How Do You Prompt an LLM to Respond Correctly in Hinglish?
The STT problem is well-defined. The LLM prompting problem is subtler. When an LLM receives a Hinglish transcript, it must decide what language to respond in. Without explicit instruction, most LLMs default to English if the transcript contains significant English content, or to formal Hindi if the transcript is mostly Devanagari. Neither behavior is what a Hinglish caller expects.
The correct approach is explicit language instruction in the system prompt, combined with few-shot examples of correct Hinglish responses. A system prompt that says "Respond in Hinglish — natural mixing of Hindi and English as spoken in urban India, using Devanagari script for Hindi words and Latin script for English words" produces significantly better code-switching behavior than a prompt that says "Respond in Hindi."
Script mixing in output is also important: Hinglish speakers expect responses in Roman script or mixed script, not formal Devanagari. An LLM that responds entirely in Devanagari to a Hinglish caller will be perceived as stiff and unnatural, even if the content is correct. GPT-4o and Claude 3.5 Sonnet both handle Hinglish output well with appropriate prompting; smaller models (sub-7B parameters) struggle with consistent code-switching.
What TTS Models Work for Hindi and Hinglish?
Text-to-speech for Hindi and Hinglish is the most immature layer of the stack. As of mid-2026, the realistic options are: Sarvam AI's Bulbul (Indian languages, good prosody, Hindi-first), Google Cloud TTS (Indian English and Hindi available, naturalness is acceptable but robotic compared to ElevenLabs English), ElevenLabs (no Hindi support), and Azure Neural TTS (Hindi available, quality is mixed on Hinglish).
The core problem with Hinglish TTS is pronunciation of English words in a Hindi-accent context. When a TTS model encounters "please share your aadhaar number," the word "aadhaar" should be pronounced with Hindi phonology, not anglicized. Models trained on American English data will mispronounce common Indian proper nouns, product names, and transliterated terms.
Sarvam's Bulbul handles this better than generic multilingual TTS because it is trained on Indian audio and understands the pronunciation conventions of Indian English. The tradeoff is that Bulbul's prosody in English-heavy sentences is less natural than in Hindi-dominant sentences.
A practical mitigation is SSML phoneme overrides for frequently mispronounced terms — maintaining a list of Indian proper nouns, brand names, and technical terms with explicit pronunciation guides in the TTS call.
How Do You Handle Regional Accent Variation in Hindi?
Hindi itself has significant regional variation. A caller from Bihar speaks Bhojpuri-influenced Hindi; a caller from Rajasthan speaks Rajasthani-influenced Hindi; a caller from Delhi speaks the urban Hinglish baseline. STT models that are fine-tuned only on Delhi/NCR speech will have elevated error rates on regional accents.
In production deployments targeting pan-India audiences, the realistic approach is to accept that WER will vary by region and design the conversation flow to gracefully handle recognition failures. This means: asking callers to repeat key information (order IDs, phone numbers, amounts) rather than relying on a single recognition event; using confirmation steps ("You said your order number is X — is that correct?") for high-stakes data; and having a fallback path to a human agent when the ASR confidence score falls below a threshold.
AI4Bharat's 2025 accent robustness benchmark found that models trained on diverse Indian regional data maintained WER under 12% across 15 accent categories. Models trained only on urban Indian speech showed WER spikes of 25–35% on rural and strongly regional-accented speech.
Conclusion
A production Hindi and Hinglish voice bot requires purpose-built choices at every layer: fine-tuned STT trained on Indian code-switching data, explicit Hinglish prompting for LLM response generation, Indian-trained TTS with pronunciation overrides, and conversation design that accounts for recognition uncertainty. Using generic multilingual models at each layer and expecting them to "figure it out" produces a system that works in demos and fails under real call center conditions.