Built on the Conformer architecture — the same family Google uses for on-device speech recognition. Fine-tuned on AI4Bharat's IndicConformer for all 22 scheduled Indian languages. Native streaming via RNNT decoder. Under 100ms latency. 10× more efficient than encoder-decoder models.
Built on AI4Bharat's open-source IndicConformer (MIT license) — the same Conformer architecture family that powers Google's on-device speech recognition. Unlike encoder-decoder models that generate tokens one-by-one, the Conformer's CTC path produces transcriptions in a single forward pass, and the RNNT path enables true word-by-word streaming. The result: 10× faster inference and 3× less VRAM than autoregressive alternatives.
Audio enters as 16 kHz mono signal. The Conformer encoder processes features through self-attention + convolution blocks (capturing both global and local speech patterns). The hybrid decoder offers two modes: CTC for maximum throughput, RNNT for real-time streaming.
Raw audio → 80-channel log-mel spectrogram
Voice activity detection filters non-speech
17 blocks × self-attention + convolution (dim 512)
CTC: single pass · RNNT: streaming tokens
Timestamped text + language ID + confidence
Why Conformer is faster: The CTC decoder produces the full transcript in one forward pass (no token-by-token generation). The RNNT decoder streams tokens as speech arrives — no need to buffer 30-second chunks. Both paths skip the autoregressive bottleneck that limits encoder-decoder models.
Conformer blocks combine transformer self-attention (global context) with depthwise convolutions (local patterns). This hybrid captures both long-range language structure and fine-grained phonetic detail — critical for tonal Indian languages.
CTC decodes the entire utterance in a single pass — no beam search loop, no token-by-token generation. Result: 10× faster inference than autoregressive decoders, with GPU utilization per call dropping from ~17% to ~3%.
The RNNT decoder emits tokens as audio arrives — words appear on screen while the borrower is still speaking. Unlike chunked streaming, RNNT provides genuine real-time partial results with no buffering delay.
The base Conformer architecture and IndicConformer weights come from AI4Bharat (IIT Madras), released under the MIT license. Everything below is ours.
Can someone download IndicConformer from HuggingFace and try? Yes — it's MIT licensed. Will it understand a Bhojpuri borrower asking about Mudra loan EMI on a noisy street? No. That's the gap we fill.
Starting from AI4Bharat's IndicConformer — a Conformer-Large model (120M params per language, 600M multilingual) trained on Indian speech corpora — we fine-tune for fintech domain accuracy and production deployment.
Start with AI4Bharat's pre-trained Conformer-Large: 17 Conformer blocks, 512 model dimension, hybrid CTC-RNNT decoder. Already trained on 22 Indian languages via AI Kosh and IndicVoices datasets.
Fine-tune on real loan origination recordings — borrower conversations, agent prompts, banking terminology. Dialect-balanced sampling for Bhojpuri, Marwari, Chhattisgarhi, and regional accents underrepresented in the base model.
Export via NVIDIA NeMo to ONNX / TensorRT. FP16 precision on GPU. Layer fusion for Conformer blocks. Validate WER parity against PyTorch baseline — same accuracy, optimized throughput.
Integrate Silero VAD for voice activity detection. RNNT decoder for real-time word-by-word streaming. gRPC API behind Sentia's agent orchestration layer. Fully self-hosted in bank infrastructure.
IndicConformer was trained by AI4Bharat on India's richest open speech corpora. We further fine-tune on proprietary fintech conversation data — real loan calls recorded with borrower consent, covering banking terminology, regional accents, and noisy field conditions that no public dataset captures.
IndicConformer achieves consistent, low WER across high-resource and low-resource Indian languages — purpose-built for Indian speech, unlike multilingual models that spread capacity across 100+ languages.
| Language | WER (%) | Accuracy | Resource Level |
|---|---|---|---|
| Hindi देवनागरी | 8.2 | 91.8% | High |
| Bengali বাংলা | 9.1 | 90.9% | High |
| Tamil தமிழ் | 9.8 | 90.2% | High |
| Telugu తెలుగు | 10.4 | 89.6% | High |
| Marathi मराठी | 10.1 | 89.9% | High |
| Gujarati ગુજરાતી | 11.3 | 88.7% | High |
| Kannada ಕನ್ನಡ | 10.9 | 89.1% | High |
| Malayalam മലയാളം | 11.7 | 88.3% | High |
| Odia ଓଡ଼ିଆ | 12.4 | 87.6% | Medium |
| Punjabi ਪੰਜਾਬੀ | 11.5 | 88.5% | Medium |
| Assamese অসমীয়া | 13.2 | 86.8% | Medium |
| Urdu اردو | 10.6 | 89.4% | High |
| Nepali नेपाली | 12.8 | 87.2% | Medium |
| Maithili मैथिली | 15.4 | 84.6% | Low |
| Konkani कोंकणी | 16.1 | 83.9% | Low |
| Sindhi سنڌي | 16.8 | 83.2% | Low |
| Dogri डोगरी | 17.5 | 82.5% | Low |
| Kashmiri कॉशुर | 18.2 | 81.8% | Low |
| Manipuri মৈতৈলোন্ | 17.0 | 83.0% | Low |
| Bodo बड़ो | 19.1 | 80.9% | Low |
| Santali ᱥᱟᱱᱛᱟᱲᱤ | 19.8 | 80.2% | Low |
| Sanskrit संस्कृतम् | 16.5 | 83.5% | Low |
WER measured on held-out test sets from AI4Bharat IndicVoices and Common Voice. Code-switched utterances included. Low-resource language WER improves with each fine-tuning iteration as more field data is collected. Fintech domain fine-tuning further improves Hindi and Bengali WER by 1–2% for banking vocabulary.
Start with Google Cloud STT for zero-infra simplicity. Graduate to Augmen's self-hosted STT when volume justifies it. Or run both — Sentia routes calls to the right engine automatically.
The Conformer architecture is dramatically more efficient than encoder-decoder models. A single NVIDIA L4 that previously handled 4–5 concurrent calls now serves ~20 — enough for most mid-size banks on a single card.
1. Silero VAD detects speech and sends 2–5 second audio chunks to the GPU.
2. Each chunk takes ~80ms to transcribe on L4 (RNNT streaming mode).
3. Between chunks, the GPU is idle — waiting for the next speech segment.
4. GPU utilization per live call: ~3% (80ms active per 3s of audio).
Result: One L4 comfortably serves ~20 simultaneous real-time conversations — a 4× improvement over encoder-decoder alternatives.
Per L4 GPU, RNNT streaming
10-hr day, 12-min avg call
22 business days, single L4
22 GB free for TTS + LLM
Capacity based on 12-min average call, RNNT streaming, ~80ms per 3s chunk. CTC batch mode is even faster (~30ms/chunk). Real-world varies by speech-to-silence ratio. VRAM headroom means STT, TTS, and routing all run on the same GPU.
With ~20,000 calls/month capacity per L4, most banks need just one GPU. The break-even vs Google STT happens at ~1,225 calls/month — after that, every call is essentially free.
L4 GPU pricing per AceCloud/Neysa India market data (Oct 2025). Power at ₹10/kWh commercial. Banks with existing data centers may see lower marginal costs. One L4 now handles STT + TTS + routing — no need for separate GPU per workload.
Assumptions: 12-min avg call. Google STT V2 Chirp at $0.016/min = ₹1.36/min (₹85/USD). Conformer uses ~2 GB VRAM + ~3% GPU per concurrent call. Vaak TTS runs on the same L4 at ₹0 marginal cost (uses only ~1.5 GB VRAM, 0.2% GPU per call).
Each L4 adds ~20 concurrent call capacity. No re-architecture needed. Sentia load-balances across GPUs automatically.
The Conformer architecture that powers our server-side STT is the same family Google uses for on-device speech recognition. At 120M parameters, our Hindi model can be quantized to ~130 MB — small enough to run on a ₹8,000 Android phone. We're building toward fully offline, phone-native STT for field agents and rural borrowers.
IndicConformer on NVIDIA L4 GPUs. 22 languages, <100ms latency, ~20 concurrent streams. Live with a bank client processing real loan origination calls. Borrower calls a number — server handles everything.
ONNX Runtime export with INT8 quantization for CPU-only deployment. For banks that don't have or want GPU infrastructure. Target: real-time Hindi+English STT on a standard Xeon server — no GPU required.
Hindi+English Conformer model exported to ONNX/TFLite, quantized to INT8 (~130 MB). Target hardware: budget Android phones (Snapdragon 4xx/6xx, 3–4 GB RAM). Offline field agent app for villages with no connectivity — voice-to-text form filling, pre-screening chatbot.
Today, our STT works via phone calls — borrower dials a number, server handles speech recognition. No app needed. That's perfect for the 85% of India that uses feature phones or basic smartphones.
But field agents visiting villages face a different problem: no reliable internet. An on-device STT model lets them record loan conversations offline, transcribe locally, and sync when connectivity returns. The Conformer architecture makes this possible — a 120M model at ~130 MB runs comfortably on a mid-range phone. A 1.55B encoder-decoder model never could.
Optimized for noisy environments — marketplaces, fields, busy roads. Silero VAD + noise-robust fine-tuning handles real field conditions, not just studio recordings.
Handles mid-sentence language switching naturally. "Mujhe loan ka interest rate batao" — Hindi structure, English terms. Transcribed accurately without language-split hacks.
Trained on regional dialects, not just standard broadcast speech. Bhojpuri Hindi, Marwari, Konkan Marathi — fine-tuned on real speakers from these regions.
Words appear on screen as the borrower speaks — true real-time streaming via the RNNT decoder. No 30-second chunk buffering. Under 100ms latency on L4 GPU.
Every word gets precise start/end timestamps. Powers borrower intent highlighting, compliance audit trails, and conversation analytics for loan QA.
Entire pipeline runs in your infrastructure. No audio leaves your network. No third-party API calls. Critical for RBI data localization and DPDP Act compliance.