Speech-to-Text — 22 Indian Languages

22

Scheduled Languages

120M

Parameters (per language)

<100ms

Streaming Latency (L4)

RNNT

Native Streaming Decoder

HindiBengaliTeluguMarathiTamilUrduGujaratiKannadaMalayalamOdiaPunjabiAssameseMaithiliSantaliKashmiriNepaliSindhiKonkaniDogriManipuriBodoSanskrit

Architecture

Augmen STT — Conformer + Hybrid CTC-RNNT

Built on AI4Bharat's open-source IndicConformer (MIT license) — the same Conformer architecture family that powers Google's on-device speech recognition. Unlike encoder-decoder models that generate tokens one-by-one, the Conformer's CTC path produces transcriptions in a single forward pass, and the RNNT path enables true word-by-word streaming. The result: 10× faster inference and 3× less VRAM than autoregressive alternatives.

Inference Pipeline

Audio enters as 16 kHz mono signal. The Conformer encoder processes features through self-attention + convolution blocks (capturing both global and local speech patterns). The hybrid decoder offers two modes: CTC for maximum throughput, RNNT for real-time streaming.

A

Audio Input

Raw audio → 80-channel log-mel spectrogram

V

Silero VAD

Voice activity detection filters non-speech

C

Conformer Encoder

17 blocks × self-attention + convolution (dim 512)

D

CTC / RNNT Decoder

CTC: single pass · RNNT: streaming tokens

T

Transcript

Timestamped text + language ID + confidence

Specification

Augmen STT (IndicConformer)

Previous gen (Whisper-class)

Architecture

Conformer + Hybrid CTC-RNNT

Encoder-decoder transformer

Parameters

120M (per language) / 600M (multi)

1.55B (all languages shared)

Decoder

Non-autoregressive (CTC) or streaming (RNNT)

Autoregressive (token-by-token)

Streaming Latency (L4)

<100ms per chunk

~500ms per chunk

VRAM Usage

~2 GB (120M) / ~4 GB (600M)

~5 GB

Concurrent Calls (1 L4)

~20 streams (RNNT mode)

~4–5 streams

Mobile-deployable

Yes (120M → ~130 MB ONNX INT8)

No (1.55B too large)

Why Conformer is faster: The CTC decoder produces the full transcript in one forward pass (no token-by-token generation). The RNNT decoder streams tokens as speech arrives — no need to buffer 30-second chunks. Both paths skip the autoregressive bottleneck that limits encoder-decoder models.

Why Conformer over Encoder-Decoder

Self-Attention + Convolution

Conformer blocks combine transformer self-attention (global context) with depthwise convolutions (local patterns). This hybrid captures both long-range language structure and fine-grained phonetic detail — critical for tonal Indian languages.

Non-Autoregressive Decoding

CTC decodes the entire utterance in a single pass — no beam search loop, no token-by-token generation. Result: 10× faster inference than autoregressive decoders, with GPU utilization per call dropping from ~17% to ~3%.

True Streaming (RNNT)

The RNNT decoder emits tokens as audio arrives — words appear on screen while the borrower is still speaking. Unlike chunked streaming, RNNT provides genuine real-time partial results with no buffering delay.

What's proprietary in Augmen STT

The base Conformer architecture and IndicConformer weights come from AI4Bharat (IIT Madras), released under the MIT license. Everything below is ours.

✦

Domain-adapted model weights for fintech
IndicConformer fine-tuned on real loan origination conversations — banking terminology, account numbers, EMI amounts, borrower accents from Jharkhand, UP, Bihar. These weights don't exist in any public model.

✦

Code-switching accuracy for banking
Hindi-English, Tamil-English, Bengali-English mid-sentence switching — trained on real loan conversations where borrowers say "mera loan amount kitna hai" and "EMI schedule bhejo".

✦

Noise-robust inference pipeline
Silero VAD + custom noise filtering for field conditions — marketplaces, roads, construction sites. Rural borrowers don't call from quiet offices.

✦

Sentia orchestration + production stack
NeMo runtime, gRPC streaming API, hybrid CTC/RNNT routing, and Sentia's context-aware agent layer — battle-tested in production with a live bank deployment.

Can someone download IndicConformer from HuggingFace and try? Yes — it's MIT licensed. Will it understand a Bhojpuri borrower asking about Mudra loan EMI on a noisy street? No. That's the gap we fill.

Training Pipeline

How we build Augmen STT

Starting from AI4Bharat's IndicConformer — a Conformer-Large model (120M params per language, 600M multilingual) trained on Indian speech corpora — we fine-tune for fintech domain accuracy and production deployment.

01

Base: IndicConformer

Start with AI4Bharat's pre-trained Conformer-Large: 17 Conformer blocks, 512 model dimension, hybrid CTC-RNNT decoder. Already trained on 22 Indian languages via AI Kosh and IndicVoices datasets.

02

Fine-Tune for Fintech

Fine-tune on real loan origination recordings — borrower conversations, agent prompts, banking terminology. Dialect-balanced sampling for Bhojpuri, Marwari, Chhattisgarhi, and regional accents underrepresented in the base model.

03

Optimize for Inference

Export via NVIDIA NeMo to ONNX / TensorRT. FP16 precision on GPU. Layer fusion for Conformer blocks. Validate WER parity against PyTorch baseline — same accuracy, optimized throughput.

04

Deploy with Streaming

Integrate Silero VAD for voice activity detection. RNNT decoder for real-time word-by-word streaming. gRPC API behind Sentia's agent orchestration layer. Fully self-hosted in bank infrastructure.

Training Data — AI Kosh & AI4Bharat IndicVoices

IndicConformer was trained by AI4Bharat on India's richest open speech corpora. We further fine-tune on proprietary fintech conversation data — real loan calls recorded with borrower consent, covering banking terminology, regional accents, and noisy field conditions that no public dataset captures.

1,700+

Hours (Base Training)

22

Languages

10K+

Speakers

+Custom

Fintech Conversations

AI Kosh (IndiaAI) AI4Bharat IndicVoices IISc Syspin IndicSpeech (IIIT-H) Augmen Field Recordings

Accuracy

Word Error Rate across 22 languages

IndicConformer achieves consistent, low WER across high-resource and low-resource Indian languages — purpose-built for Indian speech, unlike multilingual models that spread capacity across 100+ languages.

Language	WER (%)	Accuracy	Resource Level
Hindi देवनागरी	8.2	91.8%	High
Bengali বাংলা	9.1	90.9%	High
Tamil தமிழ்	9.8	90.2%	High
Telugu తెలుగు	10.4	89.6%	High
Marathi मराठी	10.1	89.9%	High
Gujarati ગુજરાતી	11.3	88.7%	High
Kannada ಕನ್ನಡ	10.9	89.1%	High
Malayalam മലയാളം	11.7	88.3%	High
Odia ଓଡ଼ିଆ	12.4	87.6%	Medium
Punjabi ਪੰਜਾਬੀ	11.5	88.5%	Medium
Assamese অসমীয়া	13.2	86.8%	Medium
Urdu اردو	10.6	89.4%	High
Nepali नेपाली	12.8	87.2%	Medium
Maithili मैथिली	15.4	84.6%	Low
Konkani कोंकणी	16.1	83.9%	Low
Sindhi سنڌي	16.8	83.2%	Low
Dogri डोगरी	17.5	82.5%	Low
Kashmiri कॉशुर	18.2	81.8%	Low
Manipuri মৈতৈলোন্	17.0	83.0%	Low
Bodo बड़ो	19.1	80.9%	Low
Santali ᱥᱟᱱᱛᱟᱲᱤ	19.8	80.2%	Low
Sanskrit संस्कृतम्	16.5	83.5%	Low

Excellent (<10% WER)

Good (10–15% WER)

Functional (15–20% WER)

WER measured on held-out test sets from AI4Bharat IndicVoices and Common Voice. Code-switched utterances included. Low-resource language WER improves with each fine-tuning iteration as more field data is collected. Fintech domain fine-tuning further improves Hindi and Bengali WER by 1–2% for banking vocabulary.

Choose Your STT Engine

We integrate both. You choose what fits.

Start with Google Cloud STT for zero-infra simplicity. Graduate to Augmen's self-hosted STT when volume justifies it. Or run both — Sentia routes calls to the right engine automatically.

Google Cloud STT V2 (Chirp)

Pricing modelPay per minute

Cost₹1.36/min ($0.016)

Infrastructure neededNone

Indian language support~15 languages

Data residencyGoogle Cloud servers

Best for: Pilots, low-volume NBFCs, quick deployment

Augmen STT — Self-Hosted

Pricing modelFixed per GPU/month

On-premise TCO~₹20,000/mo per L4

Capacity per L4~20 concurrent / ~20K calls/mo

Indian language support22 languages (fine-tuned)

Data residencyYour infrastructure

Best for: 1,200+ calls/month, data-sensitive banks, regional languages

Capacity

What one L4 GPU can handle

The Conformer architecture is dramatically more efficient than encoder-decoder models. A single NVIDIA L4 that previously handled 4–5 concurrent calls now serves ~20 — enough for most mid-size banks on a single card.

How concurrency works with Conformer

1. Silero VAD detects speech and sends 2–5 second audio chunks to the GPU.

2. Each chunk takes ~80ms to transcribe on L4 (RNNT streaming mode).

3. Between chunks, the GPU is idle — waiting for the next speech segment.

4. GPU utilization per live call: ~3% (80ms active per 3s of audio).

Result: One L4 comfortably serves ~20 simultaneous real-time conversations — a 4× improvement over encoder-decoder alternatives.

~20

Concurrent Calls

Per L4 GPU, RNNT streaming

~1,000

Calls / Business Day

10-hr day, 12-min avg call

~20,000

Calls / Month

22 business days, single L4

~2 GB

VRAM per Model

22 GB free for TTS + LLM

Capacity based on 12-min average call, RNNT streaming, ~80ms per 3s chunk. CTC batch mode is even faster (~30ms/chunk). Real-world varies by speech-to-silence ratio. VRAM headroom means STT, TTS, and routing all run on the same GPU.

Cost Economics

One GPU. Most banks covered.

With ~20,000 calls/month capacity per L4, most banks need just one GPU. The break-even vs Google STT happens at ~1,225 calls/month — after that, every call is essentially free.

1,000 calls/mo

Google: ₹16,320

Augmen: ₹20,000

→ Google still cheaper

5,000 calls/mo

Google: ₹81,600

Augmen: ₹20,000

Save ₹61,600/mo

10,000 calls/mo

Google: ₹1,63,200

Augmen: ₹20,000 (1 GPU)

Save ₹1.43L/mo

50,000 calls/mo

Google: ₹8,16,000

Augmen: ₹60,000 (3 GPUs)

Save ₹7.56L/mo

On-Premise TCO Breakdown (per L4 GPU)

NVIDIA L4 GPU (24 GB, Ada Lovelace)~₹2,50,000

Inference server (PCIe 4.0, 32 GB RAM)~₹3,00,000

Total hardware CapEx₹5,50,000

3-year amortization₹15,280/mo

Power (server ~200W × 24/7 × ₹10/kWh)₹1,460/mo

Maintenance & cooling allocation₹3,260/mo

Monthly TCO~₹20,000/mo

L4 GPU pricing per AceCloud/Neysa India market data (Oct 2025). Power at ₹10/kWh commercial. Banks with existing data centers may see lower marginal costs. One L4 now handles STT + TTS + routing — no need for separate GPU per workload.

Assumptions: 12-min avg call. Google STT V2 Chirp at $0.016/min = ₹1.36/min (₹85/USD). Conformer uses ~2 GB VRAM + ~3% GPU per concurrent call. Vaak TTS runs on the same L4 at ₹0 marginal cost (uses only ~1.5 GB VRAM, 0.2% GPU per call).

Scale linearly — add GPUs as you grow

Each L4 adds ~20 concurrent call capacity. No re-architecture needed. Sentia load-balances across GPUs automatically.

1

Small NBFC / Regional Bank

1 L4 GPU

~20 concurrent calls

Up to ~20,000 calls/month

₹20K/mo on-premise

2–3

Large Bank

2–3 L4 GPUs

40–60 concurrent calls

Up to ~60,000 calls/month

₹40–60K/mo on-premise

5+

Enterprise / BPO

5+ L4 GPUs

100+ concurrent calls

100K+ calls/month

₹1L+/mo on-premise

Roadmap

From server to phone

The Conformer architecture that powers our server-side STT is the same family Google uses for on-device speech recognition. At 120M parameters, our Hindi model can be quantized to ~130 MB — small enough to run on a ₹8,000 Android phone. We're building toward fully offline, phone-native STT for field agents and rural borrowers.

Live in Production

Server-Side STT

IndicConformer on NVIDIA L4 GPUs. 22 languages, <100ms latency, ~20 concurrent streams. Live with a bank client processing real loan origination calls. Borrower calls a number — server handles everything.

Building Now

CPU-Optimized Server

ONNX Runtime export with INT8 quantization for CPU-only deployment. For banks that don't have or want GPU infrastructure. Target: real-time Hindi+English STT on a standard Xeon server — no GPU required.

Planned

On-Device Mobile STT

Hindi+English Conformer model exported to ONNX/TFLite, quantized to INT8 (~130 MB). Target hardware: budget Android phones (Snapdragon 4xx/6xx, 3–4 GB RAM). Offline field agent app for villages with no connectivity — voice-to-text form filling, pre-screening chatbot.

Why mobile STT matters for Indian fintech

Today, our STT works via phone calls — borrower dials a number, server handles speech recognition. No app needed. That's perfect for the 85% of India that uses feature phones or basic smartphones.

But field agents visiting villages face a different problem: no reliable internet. An on-device STT model lets them record loan conversations offline, transcribe locally, and sync when connectivity returns. The Conformer architecture makes this possible — a 120M model at ~130 MB runs comfortably on a mid-range phone. A 1.55B encoder-decoder model never could.

Built for India

Why our STT is different

Rural-First

Optimized for noisy environments — marketplaces, fields, busy roads. Silero VAD + noise-robust fine-tuning handles real field conditions, not just studio recordings.

Code-Switching

Handles mid-sentence language switching naturally. "Mujhe loan ka interest rate batao" — Hindi structure, English terms. Transcribed accurately without language-split hacks.

Dialect Awareness

Trained on regional dialects, not just standard broadcast speech. Bhojpuri Hindi, Marwari, Konkan Marathi — fine-tuned on real speakers from these regions.

Native Streaming (RNNT)

Words appear on screen as the borrower speaks — true real-time streaming via the RNNT decoder. No 30-second chunk buffering. Under 100ms latency on L4 GPU.

Word-Level Timestamps

Every word gets precise start/end timestamps. Powers borrower intent highlighting, compliance audit trails, and conversation analytics for loan QA.

Full Data Sovereignty

Entire pipeline runs in your infrastructure. No audio leaves your network. No third-party API calls. Critical for RBI data localization and DPDP Act compliance.

Every Indian language.Every dialect. Every accent.

Scheduled Languages

Parameters (per language)

Streaming Latency (L4)

Native Streaming Decoder

Augmen STT — Conformer + Hybrid CTC-RNNT

Inference Pipeline

Audio Input

Silero VAD

Conformer Encoder

CTC / RNNT Decoder

Transcript

Why Conformer over Encoder-Decoder

Self-Attention + Convolution

Non-Autoregressive Decoding

True Streaming (RNNT)

What's proprietary in Augmen STT

How we build Augmen STT

Base: IndicConformer

Fine-Tune for Fintech

Optimize for Inference

Deploy with Streaming

Training Data — AI Kosh & AI4Bharat IndicVoices

Hours (Base Training)

Languages

Speakers

Fintech Conversations

Word Error Rate across 22 languages

We integrate both. You choose what fits.

Google Cloud STT V2 (Chirp)

Augmen STT — Self-Hosted

What one L4 GPU can handle

How concurrency works with Conformer

Concurrent Calls

Calls / Business Day

Calls / Month

VRAM per Model

One GPU. Most banks covered.

On-Premise TCO Breakdown (per L4 GPU)

Scale linearly — add GPUs as you grow

Small NBFC / Regional Bank

Large Bank

Enterprise / BPO

From server to phone

Server-Side STT

CPU-Optimized Server

On-Device Mobile STT

Why mobile STT matters for Indian fintech

Why our STT is different

Rural-First

Code-Switching

Dialect Awareness

Native Streaming (RNNT)

Word-Level Timestamps

Full Data Sovereignty

Test our STT accuracy yourself

Every Indian language.
Every dialect. Every accent.