Augmen AI Labs / Speech-to-Text

Every Indian language.
Every dialect. Every accent.

Built on the Conformer architecture — the same family Google uses for on-device speech recognition. Fine-tuned on AI4Bharat's IndicConformer for all 22 scheduled Indian languages. Native streaming via RNNT decoder. Under 100ms latency. 10× more efficient than encoder-decoder models.

Self-Hosted22 LanguagesNative StreamingConformer + RNNT
22

Scheduled Languages

120M

Parameters (per language)

<100ms

Streaming Latency (L4)

RNNT

Native Streaming Decoder

HindiBengaliTeluguMarathiTamilUrduGujaratiKannadaMalayalamOdiaPunjabiAssameseMaithiliSantaliKashmiriNepaliSindhiKonkaniDogriManipuriBodoSanskrit
Architecture

Augmen STT — Conformer + Hybrid CTC-RNNT

Built on AI4Bharat's open-source IndicConformer (MIT license) — the same Conformer architecture family that powers Google's on-device speech recognition. Unlike encoder-decoder models that generate tokens one-by-one, the Conformer's CTC path produces transcriptions in a single forward pass, and the RNNT path enables true word-by-word streaming. The result: 10× faster inference and 3× less VRAM than autoregressive alternatives.

Inference Pipeline

Audio enters as 16 kHz mono signal. The Conformer encoder processes features through self-attention + convolution blocks (capturing both global and local speech patterns). The hybrid decoder offers two modes: CTC for maximum throughput, RNNT for real-time streaming.

A
Audio Input

Raw audio → 80-channel log-mel spectrogram

V
Silero VAD

Voice activity detection filters non-speech

C
Conformer Encoder

17 blocks × self-attention + convolution (dim 512)

D
CTC / RNNT Decoder

CTC: single pass · RNNT: streaming tokens

T
Transcript

Timestamped text + language ID + confidence

Specification
Augmen STT (IndicConformer)
Previous gen (Whisper-class)
Architecture
Conformer + Hybrid CTC-RNNT
Encoder-decoder transformer
Parameters
120M (per language) / 600M (multi)
1.55B (all languages shared)
Decoder
Non-autoregressive (CTC) or streaming (RNNT)
Autoregressive (token-by-token)
Streaming Latency (L4)
<100ms per chunk
~500ms per chunk
VRAM Usage
~2 GB (120M) / ~4 GB (600M)
~5 GB
Concurrent Calls (1 L4)
~20 streams (RNNT mode)
~4–5 streams
Mobile-deployable
Yes (120M → ~130 MB ONNX INT8)
No (1.55B too large)

Why Conformer is faster: The CTC decoder produces the full transcript in one forward pass (no token-by-token generation). The RNNT decoder streams tokens as speech arrives — no need to buffer 30-second chunks. Both paths skip the autoregressive bottleneck that limits encoder-decoder models.

Why Conformer over Encoder-Decoder

Self-Attention + Convolution

Conformer blocks combine transformer self-attention (global context) with depthwise convolutions (local patterns). This hybrid captures both long-range language structure and fine-grained phonetic detail — critical for tonal Indian languages.

Non-Autoregressive Decoding

CTC decodes the entire utterance in a single pass — no beam search loop, no token-by-token generation. Result: 10× faster inference than autoregressive decoders, with GPU utilization per call dropping from ~17% to ~3%.

True Streaming (RNNT)

The RNNT decoder emits tokens as audio arrives — words appear on screen while the borrower is still speaking. Unlike chunked streaming, RNNT provides genuine real-time partial results with no buffering delay.

What's proprietary in Augmen STT

The base Conformer architecture and IndicConformer weights come from AI4Bharat (IIT Madras), released under the MIT license. Everything below is ours.

Domain-adapted model weights for fintech
IndicConformer fine-tuned on real loan origination conversations — banking terminology, account numbers, EMI amounts, borrower accents from Jharkhand, UP, Bihar. These weights don't exist in any public model.
Code-switching accuracy for banking
Hindi-English, Tamil-English, Bengali-English mid-sentence switching — trained on real loan conversations where borrowers say "mera loan amount kitna hai" and "EMI schedule bhejo".
Noise-robust inference pipeline
Silero VAD + custom noise filtering for field conditions — marketplaces, roads, construction sites. Rural borrowers don't call from quiet offices.
Sentia orchestration + production stack
NeMo runtime, gRPC streaming API, hybrid CTC/RNNT routing, and Sentia's context-aware agent layer — battle-tested in production with a live bank deployment.

Can someone download IndicConformer from HuggingFace and try? Yes — it's MIT licensed. Will it understand a Bhojpuri borrower asking about Mudra loan EMI on a noisy street? No. That's the gap we fill.

Training Pipeline

How we build Augmen STT

Starting from AI4Bharat's IndicConformer — a Conformer-Large model (120M params per language, 600M multilingual) trained on Indian speech corpora — we fine-tune for fintech domain accuracy and production deployment.

01

Base: IndicConformer

Start with AI4Bharat's pre-trained Conformer-Large: 17 Conformer blocks, 512 model dimension, hybrid CTC-RNNT decoder. Already trained on 22 Indian languages via AI Kosh and IndicVoices datasets.

02

Fine-Tune for Fintech

Fine-tune on real loan origination recordings — borrower conversations, agent prompts, banking terminology. Dialect-balanced sampling for Bhojpuri, Marwari, Chhattisgarhi, and regional accents underrepresented in the base model.

03

Optimize for Inference

Export via NVIDIA NeMo to ONNX / TensorRT. FP16 precision on GPU. Layer fusion for Conformer blocks. Validate WER parity against PyTorch baseline — same accuracy, optimized throughput.

04

Deploy with Streaming

Integrate Silero VAD for voice activity detection. RNNT decoder for real-time word-by-word streaming. gRPC API behind Sentia's agent orchestration layer. Fully self-hosted in bank infrastructure.

Training Data — AI Kosh & AI4Bharat IndicVoices

IndicConformer was trained by AI4Bharat on India's richest open speech corpora. We further fine-tune on proprietary fintech conversation data — real loan calls recorded with borrower consent, covering banking terminology, regional accents, and noisy field conditions that no public dataset captures.

1,700+
Hours (Base Training)
22
Languages
10K+
Speakers
+Custom
Fintech Conversations
AI Kosh (IndiaAI) AI4Bharat IndicVoices IISc Syspin IndicSpeech (IIIT-H) Augmen Field Recordings
Accuracy

Word Error Rate across 22 languages

IndicConformer achieves consistent, low WER across high-resource and low-resource Indian languages — purpose-built for Indian speech, unlike multilingual models that spread capacity across 100+ languages.

LanguageWER (%)AccuracyResource Level
Hindi देवनागरी8.291.8%High
Bengali বাংলা9.190.9%High
Tamil தமிழ்9.890.2%High
Telugu తెలుగు10.489.6%High
Marathi मराठी10.189.9%High
Gujarati ગુજરાતી11.388.7%High
Kannada ಕನ್ನಡ10.989.1%High
Malayalam മലയാളം11.788.3%High
Odia ଓଡ଼ିଆ12.487.6%Medium
Punjabi ਪੰਜਾਬੀ11.588.5%Medium
Assamese অসমীয়া13.286.8%Medium
Urdu اردو10.689.4%High
Nepali नेपाली12.887.2%Medium
Maithili मैथिली15.484.6%Low
Konkani कोंकणी16.183.9%Low
Sindhi سنڌي16.883.2%Low
Dogri डोगरी17.582.5%Low
Kashmiri कॉशुर18.281.8%Low
Manipuri মৈতৈলোন্17.083.0%Low
Bodo बड़ो19.180.9%Low
Santali ᱥᱟᱱᱛᱟᱲᱤ19.880.2%Low
Sanskrit संस्कृतम्16.583.5%Low
Excellent (<10% WER)
Good (10–15% WER)
Functional (15–20% WER)

WER measured on held-out test sets from AI4Bharat IndicVoices and Common Voice. Code-switched utterances included. Low-resource language WER improves with each fine-tuning iteration as more field data is collected. Fintech domain fine-tuning further improves Hindi and Bengali WER by 1–2% for banking vocabulary.

Choose Your STT Engine

We integrate both. You choose what fits.

Start with Google Cloud STT for zero-infra simplicity. Graduate to Augmen's self-hosted STT when volume justifies it. Or run both — Sentia routes calls to the right engine automatically.

Google Cloud STT V2 (Chirp)

Pricing modelPay per minute
Cost₹1.36/min ($0.016)
Infrastructure neededNone
Indian language support~15 languages
Data residencyGoogle Cloud servers
Best for: Pilots, low-volume NBFCs, quick deployment

Augmen STT — Self-Hosted

Pricing modelFixed per GPU/month
On-premise TCO~₹20,000/mo per L4
Capacity per L4~20 concurrent / ~20K calls/mo
Indian language support22 languages (fine-tuned)
Data residencyYour infrastructure
Best for: 1,200+ calls/month, data-sensitive banks, regional languages
Capacity

What one L4 GPU can handle

The Conformer architecture is dramatically more efficient than encoder-decoder models. A single NVIDIA L4 that previously handled 4–5 concurrent calls now serves ~20 — enough for most mid-size banks on a single card.

How concurrency works with Conformer

1. Silero VAD detects speech and sends 2–5 second audio chunks to the GPU.

2. Each chunk takes ~80ms to transcribe on L4 (RNNT streaming mode).

3. Between chunks, the GPU is idle — waiting for the next speech segment.

4. GPU utilization per live call: ~3% (80ms active per 3s of audio).

Result: One L4 comfortably serves ~20 simultaneous real-time conversations — a 4× improvement over encoder-decoder alternatives.

~20
Concurrent Calls

Per L4 GPU, RNNT streaming

~1,000
Calls / Business Day

10-hr day, 12-min avg call

~20,000
Calls / Month

22 business days, single L4

~2 GB
VRAM per Model

22 GB free for TTS + LLM

Capacity based on 12-min average call, RNNT streaming, ~80ms per 3s chunk. CTC batch mode is even faster (~30ms/chunk). Real-world varies by speech-to-silence ratio. VRAM headroom means STT, TTS, and routing all run on the same GPU.

Cost Economics

One GPU. Most banks covered.

With ~20,000 calls/month capacity per L4, most banks need just one GPU. The break-even vs Google STT happens at ~1,225 calls/month — after that, every call is essentially free.

1,000 calls/mo
Google: ₹16,320
Augmen: ₹20,000
→ Google still cheaper
5,000 calls/mo
Google: ₹81,600
Augmen: ₹20,000
Save ₹61,600/mo
10,000 calls/mo
Google: ₹1,63,200
Augmen: ₹20,000 (1 GPU)
Save ₹1.43L/mo
50,000 calls/mo
Google: ₹8,16,000
Augmen: ₹60,000 (3 GPUs)
Save ₹7.56L/mo

On-Premise TCO Breakdown (per L4 GPU)

NVIDIA L4 GPU (24 GB, Ada Lovelace)~₹2,50,000
Inference server (PCIe 4.0, 32 GB RAM)~₹3,00,000
Total hardware CapEx₹5,50,000
3-year amortization₹15,280/mo
Power (server ~200W × 24/7 × ₹10/kWh)₹1,460/mo
Maintenance & cooling allocation₹3,260/mo
Monthly TCO~₹20,000/mo

L4 GPU pricing per AceCloud/Neysa India market data (Oct 2025). Power at ₹10/kWh commercial. Banks with existing data centers may see lower marginal costs. One L4 now handles STT + TTS + routing — no need for separate GPU per workload.

Assumptions: 12-min avg call. Google STT V2 Chirp at $0.016/min = ₹1.36/min (₹85/USD). Conformer uses ~2 GB VRAM + ~3% GPU per concurrent call. Vaak TTS runs on the same L4 at ₹0 marginal cost (uses only ~1.5 GB VRAM, 0.2% GPU per call).

Scale linearly — add GPUs as you grow

Each L4 adds ~20 concurrent call capacity. No re-architecture needed. Sentia load-balances across GPUs automatically.

1

Small NBFC / Regional Bank

1 L4 GPU
~20 concurrent calls
Up to ~20,000 calls/month
₹20K/mo on-premise
2–3

Large Bank

2–3 L4 GPUs
40–60 concurrent calls
Up to ~60,000 calls/month
₹40–60K/mo on-premise
5+

Enterprise / BPO

5+ L4 GPUs
100+ concurrent calls
100K+ calls/month
₹1L+/mo on-premise
Roadmap

From server to phone

The Conformer architecture that powers our server-side STT is the same family Google uses for on-device speech recognition. At 120M parameters, our Hindi model can be quantized to ~130 MB — small enough to run on a ₹8,000 Android phone. We're building toward fully offline, phone-native STT for field agents and rural borrowers.

Live in Production

Server-Side STT

IndicConformer on NVIDIA L4 GPUs. 22 languages, <100ms latency, ~20 concurrent streams. Live with a bank client processing real loan origination calls. Borrower calls a number — server handles everything.

Building Now

CPU-Optimized Server

ONNX Runtime export with INT8 quantization for CPU-only deployment. For banks that don't have or want GPU infrastructure. Target: real-time Hindi+English STT on a standard Xeon server — no GPU required.

Planned

On-Device Mobile STT

Hindi+English Conformer model exported to ONNX/TFLite, quantized to INT8 (~130 MB). Target hardware: budget Android phones (Snapdragon 4xx/6xx, 3–4 GB RAM). Offline field agent app for villages with no connectivity — voice-to-text form filling, pre-screening chatbot.

Why mobile STT matters for Indian fintech

Today, our STT works via phone calls — borrower dials a number, server handles speech recognition. No app needed. That's perfect for the 85% of India that uses feature phones or basic smartphones.

But field agents visiting villages face a different problem: no reliable internet. An on-device STT model lets them record loan conversations offline, transcribe locally, and sync when connectivity returns. The Conformer architecture makes this possible — a 120M model at ~130 MB runs comfortably on a mid-range phone. A 1.55B encoder-decoder model never could.

Built for India

Why our STT is different

Rural-First

Optimized for noisy environments — marketplaces, fields, busy roads. Silero VAD + noise-robust fine-tuning handles real field conditions, not just studio recordings.

Code-Switching

Handles mid-sentence language switching naturally. "Mujhe loan ka interest rate batao" — Hindi structure, English terms. Transcribed accurately without language-split hacks.

Dialect Awareness

Trained on regional dialects, not just standard broadcast speech. Bhojpuri Hindi, Marwari, Konkan Marathi — fine-tuned on real speakers from these regions.

Native Streaming (RNNT)

Words appear on screen as the borrower speaks — true real-time streaming via the RNNT decoder. No 30-second chunk buffering. Under 100ms latency on L4 GPU.

Word-Level Timestamps

Every word gets precise start/end timestamps. Powers borrower intent highlighting, compliance audit trails, and conversation analytics for loan QA.

Full Data Sovereignty

Entire pipeline runs in your infrastructure. No audio leaves your network. No third-party API calls. Critical for RBI data localization and DPDP Act compliance.

Test our STT accuracy yourself

Try live transcription in any Indian language during a demo session.

Book a Demo