Augmen AI Labs / Vaak

Voice that feels
like your own loan officer

Dual-engine voice synthesis — a speed-optimized model for real-time conversations, a quality-optimized model for human-level expressiveness. Fine-tuned on 1,700+ hours of Indian speech data. Code-switching built in. 100% self-hosted.

Self-Hosted22 LanguagesCode-SwitchingVoice Cloning
22

Indian Languages

<80ms

Speed Engine Latency

4.8

Quality Engine MOS

₹0

Marginal GPU Cost (with STT)

Dual Engine Architecture

Two models. One intelligent routing layer.

Vaak doesn't force a single model for every use case. We deploy two models in parallel — one optimized for speed, one for quality — routing each synthesis request to the engine that best fits the latency and expressiveness requirements of the moment.

Vaak Speed Speed-Optimized

End-to-end single-stage TTS built on an open-source VITS2 architecture, fine-tuned for Indian languages. Text goes in, waveform comes out — no intermediate mel spectrograms. Adversarial training with normalizing flows produces natural speech at extremely low latency.

T
Text Encoder — Transformer + speaker conditioning
N
Normalizing Flow — Transformer blocks for latent alignment
D
Stochastic Duration Predictor — Adversarially trained
H
HiFi-GAN Decoder — Direct waveform generation
Parameters~29M (single) / ~37M (multi-speaker)
Inference Latency<80ms per utterance
Real-Time Factor0.06× (16× faster than real-time)
VRAM~1.5 GB
Sample Rate22.05 kHz
Best ForReal-time conversation, high throughput

Vaak Quality Quality-Optimized

Built on the open-source StyleTTS2 architecture — style diffusion + adversarial training with large speech language models, fine-tuned for Indian voices. Achieves human-level naturalness (MOS 4.8) with fine-grained control over timbre, prosody, and emotional delivery.

T
Text Encoder — PL-BERT phoneme-level representations
S
Style Diffusion — Samples prosody/timbre from text context
P
Duration + Pitch Predictor — Differentiable alignment
D
iSTFT Decoder — SLM-adversarial trained waveform output
Parameters~30M core + SLM backbone
Inference Latency~200ms (5 diffusion steps)
Real-Time Factor0.15× (6× faster than real-time)
VRAM~2 GB
Sample Rate24 kHz
Best ForVoice cloning, emotional delivery, brand voice

What's proprietary in Vaak

Both engine architectures originate from open-source TTS research. Everything below is ours.

Fine-tuned voice models for 22 Indian languagesMulti-speaker models trained on 1,700+ hours of curated Indian speech — including dialects and accents that no public TTS model supports. These weights are Augmen's proprietary asset.
Code-switching pronunciation engineEnglish banking terms ("loan amount", "EMI", "interest rate") pronounced with natural Indian accent — not jarring American/British TTS. Trained on real loan conversations.
Dual-engine smart routing via SentiaSentia decides per-utterance: speed engine for quick conversational turns, quality engine for greetings, emotional delivery, or brand voice moments. No other TTS product does this.
Voice cloning pipeline for Indian voicesClone a specific voice persona from a 10-second reference clip. Fine-tuned to reproduce Indian speech patterns — prosody, rhythm, and tonal inflections that generic cloning fails on.

Can someone download the base open-source TTS models and try? Yes. Will they produce natural Hindi-English code-switched speech for a Marwari-accented borrower? No. That's Vaak.

The Vaak Pipeline

How Augmen trains and deploys Vaak

We fine-tune both TTS models on Indian language speech data, then deploy them behind a single API with intelligent routing via Sentia.

01

Curate Training Data

Source speech datasets from AI Kosh (Government of India) and AI4Bharat's IndicTTS corpus. Clean, segment, and normalize audio across 22 languages with balanced male/female speakers.

02

Fine-Tune Both Models

Train Vaak Speed for low-latency conversational paths and Vaak Quality for high-fidelity paths. Each model is fine-tuned per language with multi-speaker conditioning for voice cloning capability.

03

Deploy with Smart Routing

Sentia's context engine decides which TTS model to invoke — Vaak Speed for rapid conversational turns, Vaak Quality for emotional moments, greetings, or when brand voice fidelity matters most.

Training Data — AI Kosh & AI4Bharat IndicTTS

We leverage India's richest open speech corpora available through the Government of India's AI Kosh platform and AI4Bharat's IndicTTS project from IIT Madras. The combined corpus provides the scale and diversity needed to train production-grade TTS models that handle regional accents, intonation patterns, and code-mixed speech naturally.

1,700+
Hours of Speech
22
Languages
10K+
Speakers
24 kHz
Audio Quality
AI Kosh (IndiaAI) AI4Bharat IndicTTS IndicVoices IISC Syspin Custom Field Recordings
Code-Switching

English words. Indian accent.
Seamless in every sentence.

Real borrowers don't speak pure Hindi or pure Tamil. They mix in English words for banking terms, amounts, and document names. Vaak handles this natively — no awkward accent switches.

How code-switching sounds with Vaak

English terms are pronounced with the speaker's natural Indian accent, not jarring American/British TTS. The prosody stays continuous across language boundaries.

Hindi + English
"Aapka loan amount ₹2,50,000 approved ho gaya hai. Disbursement aapke bank account mein 3 working days mein ho jayega."
Tamil + English
"Ungaludaya Mudra loan application process aagiyirukkiradhu. KYC verification complete aana piraghu final approval kidaikkum."
Bengali + English
"Apnar credit score bhalo achhe. Interest rate hobe 8.5% per annum, ar EMI hobe protimashe ₹5,200."
Choose Your TTS Engine

Google Cloud TTS or Vaak self-hosted.
We support both.

Start with Google Cloud TTS for quick deployment. Switch to self-hosted Vaak when scale demands it — or run both with Sentia routing automatically.

Google Cloud TTS (WaveNet)

Pricing modelPer character
WaveNet / Neural2₹1,360/1M chars ($16)
Per 12-min call (~2,500 chars)~₹3.40
Indian language voices~12 languages
Infrastructure neededNone

Vaak — Self-Hosted

Pricing modelFixed per GPU/month
GPU cost (shared with STT)~₹0 marginal
Per 12-min call~₹0 (fixed GPU cost)
Indian language voices22 languages (fine-tuned)
Data stays in your infra

Why Vaak's marginal cost is near zero

1. In a 12-min loan call, the agent speaks ~20 utterances, each ~125 characters.

2. Vaak Speed synthesizes each utterance in <80ms. Total GPU time per call: ~1.6 seconds.

3. GPU utilization per concurrent call: 0.2% (1.6s out of 720s).

4. The Vaak Speed model uses only ~1.5 GB VRAM. The L4 GPU running your STT (using ~5 GB) has 17+ GB free.

Result: Vaak runs on the same L4 GPU as your STT with no additional hardware. If you're already running Augmen STT, Vaak TTS adds zero marginal GPU cost.

Monthly savings vs Google Cloud TTS

Already running Augmen STT? Vaak TTS is free.

The same on-premise L4 that handles your speech-to-text also runs voice synthesis. No additional GPU. No per-character API fees. No data leaving your network.

10K calls/mo
Google TTS: ₹34,000
Vaak: ₹0 extra
100K calls/mo
Google TTS: ₹3,40,000
Vaak: ₹0 extra
1M calls/mo
Google TTS: ₹34,00,000
Vaak: ₹0 extra

Assumptions: ~2,500 chars TTS output per 12-min call. Google WaveNet at $16/1M chars = ₹1,360/1M chars (₹85/USD). Self-hosted Vaak runs on the same L4 GPU as Augmen STT — no additional GPU required (Vaak Speed model uses ~1.5 GB of 24 GB available).

Use Cases

Right model for the right moment

Real-Time Loan Conversations

→ Vaak Speed (low latency)

Sub-80ms latency keeps voice conversations flowing naturally. Borrower doesn't perceive any processing gap between their question and the agent's spoken response.

Brand Voice & Greetings

→ Vaak Quality (voice cloning)

Clone a specific voice persona for your bank's AI agent. Style diffusion reproduces timbre and prosody from just a 10-second reference clip.

Collections & Sensitive Calls

→ Vaak Quality (emotional control)

Tone matters in debt recovery. The quality engine's embedding scale controls emotional delivery — empathetic for hardship discussions, firm for payment commitments.

IVR & High-Throughput Queues

→ Vaak Speed (throughput)

The speed engine runs at 16× real-time on a single L4 GPU. One GPU handles thousands of concurrent synthesis requests — perfect for IVR flows and batch notifications.

Hear the difference

Schedule a live demo to hear Vaak in action across multiple languages, voices, and emotional tones.

Request Demo