Dual-engine voice synthesis — a speed-optimized model for real-time conversations, a quality-optimized model for human-level expressiveness. Fine-tuned on 1,700+ hours of Indian speech data. Code-switching built in. 100% self-hosted.
Vaak doesn't force a single model for every use case. We deploy two models in parallel — one optimized for speed, one for quality — routing each synthesis request to the engine that best fits the latency and expressiveness requirements of the moment.
End-to-end single-stage TTS built on an open-source VITS2 architecture, fine-tuned for Indian languages. Text goes in, waveform comes out — no intermediate mel spectrograms. Adversarial training with normalizing flows produces natural speech at extremely low latency.
Built on the open-source StyleTTS2 architecture — style diffusion + adversarial training with large speech language models, fine-tuned for Indian voices. Achieves human-level naturalness (MOS 4.8) with fine-grained control over timbre, prosody, and emotional delivery.
Both engine architectures originate from open-source TTS research. Everything below is ours.
Can someone download the base open-source TTS models and try? Yes. Will they produce natural Hindi-English code-switched speech for a Marwari-accented borrower? No. That's Vaak.
We fine-tune both TTS models on Indian language speech data, then deploy them behind a single API with intelligent routing via Sentia.
Source speech datasets from AI Kosh (Government of India) and AI4Bharat's IndicTTS corpus. Clean, segment, and normalize audio across 22 languages with balanced male/female speakers.
Train Vaak Speed for low-latency conversational paths and Vaak Quality for high-fidelity paths. Each model is fine-tuned per language with multi-speaker conditioning for voice cloning capability.
Sentia's context engine decides which TTS model to invoke — Vaak Speed for rapid conversational turns, Vaak Quality for emotional moments, greetings, or when brand voice fidelity matters most.
We leverage India's richest open speech corpora available through the Government of India's AI Kosh platform and AI4Bharat's IndicTTS project from IIT Madras. The combined corpus provides the scale and diversity needed to train production-grade TTS models that handle regional accents, intonation patterns, and code-mixed speech naturally.
Real borrowers don't speak pure Hindi or pure Tamil. They mix in English words for banking terms, amounts, and document names. Vaak handles this natively — no awkward accent switches.
English terms are pronounced with the speaker's natural Indian accent, not jarring American/British TTS. The prosody stays continuous across language boundaries.
Start with Google Cloud TTS for quick deployment. Switch to self-hosted Vaak when scale demands it — or run both with Sentia routing automatically.
1. In a 12-min loan call, the agent speaks ~20 utterances, each ~125 characters.
2. Vaak Speed synthesizes each utterance in <80ms. Total GPU time per call: ~1.6 seconds.
3. GPU utilization per concurrent call: 0.2% (1.6s out of 720s).
4. The Vaak Speed model uses only ~1.5 GB VRAM. The L4 GPU running your STT (using ~5 GB) has 17+ GB free.
Result: Vaak runs on the same L4 GPU as your STT with no additional hardware. If you're already running Augmen STT, Vaak TTS adds zero marginal GPU cost.
The same on-premise L4 that handles your speech-to-text also runs voice synthesis. No additional GPU. No per-character API fees. No data leaving your network.
Assumptions: ~2,500 chars TTS output per 12-min call. Google WaveNet at $16/1M chars = ₹1,360/1M chars (₹85/USD). Self-hosted Vaak runs on the same L4 GPU as Augmen STT — no additional GPU required (Vaak Speed model uses ~1.5 GB of 24 GB available).
Sub-80ms latency keeps voice conversations flowing naturally. Borrower doesn't perceive any processing gap between their question and the agent's spoken response.
Clone a specific voice persona for your bank's AI agent. Style diffusion reproduces timbre and prosody from just a 10-second reference clip.
Tone matters in debt recovery. The quality engine's embedding scale controls emotional delivery — empathetic for hardship discussions, firm for payment commitments.
The speed engine runs at 16× real-time on a single L4 GPU. One GPU handles thousands of concurrent synthesis requests — perfect for IVR flows and batch notifications.