Domain-specific SLMs fine-tuned with QLoRA for Indian financial services, grounded through Agentic RAG with corrective retrieval, and continuously improved via DPO preference optimization from real borrower conversations. Self-hosted. No external API calls.
Instead of one large model doing everything, Augmen uses specialized small models orchestrated by Sentia. Each agent handles what it does best — retrieval, reasoning, generation, validation — and the context engine routes only what's needed to each agent.
We continuously adopt the best techniques from the research frontier. Each client deployment uses the most effective combination of fine-tuning, retrieval, and alignment methods available at the time of delivery.
Fine-tune models with 75% less memory by quantizing base weights to 4-bit NormalFloat while training low-rank adapters in FP16. A 7B model fits on a single 24 GB GPU. We use QLoRA via LLaMA-Factory with BitsAndBytes for all SLM fine-tuning — adapting models like Llama 3, Qwen 2.5, and Gemma to Indian financial services in hours, not weeks.
Beyond naive RAG: our retrieval pipeline uses Sentia as an agentic orchestrator that dynamically decides when to retrieve, validates relevance of retrieved documents, re-queries if context is insufficient, and combines hybrid search (BM25 + vector embeddings) with cross-encoder reranking. Corrective RAG ensures hallucination-free grounding on bank policy documents.
Instead of complex RLHF with separate reward models, we use DPO to directly optimize the model on preference pairs from real loan conversations. Simpler, more stable, fewer hyperparameters — and achieves comparable alignment quality. Preference data is collected from bank QA teams reviewing actual borrower interactions.
For complex regulatory questions where entities have relationships (RBI circulars referencing each other, loan products with interconnected eligibility criteria), we build entity-relationship graphs over policy documents. GraphRAG captures structural knowledge that flat vector search misses — achieving near-deterministic accuracy on multi-hop policy queries.
Off-the-shelf embedding models don't understand "Mudra loan" or "CGTMSE guarantee" in Hindi context. We fine-tune domain-specific embedding models on financial terminology so that semantic search actually works for Indian banking queries. This is the hidden bottleneck most RAG systems miss.
Post-deployment, we use Reinforcement Learning from AI Feedback (RLAIF) for automated quality checks combined with periodic human review from bank compliance teams. Constitutional AI principles flag harmful or non-compliant responses. Production conversations are continuously scored to generate new DPO training pairs.
Our RAG system isn't a simple "retrieve + generate" pipeline. Sentia acts as an agent — planning retrieval strategy, validating results, and iterating until the answer is grounded in evidence.
Sentia classifies query complexity. Simple factual questions go to direct retrieval. Multi-hop policy questions trigger agentic planning with decomposed sub-queries. Ambiguous questions trigger clarification prompts back to the user.
BM25 keyword search + dense vector embeddings run in parallel. Results merged and deduplicated. For entity-relationship queries, GraphRAG adds structural context. HyDE (Hypothetical Document Embeddings) generates hypothetical answers to guide retrieval for sparse or vague queries.
Retrieved documents scored for relevance by a lightweight cross-encoder reranker. Below-threshold results trigger re-query with refined terms. If context is still insufficient, the system fetches from alternative knowledge sources or escalates to a human agent. No hallucinated answers.
The fine-tuned SLM generates a response grounded strictly on retrieved evidence. Every claim is traceable to a source document. Sentia validates the response against the retrieved context before delivery — if grounding fails, the cycle repeats.
Every custom LLM deployment follows a rigorous lifecycle — from training data collection to post-deployment monitoring. Click each section to learn more.
Training data comes from three streams: (a) Bank-provided documents — product manuals, policy circulars, FAQ sheets, compliance guidelines, KYC requirements. These form the RAG knowledge base. (b) Real conversation transcripts — anonymized and de-identified call recordings from existing loan origination workflows, transcribed by Augmen STT. These teach the model how borrowers actually speak. (c) Synthetic instruction pairs — we use a teacher LLM (e.g., GPT-4 or Claude) to generate high-quality instruction-response pairs grounded on bank documents, then validate with domain experts.
Deduplication → MinHash + semantic dedup to remove near-duplicate entries. Quality filtering → Perplexity scoring removes incoherent samples. PII redaction → Named entity recognition strips Aadhaar numbers, phone numbers, account numbers before training. Format standardization → All data converted to ChatML / Alpaca instruction format with system prompts. Hindi-English handling → Code-switched data preserved as-is, not split by language.
Typical dataset size: 5K–50K instruction pairs per client, depending on product complexity. Quality > quantity — a focused 10K dataset of bank-specific conversations outperforms 100K generic financial instructions.
LLaMA-Factory — Unified interface for fine-tuning 100+ model families. Supports LoRA, QLoRA, full fine-tuning, and DPO/PPO alignment. We use it as our primary training orchestrator. Hugging Face TRL — For DPO preference training and reward model training. BitsAndBytes — 4-bit NF4 quantization for QLoRA. PEFT — Hugging Face's parameter-efficient fine-tuning library for LoRA adapter management. vLLM — High-throughput inference engine with PagedAttention for production serving.
LangChain / LlamaIndex — Agentic RAG orchestration with tool routing and memory. ChromaDB / Weaviate — Vector databases for semantic search (chosen per client's infrastructure). Elasticsearch — BM25 lexical search for hybrid retrieval. Cross-encoder rerankers — BGE-reranker or BAAI models for relevance scoring. Neo4j — Graph database for GraphRAG entity-relationship storage.
We're model-agnostic — the right base model depends on the task and client constraints. Current favorites:
Why SLMs, not GPT-4? A fine-tuned 7B model beats GPT-4 on domain-specific tasks (verified on our benchmarks) while being 100% self-hosted, 10× cheaper to run, and compliant with RBI data localization requirements. We use larger models only as teachers for synthetic data generation.
Base model weights are frozen and quantized to 4-bit NormalFloat (NF4) using BitsAndBytes. Low-rank adapters (rank 16–64, depending on task complexity) are injected into all attention layers and trained in FP16. Double quantization reduces memory further. Paged optimizers handle memory spikes. A 7B model trains on a single L4 GPU (24 GB) in 4–8 hours for a 10K instruction dataset.
After SFT, we run the fine-tuned model on held-out prompts to generate multiple candidate responses. Bank domain experts (or a rule-based reward model for verifiable tasks) score responses as "preferred" vs "rejected." These preference pairs train the model via Direct Preference Optimization — no separate reward model needed, no PPO instability. The model learns to produce responses that match what bank compliance teams actually want.
LoRA rank: r=16 (default), up to r=64 for complex tasks. Alpha: 2× rank. Optimizer: AdamW with cosine annealing. Learning rate: 2e-4 (SFT), 5e-7 (DPO). Batch size: 4 with gradient accumulation ×4. Epochs: 1–3 for SFT (overfitting risk beyond 3), 1 for DPO. Dropout: 0.05 on LoRA layers.
No full fine-tuning (wasteful for domain adaptation). No RLHF with PPO (DPO is simpler and equally effective for our use case). No training from scratch (transfer learning from strong base models is always better). No INT8 quantization during training (NF4 is strictly better for normally distributed weights).
Generic benchmarks (MMLU, HellaSwag) don't tell you if a model can explain Mudra loan eligibility in Hindi. We build custom evaluation suites per client: 200–500 questions covering product knowledge, regulatory compliance, edge cases, and code-switched queries. Ground truth answers validated by bank subject matter experts. Measured on: accuracy (exact match on factual questions), faithfulness (does the answer cite the right source?), and relevance (does it actually answer what was asked?).
We use the RAGAS framework to evaluate our RAG pipeline on four dimensions: faithfulness (is the answer supported by retrieved context?), answer relevancy (does the answer address the question?), context precision (are the retrieved chunks actually useful?), and context recall (did we retrieve all relevant information?). Additionally, nDCG, MRR, and Recall@K metrics from information retrieval evaluate the search component independently.
For subjective quality (tone, empathy, clarity), we use a panel of LLM judges with criteria specific to fintech conversations: "Does the response explain the EMI calculation clearly?", "Is the tone appropriate for a rural borrower?", "Does it comply with Fair Practice Code?". Multiple judges reduce individual model bias. Results correlated with human annotations to validate alignment.
Every fine-tuning iteration is tested against the previous version's evaluation suite. No deployment without demonstrated improvement (or parity) on all metrics. A/B testing on live traffic with 5% canary rollout before full deployment.
Deployment is not the end — it's the start of the feedback flywheel. Every production conversation generates data that makes the next model better.
Every production response is scored by an AI evaluator on: factual accuracy (does the response match the bank's documented policies?), compliance (does it follow Fair Practice Code and RBI guidelines?), tone appropriateness (is it respectful, clear, and suitable for the borrower demographic?), and hallucination detection (does the response cite something not in the retrieved context?). Low-scoring responses are flagged for human review.
Bank QA teams periodically review flagged conversations and a random sample of production interactions. They mark responses as "good" or "needs improvement" with specific corrections. This generates high-quality preference pairs directly from production traffic — the most valuable training data possible, because it reflects real borrower queries, not synthetic ones.
Accumulated preference pairs (typically 500–1,000 per cycle) are used to run a DPO fine-tuning pass. The model learns from its own production mistakes. Each cycle takes 2–4 hours on a single GPU. We run re-training cycles monthly or on-demand when bank policies change (new product launch, regulatory update, rate change).
When the bank adds new products, updates policies, or receives new RBI circulars, the RAG knowledge base is updated in real-time — no model re-training needed. New documents are chunked, embedded, and indexed. The fine-tuned SLM's behavior on factual questions updates instantly through the retrieval layer.
We monitor embedding distribution shifts, response length changes, and hallucination rates over time. If production queries drift from the training distribution (e.g., new product line generates queries the model hasn't seen), alerts trigger targeted data collection and re-training. Sentia logs every decision for audit trails required by bank compliance.
The flywheel effect: Production conversations → RLAIF scoring → Human validation → DPO preference pairs → Model re-training → Better production conversations. Each cycle compounds. A model deployed for 6 months is significantly better than the initial release.
A fine-tuned 7B model runs on a single L4 GPU at sub-second latency. No multi-GPU clusters. No expensive cloud API bills. Total inference cost: ~₹0.002 per query.
QLoRA fine-tuning takes 4–8 hours on one GPU. New regulatory requirement on Friday? Updated model deployed by Monday. Try that with GPT-4.
A fine-tuned 7B model beats GPT-4 on our domain benchmarks — because it's trained on what matters: Indian financial services, not Shakespeare and Wikipedia.
No borrower data leaves your infrastructure. No API calls to OpenAI or Google. Complete RBI data localization compliance and DPDP Act adherence.
Smaller models are easier to audit. Every RAG response cites its source document. Every decision is logged. Critical for RBI compliance and internal audit requirements.
Different tasks get different models. FAQ resolution uses a fast 2B model. Risk scoring uses a larger 8B model. Sentia routes each task to the optimal agent automatically.