Custom LLM + RAG — Fine-Tuned Language Models

SLM

Small Language Models (1–8B)

0

External API Calls

QLoRA

4-bit Fine-Tuning (SOTA)

DPO

Preference Alignment Loop

Architecture

Multi-agent orchestration,
not a single monolithic model

Instead of one large model doing everything, Augmen uses specialized small models orchestrated by Sentia. Each agent handles what it does best — retrieval, reasoning, generation, validation — and the context engine routes only what's needed to each agent.

Input Layer

STT Agent (IndicConformer)

DocSense Agent (OCR + Layout)

Sentiment Agent

↓ Audio / documents / structured data ↓

Sentia — Context Router & Selective Memory

↓ Selective context (only what's needed per task) ↓

RAG + Reasoning Layer

Agentic RAG (Hybrid BM25 + Vector)

GraphRAG (Entity Relationships)

Corrective RAG (Self-Validation)

↓ Grounded context + retrieved evidence ↓

Task Agents (Fine-Tuned SLMs)

Form Generation

FAQ Resolution

Risk Scoring

Compliance Check

↓ Validated output ↓

Vaak TTS

Document Output

Bank API Integration

State of the Art

Every model we ship uses the latest research

We continuously adopt the best techniques from the research frontier. Each client deployment uses the most effective combination of fine-tuning, retrieval, and alignment methods available at the time of delivery.

Fine-Tuning · SOTA

QLoRA (4-bit Quantized LoRA)

Fine-tune models with 75% less memory by quantizing base weights to 4-bit NormalFloat while training low-rank adapters in FP16. A 7B model fits on a single 24 GB GPU. We use QLoRA via LLaMA-Factory with BitsAndBytes for all SLM fine-tuning — adapting models like Llama 3, Qwen 2.5, and Gemma to Indian financial services in hours, not weeks.

Retrieval · SOTA

Agentic RAG with Corrective Retrieval

Beyond naive RAG: our retrieval pipeline uses Sentia as an agentic orchestrator that dynamically decides when to retrieve, validates relevance of retrieved documents, re-queries if context is insufficient, and combines hybrid search (BM25 + vector embeddings) with cross-encoder reranking. Corrective RAG ensures hallucination-free grounding on bank policy documents.

Alignment · SOTA

DPO (Direct Preference Optimization)

Instead of complex RLHF with separate reward models, we use DPO to directly optimize the model on preference pairs from real loan conversations. Simpler, more stable, fewer hyperparameters — and achieves comparable alignment quality. Preference data is collected from bank QA teams reviewing actual borrower interactions.

Retrieval

GraphRAG for Policy Relationships

For complex regulatory questions where entities have relationships (RBI circulars referencing each other, loan products with interconnected eligibility criteria), we build entity-relationship graphs over policy documents. GraphRAG captures structural knowledge that flat vector search misses — achieving near-deterministic accuracy on multi-hop policy queries.

Embeddings

Fine-Tuned Embedding Models

Off-the-shelf embedding models don't understand "Mudra loan" or "CGTMSE guarantee" in Hindi context. We fine-tune domain-specific embedding models on financial terminology so that semantic search actually works for Indian banking queries. This is the hidden bottleneck most RAG systems miss.

Monitoring

RLAIF + Human-in-the-Loop

Post-deployment, we use Reinforcement Learning from AI Feedback (RLAIF) for automated quality checks combined with periodic human review from bank compliance teams. Constitutional AI principles flag harmful or non-compliant responses. Production conversations are continuously scored to generate new DPO training pairs.

RAG Pipeline

From naive RAG to agentic reasoning

Our RAG system isn't a simple "retrieve + generate" pipeline. Sentia acts as an agent — planning retrieval strategy, validating results, and iterating until the answer is grounded in evidence.

01

Query Understanding

Sentia classifies query complexity. Simple factual questions go to direct retrieval. Multi-hop policy questions trigger agentic planning with decomposed sub-queries. Ambiguous questions trigger clarification prompts back to the user.

02

Hybrid Retrieval

BM25 keyword search + dense vector embeddings run in parallel. Results merged and deduplicated. For entity-relationship queries, GraphRAG adds structural context. HyDE (Hypothetical Document Embeddings) generates hypothetical answers to guide retrieval for sparse or vague queries.

03

Corrective Validation

Retrieved documents scored for relevance by a lightweight cross-encoder reranker. Below-threshold results trigger re-query with refined terms. If context is still insufficient, the system fetches from alternative knowledge sources or escalates to a human agent. No hallucinated answers.

04

Grounded Generation

The fine-tuned SLM generates a response grounded strictly on retrieved evidence. Every claim is traceable to a source document. Sentia validates the response against the retrieved context before delivery — if grounding fails, the cycle repeats.

How We Build

Deep dive into our LLM lifecycle

Every custom LLM deployment follows a rigorous lifecycle — from training data collection to post-deployment monitoring. Click each section to learn more.

01

How do we collect and prepare training data?

Data Sources

Training data comes from three streams: (a) Bank-provided documents — product manuals, policy circulars, FAQ sheets, compliance guidelines, KYC requirements. These form the RAG knowledge base. (b) Real conversation transcripts — anonymized and de-identified call recordings from existing loan origination workflows, transcribed by Augmen STT. These teach the model how borrowers actually speak. (c) Synthetic instruction pairs — we use a teacher LLM (e.g., GPT-4 or Claude) to generate high-quality instruction-response pairs grounded on bank documents, then validate with domain experts.

Data Preparation Pipeline

Deduplication → MinHash + semantic dedup to remove near-duplicate entries. Quality filtering → Perplexity scoring removes incoherent samples. PII redaction → Named entity recognition strips Aadhaar numbers, phone numbers, account numbers before training. Format standardization → All data converted to ChatML / Alpaca instruction format with system prompts. Hindi-English handling → Code-switched data preserved as-is, not split by language.

Bank Policy Documents Anonymized Call Transcripts Synthetic Instruction Pairs RBI Circulars & Guidelines

Typical dataset size: 5K–50K instruction pairs per client, depending on product complexity. Quality > quantity — a focused 10K dataset of bank-specific conversations outperforms 100K generic financial instructions.

02

What tools and frameworks do we use?

Fine-Tuning Stack

LLaMA-Factory — Unified interface for fine-tuning 100+ model families. Supports LoRA, QLoRA, full fine-tuning, and DPO/PPO alignment. We use it as our primary training orchestrator. Hugging Face TRL — For DPO preference training and reward model training. BitsAndBytes — 4-bit NF4 quantization for QLoRA. PEFT — Hugging Face's parameter-efficient fine-tuning library for LoRA adapter management. vLLM — High-throughput inference engine with PagedAttention for production serving.

RAG Stack

LangChain / LlamaIndex — Agentic RAG orchestration with tool routing and memory. ChromaDB / Weaviate — Vector databases for semantic search (chosen per client's infrastructure). Elasticsearch — BM25 lexical search for hybrid retrieval. Cross-encoder rerankers — BGE-reranker or BAAI models for relevance scoring. Neo4j — Graph database for GraphRAG entity-relationship storage.

Base Models We Fine-Tune

We're model-agnostic — the right base model depends on the task and client constraints. Current favorites:

Llama 3.1 (8B) Qwen 2.5 (7B) Gemma 2 (2B / 9B) Mistral 7B Phi-3 (3.8B)

Why SLMs, not GPT-4? A fine-tuned 7B model beats GPT-4 on domain-specific tasks (verified on our benchmarks) while being 100% self-hosted, 10× cheaper to run, and compliant with RBI data localization requirements. We use larger models only as teachers for synthetic data generation.

03

How do we fine-tune?

Step 1: Supervised Fine-Tuning (SFT) with QLoRA

Base model weights are frozen and quantized to 4-bit NormalFloat (NF4) using BitsAndBytes. Low-rank adapters (rank 16–64, depending on task complexity) are injected into all attention layers and trained in FP16. Double quantization reduces memory further. Paged optimizers handle memory spikes. A 7B model trains on a single L4 GPU (24 GB) in 4–8 hours for a 10K instruction dataset.

Step 2: DPO Preference Alignment

After SFT, we run the fine-tuned model on held-out prompts to generate multiple candidate responses. Bank domain experts (or a rule-based reward model for verifiable tasks) score responses as "preferred" vs "rejected." These preference pairs train the model via Direct Preference Optimization — no separate reward model needed, no PPO instability. The model learns to produce responses that match what bank compliance teams actually want.

Key Hyperparameters

LoRA rank: r=16 (default), up to r=64 for complex tasks. Alpha: 2× rank. Optimizer: AdamW with cosine annealing. Learning rate: 2e-4 (SFT), 5e-7 (DPO). Batch size: 4 with gradient accumulation ×4. Epochs: 1–3 for SFT (overfitting risk beyond 3), 1 for DPO. Dropout: 0.05 on LoRA layers.

What We Don't Do

No full fine-tuning (wasteful for domain adaptation). No RLHF with PPO (DPO is simpler and equally effective for our use case). No training from scratch (transfer learning from strong base models is always better). No INT8 quantization during training (NF4 is strictly better for normally distributed weights).

04

How do we benchmark and test?

Domain-Specific Evaluation (Primary)

Generic benchmarks (MMLU, HellaSwag) don't tell you if a model can explain Mudra loan eligibility in Hindi. We build custom evaluation suites per client: 200–500 questions covering product knowledge, regulatory compliance, edge cases, and code-switched queries. Ground truth answers validated by bank subject matter experts. Measured on: accuracy (exact match on factual questions), faithfulness (does the answer cite the right source?), and relevance (does it actually answer what was asked?).

RAG-Specific Metrics

We use the RAGAS framework to evaluate our RAG pipeline on four dimensions: faithfulness (is the answer supported by retrieved context?), answer relevancy (does the answer address the question?), context precision (are the retrieved chunks actually useful?), and context recall (did we retrieve all relevant information?). Additionally, nDCG, MRR, and Recall@K metrics from information retrieval evaluate the search component independently.

LLM-as-Judge

For subjective quality (tone, empathy, clarity), we use a panel of LLM judges with criteria specific to fintech conversations: "Does the response explain the EMI calculation clearly?", "Is the tone appropriate for a rural borrower?", "Does it comply with Fair Practice Code?". Multiple judges reduce individual model bias. Results correlated with human annotations to validate alignment.

Regression Testing

Every fine-tuning iteration is tested against the previous version's evaluation suite. No deployment without demonstrated improvement (or parity) on all metrics. A/B testing on live traffic with 5% canary rollout before full deployment.

Custom Eval Suite (200–500 Q) RAGAS Framework LLM-as-Judge Panel A/B Canary Testing

05

How do we track and improve after deployment?

The Continuous Improvement Loop

Deployment is not the end — it's the start of the feedback flywheel. Every production conversation generates data that makes the next model better.

1. Automated Quality Scoring (RLAIF)

Every production response is scored by an AI evaluator on: factual accuracy (does the response match the bank's documented policies?), compliance (does it follow Fair Practice Code and RBI guidelines?), tone appropriateness (is it respectful, clear, and suitable for the borrower demographic?), and hallucination detection (does the response cite something not in the retrieved context?). Low-scoring responses are flagged for human review.

2. Human-in-the-Loop Review

Bank QA teams periodically review flagged conversations and a random sample of production interactions. They mark responses as "good" or "needs improvement" with specific corrections. This generates high-quality preference pairs directly from production traffic — the most valuable training data possible, because it reflects real borrower queries, not synthetic ones.

3. DPO Re-Training Cycles

Accumulated preference pairs (typically 500–1,000 per cycle) are used to run a DPO fine-tuning pass. The model learns from its own production mistakes. Each cycle takes 2–4 hours on a single GPU. We run re-training cycles monthly or on-demand when bank policies change (new product launch, regulatory update, rate change).

4. RAG Knowledge Base Updates

When the bank adds new products, updates policies, or receives new RBI circulars, the RAG knowledge base is updated in real-time — no model re-training needed. New documents are chunked, embedded, and indexed. The fine-tuned SLM's behavior on factual questions updates instantly through the retrieval layer.

5. Drift Detection & Alerting

We monitor embedding distribution shifts, response length changes, and hallucination rates over time. If production queries drift from the training distribution (e.g., new product line generates queries the model hasn't seen), alerts trigger targeted data collection and re-training. Sentia logs every decision for audit trails required by bank compliance.

RLAIF Auto-Scoring Human QA Review Monthly DPO Re-Train Real-Time RAG Updates Drift Detection

The flywheel effect: Production conversations → RLAIF scoring → Human validation → DPO preference pairs → Model re-training → Better production conversations. Each cycle compounds. A model deployed for 6 months is significantly better than the initial release.

Why SLMs

Small models win in production

Lower Compute

A fine-tuned 7B model runs on a single L4 GPU at sub-second latency. No multi-GPU clusters. No expensive cloud API bills. Total inference cost: ~₹0.002 per query.

Faster Iteration

QLoRA fine-tuning takes 4–8 hours on one GPU. New regulatory requirement on Friday? Updated model deployed by Monday. Try that with GPT-4.

Domain Precision

A fine-tuned 7B model beats GPT-4 on our domain benchmarks — because it's trained on what matters: Indian financial services, not Shakespeare and Wikipedia.

Full Data Sovereignty

No borrower data leaves your infrastructure. No API calls to OpenAI or Google. Complete RBI data localization compliance and DPDP Act adherence.

Auditable & Explainable

Smaller models are easier to audit. Every RAG response cites its source document. Every decision is logged. Critical for RBI compliance and internal audit requirements.

Multi-Agent Flexibility

Different tasks get different models. FAQ resolution uses a fast 2B model. Risk scoring uses a larger 8B model. Sentia routes each task to the optimal agent automatically.

Small models.Big financial intelligence.