← Back to The Signal
ArchitectureAmit Pandey·January 15, 2025·8 min read

Why Selective Context Beats Full-History RAG in Production Conversations

Most conversational AI systems treat conversation history as a growing append-only log. Every new message gets the entire history. We built Sentia because this approach fundamentally breaks at scale.

The Append-Only Problem

When you build a conversational AI system — particularly one handling complex, multi-topic interactions like loan origination — the standard approach is straightforward: append every user message and system response to a growing context window, and pass it all to the language model on every turn.

This works reasonably well for short conversations. A 5-turn chat about a single topic fits comfortably within context limits and generates focused responses. But financial conversations are not 5-turn chats about a single topic.

A typical loan origination conversation involves the borrower discussing their income, switching to ask about interest rates, returning to employment details, asking about documentation requirements, and then circling back to loan amounts. In a 30-minute conversation, a borrower might pivot between topics fifteen or more times.

The core insight: In a 50-turn conversation, over 70% of the context passed to the model is irrelevant to the current query. This irrelevant context doesn't just waste tokens — it actively degrades response quality.

What Happens in Practice

We measured three failure modes in production conversations using the full-history approach:

Token cost explosion. A 30-minute conversation generates roughly 15,000–20,000 tokens of history. Passing this on every turn means your per-query cost grows linearly with conversation length. For a high-volume deployment processing thousands of conversations daily, this becomes economically unsustainable.

Latency degradation. Larger context windows mean longer inference times. Response latency that starts at 300ms in turn 1 can reach 2–3 seconds by turn 40. In a voice-based conversation, this delay is immediately noticeable and breaks the natural flow.

Context confusion. This is the most insidious problem. When a borrower asks about interest rates after a long discussion about their employment, the model often "bleeds" context — referencing employment details in its interest rate response, or worse, generating contradictory information because it's trying to reconcile unrelated context segments.

The Sentia Approach

Sentia addresses this by treating conversation history not as a flat log, but as a structured graph of topic segments. When a user speaks, Sentia first classifies the utterance against active topic threads. If the utterance continues an existing thread, only that thread's context is passed to the model. If it represents a pivot to a new or previously active topic, Sentia switches context accordingly.

The result is that the model always receives a focused, relevant context window — typically 2,000–4,000 tokens regardless of total conversation length. Token costs stay flat. Latency stays consistent. And context confusion drops to near zero.

The Numbers

In our production deployment with the Government of Jharkhand, Sentia reduced average token usage per query by 70%, maintained sub-second response times across conversations exceeding 40 turns, and eliminated context confusion errors that previously affected roughly 12% of responses in long conversations.

Key takeaway: The solution to long-context conversations isn't bigger context windows — it's smarter context selection. Sentia proves that selective context, routed by real-time topic detection, outperforms brute-force history on every metric that matters in production.

What This Means for Production AI

If you're building conversational AI for any domain where conversations are long and multi-topic — financial services, healthcare intake, legal consultation — the full-history approach will hit a wall. The question isn't whether, but when.

Selective context isn't just an optimization. It's an architectural paradigm that changes what's possible in production conversational AI. We built Sentia because we needed it for our own agents. We're sharing these insights because we believe the industry needs it too.

Amit Pandey is the CTO of Augmen (Paaw Innovations), with 25 years of experience in AI and software engineering. Sentia is the core engine powering all Augmen AI agents.