The Append-Only Problem
When you build a conversational AI system — particularly one handling complex, multi-topic interactions like loan origination — the standard approach is straightforward: append every user message and system response to a growing context window, and pass it all to the language model on every turn.
This works reasonably well for short conversations. A 5-turn chat about a single topic fits comfortably within context limits and generates focused responses. But financial conversations are not 5-turn chats about a single topic.
A typical loan origination conversation involves the borrower discussing their income, switching to ask about interest rates, returning to employment details, asking about documentation requirements, and then circling back to loan amounts. In a 30-minute conversation, a borrower might pivot between topics fifteen or more times.
What Happens in Practice
We measured three failure modes in production conversations using the full-history approach:
Token cost explosion. A 30-minute conversation generates roughly 15,000–20,000 tokens of history. Passing this on every turn means your per-query cost grows linearly with conversation length. For a high-volume deployment processing thousands of conversations daily, this becomes economically unsustainable.
Latency degradation. Larger context windows mean longer inference times. Response latency that starts at 300ms in turn 1 can reach 2–3 seconds by turn 40. In a voice-based conversation, this delay is immediately noticeable and breaks the natural flow.
Context confusion. This is the most insidious problem. When a borrower asks about interest rates after a long discussion about their employment, the model often "bleeds" context — referencing employment details in its interest rate response, or worse, generating contradictory information because it's trying to reconcile unrelated context segments.
The Sentia Approach
Sentia addresses this by treating conversation history not as a flat log, but as a structured graph of topic segments. When a user speaks, Sentia first classifies the utterance against active topic threads. If the utterance continues an existing thread, only that thread's context is passed to the model. If it represents a pivot to a new or previously active topic, Sentia switches context accordingly.
The result is that the model always receives a focused, relevant context window — typically 2,000–4,000 tokens regardless of total conversation length. Token costs stay flat. Latency stays consistent. And context confusion drops to near zero.
The Numbers
In our production deployment with the Government of Jharkhand, Sentia reduced average token usage per query by 70%, maintained sub-second response times across conversations exceeding 40 turns, and eliminated context confusion errors that previously affected roughly 12% of responses in long conversations.
What This Means for Production AI
If you're building conversational AI for any domain where conversations are long and multi-topic — financial services, healthcare intake, legal consultation — the full-history approach will hit a wall. The question isn't whether, but when.
Selective context isn't just an optimization. It's an architectural paradigm that changes what's possible in production conversational AI. We built Sentia because we needed it for our own agents. We're sharing these insights because we believe the industry needs it too.
Amit Pandey is the CTO of Augmen (Paaw Innovations), with 25 years of experience in AI and software engineering. Sentia is the core engine powering all Augmen AI agents.