DocSense — Document Intelligence for Indian Finance

Current Architecture

PaddleOCR + fine-tuned LLM pipeline

Today, DocSense uses a two-stage architecture: PaddleOCR (PP-StructureV3 + PP-OCRv5) handles the visual understanding — detecting layout regions, reading text in multiple scripts, parsing tables and forms. A fine-tuned SLM then reasons over the extracted structured data to answer questions, validate fields, and flag errors.

Stage 1 — Visual Extraction (PaddleOCR 3.0)

PP-DocLayoutV2 (Layout Detection)

PP-OCRv5 (Text Recognition)

Table Structure Recognition

What PaddleOCR does: Detects document regions (headers, paragraphs, tables, stamps, signatures). Reads printed text in 100+ languages including Hindi, Bengali, Tamil, Marathi, Gujarati, Kannada, Telugu, Devanagari scripts. Extracts table structures with row/column relationships. Outputs structured JSON with bounding boxes, reading order, and recognized text.

↓ Structured JSON (text + layout + tables) ↓

Stage 2 — Reasoning (Fine-Tuned SLM)

Domain-Adapted LLM (Loan Document Understanding)

What the LLM does: Receives the structured text from PaddleOCR and performs semantic understanding — extracting key fields (borrower name, loan amount, GSTIN, Udyam number), cross-validating data across documents (does the PAN name match the Aadhaar name?), flagging missing information, and answering conversational queries ("What is this borrower's annual turnover from their bank statement?").

↓ Validated, structured output ↓

Sentia (Context Router)

Loan Application Auto-Fill

Error & Mismatch Alerts

Why two stages? PaddleOCR's PP-StructureV3 achieves SOTA document parsing on the OmniDocBench benchmark with models under 100M parameters — far lighter than billion-parameter VLMs. By separating visual extraction from reasoning, we can upgrade each stage independently. The OCR layer handles what it does best (reading pixels), and the LLM layer handles what it does best (understanding meaning).

Document Types

Every document your borrowers carry

Indian MSME loan origination involves a unique set of documents — many of which are handwritten, bilingual, or in non-standard formats that generic OCR systems struggle with.

🏭

Udyam Certificate

MSME registration. Extract Udyam number, enterprise type, NIC code, investment & turnover classification. Validate against Udyam portal.

⚙️

Machinery Order Forms

Quotations and invoices from equipment suppliers. Extract item descriptions, quantities, costs, GST amounts, supplier GSTIN. Often handwritten or semi-printed.

🏦

Bank Statements

Multi-format PDF/scanned statements. Parse transaction tables, compute monthly averages, identify EMI patterns, salary credits, and cash flow trends for underwriting.

🪪

Aadhaar & PAN

Identity document extraction with face photo, masked Aadhaar number, PAN validation. Cross-match names and DOB across documents.

📋

GST Returns

GSTR-1, GSTR-3B extraction. Parse turnover figures, tax liability, ITC claims. Verify GSTIN status and return filing consistency.

📄

ITR / Income Proof

Income Tax Returns, Form 16, salary slips. Extract gross income, deductions, tax paid. Handle both printed ITR-V and scanned copies.

🏠

Property Documents

Sale deeds, encumbrance certificates, title documents. Often in regional languages (Hindi, Marathi) with legal terminology. Extract owner name, area, registration number.

📝

Handwritten Applications

Loan application forms, declarations, guarantor letters. Borrowers in rural India often submit handwritten documents. DocSense handles Devanagari handwriting.

🧾

Business Receipts

Purchase invoices, kachha bills, delivery challans from small businesses. Often informal, handwritten, without standard formatting.

💡

Utility Bills

Electricity, water, gas bills for address proof. Extract consumer name, address, and billing period. Handle state-specific utility formats.

📊

Financial Statements

Balance sheets, P&L statements from CAs. Parse complex table structures, extract key ratios, validate arithmetic consistency.

🔖

CGTMSE / Guarantee Docs

Credit Guarantee Fund Trust certificates. Extract guarantee number, coverage amount, validity period for Mudra and CGTMSE-backed loans.

Vision Language Model Roadmap

From two-stage pipeline to
end-to-end document understanding

The current two-stage pipeline (OCR → LLM) works well but has a fundamental limitation: visual context is lost in the handoff. A Vision Language Model sees the document as an image and understands it as text simultaneously — capturing layout, spatial relationships, stamps, signatures, and handwriting in a single forward pass. PaddleOCR-VL (0.9B) demonstrated this is achievable at sub-1B scale. We're building our own.

Live in Production

Two-Stage Pipeline

PaddleOCR 3.0 (PP-StructureV3 + PP-OCRv5) for visual extraction, fine-tuned SLM for reasoning. Handles printed documents in 10+ Indian scripts. Table extraction for bank statements. In-conversation document processing.

PaddleOCR 3.0 PP-StructureV3 Fine-tuned SLM

Building Now

Indian Document Dataset

Collecting and annotating the training dataset: 50K+ Indian financial documents across all 12 document types above. Bounding box annotations, field labels, Hindi-English bilingual text, handwriting samples. This dataset is the foundation — no VLM can be trained without domain-specific data.

50K+ Documents 12 Document Types Bilingual Annotations

Planned — 2026

DocSense VLM (<1B params)

Our own Vision Language Model, architecturally inspired by PaddleOCR-VL: a NaViT-style dynamic resolution vision encoder paired with an Indian-language fine-tuned LLM (replacing ERNIE with a Devanagari-capable model). Single forward pass: image in → structured understanding out. No separate OCR step needed.

NaViT Vision Encoder Indian Language LM <1B Parameters

Target Architecture

How the DocSense VLM will work

PaddleOCR-VL proved that a 0.9B parameter model (NaViT vision encoder + ERNIE-4.5-0.3B language model) can rival billion-parameter VLMs on document parsing benchmarks. We're adapting this architecture for Indian financial documents with an Indian-language LLM backbone.

Input

Document Image (any resolution, any script)

↓

Stage 1 — Layout Detection (PP-DocLayoutV2)

Detect Regions (text, tables, stamps, signatures, photos)

Predict Reading Order

↓ Cropped regions with spatial context ↓

Stage 2 — Vision Language Model (<1B)

NaViT Vision Encoder (dynamic resolution patches)

↓ Visual embeddings ↓

Cross-Attention Fusion (vision + text tokens)

↓

Indian-Language LM (~0.3B–0.5B, Devanagari-native)

↓ Structured Markdown / JSON output ↓

Field Extraction

Table Parsing

Handwriting Recognition

Cross-Doc Validation

VLM Architecture

NaViT Vision Encoder

Native Variable Resolution Vision Transformer. Unlike fixed-resolution ViT, NaViT handles documents at their natural resolution — a full A4 bank statement and a small PAN card are processed at their actual pixel dimensions. Critical for Indian documents that vary wildly in size, quality, and DPI. PaddleOCR-VL demonstrated this enables accurate parsing without expensive image resizing.

Language Model

Indian-Language LM Backbone

PaddleOCR-VL uses ERNIE-4.5-0.3B (Chinese-English). We'll replace this with a Devanagari-native language model fine-tuned on Indian financial terminology. The LM doesn't need to be large — PaddleOCR-VL proved 0.3B is sufficient when paired with a strong vision encoder. Our target: 0.3–0.5B parameters for the LM component, trained on Hindi, Marathi, Bengali, Tamil, and English financial text.

Why <1B Total

Efficiency by Design

PaddleOCR-VL-0.9B outperforms GPT-4V and Gemini on document parsing benchmarks — with 100× fewer parameters. Small models mean: runs on a single L4 GPU alongside STT and TTS, inference in <2 seconds per page, deployable on-prem in any bank's existing infrastructure, no cloud dependency. The constraint isn't ambition — it's that our customers need models that run on their hardware.

Our Investment

We're putting real resources behind this

Building a domain-specific VLM isn't a side project. It requires a dedicated dataset, sustained compute, and a specialized team. Here's what we're committing.

50K+

Document Annotations

Annotated Indian financial documents across 12 types. Bounding boxes, field labels, script identification, handwriting samples. This is the highest-cost component — quality data is everything.

₹50L+

Estimated Total Investment

Data annotation (~₹15–20L), GPU compute for training (~₹10–15L on A100 clusters), and 2–3 ML engineers for 12 months. This is a serious R&D commitment for a startup our size.

12mo

Development Timeline

Dataset collection: 3 months (ongoing). Architecture experiments: 3 months. Training & evaluation: 3 months. Production hardening: 3 months. First production model targeted H2 2026.

Open

Base Model Weights

We plan to open-source the base model weights for the Indian document VLM — giving back to the community that gave us PaddleOCR. Domain-specific fine-tuning (fintech fields, validation logic) remains proprietary.

Why we believe this is achievable: PaddleOCR-VL proved a 0.9B model can match or beat multi-billion parameter VLMs on document parsing. We're not building a general-purpose vision model — we're building a specialist that only needs to understand Indian financial documents. A narrower domain means less data, less compute, and a smaller model. The hardest part isn't the architecture — it's the dataset. That's where we're investing most heavily.

Capabilities

More than OCR — document intelligence

Live

In-Conversation Processing

Borrower takes a photo during the loan conversation. DocSense processes it in real-time — extracting fields, validating data, and asking clarifying questions if information is missing or unclear. No separate upload flow needed.

Live

Cross-Document Validation

Automatically flags inconsistencies across documents: name mismatch between Aadhaar and PAN, address discrepancy between utility bill and loan application, Udyam number format validation, GSTIN status check. Catches errors before they reach the underwriter.

Live

Multi-Script OCR

PP-OCRv5 supports Hindi (Devanagari), Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Marathi, Odia, Punjabi (Gurmukhi), and English — in a single model. Code-switched documents (Hindi headers with English body text) handled natively.

Live

Table Extraction

Bank statements, financial statements, GST returns — all rely on table parsing. PP-StructureV3's table recognition handles complex layouts including merged cells, nested headers, and multi-page tables that span across scanned pages.

Live

Error & Tamper Detection

Flags documents with potential tampering: inconsistent fonts, pixel-level editing artifacts, misaligned text. Also detects expired documents, invalid format numbers (wrong Aadhaar check digit, invalid PAN format), and incomplete submissions.

Building

Handwritten Devanagari

Extending PaddleOCR's handwriting recognition for Devanagari script — critical for rural loan applications where borrowers fill forms by hand. Training on field-collected samples from Jharkhand, UP, and Bihar. This is one of the hardest unsolved problems in Indian document AI.

Read any document.
Even the handwritten ones.

Print Accuracy

Indian Scripts

Processing Time

Target VLM Parameters

PaddleOCR + fine-tuned LLM pipeline

Every document your borrowers carry

Udyam Certificate

Machinery Order Forms

Bank Statements

Aadhaar & PAN

GST Returns

ITR / Income Proof

Property Documents

Handwritten Applications

Business Receipts

Utility Bills

Financial Statements

CGTMSE / Guarantee Docs

From two-stage pipeline to
end-to-end document understanding

Two-Stage Pipeline

Indian Document Dataset

DocSense VLM (<1B params)

How the DocSense VLM will work

NaViT Vision Encoder

Indian-Language LM Backbone

Efficiency by Design

We're putting real resources behind this

Document Annotations

Estimated Total Investment

Development Timeline

Base Model Weights

More than OCR — document intelligence

In-Conversation Processing

Cross-Document Validation

Multi-Script OCR

Table Extraction

Error & Tamper Detection

Handwritten Devanagari

Test DocSense with your toughest documents

Read any document.Even the handwritten ones.

Print Accuracy

Indian Scripts

Processing Time

Target VLM Parameters

PaddleOCR + fine-tuned LLM pipeline

Every document your borrowers carry

Udyam Certificate

Machinery Order Forms

Bank Statements

Aadhaar & PAN

GST Returns

ITR / Income Proof

Property Documents

Handwritten Applications

Business Receipts

Utility Bills

Financial Statements

CGTMSE / Guarantee Docs

From two-stage pipeline toend-to-end document understanding

Two-Stage Pipeline

Indian Document Dataset

DocSense VLM (<1B params)

How the DocSense VLM will work

NaViT Vision Encoder

Indian-Language LM Backbone

Efficiency by Design

We're putting real resources behind this

Document Annotations

Estimated Total Investment

Development Timeline

Base Model Weights

More than OCR — document intelligence

In-Conversation Processing

Cross-Document Validation

Multi-Script OCR

Table Extraction

Error & Tamper Detection

Handwritten Devanagari

Test DocSense with your toughest documents

Read any document.
Even the handwritten ones.

From two-stage pipeline to
end-to-end document understanding