Document intelligence built for Indian financial services. Today: PaddleOCR-powered layout analysis + fine-tuned LLM for understanding. Tomorrow: our own sub-1B Vision Language Model trained on Indian financial documents — Udyam certificates, machinery invoices, bank statements, and handwritten applications in 10+ Indian scripts.
Today, DocSense uses a two-stage architecture: PaddleOCR (PP-StructureV3 + PP-OCRv5) handles the visual understanding — detecting layout regions, reading text in multiple scripts, parsing tables and forms. A fine-tuned SLM then reasons over the extracted structured data to answer questions, validate fields, and flag errors.
Why two stages? PaddleOCR's PP-StructureV3 achieves SOTA document parsing on the OmniDocBench benchmark with models under 100M parameters — far lighter than billion-parameter VLMs. By separating visual extraction from reasoning, we can upgrade each stage independently. The OCR layer handles what it does best (reading pixels), and the LLM layer handles what it does best (understanding meaning).
Indian MSME loan origination involves a unique set of documents — many of which are handwritten, bilingual, or in non-standard formats that generic OCR systems struggle with.
MSME registration. Extract Udyam number, enterprise type, NIC code, investment & turnover classification. Validate against Udyam portal.
Quotations and invoices from equipment suppliers. Extract item descriptions, quantities, costs, GST amounts, supplier GSTIN. Often handwritten or semi-printed.
Multi-format PDF/scanned statements. Parse transaction tables, compute monthly averages, identify EMI patterns, salary credits, and cash flow trends for underwriting.
Identity document extraction with face photo, masked Aadhaar number, PAN validation. Cross-match names and DOB across documents.
GSTR-1, GSTR-3B extraction. Parse turnover figures, tax liability, ITC claims. Verify GSTIN status and return filing consistency.
Income Tax Returns, Form 16, salary slips. Extract gross income, deductions, tax paid. Handle both printed ITR-V and scanned copies.
Sale deeds, encumbrance certificates, title documents. Often in regional languages (Hindi, Marathi) with legal terminology. Extract owner name, area, registration number.
Loan application forms, declarations, guarantor letters. Borrowers in rural India often submit handwritten documents. DocSense handles Devanagari handwriting.
Purchase invoices, kachha bills, delivery challans from small businesses. Often informal, handwritten, without standard formatting.
Electricity, water, gas bills for address proof. Extract consumer name, address, and billing period. Handle state-specific utility formats.
Balance sheets, P&L statements from CAs. Parse complex table structures, extract key ratios, validate arithmetic consistency.
Credit Guarantee Fund Trust certificates. Extract guarantee number, coverage amount, validity period for Mudra and CGTMSE-backed loans.
The current two-stage pipeline (OCR → LLM) works well but has a fundamental limitation: visual context is lost in the handoff. A Vision Language Model sees the document as an image and understands it as text simultaneously — capturing layout, spatial relationships, stamps, signatures, and handwriting in a single forward pass. PaddleOCR-VL (0.9B) demonstrated this is achievable at sub-1B scale. We're building our own.
PaddleOCR 3.0 (PP-StructureV3 + PP-OCRv5) for visual extraction, fine-tuned SLM for reasoning. Handles printed documents in 10+ Indian scripts. Table extraction for bank statements. In-conversation document processing.
Collecting and annotating the training dataset: 50K+ Indian financial documents across all 12 document types above. Bounding box annotations, field labels, Hindi-English bilingual text, handwriting samples. This dataset is the foundation — no VLM can be trained without domain-specific data.
Our own Vision Language Model, architecturally inspired by PaddleOCR-VL: a NaViT-style dynamic resolution vision encoder paired with an Indian-language fine-tuned LLM (replacing ERNIE with a Devanagari-capable model). Single forward pass: image in → structured understanding out. No separate OCR step needed.
PaddleOCR-VL proved that a 0.9B parameter model (NaViT vision encoder + ERNIE-4.5-0.3B language model) can rival billion-parameter VLMs on document parsing benchmarks. We're adapting this architecture for Indian financial documents with an Indian-language LLM backbone.
Native Variable Resolution Vision Transformer. Unlike fixed-resolution ViT, NaViT handles documents at their natural resolution — a full A4 bank statement and a small PAN card are processed at their actual pixel dimensions. Critical for Indian documents that vary wildly in size, quality, and DPI. PaddleOCR-VL demonstrated this enables accurate parsing without expensive image resizing.
PaddleOCR-VL uses ERNIE-4.5-0.3B (Chinese-English). We'll replace this with a Devanagari-native language model fine-tuned on Indian financial terminology. The LM doesn't need to be large — PaddleOCR-VL proved 0.3B is sufficient when paired with a strong vision encoder. Our target: 0.3–0.5B parameters for the LM component, trained on Hindi, Marathi, Bengali, Tamil, and English financial text.
PaddleOCR-VL-0.9B outperforms GPT-4V and Gemini on document parsing benchmarks — with 100× fewer parameters. Small models mean: runs on a single L4 GPU alongside STT and TTS, inference in <2 seconds per page, deployable on-prem in any bank's existing infrastructure, no cloud dependency. The constraint isn't ambition — it's that our customers need models that run on their hardware.
Building a domain-specific VLM isn't a side project. It requires a dedicated dataset, sustained compute, and a specialized team. Here's what we're committing.
Annotated Indian financial documents across 12 types. Bounding boxes, field labels, script identification, handwriting samples. This is the highest-cost component — quality data is everything.
Data annotation (~₹15–20L), GPU compute for training (~₹10–15L on A100 clusters), and 2–3 ML engineers for 12 months. This is a serious R&D commitment for a startup our size.
Dataset collection: 3 months (ongoing). Architecture experiments: 3 months. Training & evaluation: 3 months. Production hardening: 3 months. First production model targeted H2 2026.
We plan to open-source the base model weights for the Indian document VLM — giving back to the community that gave us PaddleOCR. Domain-specific fine-tuning (fintech fields, validation logic) remains proprietary.
Why we believe this is achievable: PaddleOCR-VL proved a 0.9B model can match or beat multi-billion parameter VLMs on document parsing. We're not building a general-purpose vision model — we're building a specialist that only needs to understand Indian financial documents. A narrower domain means less data, less compute, and a smaller model. The hardest part isn't the architecture — it's the dataset. That's where we're investing most heavily.
Borrower takes a photo during the loan conversation. DocSense processes it in real-time — extracting fields, validating data, and asking clarifying questions if information is missing or unclear. No separate upload flow needed.
Automatically flags inconsistencies across documents: name mismatch between Aadhaar and PAN, address discrepancy between utility bill and loan application, Udyam number format validation, GSTIN status check. Catches errors before they reach the underwriter.
PP-OCRv5 supports Hindi (Devanagari), Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Marathi, Odia, Punjabi (Gurmukhi), and English — in a single model. Code-switched documents (Hindi headers with English body text) handled natively.
Bank statements, financial statements, GST returns — all rely on table parsing. PP-StructureV3's table recognition handles complex layouts including merged cells, nested headers, and multi-page tables that span across scanned pages.
Flags documents with potential tampering: inconsistent fonts, pixel-level editing artifacts, misaligned text. Also detects expired documents, invalid format numbers (wrong Aadhaar check digit, invalid PAN format), and incomplete submissions.
Extending PaddleOCR's handwriting recognition for Devanagari script — critical for rural loan applications where borrowers fill forms by hand. Training on field-collected samples from Jharkhand, UP, and Bihar. This is one of the hardest unsolved problems in Indian document AI.