Our Approach | Docugym

Home Contact API Sign In

From No Dataset
to Robust Document AI

Our Approach to Practical, Incremental Document AI

By Frank Sommers

•

October 19, 2025

Infrastructure Backbone

Document storage is abstracted through repository integrations so different backends can be connected. The current deployment uses Google Drive, pulling document binaries via the Drive API while only keeping references (but no document data) in Docugym.
Authentication is handled through Google Auth, letting users sign in with their Google identities and aligning Drive permissions with corpus access control. This can be extended to other authentication providers and other backends like S3, OneDrive, Dropbox, etc.
Postgres stores corpus metadata and all page embeddings, and the platform leans on pgvector for similarity search, centroid statistics, HNSW indexes, and the MaxSim ranking that powers the late-interaction, multivector-based, 2-stage classification.

Model Configuration & Deployment

Flexible Model Architecture: The platform supports both embedding models and vision-language models (VLMs) through a unified configuration system. Each corpus can independently select its embedding model for document similarity and VLM for entity extraction and other tasks, allowing organizations to optimize for their specific use cases and balance cost versus performance.
Task-Based VLM Configuration: VLM functionality is organized around configurable tasks with specific purposes—entity extraction, prompt discovery, and labeling guidance. Tasks are defined globally, then configured per corpus with custom prompts, enabling reusable extraction patterns while maintaining corpus-specific customization. Document types can further specialize tasks for layout-aware extraction.
Hierarchical Prompt System: The platform implements a three-tier prompt hierarchy: base prompts provide default extraction instructions, corpus-level custom prompts override for specific domains, and document type prompts enable layout-specific extraction strategies. This cascade ensures prompts become progressively more targeted while maintaining fallback defaults.
Serverless Model Deployment: Models are deployed on a severless infrastructure (currently RunPod) with configurable endpoints, supporting both synchronous processing results and asynchronous execution with webhook callbacks for queued operations. Further, models can be deployed locally on the serverless endpoint (using vLLM or Hugging Face transformers) or via API-based model providers endpoints. The abstraction layer allows switching between models and model providers while maintaining consistent API interfaces.
Performance Optimization: Page-level embedding caches, centroid-based pre-filtering, and optimized MaxSim operations ensure scalable performance even with large document corpora.

Embedding Types

The system stores and utilizes four different types of embeddings for each document page in the corpus.

ColQwen2.5 Multivector Embeddings: L2 normalized multivector representations that capture fine-grained semantic information about document layout and content. ColBERT-style multivectors are used for MaxSim() comparison.
Binary Quantized Embeddings: A binary quantized version of the ColQwen2.5 embeddings for efficient similarity computation with MaxSim() using Hamming distance
Centroid Vector: A unit-normalized centroid computed from the ColQwen embeddings, providing a compressed representation of the document's semantic content
Class Centroid Vectors: Unit-normalized centroid vectors maintained for each document class, representing the prototypical embedding for that class

Classification Pipeline

In a production environment, a single document can contain several pages, and each page may belong to a different document type or class. For instance, in the initial use-cases for our system, a customer may send any PDF document, containing any combination of pages, in different orientations and with different layouts and page sizes. Therefore, Docugym performs per-page classification and returns classification results for each page.
To allow the system to start out with a minimal dataset, Docugym supports a multi-stage classification mechanism. The initial classification mechanism is based on k-NN, which is a fast and memory-efficient classification mechanism stage allows us to start with just 2-3 labeled document pages per document type, and then gradually add more labeled pages over time. To account for yet-unknown document types, Docugym implements a statistical threshold-based rejection system that allows us to reject documents that are too dissimilar to any known document class.
The first step in the classification Page classification starts with stored centroids; a pgvector cosine search quickly finds the closest layout prototypes, collapsing the search space to a handful of promising candidates.
For each candidate layout, the system streams multi-vector embeddings from Postgres and executes a MaxSim comparison in-database, aligning the query page's tokens against historical exemplars without shipping vectors to the app server.
The top-ranked result returns with document type, variant, and entity template metadata so downstream extraction prompts map directly onto the matched layout.
Classification artifacts - centroids, multivectors, and quantized embeddings - stay persisted for every page, powering cache refreshes, evaluation dashboards, and future inferences without reprocessing the original files.

Out-of-Distribution Detection

Each prediction starts with a centroid screen: the page's centroid embedding is measured against all layout centroids, and if the highest cosine score misses the corpus-level threshold, the page is immediately flagged as unknown before deeper comparisons run.
Surviving candidates go through MaxSim re-ranking, but an adaptive layout-specific floor, computed from the labeled corpus using stats like the 5th percentile and mean minus 2 sigma (based on Chebyshev's inequality), must be cleared; falling short triggers an unknown verdict even if the candidate would otherwise win the ranking.
Operators can switch modes (automatic, guided, manual) to control how those thresholds play with user-defined policies, while the system still returns the nearest known layout and similarity metadata so adjudicators understand which class was closest.
Every rejection flows into the out-of-distribution review workflow, where analysts can inspect the highlighted nearest matches, adjust thresholds, label new variants, or requeue the page, giving teams a full loop for continuously refining coverage.

Named Entity Workflows

Entity Catalog & Templates: Each corpus maintains a catalog of named entities with unique labels, colors, and extraction prompts. When a page is classified into a specific layout, that page automatically inherits that layout's entity template—defining which fields are expected and each field's extraction prompt. This tight coupling ensures extraction attempts are targeted and contextually aware.
Interactive Labeling Workflow: Human annotators can work with the in-browser OCR to select text regions and spatial coordinates and assign them to entities, or use the Extractive VLM prompting to extract the text value and spatial coordinates for each extracted entity.
Vision-Language Model Integration: The platform leverages VLMs for intelligent entity extraction at scale. For each entity type, the system maintains extraction prompts that evolve through use—starting with simple questions and refining based on extraction success rates. When processing new documents, the classification result determines which entity prompts to deploy, ensuring the VLM receives layout-specific instructions for optimal accuracy.
Prompt Discovery & Refinement: The system analyzes successful extractions to automatically discover effective prompts during labeling, using similarity metrics (Levenshtein distance) to measure extraction quality against labeled examples. Reviewers can test prompts in real-time, immediately seeing extraction results and adjusting prompts based on performance. This creates a continuous improvement cycle where human expertise guides automated extraction quality.
Inference Pipeline Integration: At inference time, the MCP server and API orchestrate the full extraction pipeline: classify the page layout, retrieve the appropriate entity template, fetch optimized prompts for each entity, execute VLM extraction, and map results back to the structured schema. Unknown document types are flagged for human review, expanding the system's capabilities over time.

From No Datasetto Robust Document AI

Our Approach to Practical, Incremental Document AI

Infrastructure Backbone

Model Configuration & Deployment

Embedding Types

Classification Pipeline

Out-of-Distribution Detection

Named Entity Workflows

From No Dataset
to Robust Document AI