By Frank Sommers
•
October 21, 2025
The emergence of pretrained, open-source vision-language models (VLMs) is revolutionizing document intelligence.
Document intelligence (DI) tasks include document classification, named-entity recognition, document summarization, visual document question answering, visual document retrieval, document comparison, document generation, and more. These tasks are foundational for most business processes, and represent repetitive, error-prone work.
In contrast to text-only tasks that large language models (LLMs) excel at, DI tasks are applied to visually rich documents with complex, heterogeneous layouts, including a mixture of textual and visual cues, often with domain-specific data, and in many languages. While VLMs were designed for general vision tasks, VLMs achieve state-of-the-art results in DI tasks in part because of their architecture, and in part because their training data include large amounts of document images.
Most tutorials and evaluations of VLMs use publicly available, academic document datasets and benchmarks.
🤔 But how well do open-source VLMs work on your own documents?
Apart from a quick Colab vibe check, that has not been an easy question to answer.
To begin with, you need a high-quality document dataset that is representative of the documents you want to process. Without such a dataset, you simply cannot evaluate the performance of a model on your own documents. You also can't tell which models work best, which models you need to tweak or tune in what way for your data, and which models are the most cost-effective for your documents.
Most organizations do not readily have such document datasets available, for several reasons:
Many real-world business documents contain highly specialized data that only domain experts can label properly. Domain experts in organizations are scarce and cannot routinely dedicate significant time to document labeling. In addition, the more complex your data, the more domain experts may disagree on the correct labels. A high-quality dataset must account for such inter-labeler disagreements.
Real-world business document datasets are highly dynamic as new documents are continuously added. You may start with a dozen patient intake documents, for example, but new document formats and document types are added all the time in the course of business.
Open-source document intelligence evolves rapidly, with new and improved models becoming available regularly. Improved models not only provide better performance on key DI tasks, but can also reduce operational costs due to smaller, more efficient model sizes. However, adapting a dataset to a new model is not a trivial task, as a new model may require new prompts, different context, or fine-tuning.
Real-world business documents contain vast amounts of sensitive, private data. Access to this data is highly regulated within an organization.
Few organizations have the in-house expertise to maintain and evolve a document AI dataset, and to monitor and evaluate the performance of various models on that dataset.
Docugym came out of a struggle with these issues in a real-world business setting. We wanted to make use of the latest open-source multimodal models for DI, but soon ran into all these issues, and more.
Traditional parametric classification, named entity recognition, and other document AI models require a large amount of labeled data to be trained on. That is not a problem for public datasets, but for private datasets, that is a major challenge: All but the simplest domain-specific documents require domain experts to help label the data. However, domain experts are a scarce resource within an organization. Rather than assembling a large document dataset from the start, we must be able to (a) perform useful document AI tasks with minimal labeled data, and (b) facilitate the incremental development of the labeled corpus.
With the requirement to have few labeled examples, and with possibly complex, domain-specific documents, the labeling of the document corpus must be of high quality. We support redundant labeling, where a document may be independently labeled by multiple domain experts. Since domain experts can disagree, we help view and resolve inter-labeler disagreements.
The system provides useful DI functionality with just a few labeled examples. As the labeled corpus grows, the system improves its performance in a continuous, online fashion, without the need for explicit retraining.
An organization's document data is constantly changing. The system continuously evaluates it's performance in the presence of new operational document data, and makes on-the-fly adjustments to adapt to shifting data distributions.
Our environment assumes that an organization has no specialized ML-Ops experts available to maintain and even monitor, the models or the dataset. An API-based endpoint is immediately available. For agentic use, the system also offers an MCP server.
Different models perform very differently on the same DI task, given the same dataset. Docugym lets you evaluate the performance of different models on the same dataset, and the same model on different datasets.
In addition to specific models, Docugym also allows users to configure the deployment modality of models. Deployment modality affects cost and latency in the system. For example, API-based models generally charge per processed token, while local model costs are fixed per time unit, such as GPU hours. Users can evalute local and API-based models on the same dataset from a cost and performance perspective.
Docugym is designed to produce output that aligns with existing business systems and workflows. Via the API and MCP, Docugym is a drop-in integration into existing document-intensive workflows.
Business documents contain lots of sensitive, private data. Example datasets must be located in a secure, private document storage that only the domain experts have access to. The need for a minimal dataset also assists in data security: Traditionally, private document data would first be anonymized before being used for training and evaluation. That adds an additional step and increases the complexity and cost of the system. Minimizing the amount of labeled data, and limiting labeling to a small number of domain experts, helps avoid that additional step: Domain experts are part of the organization and are already granted access to the documents.
© 2025 Docusure, Inc. All rights reserved.