Using Vision Language Models to Process Millions of Documents

(VLMs) are powerful machine-learning models that can process both visual and textual information. With the recent release of Qwen 3 VL, I want to make a deep dive into how you can utilize these powerful VLMs to process documents.

Why you need to use VLMs

To highlight why some tasks require VLMs, I want to start off with an example task, where we need to interpret text and the visual information of text.

Imagine you look at the image below. The checkboxes represent whether a document should be included in a report or not, and now you need to determine which documents to include.

This figure highlights a suitable problem for VLMs. You have an image containing text about documents, along with checkboxes. You now need to determine which documents have been checked off the checkboxes. This is difficult to solve with LLMs, because you first need to apply OCR to the image. The text then loses its visual position, which is required to properly solve the task. With VLMs, you can easily both read the text in the document, and utilize its visual position (if the text is above a checked off checkbox or not), and successfully solve the task. Image by the author.

For a human, this is a simple task; obviously, documents 1 and 3 should be included, while document 2 should be excluded. However, if you tried to solve this problem through a pure LLM, you would encounter issues.

To run a pure LLM, you would first need to OCR the image, where the OCR output would look something like below, if you use Google’s Tesseract, for example, which extracts the text line by line.

Document 1  Document 2  Document 3  X   X

As you might have already discovered, the LLM will have issues deciding which documents to include, because it’s impossible to know which documents the Xs belong to. This is just one of many scenarios where VLMs are extremely efficient at solving a problem.

The main point here is that knowing which documents have a checkboxed X requires both visual and textual information. You need to know the text and the visual position of the text in the image. I summarize this in the quote below:

VLMs are required when the meaning of text depends on its visual position

Application areas

There are a plethora of areas you can apply VLMs to. In this section, I’ll cover some different areas where VLMs have proven useful, and where I have also successfully applied VLMs.

Agentic use cases

Agents are in the wind nowadays, and VLMs also play a role in this. I’ll highlight two main areas where VLMs can be used in an agentic context, though there are naturally many other such areas.

Computer use

Computer use is an interesting use case for VLMs. With computer use, I refer to a VLM looking at a frame from your computer and deciding which action to take next. One example of this is OpenAI’s Operator. This can, for example, be looking at a frame of this article you’re reading right now, and scrolling down to read more from this article.

VLMs are useful for computer use, because LLMs are not enough to decide which actions to take. When operating on a computer, you often have to interpret the visual position of buttons and information, which, as I described in the beginning, is one of the prime areas of use for VLMs.

Debugging

Debugging code is also a super useful agentic application area for VLMs. Imagine that you are developing a web application, and discover a bug.

One option is to start logging to the console, copy the logs, describe to Cursor what you did, and prompt Cursor to fix it. This is naturally time-consuming, as it requires a lot of manual steps from the user.

Another option is thus to utilize VLMs to better solve the problem. Ideally, you describe how to reproduce the issue, a VLM can go into your application, recreate the flow, check out the issue, and thus debug what is going wrong. There are applications being built for areas like this, though most have not come far in development from what I’ve seen.

Question answering

Utilizing VLMs for visual question answering is one of the classic approaches to using VLMs. Question answering is the use case I described earlier in this article about figuring out which checkbox belongs to which documents. You feed the VLM with a user question, and an image (or several images), for the VLM to process. The VLM will then provide an answer in text format. You can see how this process works in the figure below.

You should, however, weigh the trade-offs of using VLMs vs LLMs. Naturally, when a task requires textual and visual information, you need to utilize VLMs to get a proper result. However, VLMs are also usually much more expensive to run, as they need to process more tokens. This is because images contain a lot of information, which thus leads to many input tokens to process.

Furthermore, if the VLM is to process text, you also need high-resolution images, allowing the VLM to interpret the pixels making up letters. With lower resolutions, the VLM struggles to read the text in the images, and you’ll receive low-quality results.

Classification

Another interesting application area for VLMs is classification. With classification, I refer to the situation where you have a predetermined set of categories and need to determine which category an image belongs to.

You can utilize VLMs for classification, with the same approach as using LLMs. You create a structured prompt containing all relevant information, including the possible output categories. Furthermore, you preferably cover the different edge cases, for example, in scenarios where two categories are both very probable, and the VLM has to decide between the two categories.

You can, for example, have a prompt such as:

def get_prompt():
    return """
        ## General instructions
        You need to determine which category a given document belongs to. 
        The available categories are "legal", "technical", "financial".

        ## Edge case handling
        - In the scenario where you have a legal document covering financial information, the document belongs to the financial category
        - ...
        ## Return format
        Respond only with the corresponding category, and no other text 
    """

You can also effectively utilize VLMs for information extraction, and there are a lot of information extraction tasks requiring visual information. You create a similar prompt to the classification prompt I created above, and typically prompt the VLM to respond in a structured format, such as a JSON object.

When performing information extraction, you need to consider how many data points you want to extract. For example, if you need to extract 20 different data points from a document, you probably don’t want to extract all of them at once. This is because the model will likely struggle to accurately extract that much information in one go.

Instead, you should consider splitting up the task, for example, extracting 10 data points, with two different requests, simplifying the task for the model. On the other side of the argument, you’ll sometimes encounter that some data points are related to each other, meaning they should be extracted in the same request. Furthermore, sending several requests increases the inference cost.

This figure highlights how you can utilize VLMs to perform information extraction. You again feed the VLM the image of the document, and also prompt the VLM to extract specific data points. In this figure, I prompt the VLM to extract the date of the document, the location mentioned in the document, and the document type. The VLM then analyzes the prompt and the document image, and outputs a JSON object containing the requested information. Image by the author.

When VLMs are problematic

VLMs are amazing models that can perform tasks that were unimaginable to solve with AI just a few years ago. However, they also have their limitations, which I’ll cover in this section.

Cost of running VLMs

The first limitation is the cost of running VLMs, which I’ve also briefly discussed earlier in this article. VLMs process images, which consist of a lot of pixels. These pixels represent a lot of information, which is encoded into tokens that the VLM can process. The issue is that since images contain so much information, you need to create a lot of tokens per image, which again increases the cost to run VLMs.

Furthermore, you often need high-resolution images, since the VLM is required to read text in images, leading to even more tokens to process. VLMs are thus expensive to run, both over an API, but in compute costs if you decide to self-host the VLM.

Cannot process long documents

The amount of tokens contained in images also limits the number of pages a VLM can process at once. VLMs are limited by their context windows, just like traditional LLMs. This is a problem if you want to process long documents containing hundreds of pages. Naturally, you could split the document into chunks, but you might encounter problems where the VLM doesn’t have access to all the contents of the document in one go.

For example, if you have a 100-page document, you could first process pages 1-50, and then process pages 51-100. However, if some information on page 53 might need the context from page 1 (for example, the title or date of the document), this will lead to issues.

To learn how to deal with this problem, I read Qwen 3’s cookbook, where they have a page on how to utilize Qwen 3 for ultralong documents. I’ll be sure to test this out and discuss how well it works in a future article.

Conclusion

In this article, I’ve discussed vision language models and how you can apply them to different problem areas. I first described how to integrate VLMs in agentic systems, for example, as a computer use agent, or to debug web applications. Continuing, I covered areas such as question answering, classification, and information extraction. Lastly, I also covered some limitations of VLMs, discussing the computational cost of running VLMs and how they struggle with long documents.

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

Table of contents

Why you need to use VLMs

Application areas

Agentic use cases

Question answering

Classification

When VLMs are problematic

Cost of running VLMs

Cannot process long documents

Conclusion

Apple says iPhone 17 Pro ‘scratchgate’ debate is overblown – and explains why in-store phones are scuff magnets

Lifeasible Expands Plant Genetic Transformation Service for Advanced Crop Research

Eivind Kjosbakken

About Author

You may also like

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.