(VLMs) are powerful models capable of inputting both images and text, and responding with text. This allows us to perform visual information extraction on documents and images. In this article, I’ll discuss the newly released Qwen 3 VL, and the powerful capabilities VLMs possess.
Qwen 3 VL was released a few weeks ago, initially with the 235B-A22B model, which is quite a large model. They then released the 30B-A3B, and just now released the dense 4B and 8B versions. My goal for this article is to highlight the capabilities of vision language models and inform you of their capabilities on a high level. I’ll use Qwen 3 VL as a specific example in this article, though there are many other high-quality VLMs available. I’m not affiliated in any way with Qwen when writing this article.

Why do we need vision language models
Vision language models are necessary because the alternative is to instead rely on OCR and feed the OCR-ed text into an LLM. This has several issues:
- OCR isn’t perfect, and the LLM will have to deal with imperfect text extraction
- You lose the information contained in the visual position of the text
Traditional OCR engines like Tesseract have long been super important to document processing. OCR has allowed us to input images and extract the text from them, enabling further processing of the contents of the document. However, traditional OCR is far from perfect, and it may struggle with issues like small text, skewed images, vertical text, and so on. If you have poor OCR output, you’ll struggle with all downstream tasks, whether you’re using regex or an LLM. Feeding images directly to VLMs, instead of OCR-ed text to LLMs, is thus far more effective in utilizing information.
The visual position of text is sometimes critical to understanding the meaning of the text. Imagine the example in the image below, where you have checkboxes highlighting which text is relevant, where some checkboxes are ticked off, and some are not. You might then have some text corresponding to each checkbox, where only the text beside the ticked-off checkbox is relevant. Extracting this information using OCR + LLMs is challenging, because you can’t know which text the ticked checkbox belongs to. However, solving this task using vision language models is trivial.

I fed the image above to Qwen 3 VL, and it replied with the response shown below:
Based on the image provided, the documents that are checked off are:
- **Document 1** (marked with an "X")
- **Document 3** (marked with an "X")
**Document 2** is not checked (it is blank).
As you can see, Qwen 3 VL easily solved the problem correctly.
Another reason we need VLMs is that we also get video understanding. Truly understanding video clips would be immensely challenging using OCR, as a lot of the information in videos is not displayed with text, but rather shown as an image directly. OCR is thus not effective. However, the new generation of VLMs allows you to input hundreds of images, for example, representing a video, allowing you to perform video understanding tasks.
Vision language model tasks
There are many tasks you can apply vision language models to. I’ll discuss a few of the most relevant tasks.
- OCR
- Information extraction
The data
I’ll use the image below as an example image for my testing.

I’ll use this image because it’s an example of a real document, very relevant to apply Qwen 3 VL on. Furthermore, I’ve cropped the image to its current shape, so that I can feed the image with a high resolution into Qwen 3 VL on my local computer. Maintaining a high resolution is critical if you want to perform OCR on the image. I have extracted the JPG from a PDF using 600 DPI. Normally, 300 DPI is enough for OCR, but I kept a higher DPI just to be sure, which works in this small image.
Prepare Qwen 3 VL
I need the following imports to run Qwen 3 VL:
torch
accelerate
pillow
torchvision
git+https://github.com/huggingface/transformers
You need to install Transformers from source (GitHub), as Qwen 3 VL is not yet available in the latest Transformers version.
The following code loads the imports, model, and processor, and creates an inference function:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import os
import time
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
"""Resize image if needed to a maximum size of max_size. Keep the aspect ratio."""
img = Image.open(image_path)
width, height = img.size
if width <= max_size and height <= max_size:
return image_path
ratio = min(max_size / width, max_size / height)
new_width = int(width * ratio)
new_height = int(height * ratio)
img_resized = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
base_name = os.path.splitext(image_path)[0]
ext = os.path.splitext(image_path)[1]
resized_path = f"{base_name}_resized{ext}"
img_resized.save(resized_path)
return resized_path
def _build_messages(system_prompt: str, user_prompt: str, image_paths: list[str] | None = None, max_image_size: int | None = None):
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]}
]
user_content = []
if image_paths:
if max_image_size is not None:
processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
else:
processed_paths = image_paths
user_content.extend([
{"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
for image_path in processed_paths
])
user_content.append({"type": "text", "text": user_prompt})
messages.append({
"role": "user",
"content": user_content,
})
return messages
def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: list[str] | None = None, max_image_size: int | None = None):
messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
start_time = time.time()
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
return output_text[0]
OCR
OCR is a task that most VLMs are trained for. You can for example read the technical reports of the Qwen VL models, where they mention how OCR data is a part of the training set. To train VLMs to perform OCR they give the model a series of images, and the text contained in those images. The model then learns to extract the text from the images.
I’ll apply OCR to the image with the prompt below, which is the same prompt the Qwen team uses to perform OCR according to the Qwen 3 VL cookbook.
user_prompt = "Read all the text in the image."
Now I’ll run the model. I called the test image we’re running on, for example-doc-site-plan-cropped.jpg
system_prompt = """
You are a helpful assistant that can answer questions and help with tasks.
"""
user_prompt = "Read all the text in the image."
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
Plan- og
bygningsetaten
Dato: 23.01.2014
Bruker: HKN
Målestokk 1:500
Ekvidistanse 1m
Høydegrunnlag: Oslo lokal
Koordinatsystem: EUREF89 - UTM sone 32
© Plan- og bygningsetaten,
Oslo kommune
Originalformat A3
Adresse:
Camilla Colletts vei 15
Gnr/Bnr:
.
Kartet er sammenstilt for:
.
PlotID: / Best.nr.:
27661 /
Deres ref: Camilla Colletts vei 15
Kommentar:
Gjeldende kommunedelplaner:
KDP-BB, KDP-13, KDP-5
Kartutsnittet gjelder vertikalinvå 2.
I tillegg finnes det regulering i
følgende vertikalinvå:
(Hvis blank: Ingen øvrige.)
Det er ikke registrert
naturn mangfold innenfor
Se tegnforklaring på eget ark.
Beskrivelse:
NR:
Dato:
Revidert dato:
This output is from my testing, completely correct, and covers all the text in the image, and extracts all correct characters.
Information extraction
You can also perform information extraction using vision language models. This can, for example, be used to extract important metadata from images. You typically also want to extract this metadata into a JSON format, so it’s easily parsable and can be used for downstream tasks. In this example, I’ll extract:
- Date – 23.01.2024 in this example
- Address – Camilla Colletts vei 15 in this example
- Gnr (street number) – which in the test image is a blank field
- Målestokk (scale) – 1:500
I’m running the following code:
user_prompt = """
Extract the following information from the image, and reply in JSON format:
{
"date": "The date of the document. In format YYYY-MM-DD.",
"address": "The address mentioned in the document.",
"gnr": "The street number (Gnr) mentioned in the document.",
"scale": "The scale (målestokk) mentioned in the document.",
}
If you cannot find the information, reply with None. The return object must be a valid JSON object. Reply only the JSON object, no other text.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
{
"date": "2014-01-23",
"address": "Camilla Colletts vei 15",
"gnr": "15",
"scale": "1:500"
}
The JSON object is in a valid format, and Qwen has successfully extracted the date, address, and scale fields. However, Qwen has actually returned a gnr. Initially, when I saw this result, I assumed this was a hallucination, since the Gnr field in the test image is blank. However, Qwen has actually made a natural assumption that the Gnr is available in the address, which is correct in this instance.
To be sure of its capabilities to answer None if it can’t find anything, I asked Qwen to extract the Bnr (building number), which is not available in this example. Running the code below:
user_prompt = """
Extract the following information from the image, and reply in JSON format:
{
"date": "The date of the document. In format YYYY-MM-DD.",
"address": "The address mentioned in the document.",
"Bnr": "The building number (Bnr) mentioned in the document.",
"scale": "The scale (målestokk) mentioned in the document.",
}
If you cannot find the information, reply with None. The return object must be a valid JSON object. Reply only the JSON object, no other text.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
I get:
{
"date": "2014-01-23",
"address": "Camilla Colletts vei 15",
"Bnr": None,
"scale": "1:500"
}
So as you can see, Qwen does manage to inform us if information is not present in the document.
Vision language models’ downsides
I would also like to note that there are some issues with vision language models as well. The image I tested OCR and information extraction with is a relatively simple image. To truly test the capabilities of Qwen 3, I would have to expose it to more challenging tasks, for example, extracting more text from a longer document or making it extract more metadata fields.
The main current downsides with VLMs, from what I have seen, are:
- Sometimes missing text with OCR
- Inference is slow
VLMs missing text when performing OCR is something I’ve observed a few times. When it happens, the VLM typically just misses a section of the document and completely ignores the text. This is naturally very problematic, as it could miss text that is critical for downstream tasks like performing keyword searches. The reason this happens is a complicated topic that is out of scope for this article, but it’s a problem you should be aware of if you’re performing OCR with VLMs.
Furthermore, VLMs require a lot of processing power. I’m running locally on my PC, though I’m also running a very small model. I started experiencing memory issues when I simply wanted to process an image with dimensions of 2048×2048, which is problematic if I want to perform text extraction from larger documents. You can thus imagine how resource-intensive it is to apply VLMs to either:
- More images at once (for example, processing a 10-page document)
- Processing documents of higher resolutions
- Using a larger VLM, with more parameters
Conclusion
In this article, I’ve discussed VLMs, where I started off discussing why we need VLMs, highlighting how some tasks require both text and the visual position of the text. Furthermore, I highlighted some tasks you can perform with VLMs and how Qwen 3 VL was able to perform these tasks. I think the vision modality will be more and more important in the coming years. Up until a year ago, almost all focus was on pure text models. However, to gain even more powerful models, we need to utilize the vision modality, which is where I believe VLMs will be incredibly important.