Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

As a practical example, I’ll walk through finetuning

Qwen/Qwen3-VL-Embedding-2B

for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting

tomaarsen/Qwen3-VL-Embedding-2B-vdr

demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model’s 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

If you’re new to multimodal models in Sentence Transformers, I recommend reading Multimodal Embedding & Reranker Models with Sentence Transformers first. For training text-only embedding, reranker, or sparse embedding models, see the Prior Blogposts section at the end.

Why Finetune?
Training Components
Model
Dataset
Loss Function
Training Arguments
Evaluator
Trainer
Results
Training Multimodal Reranker Models
Additional Resources

Why Finetune?

General-purpose multimodal embedding models like

Qwen/Qwen3-VL-Embedding-2B

are trained on diverse data to perform well across a wide range of languages and tasks: image-text matching, visual question answering, document understanding, and more. But this generality means the model is rarely the best choice for any specific task.

Consider Visual Document Retrieval: given a text query like “What was the company’s Q3 revenue?”, the model must find the most relevant document screenshot from a corpus of thousands. This requires understanding document layouts, charts, tables, and text, which is a very different skill from e.g. matching pictures of shoes with product descriptions.

By finetuning on domain-specific data, the model can learn these specialized patterns. In my experiment, finetuning improved NDCG@10 from 0.888 to 0.947, ahead of every recent multimodal model I tested, including ones up to 4x larger.

Training Components

Training multimodal Sentence Transformer models involves the same components as training text-only models:

The most common approach is to finetune an existing multimodal embedding model, or to start from a Vision-Language Model (VLM) checkpoint. The
```
Transformer
```
module automatically detects supported modalities from the model’s processor.
To finetune an existing multimodal embedding model (e.g., one that already has a
```
modules.json
```
file), you can pass
```
processor_kwargs
```
and
```
model_kwargs
```
to control preprocessing and model loading respectively. The
```
processor_kwargs
```
are passed directly to
```
AutoProcessor.from_pretrained(...)
```
, while the
```
model_kwargs
```
are passed to the appropriate
```
AutoModel.from_pretrained(...)
```
call.
To start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet, Sentence Transformers will attempt to recognize the architecture and infer the supported modalities from the processor. If automatic detection doesn’t work perfectly for a particular model, the configuration in the saved
```
sentence_bert_config.json
```
can be edited to adjust modality settings, forward methods, and output handling.
In both cases, the
```
Transformer
```
module inspects the processor to determine which modalities are available, and a
```
Pooling
```
is added automatically if needed. You can verify the supported modalities using
```
print(model.modalities)
```
.

Alternative: Building multimodal models with Router

Instead of using a single VLM backbone, you can compose separate encoders for different modalities using the

Router

module. This lets you combine any existing encoders and route inputs to the appropriate one based on detected modality:

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer

# Create separate encoders for different modalities
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean")
text_projection = Dense(text_encoder.get_embedding_dimension(), 768)

# SigLIP outputs pooled embeddings directly, so no separate Pooling module is needed
image_encoder = Transformer("google/siglip2-base-patch16-224")

# Route inputs based on modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_pooling, text_projection],
        "image": [image_encoder],
    },
)

model = SentenceTransformer(modules=[router])

Since Router-based multimodal models use separate encoders per modality, their embedding spaces are initially unaligned. Training is required to align the spaces for meaningful cross-modal similarity. The
Dense
projection layer helps map embeddings from different encoders into a shared space.

This approach is useful when you want to use lightweight, specialized encoders rather than a large VLM. You can also combine Router-based multimodality with task-specific routing (e.g., different encoders for queries vs. documents) using

route_mappings

. See the

Router

documentation for advanced routing scenarios.

Dataset

Visual Document Retrieval Dataset

For this example, I use the

tomaarsen/llamaindex-vdr-en-train-preprocessed

dataset, a preprocessed English subset of

llamaindex/vdr-multilingual-train

. The source dataset was released alongside the Visual Document Retrieval Goes Multilingual blogpost by LlamaIndex, and consists of ~500k multilingual query-image samples collected from public internet PDFs, with queries synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). My preprocessed version filters to the 53,512 English samples and resolves 4 of the 16 ID-based hard negatives per sample into actual document screenshot images, so it can be used directly for training without further preprocessing:

from datasets import load_dataset

train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
train_dataset = train_dataset.select_columns(["query", "image", "negative_0"])
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")

The

train

config contains the first 10,000 samples, and the

eval

config contains the next 300 samples (a

full

config with all 53,512 samples is also available). For training, I select

query

image

, and

negative_0

to form (anchor, positive, hard negative) triplets. Including additional hard negatives would likely improve the training signal, but each extra negative also increases memory usage and training time, so I stick with one. For evaluation, I keep all four hard negatives per query to build a more challenging retrieval corpus (more on that in the Evaluator section).

Dataset Format

Just like text-only training, the dataset format must match your chosen loss function. The rules are the same:

If your loss function requires a Label, your dataset must have a column named “label” or “score”.
All columns other than “label” or “score” are considered Inputs. The number of these columns must match the number of valid inputs for your chosen loss function. Beyond the label column, the column names don’t matter, only the order does.

For multimodal datasets, the inputs can contain:

Text: strings.
Image: PIL images, file paths, URLs, or numpy/torch arrays.
Audio: file paths, numpy/torch arrays, dicts with “array” and “sampling_rate” keys, or (if `torchcodec` is installed) `torchcodec.AudioDecoder` instances.
Video: file paths, numpy/torch arrays, dicts with “array” and “video_metadata” keys, or (if `torchcodec` is installed) `torchcodec.VideoDecoder` instances.
Multimodal dicts: a dict mapping modality names to values, e.g.
```
{"text": ..., "image": ...}
```
. The keys must be “text”, “image”, “audio”, or “video”.

The data collator automatically calls

model.preprocess()

, which detects the modality of each input and applies the appropriate preprocessing. No manual tokenization or image processing is needed.

Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them at https://huggingface.co/datasets?other=sentence-transformers.

Loss Function

CachedMultipleNegativesRankingLoss

For this training, I use

CachedMultipleNegativesRankingLoss

, a common choice for retrieval tasks. It accepts (query, positive) pairs with any number of additional hard negative columns, from 0 up to n, as long as each sample has the same number of negatives.

During training, the loss pushes each query’s similarity to its positive up and its similarity to every negative down. The negatives come from two sources:

Hard negatives: the negative column(s) explicitly supplied in the dataset (just
```
negative_0
```
in our triplet setup).
In-batch negatives: the positives and hard negatives from every other sample in the same batch, reused as additional negatives for this query at no extra cost.Source Read original →

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Table of Contents

Why Finetune?

Training Components

Alternative: Building multimodal models with Router

Dataset

Visual Document Retrieval Dataset

Dataset Format

Loss Function

CachedMultipleNegativesRankingLoss

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Scientists’ Side Hustle? Using…

OpenAI CEO Altman is…

AI agents win at…

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Table of Contents

Why Finetune?

Training Components

Alternative: Building multimodal models with Router

Dataset

Visual Document Retrieval Dataset

Dataset Format

Loss Function

CachedMultipleNegativesRankingLoss

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Scientists’ Side Hustle? Using…

OpenAI CEO Altman is…

AI agents win at…