Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
As a practical example, I’ll walk through finetuning
Qwen/Qwen3-VL-Embedding-2B
for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting
tomaarsen/Qwen3-VL-Embedding-2B-vdr
demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model’s 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.
If you’re new to multimodal models in Sentence Transformers, I recommend reading Multimodal Embedding & Reranker Models with Sentence Transformers first. For training text-only embedding, reranker, or sparse embedding models, see the Prior Blogposts section at the end.
Table of Contents
- Why Finetune?
- Training Components
- Model
- Dataset
- Loss Function
- Training Arguments
- Evaluator
- Trainer
- Results
- Training Multimodal Reranker Models
- Additional Resources
Why Finetune?
General-purpose multimodal embedding models like
Qwen/Qwen3-VL-Embedding-2B
are trained on diverse data to perform well across a wide range of languages and tasks: image-text matching, visual question answering, document understanding, and more. But this generality means the model is rarely the best choice for any specific task.
Consider Visual Document Retrieval: given a text query like “What was the company’s Q3 revenue?”, the model must find the most relevant document screenshot from a corpus of thousands. This requires understanding document layouts, charts, tables, and text, which is a very different skill from e.g. matching pictures of shoes with product descriptions.
By finetuning on domain-specific data, the model can learn these specialized patterns. In my experiment, finetuning improved NDCG@10 from 0.888 to 0.947, ahead of every recent multimodal model I tested, including ones up to 4x larger.
Training Components
Training multimodal Sentence Transformer models involves the same components as training text-only models:
- The most common approach is to finetune an existing multimodal embedding model, or to start from a Vision-Language Model (VLM) checkpoint. The
Transformer
module automatically detects supported modalities from the model’s processor.
- To finetune an existing multimodal embedding model (e.g., one that already has a
modules.json
file), you can pass
processor_kwargs
and
model_kwargs
to control preprocessing and model loading respectively. The
processor_kwargs
are passed directly to
AutoProcessor.from_pretrained(...)
, while the
model_kwargs
are passed to the appropriate
AutoModel.from_pretrained(...)
call.
- To start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet, Sentence Transformers will attempt to recognize the architecture and infer the supported modalities from the processor. If automatic detection doesn’t work perfectly for a particular model, the configuration in the saved
sentence_bert_config.json
can be edited to adjust modality settings, forward methods, and output handling.
- In both cases, the
Transformer
module inspects the processor to determine which modalities are available, and a
Pooling
is added automatically if needed. You can verify the supported modalities using
print(model.modalities)
.
Alternative: Building multimodal models with Router
Instead of using a single VLM backbone, you can compose separate encoders for different modalities using the
Router
module. This lets you combine any existing encoders and route inputs to the appropriate one based on detected modality:
from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer
# Create separate encoders for different modalities
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean")
text_projection = Dense(text_encoder.get_embedding_dimension(), 768)
# SigLIP outputs pooled embeddings directly, so no separate Pooling module is needed
image_encoder = Transformer("google/siglip2-base-patch16-224")
# Route inputs based on modality
router = Router(
sub_modules={
"text": [text_encoder, text_pooling, text_projection],
"image": [image_encoder],
},
)
model = SentenceTransformer(modules=[router])
Since Router-based multimodal models use separate encoders per modality, their embedding spaces are initially unaligned. Training is required to align the spaces for meaningful cross-modal similarity. The
Denseprojection layer helps map embeddings from different encoders into a shared space.
This approach is useful when you want to use lightweight, specialized encoders rather than a large VLM. You can also combine Router-based multimodality with task-specific routing (e.g., different encoders for queries vs. documents) using
route_mappings
. See the
Router
documentation for advanced routing scenarios.
Dataset
Visual Document Retrieval Dataset
For this example, I use the
tomaarsen/llamaindex-vdr-en-train-preprocessed
dataset, a preprocessed English subset of
llamaindex/vdr-multilingual-train
. The source dataset was released alongside the Visual Document Retrieval Goes Multilingual blogpost by LlamaIndex, and consists of ~500k multilingual query-image samples collected from public internet PDFs, with queries synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). My preprocessed version filters to the 53,512 English samples and resolves 4 of the 16 ID-based hard negatives per sample into actual document screenshot images, so it can be used directly for training without further preprocessing:
from datasets import load_dataset
train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
train_dataset = train_dataset.select_columns(["query", "image", "negative_0"])
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")
The
train
config contains the first 10,000 samples, and the
eval
config contains the next 300 samples (a
full
config with all 53,512 samples is also available). For training, I select
query
,
image
, and
negative_0
to form (anchor, positive, hard negative) triplets. Including additional hard negatives would likely improve the training signal, but each extra negative also increases memory usage and training time, so I stick with one. For evaluation, I keep all four hard negatives per query to build a more challenging retrieval corpus (more on that in the Evaluator section).
Dataset Format
Just like text-only training, the dataset format must match your chosen loss function. The rules are the same:
- If your loss function requires a Label, your dataset must have a column named “label” or “score”.
- All columns other than “label” or “score” are considered Inputs. The number of these columns must match the number of valid inputs for your chosen loss function. Beyond the label column, the column names don’t matter, only the order does.
For multimodal datasets, the inputs can contain:
- Text: strings.
- Image: PIL images, file paths, URLs, or numpy/torch arrays.
- Audio: file paths, numpy/torch arrays, dicts with “array” and “sampling_rate” keys, or (if `torchcodec` is installed) `torchcodec.AudioDecoder` instances.
- Video: file paths, numpy/torch arrays, dicts with “array” and “video_metadata” keys, or (if `torchcodec` is installed) `torchcodec.VideoDecoder` instances.
- Multimodal dicts: a dict mapping modality names to values, e.g.
{"text": ..., "image": ...}. The keys must be “text”, “image”, “audio”, or “video”.
The data collator automatically calls
model.preprocess()
, which detects the modality of each input and applies the appropriate preprocessing. No manual tokenization or image processing is needed.
Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them at https://huggingface.co/datasets?other=sentence-transformers.
Loss Function
CachedMultipleNegativesRankingLoss
For this training, I use
CachedMultipleNegativesRankingLoss
, a common choice for retrieval tasks. It accepts (query, positive) pairs with any number of additional hard negative columns, from 0 up to n, as long as each sample has the same number of negatives.
During training, the loss pushes each query’s similarity to its positive up and its similarity to every negative down. The negatives come from two sources:
- Hard negatives: the negative column(s) explicitly supplied in the dataset (just
negative_0
in our triplet setup).
- In-batch negatives: the positives and hard negatives from every other sample in the same batch, reused as additional negatives for this query at no extra cost.
Originally published at huggingface.co. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




