Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers As a practical example, I’ll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual…

By AI Maestro May 10, 2026 6 min read
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers


Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

As a practical example, I’ll walk through finetuning

Qwen/Qwen3-VL-Embedding-2B

for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting

tomaarsen/Qwen3-VL-Embedding-2B-vdr

demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model’s 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

If you’re new to multimodal models in Sentence Transformers, I recommend reading Multimodal Embedding & Reranker Models with Sentence Transformers first. For training text-only embedding, reranker, or sparse embedding models, see the Prior Blogposts section at the end.

Table of Contents

  • Why Finetune?
  • Training Components
  • Model
  • Dataset
  • Loss Function
  • Training Arguments
  • Evaluator
  • Trainer
  • Results
  • Training Multimodal Reranker Models
  • Additional Resources

Why Finetune?

General-purpose multimodal embedding models like

Qwen/Qwen3-VL-Embedding-2B

are trained on diverse data to perform well across a wide range of languages and tasks: image-text matching, visual question answering, document understanding, and more. But this generality means the model is rarely the best choice for any specific task.

Consider Visual Document Retrieval: given a text query like “What was the company’s Q3 revenue?”, the model must find the most relevant document screenshot from a corpus of thousands. This requires understanding document layouts, charts, tables, and text, which is a very different skill from e.g. matching pictures of shoes with product descriptions.

By finetuning on domain-specific data, the model can learn these specialized patterns. In my experiment, finetuning improved NDCG@10 from 0.888 to 0.947, ahead of every recent multimodal model I tested, including ones up to 4x larger.

Training Components

Training multimodal Sentence Transformer models involves the same components as training text-only models:

  • The most common approach is to finetune an existing multimodal embedding model, or to start from a Vision-Language Model (VLM) checkpoint. The
    Transformer

    module automatically detects supported modalities from the model’s processor.

  • To finetune an existing multimodal embedding model (e.g., one that already has a
    modules.json

    file), you can pass

    processor_kwargs

    and

    model_kwargs

    to control preprocessing and model loading respectively. The

    processor_kwargs

    are passed directly to

    AutoProcessor.from_pretrained(...)

    , while the

    model_kwargs

    are passed to the appropriate

    AutoModel.from_pretrained(...)

    call.

  • To start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet, Sentence Transformers will attempt to recognize the architecture and infer the supported modalities from the processor. If automatic detection doesn’t work perfectly for a particular model, the configuration in the saved
    sentence_bert_config.json

    can be edited to adjust modality settings, forward methods, and output handling.

  • In both cases, the
    Transformer

    module inspects the processor to determine which modalities are available, and a

    Pooling

    is added automatically if needed. You can verify the supported modalities using

    print(model.modalities)

    .

Alternative: Building multimodal models with Router

Instead of using a single VLM backbone, you can compose separate encoders for different modalities using the

Router

module. This lets you combine any existing encoders and route inputs to the appropriate one based on detected modality:

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer

# Create separate encoders for different modalities
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean")
text_projection = Dense(text_encoder.get_embedding_dimension(), 768)

# SigLIP outputs pooled embeddings directly, so no separate Pooling module is needed
image_encoder = Transformer("google/siglip2-base-patch16-224")

# Route inputs based on modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_pooling, text_projection],
        "image": [image_encoder],
    },
)

model = SentenceTransformer(modules=[router])

Since Router-based multimodal models use separate encoders per modality, their embedding spaces are initially unaligned. Training is required to align the spaces for meaningful cross-modal similarity. The

Dense

projection layer helps map embeddings from different encoders into a shared space.

This approach is useful when you want to use lightweight, specialized encoders rather than a large VLM. You can also combine Router-based multimodality with task-specific routing (e.g., different encoders for queries vs. documents) using

route_mappings

. See the

Router

documentation for advanced routing scenarios.

Dataset

Visual Document Retrieval Dataset

For this example, I use the

tomaarsen/llamaindex-vdr-en-train-preprocessed

dataset, a preprocessed English subset of

llamaindex/vdr-multilingual-train

. The source dataset was released alongside the Visual Document Retrieval Goes Multilingual blogpost by LlamaIndex, and consists of ~500k multilingual query-image samples collected from public internet PDFs, with queries synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). My preprocessed version filters to the 53,512 English samples and resolves 4 of the 16 ID-based hard negatives per sample into actual document screenshot images, so it can be used directly for training without further preprocessing:

from datasets import load_dataset

train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
train_dataset = train_dataset.select_columns(["query", "image", "negative_0"])
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")

The

train

config contains the first 10,000 samples, and the

eval

config contains the next 300 samples (a

full

config with all 53,512 samples is also available). For training, I select

query

,

image

, and

negative_0

to form (anchor, positive, hard negative) triplets. Including additional hard negatives would likely improve the training signal, but each extra negative also increases memory usage and training time, so I stick with one. For evaluation, I keep all four hard negatives per query to build a more challenging retrieval corpus (more on that in the Evaluator section).

Dataset Format

Just like text-only training, the dataset format must match your chosen loss function. The rules are the same:

  • If your loss function requires a Label, your dataset must have a column named “label” or “score”.
  • All columns other than “label” or “score” are considered Inputs. The number of these columns must match the number of valid inputs for your chosen loss function. Beyond the label column, the column names don’t matter, only the order does.

For multimodal datasets, the inputs can contain:

  • Text: strings.
  • Image: PIL images, file paths, URLs, or numpy/torch arrays.
  • Audio: file paths, numpy/torch arrays, dicts with “array” and “sampling_rate” keys, or (if `torchcodec` is installed) `torchcodec.AudioDecoder` instances.
  • Video: file paths, numpy/torch arrays, dicts with “array” and “video_metadata” keys, or (if `torchcodec` is installed) `torchcodec.VideoDecoder` instances.
  • Multimodal dicts: a dict mapping modality names to values, e.g.
    {"text": ..., "image": ...}

    . The keys must be “text”, “image”, “audio”, or “video”.

The data collator automatically calls

model.preprocess()

, which detects the modality of each input and applies the appropriate preprocessing. No manual tokenization or image processing is needed.

Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with `sentence-transformers`, allowing you to easily find them at https://huggingface.co/datasets?other=sentence-transformers.

Loss Function

CachedMultipleNegativesRankingLoss

For this training, I use

CachedMultipleNegativesRankingLoss

, a common choice for retrieval tasks. It accepts (query, positive) pairs with any number of additional hard negative columns, from 0 up to n, as long as each sample has the same number of negatives.

During training, the loss pushes each query’s similarity to its positive up and its similarity to every negative down. The negatives come from two sources:

  • Hard negatives: the negative column(s) explicitly supplied in the dataset (just
    negative_0

    in our triplet setup).

  • In-batch negatives: the positives and hard negatives from every other sample in the same batch, reused as additional negatives for this query at no extra cost.

    Originally published at huggingface.co. Curated by AI Maestro.

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top