Multilingual embedder for Weaviate

rjalex · May 27, 2025, 3:52pm

Dear friends,
I need to embed millions of Italian language strings using the well tested intfloat/multilingual-e5-large model.

If anyone is interested I have uploaded a small repo that will show you how this is done and also test the setup and performance of your new multilingual Weaviate service.

I am left with one question though. As you can see from its Huggingface card, the strings to be vectorized should be all prefixed by the "passage: " string.

This string of course is only useful to generate the embedding and should not be saved to the DB.

How would you suggest handling this?

DudaNogueira · May 27, 2025, 6:03pm

Ciao amico mio!!!

Long time no see

Indeed, I don’t see anything that could prefix this in the inference container code.

Maybe it could be added as an ENV VAR, something like: VECTORIZER_TEXT_PREPEND, somewhere around here

I will raise this internally with our team.

Thanks for exploring and sharing!

antas-marcin · May 27, 2025, 6:33pm

@rjalex with our latest v1.31 (which should be released this week) I have added one small change to transformers module.

Now when we do a query we are sending a taskType: query and when we send passage request we add taskType: passage. We can use this information in our transformers inference container and prefix the text either with passage: or query: string if used model is intfloat/multilingual-e5-large

I can add support for this model and modify transformers inference container in a way that it will support your case.

rjalex · May 28, 2025, 7:39am

Hi Duda yes I was absent for a while due to other job priorities. As always thanks a lot.

rjalex · May 28, 2025, 7:46am

Hi Marcin, yes that would be cool. https://huggingface.co/intfloat/multilingual-e5-large and BAAI/bge-m3 · Hugging Face are as you probably know the best multilingual embedders out there, and as you can see from their HG pages both Milvus and Vespa do support them, so I guess that an optimal support also for my beloved Weaviate is strategic (and very useful for me ).

Here is the FAQ from e5-large HF page:

FAQ

1. Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

Thanks for your attention and take care.

antas-marcin · May 28, 2025, 8:41am

FYI: our transformers inference container already supports bge-m3 we expose ONNX images.

rjalex · May 28, 2025, 2:08pm

Very cool to know but I already have 40 million vectors with e5-large useful for another project though. Thanks

antas-marcin · July 6, 2025, 10:02am

@rjalex sorry that it took me that long, here’s a PR that adds support for intfloat/multilingual-e5-large model and prepending of query passage prefixes to vectorized inputs

antas-marcin · July 8, 2025, 9:13am

@rjalex I have released v1.12.0 version of the transformers-inference container project and you can use a pre-built intfloat-multilingual-e5-large docker image together with our text2vec-tranformers module to produce the embeddings using this model

rjalex · July 8, 2025, 2:46pm

Thank you very much. This fixes one issue with this embedder The other VERY RELEVANT aspect is that this model needs the vector output to be normalized before being stored or used for similarity. This is a short doc explaining why, hope it’s useful.

Normalizing the embeddings vector produced by the intfloat/multilingual-e5-large model (or any dense retrieval model in the E5 family) is important for semantic similarity comparisons and retrieval tasks. Below is a detailed explanation of the reasons and best practices for normalization.

Why Normalize Embeddings

1. Cosine Similarity vs. Dot Product

Most retrieval and semantic similarity systems are based on cosine similarity, which compares the angle between two vectors, not their magnitude. However, many dense retrieval implementations (e.g., FAISS, Elasticsearch, Weaviate, etc.) are optimized to use the dot product for efficiency.

Cosine similarity:

$$
\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}
$$
Dot product after L2 normalization (i.e., unit norm):

$$
\vec{a}{\text{norm}} \cdot \vec{b}{\text{norm}} = \cos(\theta)
$$

By normalizing each embedding vector to unit length (L2 norm = 1), the dot product becomes equivalent to cosine similarity, which is more meaningful in semantic spaces.

2. Performance in Retrieval

Normalized vectors are crucial in:

Similarity search (e.g., top-k retrieval).
Clustering or classification in embedding space.
Avoiding bias from vector magnitude (especially in transformer outputs where magnitude can correlate with input length or token entropy).

3. Model Training Convention

The E5 paper (“Text Embeddings by Weakly-Supervised Contrastive Pre-training”) and model documentation explicitly state that normalization should be applied at inference to follow the behavior used during training.

How to Normalize Embeddings

Assuming you’re using sentence-transformers or the HuggingFace transformers pipeline:

Using `torch`:

import torch
from transformers import AutoTokenizer, AutoModel

model_name = "intfloat/multilingual-e5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode text following E5's format
text = "passage: The Eiffel Tower is in Paris."  # or "query: ..." if it's a query
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    output = model(**inputs)
    embeddings = output.last_hidden_state[:, 0]  # CLS token

# Normalize the embedding to unit length
embeddings_normalized = torch.nn.functional.normalize(embeddings, p=2, dim=1)

Using `sentence-transformers`:

If you’re using the SentenceTransformer interface (which wraps normalization internally if configured):

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("intfloat/multilingual-e5-large")

# model.encode applies mean pooling + normalization by default (check config)
embedding = model.encode("query: what is the capital of France?", normalize_embeddings=True)

Summary

Aspect	Reason/Implication
Normalize to unit length	Ensures cosine similarity ≈ dot product
Required for E5 models	Matches training conditions (contrastive learning setup)
Best practice for retrieval	Improves performance and semantic alignment
Method	`torch.nn.functional.normalize(..., p=2, dim=1)` or via `SentenceTransformer(..., normalize_embeddings=True)`

antas-marcin · July 17, 2025, 10:40am

I thought that this model already responds with normalized embeddings. I will fix that @rjalex in the next version. Thanks!

Topic		Replies	Views
Normalizing a vector when using weaviate with a text2vec-transformers Support	2	43	June 27, 2025
Text2vec-openai Batch API Support integration , wcs , python	1	213	July 8, 2024
Weaviate Text Embedding Variations Support	1	564	February 19, 2024
Multi-Lingual Cosine Similarity Search Support	5	270	March 9, 2024
Need help to use my own vectorizer and generative model Support integration , wcs	5	415	July 8, 2024