Thank you very much. This fixes one issue with this embedder
The other VERY RELEVANT aspect is that this model needs the vector output to be normalized before being stored or used for similarity. This is a short doc explaining why, hope it’s useful.
Normalizing the embeddings vector produced by the intfloat/multilingual-e5-large
model (or any dense retrieval model in the E5 family) is important for semantic similarity comparisons and retrieval tasks. Below is a detailed explanation of the reasons and best practices for normalization.
Why Normalize Embeddings
1. Cosine Similarity vs. Dot Product
Most retrieval and semantic similarity systems are based on cosine similarity, which compares the angle between two vectors, not their magnitude. However, many dense retrieval implementations (e.g., FAISS, Elasticsearch, Weaviate, etc.) are optimized to use the dot product for efficiency.
-
Cosine similarity:
$$
\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}
$$
-
Dot product after L2 normalization (i.e., unit norm):
$$
\vec{a}{\text{norm}} \cdot \vec{b}{\text{norm}} = \cos(\theta)
$$
By normalizing each embedding vector to unit length (L2 norm = 1), the dot product becomes equivalent to cosine similarity, which is more meaningful in semantic spaces.
2. Performance in Retrieval
Normalized vectors are crucial in:
- Similarity search (e.g., top-k retrieval).
- Clustering or classification in embedding space.
- Avoiding bias from vector magnitude (especially in transformer outputs where magnitude can correlate with input length or token entropy).
3. Model Training Convention
The E5 paper (“Text Embeddings by Weakly-Supervised Contrastive Pre-training”) and model documentation explicitly state that normalization should be applied at inference to follow the behavior used during training.
How to Normalize Embeddings
Assuming you’re using sentence-transformers
or the HuggingFace transformers
pipeline:
Using torch
:
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "intfloat/multilingual-e5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Encode text following E5's format
text = "passage: The Eiffel Tower is in Paris." # or "query: ..." if it's a query
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
output = model(**inputs)
embeddings = output.last_hidden_state[:, 0] # CLS token
# Normalize the embedding to unit length
embeddings_normalized = torch.nn.functional.normalize(embeddings, p=2, dim=1)
Using sentence-transformers
:
If you’re using the SentenceTransformer
interface (which wraps normalization internally if configured):
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("intfloat/multilingual-e5-large")
# model.encode applies mean pooling + normalization by default (check config)
embedding = model.encode("query: what is the capital of France?", normalize_embeddings=True)
Summary
Aspect |
Reason/Implication |
Normalize to unit length |
Ensures cosine similarity ≈ dot product |
Required for E5 models |
Matches training conditions (contrastive learning setup) |
Best practice for retrieval |
Improves performance and semantic alignment |
Method |
torch.nn.functional.normalize(..., p=2, dim=1) or via SentenceTransformer(..., normalize_embeddings=True) |