Dear friends,
I need to embed millions of Italian language strings using the well tested intfloat/multilingual-e5-large model.
If anyone is interested I have uploaded a small repo that will show you how this is done and also test the setup and performance of your new multilingual Weaviate service.
I am left with one question though. As you can see from its Huggingface card, the strings to be vectorized should be all prefixed by the "passage: " string.
This string of course is only useful to generate the embedding and should not be saved to the DB.
@rjalex with our latest v1.31 (which should be released this week) I have added one small change to transformers module.
Now when we do a query we are sending a taskType: query and when we send passage request we add taskType: passage. We can use this information in our transformers inference container and prefix the text either with passage: or query: string if used model is intfloat/multilingual-e5-large
I can add support for this model and modify transformers inference container in a way that it will support your case.
Hi Marcin, yes that would be cool. https://huggingface.co/intfloat/multilingual-e5-large and BAAI/bge-m3 · Hugging Face are as you probably know the best multilingual embedders out there, and as you can see from their HG pages both Milvus and Vespa do support them, so I guess that an optimal support also for my beloved Weaviate is strategic (and very useful for me ).
Here is the FAQ from e5-large HF page:
FAQ
1. Do I need to add the prefix "query: " and "passage: " to input texts?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
Here are some rules of thumb:
Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.