Multi-Lingual Cosine Similarity Search

Description

I have added some Hebrew texts to my embeddings. I use NearText queries under the Tex2Vec OpenAI transformer.

I am curious if these queries will match similar contexts in the Hebrew texts?

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method: WCS
  • Multi Node? Number of Running Nodes:
  • Client Language and Version: 1.23.10

Any additional Information

Hi!

It will depend on the LLM inference you use.

If it supports your language, it should work fine as long as Weaviate is aware :slight_smile:

Let me know if this helps!

gpt-4 is the model. It supports Hebrew.

How is Weaviate made aware?

When you create a collection, you can specify a vectorizer model.

As long as this vectorizer model support multi language, and your class was configured to use it, it should work.

for example:

import weaviate.classes as wvc

client.collections.create(
    "Article",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_cohere(
        model="embed-multilingual-v2.0",
        vectorize_collection_name=True
    ),
)

Cohere has a nice multi language model:

Let me know if this helps :slight_smile:

1 Like

Yes, thanks.

Actually, I want to create a new cluster and use text-embedding-3-large as my embedding model. Do you have a way of finding out if it is multi-lingual? I have asked on the OpenAI developer forum, but so far nobody seems to know.

It looks like regardless of whether the model is multi-lingual or not, if I add the English translation to the embedding, the cosine similarity search will find it. That’s good to know.