Query regarding similarity search

Sriparna · July 11, 2023, 10:04am

I have a column in my schema which stores email message content in different languages. If I try a similarity search on this schema, will it return the similar objects irrespective of the language by understanding the context of the email message?
My expected output would be array of messages in different languages with same context.

Will it work?

Sriparna · July 17, 2023, 6:43am

Can someone please assist with my query?
Thanks in advance.

jphwang · July 17, 2023, 4:26pm

Hi @Sriparna (I’ve moved this to support from general).

Yes, this is possible. For this to work, your embedding model would have to be able to understand these different languages.

One model that you can do this with is Cohere’s multilingual model:

From experience I believe OpenAI’s ada-002 is also somewhat multi-lingual, but it’s not explicitly trained for that purpose as far as I know.

So I recommend starting with the Cohere module and multilingual model in this case.

Sriparna · July 18, 2023, 5:38am

Thanks @jphwang . Just reconfirming, I am using the default text2vec-transformer for vectorization of my column. So it will not be able understand the context if the language is not in english for every email content right?

jphwang · July 18, 2023, 9:11am

Hi @Sriparna - if you are using text2vec-transformers, you could try one of the multilingual models available.

You can for example try paraphrase-multilingual-MiniLM-L12-v2:

Sriparna · July 18, 2023, 9:38am

Thanks for clarifying @jphwang . Really appreciate your quick help with my queries. So currently my docker-compose file looks like this :

version: ‘3.4’
services:
weaviate:
image: semitechnologies/weaviate:1.19.6
restart: on-failure
ports:
- “8080:8080”
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: “./data”
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: ‘text2vec-transformers, text2vec-openai,generative-openai’
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: ‘node1’
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-paraphrase-MiniLM-L6-v2
environment:
ENABLE_CUDA: 0 # set to 1 to enable
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

Here is my sample schema creation code

class_obj = {
“class”: “Product”,
“description”: “Product Schema”
“properties”: [
{
“dataType”: [“text”],
“description”: “prodName”,
“name”: “prodName”,
“moduleConfig”: {
“text2vec-transformers”: {
“skip”: “false”,
“vectorizePropertyName”: “false”
}
}
},
{
“dataType”: [“text”],
“description”: “prodDesc”,
“name”: “prodDesc”,
“moduleConfig”: {
“text2vec-transformers”: {
“skip”: “false”,
“vectorizePropertyName”: “false”
}
}
},
],
“vectorizer”: “text2vec-transformers”
}

Now if I want to use paraphrase-multilingual-MiniLM-L12-v2 model then I just need to change my docker-compose.yml file as below and no change is required in my schema creation code right? Or do I have to specify this model name in my schema creation code somewhere as well?

version: ‘3.4’
services:
weaviate:
image: semitechnologies/weaviate:1.19.6
restart: on-failure
ports:
- “8080:8080”
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: “./data”
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: ‘text2vec-transformers, text2vec-openai,generative-openai’
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: ‘node1’
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2
environment:
ENABLE_CUDA: 0 # set to 1 to enable
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

jphwang · July 18, 2023, 12:53pm

Yup. To set the image (model) to be used, you just need to set it in the config file as shown here.

And you will have to re-vectorize your data, of course.

Cheers,
JP

Sriparna · July 18, 2023, 1:14pm

Got it. Thanks a lot !

Topic		Replies	Views
Multi-Lingual Cosine Similarity Search Support	5	272	March 9, 2024
Does Weaviate have a good support for non-English (multi-lingual) search? General	2	406	March 20, 2024
Multimodal search with Bring your own vector Support	8	248	October 21, 2024
Multilingual embedder for Weaviate Support	10	558	July 17, 2025
Weaviate cosine similarity completelly different than ScikitLearn with SentenceTransformer vectorizer Support bug , developer-experience , python , technical	1	152	January 14, 2025

Query regarding similarity search

Related topics