Query regarding similarity search

I have a column in my schema which stores email message content in different languages. If I try a similarity search on this schema, will it return the similar objects irrespective of the language by understanding the context of the email message?
My expected output would be array of messages in different languages with same context.

Will it work?

Can someone please assist with my query?
Thanks in advance.

Hi @Sriparna (I’ve moved this to support from general).

Yes, this is possible. For this to work, your embedding model would have to be able to understand these different languages.

One model that you can do this with is Cohere’s multilingual model:

From experience I believe OpenAI’s ada-002 is also somewhat multi-lingual, but it’s not explicitly trained for that purpose as far as I know.

So I recommend starting with the Cohere module and multilingual model in this case.

Thanks @jphwang . Just reconfirming, I am using the default text2vec-transformer for vectorization of my column. So it will not be able understand the context if the language is not in english for every email content right?

Hi @Sriparna - if you are using text2vec-transformers, you could try one of the multilingual models available.

You can for example try paraphrase-multilingual-MiniLM-L12-v2:

Thanks for clarifying @jphwang . Really appreciate your quick help with my queries. So currently my docker-compose file looks like this :

version: ‘3.4’
services:
weaviate:
image: semitechnologies/weaviate:1.19.6
restart: on-failure
ports:
- “8080:8080”
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: “./data”
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: ‘text2vec-transformers, text2vec-openai,generative-openai’
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: ‘node1’
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-paraphrase-MiniLM-L6-v2
environment:
ENABLE_CUDA: 0 # set to 1 to enable
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

Here is my sample schema creation code

class_obj = {
“class”: “Product”,
“description”: “Product Schema”
“properties”: [
{
“dataType”: [“text”],
“description”: “prodName”,
“name”: “prodName”,
“moduleConfig”: {
“text2vec-transformers”: {
“skip”: “false”,
“vectorizePropertyName”: “false”
}
}
},
{
“dataType”: [“text”],
“description”: “prodDesc”,
“name”: “prodDesc”,
“moduleConfig”: {
“text2vec-transformers”: {
“skip”: “false”,
“vectorizePropertyName”: “false”
}
}
},
],
“vectorizer”: “text2vec-transformers”
}

Now if I want to use paraphrase-multilingual-MiniLM-L12-v2 model then I just need to change my docker-compose.yml file as below and no change is required in my schema creation code right? Or do I have to specify this model name in my schema creation code somewhere as well?

version: ‘3.4’
services:
weaviate:
image: semitechnologies/weaviate:1.19.6
restart: on-failure
ports:
- “8080:8080”
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: “./data”
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: ‘text2vec-transformers, text2vec-openai,generative-openai’
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: ‘node1’
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2
environment:
ENABLE_CUDA: 0 # set to 1 to enable
# NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA

Yup. To set the image (model) to be used, you just need to set it in the config file as shown here.

And you will have to re-vectorize your data, of course.

Cheers,
JP

Got it. Thanks a lot !