Does Weaviate have a good support for non-English (multi-lingual) search?

Basically, we have implemented a semantic search engine for our user profiles which is really good as of the moment however, we are now having an increasing user base from Spain and Italy that we wanted to support multi-lingual capabilities on our search, since their profile info are written into their native language (e.g., Spanish, Italian)

We are currently using text2vec-transformers with Weaviate pre-built image:
semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1 which looks like not meant for multi-lingual support?

So basically, its a 3-part question:

  1. Changing the ML model requires re-vectorization of existing data, right?
  2. What Weaviate pre-built ML images would you recommend for multi-lingual support?
  3. Lastly, after changing to a multi-lingual model, does this mean using a hybrid-search query in any language can be able to provide me good results?

For example, we have a user profile that states its basic info “I am good in playing drums or any percussion instrument” then does searching a query like:

  • Buscando músicos que saben tocar la batería (in Spanish)
  • Cerco musicisti che sappiano suonare la batteria (in Italian)

which just means in English “Looking for musicians who knows to play drums” - will it give me the results as intended? My worry is that I saw this blog about the current limitations of Weaviate with non-English languages. Thanks!

Hi @junbetterway !!

I believe that as long as the model supports multi language, you should have no issues with Weaviate doing vector search (for keyword/hybrid see below)

Regarding your questions:

1 - Yes, if you change the vector model, you will need to ingest your data again, so it forces Weaviate to vectorize your data again. There is a migration data guide that can help you here.

2 - I am not sure about the pre-built images. I usually for with the cohere model

3 - Now, when using the keyword search alone for multilingual searches can bring you odd results, as the words are not the same. But the vector search should work, as drum will be close to batería. So you will have to tweak the alfa parameter of the hybrid search so you can get best results. of course, some experimenting will be needed.

I believe the limitation is only for non alphabetic languages (I see it’s worded non English :thinking: ) I will check with the author.

THanks!

1 Like

I see thanks as always @DudaNogueira for this.

By the way, for item#2 - I saw this hugging-face model:

which is part of Weaviate pre-build ML images but I will need to test it out soon.

1 Like