Recommendations for free ML models of Weaviate text2vec-transformers for Semantic Search purposes?

junbetterway · November 8, 2023, 4:22pm

We are using the Weaviate module: text2vec-transformers where we use the pre-built image:

semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1

The only problem is that our users profile can go lengthy (e.g., work experiences, roles) and the above model based from this documentation: sentence-transformers/multi-qa-MiniLM-L6-cos-v1 · Hugging Face - it has the following notes:

Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.

Are there any recommendation to use for higher tokens (>500) which is free or open source and can be used for semantic/search engine purposes and multi-language capable?

Thanks!

DudaNogueira · November 9, 2023, 12:49pm

Hi @junbetterway !

That’s a really good question!

It would indeed be very helpful to have a list and some nice comparison about open source / free ML models.

Regarding your question on lenghty texts: Good news! We got you covered!

Even if your text exceeds that model limit, the t2v-transformers container service will chunk it up and create an averaged embedding out of those.

Here is where this magic happens: https://github.com/weaviate/t2v-transformers-models/blob/b089315ecc589ceb6e33deacba3fa5c2dc1c2627/vectorizer.py#L119-L133

And thanks for asking this question! I also learned about this hidden feature today

I will make sure we document that

Let me know if this helps

junbetterway · November 9, 2023, 3:01pm

Thanks @DudaNogueira for looking into this.

So given this, we do not need to worry on the token limits of the model and also no changes on our side? Is there like a config needed to enable this?

Here is my local docker setup (though production looks similar)

  weaviate-vectorizer:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: 0 # set to 1 to enable if you have a GPU available for optimum performance

  weaviate-core:
    image: semitechnologies/weaviate:1.21.2
    depends_on:
      weaviate-vectorizer:
        condition: service_started
    ports:
      - "8890:8080"
    volumes:
      - ./volumes/weaviate:/var/lib/weaviate
    environment:
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'XXXXXXXXXX'
      AUTHENTICATION_APIKEY_USERS: 'local-dev'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      AUTOSCHEMA_ENABLED: 'false'
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://weaviate-vectorizer:8080
      CLUSTER_HOSTNAME: 'node1'

DudaNogueira · November 9, 2023, 7:41pm

That’s right.

This all will happen in the transfoermers-inference docker container before returning the vector to Weaviate.

junbetterway · November 10, 2023, 11:06am

Great thanks @DudaNogueira - it will be great if there was a documentation about this then, in the past we could have seen this sooner and we did not need to select props to vectorize - though, I can perform schema change then revectorized again - no biggie for now we just have ~3k-ish user profiles.

Anyways, I am not sure why I can’t find the solution button to click since you have answered this on point! Thanks again.

DudaNogueira · November 10, 2023, 2:34pm

Awesome!

I have moved this to the support category. That’s the only one that has this feature.

Thanks for being such an active user and asking really nice questions!

We really appreaciate it !!

Topic		Replies	Views
Weaviate Openai Embedding Models General	8	870	August 23, 2024
Errors: text too long for vectorization. Tokens for text: 10440, max tokens per batch: 8192, ApiKey absolute token limit: 1000000' Support bug	12	747	November 1, 2024
Weaviate Text Embedding Variations Support	1	792	February 19, 2024
An issue with the vectorizer module text2vec-transformers: 1024-dim vectors in 768-dim model Support	2	715	October 27, 2023
Need help to use my own vectorizer and generative model Support integration , wcs	5	803	July 8, 2024

Recommendations for free ML models of Weaviate text2vec-transformers for Semantic Search purposes?

Related topics