Recommendations for free ML models of Weaviate text2vec-transformers for Semantic Search purposes?

We are using the Weaviate module: text2vec-transformers where we use the pre-built image:

semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1

The only problem is that our users profile can go lengthy (e.g., work experiences, roles) and the above model based from this documentation: sentence-transformers/multi-qa-MiniLM-L6-cos-v1 · Hugging Face - it has the following notes:

Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.

Are there any recommendation to use for higher tokens (>500) which is free or open source and can be used for semantic/search engine purposes and multi-language capable?

Thanks!

Hi @junbetterway !

That’s a really good question!

It would indeed be very helpful to have a list and some nice comparison about open source / free ML models.

Regarding your question on lenghty texts: Good news! We got you covered!

Even if your text exceeds that model limit, the t2v-transformers container service will chunk it up and create an averaged embedding out of those.

Here is where this magic happens: https://github.com/weaviate/t2v-transformers-models/blob/b089315ecc589ceb6e33deacba3fa5c2dc1c2627/vectorizer.py#L119-L133

And thanks for asking this question! I also learned about this hidden feature today :exploding_head:

I will make sure we document that :wink:

Let me know if this helps :slight_smile:

1 Like

Thanks @DudaNogueira for looking into this.

So given this, we do not need to worry on the token limits of the model and also no changes on our side? Is there like a config needed to enable this?

Here is my local docker setup (though production looks similar)

  weaviate-vectorizer:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: 0 # set to 1 to enable if you have a GPU available for optimum performance

  weaviate-core:
    image: semitechnologies/weaviate:1.21.2
    depends_on:
      weaviate-vectorizer:
        condition: service_started
    ports:
      - "8890:8080"
    volumes:
      - ./volumes/weaviate:/var/lib/weaviate
    environment:
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'XXXXXXXXXX'
      AUTHENTICATION_APIKEY_USERS: 'local-dev'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      AUTOSCHEMA_ENABLED: 'false'
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://weaviate-vectorizer:8080
      CLUSTER_HOSTNAME: 'node1'

That’s right.

This all will happen in the transfoermers-inference docker container before returning the vector to Weaviate.

1 Like

Great thanks @DudaNogueira - it will be great if there was a documentation about this then, in the past we could have seen this sooner and we did not need to select props to vectorize :slight_smile: - though, I can perform schema change then revectorized again - no biggie for now we just have ~3k-ish user profiles.

Anyways, I am not sure why I can’t find the solution button to click since you have answered this on point! Thanks again.

1 Like

Awesome!

I have moved this to the support category. That’s the only one that has this feature.

Thanks for being such an active user and asking really nice questions!

We really appreaciate it :hugs: !!

1 Like