We are using the Weaviate module: text2vec-transformers where we use the pre-built image:
semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
The only problem is that our users profile can go lengthy (e.g., work experiences, roles) and the above model based from this documentation: sentence-transformers/multi-qa-MiniLM-L6-cos-v1 · Hugging Face - it has the following notes:
Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
Are there any recommendation to use for higher tokens (>500) which is free or open source and can be used for semantic/search engine purposes and multi-language capable?
Thanks!
Hi @junbetterway !
That’s a really good question!
It would indeed be very helpful to have a list and some nice comparison about open source / free ML models.
Regarding your question on lenghty texts: Good news! We got you covered!
Even if your text exceeds that model limit, the t2v-transformers container service will chunk it up and create an averaged embedding out of those.
Here is where this magic happens: https://github.com/weaviate/t2v-transformers-models/blob/b089315ecc589ceb6e33deacba3fa5c2dc1c2627/vectorizer.py#L119-L133
And thanks for asking this question! I also learned about this hidden feature today
I will make sure we document that
Let me know if this helps
1 Like
Thanks @DudaNogueira for looking into this.
So given this, we do not need to worry on the token limits of the model and also no changes on our side? Is there like a config needed to enable this?
Here is my local docker setup (though production looks similar)
weaviate-vectorizer:
image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
environment:
ENABLE_CUDA: 0 # set to 1 to enable if you have a GPU available for optimum performance
weaviate-core:
image: semitechnologies/weaviate:1.21.2
depends_on:
weaviate-vectorizer:
condition: service_started
ports:
- "8890:8080"
volumes:
- ./volumes/weaviate:/var/lib/weaviate
environment:
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'XXXXXXXXXX'
AUTHENTICATION_APIKEY_USERS: 'local-dev'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
AUTOSCHEMA_ENABLED: 'false'
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: text2vec-transformers
TRANSFORMERS_INFERENCE_API: http://weaviate-vectorizer:8080
CLUSTER_HOSTNAME: 'node1'
That’s right.
This all will happen in the transfoermers-inference docker container before returning the vector to Weaviate.
1 Like
Great thanks @DudaNogueira - it will be great if there was a documentation about this then, in the past we could have seen this sooner and we did not need to select props to vectorize - though, I can perform schema change then revectorized again - no biggie for now we just have ~3k-ish user profiles.
Anyways, I am not sure why I can’t find the solution button to click since you have answered this on point! Thanks again.
1 Like
Awesome!
I have moved this to the support category. That’s the only one that has this feature.
Thanks for being such an active user and asking really nice questions!
We really appreaciate it !!
1 Like