Custom t2v-transformers image produces different vectors

I created a custom vectorizer following the docs. My distilluse_v1.Dockerfile has the following:

FROM semitechnologies/transformers-inference:custom
RUN MODEL_NAME=sentence-transformers/distiluse-base-multilingual-cased ./download.py

I built it and tagged it as distiluse_v1-inference. Then in the docker-compose.yml I have the following in the t2v section:

t2v-transformers:
image: distiluse_v1-inference
environment:
ENABLE_CUDA: 0
ports:
- 9090:8080

Everything is “apparently” working fine except for the fact that the vectors created are of 768 dimensions. The model should produce a 512 dimension vector:

So i am not sure how/why this is happening.

My guess is that my custom image is not really working and pulling the model that I want and therefore the model used defaults to one of the pre-built images by weaviate which most of them produce 768 dimension vectors. Is this feasible? or is there another reason? How can I check the model being used by my image?

I have tested your model and in fact there’s a discrepancy in vector dimensions.

We have a custom implementation of our vectorizer logic and it’s second time that it occurs that our logic produces vectors with wrong dimensions. Our transformers module has also implemented a second vectorizer logic which uses SentenceTransformers library and I have tested that when I switched to that library the vector dimensions produced are good (512).

I will make a fix for this and release a new version of our transformers module by EoW.

Thank you for reporting this!
I will let you know once the new version is out.

Thank you, I recreated the docker image a few times thinking I did something wrong, good to know it is a bug.

When do u think this release would this release be ready? Could you reply in this comment when it is released please.

I will use one of the default models meanwhile. The reason I need a custom vectorizer is because i want a German language specific model.

Lastly, my data is sparse so before embedding/inserting I am doing 3 things:

  • Deleting fields that have empty values.

  • For numeric fields, append the name of the field to the string-converted value so the embedding has context

So I am not sure if this is a good idea at all? Can I have a {… {“fireproof score”: “fireproof score = 30”} …} and produce an embedding with such fields? Will it produce good results if a user then searches “x product with fireproof protection 30”?

Hello, I have merged this PR and I will make a release of this module tomorrow, so stay tuned. With this fix you will be able to instruct the module to use SentenceTransformerVectorizer with which you will be able to produce embeddings with correct dimensions.

As for your question as I understand correctly you are manually adding property name to the property value, right? if so you can skip that and add "vectorizePropertyName": "true" setting to that property so that Weaviate will handle this for you.

This is what you would need to test, my gut feeling is yes.

@erickfoxbase I have release a new version of transformers-inference docker images.

If you will USE_SENTENCE_TRANSFORMERS_VECTORIZER=true to your custom dockerfile then you will notice that the built image is now able to create vectors with proper dimensions.

FROM semitechnologies/transformers-inference:custom
RUN USE_SENTENCE_TRANSFORMERS_VECTORIZER=true MODEL_NAME=sentence-transformers/distiluse-base-multilingual-cased ./download.py

Thank you @antas-marcin it works great :slight_smile:

I have a new problem now. I have described it in a new thread here

but basically I cannot access a gated HF repo to created a custom vectorizer. Maybe the module does not support injecting secrets into the docker file yet? or a bug? or I am doing incorrectly?

Thank you, it wors great