Weaviate Text Embedding Variations

Hi team,

Recently, we’ve begun exploring Weaviate, and I need clarification on the embeddings generated by the text2vec-transformers module. In my setup, I’ve enabled the text2vec-transformers module by setting “enable” to true and “tag” to sentence-transformers-all-MiniLM-L12-v2. However, after indexing is complete, I noticed differences between the embeddings generated by the transformers module and those from the same HF model used in standalone setup.

Here’s my input sentence:

input = “Combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.”

Embeddings generated in the standalone setup by the HF model,  
> [0.020367290824651718, -0.06027209013700485, 0.08994261175394058, 0.021357210353016853, 0.01984591782093048, 0.05606047809123993, -0.02014531008899212, 0.006495875306427479, 0.031136102974414825, 0.024565359577536583, 0.08532216399908066, ... ]


>Embeddings generated by Weaviate,
[0.07658111,-0.24428493,0.36238655,0.08147758,0.10421621,0.28499788,-0.057393536,-0.027201267,0.18786895,0.11688445,0.34956443,-0.113155164,0.34250045,0.14217529,-0.31033084,-0.19569416,0.14555733,0.3528335,-0.07058265,0.032889266,0.115612...]

Weaviate version : 1.23.9

The vector generated by Weaviate differs from the vectors generated in the standalone setup with the same HF model. Could you please advise on what I might be missing here?

Hi @Sowmiya_jaganathan ! Welcome to our community! :hugs:

Before vectorizing your object, Weaviate will take in consideration some configurations of your collection.

For example, it can vectorize the collection name, as well as some properties name as well as their values.

So based on all those, Weaviate will generate the content to be vectorized.

for example, considering this configuration:

from weaviate import classes as wvc
client.collections.delete("Test")
col = client.collections.create(
    name="Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
        vectorize_collection_name=True
    ),
    properties=[
        wvc.config.Property(
            name="text",
            vectorize_property_name=True,
            data_type=wvc.config.DataType.TEXT,
            skip_vectorization=False
        )
    ]
)

Once you ingest a new object:

col.data.insert({"text": "Combining text and image embeddings allows powerful systems for tasks like image search, classification, and description."})

this will be the resulting text that will be vectorized:

test text combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.

note: all lowercase, and the collection name and property name will come first.

Now, if you want to get the exact same vector from the model, you can run a docker like this one:

---
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: semitechnologies/weaviate:1.23.9
    ports:
    - 8080:8080
    - 50051:50051
    restart: on-failure:0
    environment:
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      CLUSTER_HOSTNAME: 'node1'
      LOG_LEVEL: 'trace'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L12-v2
    environment:
      ENABLE_CUDA: '0'
    ports:
    - 9000:8080 
...

note: port 9000 is mapped to the model

Now, you can request a vector, directly to the model, like this

import requests
json_data = {
    'text': 'test text combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.',
}

response = requests.post('http://localhost:9000/vectors', json=json_data)

and finally, validate that the vectors are the same:

response.json().get("vector") == col.query.fetch_objects(include_vector=True).objects[0].vector.get("default")

Ps: in order to get the text that was vectorized, I entered the container, and added a print(item) here:

We had some discussions last week where modules would log in DEBUG level the payload to be vectorized. That “feature” will help this kind of situation.

Let me know if this helps :slight_smile: