Weaviate Text Embedding Variations

Sowmiya_jaganathan · February 18, 2024, 8:53am

Hi team,

Recently, we’ve begun exploring Weaviate, and I need clarification on the embeddings generated by the text2vec-transformers module. In my setup, I’ve enabled the text2vec-transformers module by setting “enable” to true and “tag” to sentence-transformers-all-MiniLM-L12-v2. However, after indexing is complete, I noticed differences between the embeddings generated by the transformers module and those from the same HF model used in standalone setup.

Here’s my input sentence:

input = “Combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.”

Embeddings generated in the standalone setup by the HF model,  
> [0.020367290824651718, -0.06027209013700485, 0.08994261175394058, 0.021357210353016853, 0.01984591782093048, 0.05606047809123993, -0.02014531008899212, 0.006495875306427479, 0.031136102974414825, 0.024565359577536583, 0.08532216399908066, ... ]


>Embeddings generated by Weaviate,
[0.07658111,-0.24428493,0.36238655,0.08147758,0.10421621,0.28499788,-0.057393536,-0.027201267,0.18786895,0.11688445,0.34956443,-0.113155164,0.34250045,0.14217529,-0.31033084,-0.19569416,0.14555733,0.3528335,-0.07058265,0.032889266,0.115612...]

Weaviate version : 1.23.9

The vector generated by Weaviate differs from the vectors generated in the standalone setup with the same HF model. Could you please advise on what I might be missing here?

DudaNogueira · February 19, 2024, 5:50pm

Hi @Sowmiya_jaganathan ! Welcome to our community!

Before vectorizing your object, Weaviate will take in consideration some configurations of your collection.

For example, it can vectorize the collection name, as well as some properties name as well as their values.

So based on all those, Weaviate will generate the content to be vectorized.

for example, considering this configuration:

from weaviate import classes as wvc
client.collections.delete("Test")
col = client.collections.create(
    name="Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
        vectorize_collection_name=True
    ),
    properties=[
        wvc.config.Property(
            name="text",
            vectorize_property_name=True,
            data_type=wvc.config.DataType.TEXT,
            skip_vectorization=False
        )
    ]
)

Once you ingest a new object:

col.data.insert({"text": "Combining text and image embeddings allows powerful systems for tasks like image search, classification, and description."})

this will be the resulting text that will be vectorized:

test text combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.

note: all lowercase, and the collection name and property name will come first.

Now, if you want to get the exact same vector from the model, you can run a docker like this one:

---
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: semitechnologies/weaviate:1.23.9
    ports:
    - 8080:8080
    - 50051:50051
    restart: on-failure:0
    environment:
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      CLUSTER_HOSTNAME: 'node1'
      LOG_LEVEL: 'trace'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L12-v2
    environment:
      ENABLE_CUDA: '0'
    ports:
    - 9000:8080 
...

note: port 9000 is mapped to the model

Now, you can request a vector, directly to the model, like this

import requests
json_data = {
    'text': 'test text combining text and image embeddings allows powerful systems for tasks like image search, classification, and description.',
}

response = requests.post('http://localhost:9000/vectors', json=json_data)

and finally, validate that the vectors are the same:

response.json().get("vector") == col.query.fetch_objects(include_vector=True).objects[0].vector.get("default")

Ps: in order to get the text that was vectorized, I entered the container, and added a print(item) here:

github.com

weaviate/t2v-transformers-models/blob/55f01cdbeb2f958a7419b3504c08501056daffea/app.py#L93


      
              response.status_code = status.HTTP_204_NO_CONTENT
          
          
          @app.get("/meta")
          def meta():
              return meta_config.get()
          
          
          @app.post("/vectors")
          @app.post("/vectors/")
          async def read_item(item: VectorInput, response: Response):
              try:
                  vector = await vec.vectorize(item.text, item.config)
                  return {"text": item.text, "vector": vector.tolist(), "dim": len(vector)}
              except Exception as e:
                  logger.exception(
                      'Something went wrong while vectorizing data.'
                  )
                  response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
                  return {"error": str(e)}

We had some discussions last week where modules would log in DEBUG level the payload to be vectorized. That “feature” will help this kind of situation.

Let me know if this helps

Topic		Replies	Views
Does Weaviate has its own embedding model which can be used for text and image embedding Support technical	1	142	January 2, 2025
Multilingual embedder for Weaviate Support	9	498	July 8, 2025
Recommendations for free ML models of Weaviate text2vec-transformers for Semantic Search purposes? Support	5	828	November 10, 2023
Generative tasks using Together AI endpoint (and via proxy) Support	2	26	June 30, 2025
Using sentence_transformers together with Weaviate Support bug , python	5	667	July 24, 2024

Weaviate Text Embedding Variations

Related topics