Distance and certainty discrepancy

Description

I have been trying to create a named vector collection.
with the below configuration

multi_vector_test = client.collections.create(
    name="Multi_vector_test",
    properties=[
        wvc.config.Property(
            name="data",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=False,
            tokenization=wvc.config.Tokenization.FIELD,
        ),
        wvc.config.Property(
            name="vec1",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=False,
            tokenization=wvc.config.Tokenization.FIELD,
        ),
        wvc.config.Property(
            name="vec2",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=False, 
            tokenization=wvc.config.Tokenization.FIELD,  
        ),
    ],
    vector_config=[
        Configure.Vectors.text2vec_azure_openai(
            name="default",
            source_properties=["data"],
            resource_name=AZURE_OPENAI_EMBEDDING_RESOURCE_NAME,
            deployment_id=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID,
            base_url=AZURE_OPENAI_ENDPOINT,
            vectorize_collection_name=False,
        ),
        Configure.Vectors.text2vec_azure_openai(
            name="vec1",
            source_properties=["vec1"],
            resource_name=AZURE_OPENAI_EMBEDDING_RESOURCE_NAME,
            deployment_id=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID,
            base_url=AZURE_OPENAI_ENDPOINT,
            vectorize_collection_name=False,
        )
    ],
)

And have added one document in it

added_doc = multi_vector_test.data.insert({"data": "test 1"})
multi_vector_test.query.fetch_object_by_id(added_doc).properties

# OUTPUT
{'vec1': None, 'vec2': None, 'data': 'test 1'}

When queried with vec1 vector the output should be empty as the doc i inserted had vec1 property as None. But a doc is retrieved and its distance and certainty are shown as distance=0.6095870733261108, certainty=0.6952064633369446

docs = multi_vector_test.query.near_text(query="test", target_vector="vec1", return_metadata=["certainty", "distance"])
print(docs.objects[0])

## OUTPUT
Object(uuid=_WeaviateUUIDInt('831905db-9a32-475d-8e6c-47868c918543'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=0.6095870733261108, certainty=0.6952064633369446, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'vec1': None, 'vec2': None, 'data': 'test 1'}, references=None, vector={}, collection='Multi_vector_test')

Why does this happen? am i doing something wrong

Server Setup Information

  • Weaviate Server Version: semitechnologies/weaviate:1.27.27
  • Deployment Method: Docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: python 4.16.10

hi @Bigdwarf43 !!

The vec1 named vector will generate the following payload even with vectorize_collection_name=False

{
  "input": [
    "Multi _ vector _ test"
  ],
  "model": "text-embedding-3-small",
  "dimensions": 1536
}

I am not sure if this is intentional :thinking: Also we could have a better parsing of this collection name.

I will raise this internally.

Thanks for reporting!

Does this mean that I’ll always get slightly off certainty and distances?

With the same configuration as above if i do

obj.vector["default"] == obj.vector["vec1"] 

It returns true, why is it creating the same vector when i have specified the source_properties in vector config.

Is there a way to generate vectors from the python weaviate client so i can directly feed them into the document. The weaviate class vectorization is causing a lot of discrepancy and false positives in our system.

Does this mean that I’ll always get slightly off certainty and distances?

I believe the impact is only generating useless vectors and polluting the HNSW and inverted index.

obj.vector[“default”] == obj.vector[“vec1”]

that’s strange. At my end this returned false.

I have found out that our team was already discussing on how to improve our vectorization UX regarding this issue!

If you want to provide the vectors yourself, here is how:

client.collections.delete("Multi_vector_test")
multi_vector_test = client.collections.create(
    name="Multi_vector_test",
    vector_config=[
        wvc.config.Configure.Vectors.self_provided(
            name="default",
        ),
        wvc.config.Configure.Vectors.self_provided(
            name="vec1",
        )
    ],
)
obj = multi_vector_test.data.insert(
    {"data": "test" }, 
    vector={"default": [1,2,3], "vec1": []}
)

Let me know if this helps!

THanks!

Thanks duda!

Sorry i added the wrong snippet, the vectors are similar when i try to use text2vec-google without the title_property. Is this the intended behaviour?

multi_vector_test = client.collections.create(
    name="Multi_vector_test_v2",
    properties=[
        wvc.config.Property(
            name="data",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=False,
            tokenization=wvc.config.Tokenization.FIELD,
            skip_vectorization=False
        ),
        wvc.config.Property(
            name="vec1",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=False,
            tokenization=wvc.config.Tokenization.FIELD,
            skip_vectorization=False
        )
    ],

    
    vector_config=[
        Configure.Vectors.text2vec_google(
            name="vec1",
            source_properties=["vec1"],
            model="text-embedding-005",
            project_id=VERTEX_PROJECT_ID,
            vectorize_collection_name=False,

        ),
        Configure.Vectors.text2vec_google(
            name="default",
            source_properties=["data"],
            model="text-embedding-005",
            project_id=VERTEX_PROJECT_ID,
            vectorize_collection_name=False,
            
        ),
    ]
)