How weaviate calculates score in similarity_search_with_score?

Hi everyone,

Lately, I have been implementing a RAG system for my chatbot. When I retrieve documents from weaviate using similarity_search_with_score, the result docs are [(doc1, score1),…].
As my understanding, this function using search_method=“hybrid” (source) and the param alpha=1 for only using vector search. The default distance metric is cosine distance.
Here is the result from similarity_search_with_score function (I’m using text-embedding-3-small model):
[(Document(page_content=‘collaboration are key considerations throughout the project. Further clarification is needed for Phase 2 deliv erables, but the document provides a structur ed frame work for successful project execution. 13 /n–/n’, metadata={‘info’: ‘from meeting input’}), 0.014285714365541935)]
And this is the result that I calculate cosine_similarity of sklearn by getting the embedding vectors from text-embedding-3-small.


Isn’t similarity_search_with_score calculated base on this definition: 1 - cosine_sim(a,b) as described in this source ? Should the cosine_distance = 1 - 0.29631816 = 0.70368184 ?

Thank you for your time!

hi @cong.dao ! Welcome to our community :hugs:

When performing a hybrid search, the relative score functions will kick in:

So the score is not the distance you are looking for.

You will need to do a nearText instead of a hybrid.

Here an example to illustrate this:

from weaviate.classes import config

# lets first create our collection and import data

client.collections.delete("MyCollection")
collection = client.collections.create(
    "MyCollection",
    vectorizer_config=config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        config.Property(name="text", data_type=config.DataType.TEXT),
        config.Property(name="source", data_type=config.DataType.TEXT)
    ]
)

collection.data.insert({"text": "something about cats", "source": "document1"})
collection.data.insert({"text": "something about tiger", "source": "document1"})
collection.data.insert({"text": "something about lion", "source": "document1"})

collection.data.insert({"text": "something about dogs", "source": "document2"})
collection.data.insert({"text": "something about wolf", "source": "document2"})
collection.data.insert({"text": "something about coyotes", "source": "document2"})

now we perform a nearText:

from weaviate import classes as wvc
result = collection.query.near_text(
    limit=2,
    query="pet animals",
    return_metadata=wvc.query.MetadataQuery(distance=True, score=True)
)
for object in result.objects:
    print(object.properties)
    print(object.metadata.distance)

With this output:

{‘text’: ‘something about dogs’, ‘source’: ‘document2’}
0.17943477630615234
{‘text’: ‘something about cats’, ‘source’: ‘document1’}
0.1885947585105896

Now, a hybrid search:

from weaviate import classes as wvc
result = collection.query.hybrid(
    alpha=1,
    query="pet animals",
    return_metadata=wvc.query.MetadataQuery(distance=True, score=True, explain_score=True)
)
for object in result.objects:
    print(object.properties)
    print(object.metadata.score)
    print(object.metadata.explain_score)

and this is the output (note the data under explain score)

{‘text’: ‘something about dogs’, ‘source’: ‘document2’}
1.0
Hybrid (Result Set vector,hybridVector) Document e89cc799-b180-4ae1-a496-aa806a458915: original score 0.8205652, normalized score: 1

{‘text’: ‘something about cats’, ‘source’: ‘document1’}
0.8020085096359253
Hybrid (Result Set vector,hybridVector) Document 34fd8612-06ab-4b01-b0cb-aa9b81e1d6dc: original score 0.81140524, normalized score: 0.8020085

Note that the first result is normalized at 1.

Let me know if this helps.

Thanks!

1 Like

Hi @DudaNogueira, thank you for your response.

As you described, I did try the example above with near_text.
To compare with the cosine_distance that I calculate from sklearn, I commented this line
config.Property(name="source", data_type=config.DataType.TEXT)
The distance returns 0.18037784099578857 by near_text. By default near_text using “text-embedding-ada-002”, so I’m also using it on my test like below.


May I ask why the result from near_text = 0.18037784099578857 not equal to 1-cosine_sim = 1-0.84232189 = 0.15767810999999998 ?

By the way, my hybrid search return quite different to your result

{'text': 'something about dogs', 'source': 'document2'}
0.016393441706895828
{'text': 'something about cats', 'source': 'document1'}
0.016129031777381897

Is the issue due to using different versions of Weaviate?
I just want to know the math that Weaviate is using to calculate the distance in near_text or the score in the hybrid search because the ranking from these functions is very different from the ranking from cosine similarity.

Thank you for your precious time.

Hi!

In the example I posted, consider that the collection name and the source property will influentiate the vector, if comparing with the vectorization directly.

Check here for more on how Weaviate concatenate your properties to vectorize your object

This is something more in line with what you want to compare (note that I configured vectorize_collection_name=False)

from weaviate.classes import config

# lets first create our collection and import data

client.collections.delete("MyCollection")
collection = client.collections.create(
    "MyCollection",
    vectorizer_config=config.Configure.Vectorizer.text2vec_openai(vectorize_collection_name=False),
    properties=[
        config.Property(name="text", data_type=config.DataType.TEXT, vectorize_property_name=False)
    ]
)

collection.data.insert({"text": "something about cats"})
collection.data.insert({"text": "something about tiger"})
collection.data.insert({"text": "something about lion"})

collection.data.insert({"text": "something about dogs"})
collection.data.insert({"text": "something about wolf"})
collection.data.insert({"text": "something about coyotes"})

from weaviate import classes as wvc
result = collection.query.near_text(
    limit=2,
    query="pet animals",
    return_metadata=wvc.query.MetadataQuery(distance=True, score=True)
)
for object in result.objects:
    print(object.properties)
    print(object.metadata.distance)

now this is the output

{‘text’: ‘something about dogs’}
0.1413179636001587
{‘text’: ‘something about cats’}
0.15767818689346313

lets now compare the two distances, something about dogs and pet animals:

import os
import requests

headers = {
    'Authorization': 'Bearer ' + os.getenv('OPENAI_API_KEY', ''),
    'Content-Type': 'application/json',
}

json_data = {
    'input': 'text here',
    'model': 'text-embedding-ada-002',
    'encoding_format': 'float',
}

# embedding1
json_data["input"] = "pet animals"
response1 = requests.post('https://api.openai.com/v1/embeddings', headers=headers, json=json_data)
embedding1 = response1.json().get("data")[0].get("embedding")
#print(embedding1)

# embedding2
json_data["input"] = "something about dogs"
response2 = requests.post('https://api.openai.com/v1/embeddings', headers=headers, json=json_data)
embedding2 = response2.json().get("data")[0].get("embedding")
#print(embedding2)

from sklearn.metrics.pairwise import cosine_distances

cosine_distances([embedding1],[embedding2])

outputs:

array([[0.14131812]])

Now, on top of all that… consider that OpenAI Embedding may vary even for the same input :hushed:

Let me know if this helps!

Thanks!

1 Like

Thank you so much! I really appreciate your help!

1 Like