Hybrid similarity scoring is so weird - it doesn't make any sense

Tested on the newest weaviate server version.

Could you help with explaining how the scoring works?

We are getting hybrid search score results from weaviate. We extract them as follows:
if explain_score is not None:
vector_score_pattern = r’Result Set vector.?original score (\d+.\d+)’
keyword_score_pattern = r’Result Set keyword.
?original score (\d+.\d+)’

                vector_score_match = re.search(vector_score_pattern, explain_score)
                keyword_score_match = re.search(keyword_score_pattern, explain_score)

                try:
                    vector_score = float(vector_score_match.group(1))
                except AttributeError:
                    vector_score = 0.0
                    alpha = 0.0
                try:
                    keyword_score = float(keyword_score_match.group(1))
                except AttributeError:
                    keyword_score = 0.0
                    alpha = 1.0

We had to build in these fallbacks because we still get scores from weaviate very unreliably. For 80% we receive NaN as keyword score value for whatever reason…
So, we decided to rely mostly on vector score, since we should be getting it more or less reliably…

We ask in German “What does a cat look like?” and get following results:
[(0.31066078, ‘\nHybrid (Result Set keyword,bm25) Document e160ac1e-22c4-4eeb-8cff-27f064011aab: original score NaN, normalized score: NaN - \nHybrid (Result Set vector,hybridVector) Document e160ac1e-22c4-4eeb-8cff-27f064011aab: original score 0.31066078, normalized score: 0.23898673’, Chunk(id=‘e160ac1e-22c4-4eeb-8cff-27f064011aab’, text=‘Hunde beim Tierarzt Bei der ersten Behandlung wird dein Welpe gründlich untersucht und der Tierarzt wird mit dir die Impfung besprechen. Detaillierte Angaben zu früheren Behandlungen, die dein Züchter oder das Tierheim eventuell veranlasst haben, sind sinnvoll. Ihr werdet euch über häufige Probleme wie Würmer und Flöhe unterhalten, einschließlich deren Behandlung und Vorbeugung (erste Informationen solltest du schon vom Züchter oder Tierheim/AuKangstation erhalten haben), sowie über Mikrochips, Kastrationen und alle Fragen, die du bezüglich der Gesundheitsversorgung von Welpen hast. Möglicherweise sind auch Fütterung, Bewegung und Pflege ein Thema. Darüber hinaus gibt es hier einige Informationen zur Niederschlagsmenge:’, metadata=ChunkMetadata(etc)
This is the best matching chunk for some reason (even though it talks about dogs). At the same time we have a chunk about a cat: “Das Bild zeigt eine weiße Katze, die ruhig auf dem Boden liegt. Auffällig sind ihre heterochromen Augen, eines ist blau und das andere gelb. Die Umgebung ist schlicht gehalten, mit einem hellen Vorhang im Hintergrund, der einen minimalistischen und sauberen Wohnraum andeutet. Die Katze schaut leicht zur Seite und wirkt aufmerksam und interessiert.” This receives a vector score of only 0.18 for some reason…

Are we doing something wrong? Should we not use hybrid score if not working with English? Or should we just not use hybrid scoring at all to get reliable results?

hi @A_S !!

The best way to analyze this is with a working code so we can make sure we are at the same page.

I was not able to reproduce the NAN issue. Can you provide some code for that?

Here is some code:

client.collections.delete("Test")
collection = client.collections.create(
    "Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)
    ]
)
collection.data.insert_many(
    objects=[
        {"text": "What does a cat look like?"},
        {"text": "What is the sound of a dog?"},
        {"text": "What time is it?"},
    ]
)

Now if I perform a hybrid search:

query = collection.query.hybrid(
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
    query="dog",
    limit=2
)
for i in query.objects:
    print("#"*10)
    print(i.properties)
    print(i.metadata.score)
    print(i.metadata.explain_score)

I will get:

##########
{‘text’: ‘What is the sound of a dog?’}
1.0

Hybrid (Result Set keyword,bm25) Document e3e1cab9-e8d4-4a31-a157-df0f3fbb5d76: original score 0.4815891, normalized score: 0.3 -
Hybrid (Result Set vector,hybridVector) Document e3e1cab9-e8d4-4a31-a157-df0f3fbb5d76: original score 0.41238725, normalized score: 0.7
##########
{‘text’: ‘What does a cat look like?’}
0.37308037281036377

Hybrid (Result Set vector,hybridVector) Document dd8bec05-b9d3-457a-b67a-56f92b3c16f3: original score 0.27121615, normalized score: 0.37308037

If I do the same query, now using bm25:

query = collection.query.bm25(
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
    query="dog"
)
for i in query.objects:
    print("#"*10)
    print(i.properties)
    print(i.metadata.score)
    print(i.metadata.explain_score)

I get:

##########
{‘text’: ‘What is the sound of a dog?’}
0.48158910870552063
, BM25F_dog_frequency:1, BM25F_dog_propLength:7

Now if we get back to our explain score, this is the score equivalent to the bm25 0.4815891, normalized score: 0.3

If I perform a near_text, I will get the distance:

query = collection.query.near_text(
    return_metadata=wvc.query.MetadataQuery(distance=True),
    query="dog",
    limit=1
)
for i in query.objects:
    print("#"*10)
    print(i.properties)
    print(i.metadata.distance)

output:

##########
{‘text’: ‘What is the sound of a dog?’}
0.5876127481460571

Now if we get back to our explain score, this is the distance equivalent to the near_text 0.41238725, normalized score: 0.7

So those two (score and distance) metrics are fused to combine a single score for the hybrid search.

You can learn more on those algorithms here:

Let me know if that helps.