Hybrid Search: max-vector-distance not filtering results as expected

Description

I am using hybrid search with max-vector-distance to limit the vector similarity contributions in my search results. However, when I inspect the results using explain_score, I notice that the vector similarity scores still exceed the max-vector-distance threshold.

This behavior is unexpected, as I assumed setting max-vector-distance would filter out any results beyond the specified threshold. The scores seem inconsistent with my expectations for hybrid search.

Am I misunderstanding how max-vector-distance is applied in hybrid search?

Server Setup Information

  • Weaviate Server Version: semitechnologies/weaviate:1.26.1
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: 4.8.0
  • Multitenancy?: No

Any additional Information

Here is the example search result.

Max Vector Distance: 0.3

Hybrid Search results: Object(uuid=_WeaviateUUIDInt('02acc4a4-d186-51bd-8b17-a25470a6a053'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=0.6219874620437622, explain_score='\nHybrid (Result Set keyword,bm25) Document 02acc4a4-d186-51bd-8b17-a25470a6a053: original score 7.7068768, normalized score: 0.343784 - \nHybrid (Result Set vector,hybridVector) Document 02acc4a4-d186-51bd-8b17-a25470a6a053: **original score 0.5154053**, normalized score: 0.27820346', is_consistent=None, rerank_score=-10.181857109069824)

Any insights would be greatly appreciated!

Hi @pc10 !!

Can you make sure this also happens on latest version?

This is what I got running on 1.28.4:

import weaviate
from weaviate import classes as wvc


headers = {
    "X-Openai-Api-Key": os.environ.get("OPENAI_APIKEY"),
}

client = weaviate.connect_to_local(
    headers=headers
)
print(f"Client: {weaviate.__version__}, Server: {client.get_meta().get('version')}")
# Client: 4.10.4, Server: 1.28.4

client.collections.delete("Test")
client.collections.create(
    name="Test",
    vectorizer_config=[
        wvc.config.Configure.NamedVectors.text2vec_openai(
            name="default"
        ),
    ],
)
collection = client.collections.get("Test")
collection.data.insert({"text": "Something about Brazil", })
collection.data.insert({"text": "Something about Pelé, best soccer player", })
collection.data.insert({"text": "Something about indian food", })

Now, I performed a search and printed all the infos:

for o in collection.query.hybrid(
    query="futebol", 
    #max_vector_distance=0.4,
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True, distance=True)
    ).objects:
    print("#"*10)
    print(o.properties)
    print(o.metadata.distance, o.metadata.score, o.metadata.explain_score)

and got this as output:

##########
{'text': 'Something about Pelé, best soccer player'}
None 0.699999988079071 
Hybrid (Result Set vector,hybridVector) Document 5f8f8671-27cb-407b-81cf-ae4560fc0186: original score 0.35348773, normalized score: 0.7
##########
{'text': 'Something about Brazil'}
None 0.46075865626335144 
Hybrid (Result Set vector,hybridVector) Document d194fb42-1bb5-414c-8e72-02fccb48039d: original score 0.23937017, normalized score: 0.46075866
##########
{'text': 'Something about indian food'}
None 0.0 
Hybrid (Result Set vector,hybridVector) Document 0e1fba05-f908-4a3c-9ce2-711ecd4ed062: original score 0.019589365, normalized score: 0

Now, using max_vector_distance in a way to better understand it :wink:

Those are the distances we will filter out:

  1. score 0.35348773
  2. score 0.23937017
  3. score 0.019589365
for o in collection.query.hybrid(
    query="futebol", 
    max_vector_distance=1-0.020,
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True, distance=True)
    ).objects:
    print("#"*10)
    print(o.properties)
    print(o.metadata.distance, o.metadata.score, o.metadata.explain_score)

and this was the output:

##########
{'text': 'Something about Pelé, best soccer player'}
None 0.699999988079071 
Hybrid (Result Set vector,hybridVector) Document aa55119a-6d83-4161-85b6-217861031f0f: original score 0.35347474, normalized score: 0.7
##########
{'text': 'Something about Brazil'}
None 0.0 
Hybrid (Result Set vector,hybridVector) Document 2091b86c-e835-4da7-922f-24bcdd63c70c: original score 0.23937523, normalized score: 0

I believe this is filtering against the scored vector distance (as it is in a hybrid search).

So whenever that value is closer to 1, it is closer to the query, instead of the other way around: bigger the value bigger the distance.

Let me know if this helps!

Thanks!

Thank you for the quick response! I haven’t seen any filtering of results in weaviate 1.26.1, regardless of the max-vector-distance value I set. I will try this again on the upgraded version. Since we are using weaviate-vectorstore in production, I’ll need to check for potential regressions before upgrading.

I’d also like to clarify the behavior of max-vector-distance. My understanding is that documents with a vector distance greater than max-vector-distance should be excluded. However, is the intuition here that the score returned from hybrid search represents similarity rather than actual distance?

If that’s the case, then the following values are similarity scores, and setting max_vector_distance = 0.98 filters out the 0.0195 document because its distance is higher:

  • 0.3534
  • 0.2393
  • 0.0195

Is there a way to explicitly display the distance values in the search results to compare them directly with max-vector-distance for a more apples-to-apples comparison?

Thank you !

Hi!

There isn’t, AFAIK.

Also, the vector distance may vary for different query and objects, so you couldn’t define a threshold solely on vector distance.

So when you do a hybrid search, the distance calculated for the vector part of the search will be normalized in order to be fused. And that normalized vector distance is the one you can filter out with max-vector-distance.

Let me know if that helps!

Thanks!

Hi,

IIRC there was a bug in the first release of the max vector distance that could cause some objects to be included even if their vector distance was larger than the threshold.

yes, this is correct.

So when you do a hybrid search, the distance calculated for the vector part of the search will be normalized in order to be fused. And that normalized vector distance is the one you can filter out with max-vector-distance.

The filtering is happening before the normalization+fusion

Thank you. Do we know what stable-version has this bug fixed/resolved?

IIRC correctly one of the early 1.26.X releases. If I were you I would update to the lastest 1.26 point release