Hybrid Search near_text distance filtering

Description

The distance field in the HybridVector.near_text doesn’t filter contrary to the near_text search. Is this the desired behavior or a bug? This issue seems to implement the same behavior as the near_text search. Improve Hybrid Search · Issue #4325 · weaviate/weaviate · GitHub

For me, objects with a distance superior to this parameter should be filtered out of the results (independently of the fusion score). In my case, the ability to filter the results with the near_text distance is useful, as we don’t have the vector distance and the BM25 score returned.

The results I got with the code in the Any additional Information section.

== Near text ==
no distance limit
QueryReturn(objects=[Object(uuid=_WeaviateUUIDInt('35ddc998-e530-44a2-8b6a-e65dc8cb9afb'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=0.11371487379074097, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'title': 'article'}, references=None, vector={}, collection='Article')])
== Near text ==
distance limit 0.1
QueryReturn(objects=[])
== Hybrid search ==
distance limit 0.1
QueryReturn(objects=[Object(uuid=_WeaviateUUIDInt('35ddc998-e530-44a2-8b6a-e65dc8cb9afb'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'title': 'article'}, references=None, vector={}, collection='Article')])

I would like a way to obtain the same result from the near_text search with the distance limit applied to the hybrid search. For example, with the scores returned after an hybrid search we would be able to apply an post filtering.

Server Setup Information

  • Weaviate Server Version: 1.26.4
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: No
  • Client Language and Version: python weaviate-client-4.8.1
  • Multitenancy?: No

Any additional Information

# setup.py
import os

import weaviate
from dotenv import load_dotenv
from weaviate.classes.config import Configure, DataType, Property

load_dotenv()

client = weaviate.connect_to_local(
    headers={"X-Azure-Api-Key": os.getenv("AZURE_API_KEY")}
)

client.collections.delete("Article")
client.collections.create(
    "Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_azure_openai(
        base_url=os.environ.get("AZURE_BASE"),
        resource_name=os.environ.get("AZURE_RESOURCE_NAME"),
        deployment_id=os.environ.get("AZURE_DEPLOYMENT_ID"),
        vectorize_collection_name=False,
    ),
)

article = client.collections.get("Article")
article.data.insert(
    properties={
        "title": "article",
    },
)
import os

import weaviate
from dotenv import load_dotenv
from weaviate.classes.query import HybridFusion, HybridVector, MetadataQuery, Move

load_dotenv()
client = weaviate.connect_to_local(
    headers={"X-Azure-Api-Key": os.getenv("AZURE_API_KEY")}
)

article = client.collections.get("Article")

response = article.query.near_text(
    query="Article", return_metadata=MetadataQuery(distance=True)
)
print(f"== Near text ==\nno distance limit\n{response}")
response = article.query.near_text(
    query="Article", distance=0.1, return_metadata=MetadataQuery(distance=True)
)
print(f"== Near text ==\ndistance limit 0.1\n{response}")

response = article.query.hybrid(
    query="Article", vector=HybridVector.near_text(query="article", distance=0.1)
)
print(f"== Hybrid search ==\ndistance limit 0.1\n{response}")

client.close()

Hi,

it is not yet part of the docs, but there is a new parameter to filter all results based on the vector distance - so any result that has a distance higher than the max vector distance is filtered out:

    objs_hy_cutoff = collection.query.hybrid(
        query,
        max_vector_distance=max_vector_distance,
        return_metadata=wvc.query.MetadataQuery.full(),
    ).objects

not totally sure why the distance is not working for the NT search, but I think it should not be used at all

Hello Dirk,

Thank you for your help. Your solution has resolved the issue.