Is pre-filtering not supported for hybrid search?

Description

We’re currently using Pinecone in our company and would like to extend to an engine that lets us perform hybrid search, as doing this in Pinecone is non-trivial. We’ve looked at OpenSearch’s Neural Search plugin, but the problem is that this plugin also doesn’t allow for pre-filtering using Boolean queries as there seems to be an incompatibility with Lucene.

The way that we’re using our vector database is that we basically have a single index that contains a lot of different vectors. One good example as to why we need pre-filtering is because our clients use different languages. If a user’s query is in, for example, English, then we would only want to search within the subset of English vectors (i.e., metadata.lang == "en").

I thought that Weaviate supported this but it seems like it doesn’t?

Here’s my setup:

filters = (
    Filter.by_property("type").equal("dummy-type") &
    Filter.by_property("lang").equal("en")
)

dense_search_results = weaviate_index.query.near_vector(
    near_vector=query_embedding_vector,
    limit=20,
    return_metadata=MetadataQuery(distance=True),
    filters=filters,
)

hybrid_search_results = weaviate_index.query.hybrid(
    query=query_text,
    vector=query_embedding_vector,
    alpha=0.5,
    limit=20,
    return_metadata=MetadataQuery(score=True),
    filters=filters,
    fusion_type=HybridFusion.RELATIVE_SCORE,
)

As you can see, I’m using a type called "dummy-type" for testing purposes. the dense_search_results is correctly [], but the hybrid_search_results just returns a bunch of different vectors that seem to completely disregard the filtering logic.

Adding post-filtering logic isn’t really an option right now, since making that work reliably also doesn’t seem that easy to do.

Any opinions are appreciated. Thanks!

Server Setup Information

  • Weaviate Server Version: 1.30.0
  • Deployment Method: Docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python 3.12
  • Multitenancy?: No.

Hi!

This may be due to how you created your collection and defined the tokenization of your properties.

Here is an example we can work on:

client.collections.delete("Test")
collection = client.collections.create(
    "Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        wvc.config.Property(name="type", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.FIELD),
        wvc.config.Property(name="lang", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.FIELD),
        wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
    ]
)

collection.data.insert_many([
    {"type": "dummy-type", "lang": "en", "content": "This is a test"},
    {"type": "dummy-type", "lang": "en", "content": "Houston, this is a test"},
    {"type": "dummy-type", "lang": "fr", "content": "Ceci est un test"},
    {"type": "other-type", "lang": "en", "content": "I say Ping, you say...?"},
    {"type": "other-type", "lang": "de", "content": "Dies ist ein Test"},
])

from weaviate.classes.query import Filter, MetadataQuery
from weaviate.classes.query import HybridFusion

filters = (
    Filter.by_property("type").equal("dummy-type") &
    Filter.by_property("lang").equal("en")
)

print("VECTOR")
search = collection.query.near_text(
    query="space",
    limit=20,
    return_metadata=MetadataQuery(distance=True),
    filters=filters,
)

for i in search.objects:
    print(i.properties, i.metadata.distance)


print("HYBRID")
search = collection.query.hybrid(
    query="Space",
    alpha=0.5,
    limit=20,
    return_metadata=MetadataQuery(score=True),
    filters=filters,
    fusion_type=HybridFusion.RELATIVE_SCORE,
)

for i in search.objects:
    print(i.properties, i.metadata.score)


This was the output

VECTOR
{'content': 'Houston, this is a test', 'type': 'dummy-type', 'lang': 'en'} 0.7594600915908813
{'content': 'This is a test', 'type': 'dummy-type', 'lang': 'en'} 0.7627241015434265
HYBRID
{'content': 'Houston, this is a test', 'lang': 'en', 'type': 'dummy-type'} 0.5
{'content': 'This is a test', 'type': 'dummy-type', 'lang': 'en'} 0.0

Let me know if this helps!

THanks!

1 Like

Thanks, that worked! I guess Tokenization.WORD and Tokenization.FIELD are akin to OpenSearch’s text and keyword types.

Is there anywhere that I can read more on how you guys are handling pre-filtering? I know that doing filtering with vectors is non-trivial, and am aware that this is actually one of Pinecone’s selling points. I’m just curious if Weaviate has its own method, or if it’s just brute force-based.

Thanks again.

1 Like