Is pre-filtering not supported for hybrid search?

Sean · April 16, 2025, 11:47am

Description

We’re currently using Pinecone in our company and would like to extend to an engine that lets us perform hybrid search, as doing this in Pinecone is non-trivial. We’ve looked at OpenSearch’s Neural Search plugin, but the problem is that this plugin also doesn’t allow for pre-filtering using Boolean queries as there seems to be an incompatibility with Lucene.

The way that we’re using our vector database is that we basically have a single index that contains a lot of different vectors. One good example as to why we need pre-filtering is because our clients use different languages. If a user’s query is in, for example, English, then we would only want to search within the subset of English vectors (i.e., metadata.lang == "en").

I thought that Weaviate supported this but it seems like it doesn’t?

Here’s my setup:

filters = (
    Filter.by_property("type").equal("dummy-type") &
    Filter.by_property("lang").equal("en")
)

dense_search_results = weaviate_index.query.near_vector(
    near_vector=query_embedding_vector,
    limit=20,
    return_metadata=MetadataQuery(distance=True),
    filters=filters,
)

hybrid_search_results = weaviate_index.query.hybrid(
    query=query_text,
    vector=query_embedding_vector,
    alpha=0.5,
    limit=20,
    return_metadata=MetadataQuery(score=True),
    filters=filters,
    fusion_type=HybridFusion.RELATIVE_SCORE,
)

As you can see, I’m using a type called "dummy-type" for testing purposes. the dense_search_results is correctly [], but the hybrid_search_results just returns a bunch of different vectors that seem to completely disregard the filtering logic.

Adding post-filtering logic isn’t really an option right now, since making that work reliably also doesn’t seem that easy to do.

Any opinions are appreciated. Thanks!

Server Setup Information

Weaviate Server Version: 1.30.0
Deployment Method: Docker
Multi Node? Number of Running Nodes: 1
Client Language and Version: Python 3.12
Multitenancy?: No.

DudaNogueira · April 16, 2025, 3:17pm

Hi!

This may be due to how you created your collection and defined the tokenization of your properties.

Here is an example we can work on:

client.collections.delete("Test")
collection = client.collections.create(
    "Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        wvc.config.Property(name="type", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.FIELD),
        wvc.config.Property(name="lang", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.FIELD),
        wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
    ]
)

collection.data.insert_many([
    {"type": "dummy-type", "lang": "en", "content": "This is a test"},
    {"type": "dummy-type", "lang": "en", "content": "Houston, this is a test"},
    {"type": "dummy-type", "lang": "fr", "content": "Ceci est un test"},
    {"type": "other-type", "lang": "en", "content": "I say Ping, you say...?"},
    {"type": "other-type", "lang": "de", "content": "Dies ist ein Test"},
])

from weaviate.classes.query import Filter, MetadataQuery
from weaviate.classes.query import HybridFusion

filters = (
    Filter.by_property("type").equal("dummy-type") &
    Filter.by_property("lang").equal("en")
)

print("VECTOR")
search = collection.query.near_text(
    query="space",
    limit=20,
    return_metadata=MetadataQuery(distance=True),
    filters=filters,
)

for i in search.objects:
    print(i.properties, i.metadata.distance)


print("HYBRID")
search = collection.query.hybrid(
    query="Space",
    alpha=0.5,
    limit=20,
    return_metadata=MetadataQuery(score=True),
    filters=filters,
    fusion_type=HybridFusion.RELATIVE_SCORE,
)

for i in search.objects:
    print(i.properties, i.metadata.score)

This was the output

VECTOR
{'content': 'Houston, this is a test', 'type': 'dummy-type', 'lang': 'en'} 0.7594600915908813
{'content': 'This is a test', 'type': 'dummy-type', 'lang': 'en'} 0.7627241015434265
HYBRID
{'content': 'Houston, this is a test', 'lang': 'en', 'type': 'dummy-type'} 0.5
{'content': 'This is a test', 'type': 'dummy-type', 'lang': 'en'} 0.0

Let me know if this helps!

THanks!

Sean · April 18, 2025, 9:43am

Thanks, that worked! I guess Tokenization.WORD and Tokenization.FIELD are akin to OpenSearch’s text and keyword types.

Is there anywhere that I can read more on how you guys are handling pre-filtering? I know that doing filtering with vectors is non-trivial, and am aware that this is actually one of Pinecone’s selling points. I’m just curious if Weaviate has its own method, or if it’s just brute force-based.

Thanks again.

Dirk · April 22, 2025, 6:47am

Hey Sean,

here is a blog post about this topic: How we speed up filtered vector search with ACORN | Weaviate

Topic		Replies	Views
Langchain WeaviateHybridSearchRetriever with filters? Support	7	877	August 13, 2024
Filtering with hybrid search or with get_collection Support	4	546	May 17, 2024
How to Improve the accuracy of vector search in weaviate General	2	2047	March 6, 2025
Performance wise suggestion General developer-experience , python	0	165	May 28, 2024
Does filtering based on property effectively scope vector search? General developer-experience	1	745	October 12, 2023

Is pre-filtering not supported for hybrid search?

Description

Server Setup Information

Related topics