How to manage the merging of an hybrid query on a property and a BM25 on another

rjalex · May 14, 2024, 3:33pm

I have a collection whose declaration is the following (simplified):

client.collections.create(
                name=wv_artcollname,
                description="A collection of articles data",
                vectorizer_config=vect_config_list,
                properties=[
                    wvcc.Property(name="prose", data_type=wvcc.DataType.TEXT),
                    wvcc.Property(
                        name="entities",
                        data_type=wvcc.DataType.TEXT,
                        skip_vectorization=True,
                    ),
                ],
            )

and I am using 'prose" with an hybrid search and obtain a given ranking and normalized scores.

But I also want to perform a BM25 keyword search on the ‘entities’ property (which is a string with the names of people, places and organizations mentioned in the article and identified by a NER preprocessing). This query will also return another ranking and scores.

What is the suggested approach to make the best of these two informations?

Let me give you an example:

query_string: “Global warming tropical Brazil Lula” sent to Weaviate on the ‘articles’ collection as an hybrid query with alpha 0.5 to a specific named vector that resulted by the embedding of the ‘prose’ property.

Now the very same query string could also be used against the “entities” property, right? And this will produce another ranking.

How would I handle these two classifications? Suggestions?

I also tried the following approach:

response = wv_artcoll.query.hybrid(
        query=query_string,
        query_properties=["entities^2", "prose"],
        vector=query_vector,
        target_vector=graphql_model_name,
        limit=request.result_limit,
        alpha=request.alpha,
        return_metadata=MetadataQuery(score=True, explain_score=True),
    )

does it make sense?

Thank you

DudaNogueira · May 15, 2024, 12:31am

hi @rjalex !

That’s a really good question. hahaha

My first assumption is that query_properties will first run the bm25 query, consider its weights, and then fuse that with the vector findings. This is what you get on the query example.

What you say here is doing a “late bm25 reinforcement”? So, after a hybrid search, you reinforce the score with a new bm25 on targeted properties?

This is interesting.

Let me know if I got this right!

Thanks!

rjalex · May 15, 2024, 7:28am

Yes that is exactly what I’m trying to better understand.
I have a set of results with an hybrid query on property ‘prose’ and also another set of results via a BM25 query on another property ‘entities’ which is only populated with names of places, people, organizations.

I am trying to understand what strategies are the best to fuse these two approaches.

Thanks !!!

Topic		Replies	Views
How can we make hybrid search results more predictable? Support	8	1416	November 4, 2023
How to Configure Hybrid Search with Specific Query Properties? General	2	398	January 14, 2025
Hybrid search score calculation anomaly Support	3	606	January 30, 2024
How do I improve hybrid search on Weaviate? Been poking at this for too long but haven't made much headway General	2	970	April 23, 2024
Storing multiple vectors per doc for hybrid search Support	2	826	September 15, 2023

How to manage the merging of an hybrid query on a property and a BM25 on another

Related topics