How can we make hybrid search results more predictable?

junbetterway · October 31, 2023, 11:54am

I have a user requirement where we have a Profile model which has the following information (all are being saved in Weaviate):

Full Name
Marketing pitch (free-text) → vectorized
Experience story (free-text) → vectorized
List of tech skills (array) → vectorized
List of languages (array)
Other properties, so on

The pitch, story and tech skills are the only ones we are vectorizing as our vectorizer is limited to 250 tokens.

So question, are we right that if we perform a hybrid search even though languages or other properties are not vectorized but being saved in Weaviate then it should contribute to the BM25 scoring, right? The results is somehow not matching our expectation. We are using the default alpha=0.5 (so 50-50 keyword vs vector)

Also, is there a way we can give a high score to the matching full name as our client wants to be able to search by full name as priority. If not, is there like a recommended way to achieve this?

sebawita · November 1, 2023, 11:55pm

Hi @junbetterway,

There are many topics to unpack from your post

Full Name - importance

Let’s start with giving the full name property more importance.

bm25, and hybrid search offers a mechanism to boost matching on specific properties. We do that by adding ^2, ^3, etc to the list of hybrid properties. Where ^2 doubles the importance, while ^3 triples the importance.

Here is an example query in Python:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="Jon Doe doing something",
        properties=["full_name^2", "marketing_pitch"],
        alpha=0.5
    )
    .do()
)

You can see examples in other languages in the docs.

Alternatively, if you just want to make sure that the full_name is always matched. You could add a filter to the query. Note, only objects that match the filter will be returned. So, only use it when you must match the full_name.
For example, you could use Equals or Like operators (see the docs for more info).
Like this:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="doing something",
        properties=["marketing_pitch", "some_other_prop"],
        alpha=0.5
    )
    .with_where({
        "path": ["full_name"],
        "operator": "Like",
        "valueText": "*Jon Doe*"
    })
    .do()
)

Running Hybrid search on not vectorized properties

Yes, you can use the keyword part of the hybrid search on any text property in the database. The properties don’t need to be vectorized.

Also, as you define the data scheme, you can specify the type of tokenization per property.
Unfortunately, I am not an expert on the topic, but you can learn more here.

sebawita · November 2, 2023, 12:11am

What vectorizer do you use?
Do you generate vector embeddings outside of Weaviate, or do you use one of the Weaviate modules?

It would be great if you could share your Schema configuration.

What do you mean by “through languages”?
Are you running queries in different language from the language of the text? i.e. your content is in German, and the query is in English?

If yes, then the keyword part of the query will most likely not work for words that are in different language. However, depending on the ML Model you use (vectorizer), you should be able to match results in other languages.

note, I’ve used Cohere’s models for multi-lingual search, and I was quite happy with the results. Here is a small recipe that shows end-to-end how to use it.

junbetterway · November 2, 2023, 1:59am

Hi @sebawita - thank you for taking time and giving me light on this path
Agreed, it has a lot of questions but you managed to give path forward to each - so thank you so much!

Please see below my replies (some follow-up question ).

Full Name - importance

Thank you so much on this - I wonder how I missed this Weight-boost capability in BM25 when I first implemented the Hybrid search.

Just a qq though, so if do this - I need now to declare all my targeted properties for keyword search - right?

properties=["full_name^2", "marketing_pitch", "xxx", "yyy", "zzz", so on]

there is no way to just say, boost the full_name in BM25 search but still include the rest? If not, I am still ok with this and will try to start checking this out in a couple of days.

Also, if I do something like below where I only wanted to declare all the non-vectorized props and boost full name only:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="Jon Doe doing something",
        properties=["full_name^2", "languages", "...other non-vectorized props"],
        alpha=0.5
    )
    .do()
)

will the vector search of the hybrid still work on my vectorized props even if they are not declared under properties (e.g., marketing_pitch, exp story) ?

Running Hybrid search on not vectorized properties

Thanks for confirming this.

Regarding the tokenization method, we are using the default “word” so it should work with search strings as shown here Tokenization and Search Filtering but I think because I am using hybrid with alpha=0.5 so somehow the vector scoring affects the result but will try it out once I played around with the boost props.

What vectorizer do you use?

We are using the Weaviate module: text2vec-transformers where we use the pre-built image:

semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1

To your knowledge, is there any recommendation to use for higher tokens (>250) which is part of Weaviate pre-built images?

What do you mean by “through languages”?

Sorry for the confusion, but this is just in our Profile model - this is just a collection of languages that our users can speak so it has nothing to do with multi-language searching - we are just using English in our platform for now.

I just gave this as an example that this field languages[] is not vectorized but will searching it be part of the hybrid search? It is of type text[] and not text type - will it still be part of the hybrid search?

Thank you!

sebawita · November 2, 2023, 11:31pm

List of properties for keyword search

That is correct, the moment you add properties inside with_hybrid() you need to list all properties you want to use for the keyword search part of the query.
If you only list full_name (like below), then only full_name will be used.

properties=["full_name^2"]

Properties vs vector search

That is correct, the list of properties doesn’t affect the vector search part of the query. (it only affects the keyword search).

Properties (keyword) vs vector search

Thanks for confirming this properties vs vector search behavior. I will give this a try then and let you know hopefully, next week

Topic		Replies	Views
How to manage the merging of an hybrid query on a property and a BM25 on another General	2	271	May 15, 2024
How to Configure Hybrid Search with Specific Query Properties? General	2	218	January 14, 2025
How do I improve hybrid search on Weaviate? Been poking at this for too long but haven't made much headway General	2	845	April 23, 2024
How to Improve the accuracy of vector search in weaviate General	2	2047	March 6, 2025
Different query weightings on properties inside one collection Support	2	581	February 22, 2024

How can we make hybrid search results more predictable?

Full Name - importance

Running Hybrid search on not vectorized properties

Full Name - importance

Running Hybrid search on not vectorized properties

What vectorizer do you use?

What do you mean by “through languages”?

List of properties for keyword search

Properties vs vector search

Other questions

Properties (keyword) vs vector search

Other questions

How can we make hybrid search results more predictable?

Full Name - importance

Running Hybrid search on not vectorized properties

Full Name - importance

Running Hybrid search on not vectorized properties

What vectorizer do you use?

What do you mean by “through languages”?

List of properties for keyword search

Properties vs vector search

Other questions

Properties (keyword) vs vector search

Other questions

Related topics