How can we make hybrid search results more predictable?

I have a user requirement where we have a Profile model which has the following information (all are being saved in Weaviate):

  • Full Name
  • Marketing pitch (free-text) → vectorized
  • Experience story (free-text) → vectorized
  • List of tech skills (array) → vectorized
  • List of languages (array)
  • Other properties, so on

The pitch, story and tech skills are the only ones we are vectorizing as our vectorizer is limited to 250 tokens.

So question, are we right that if we perform a hybrid search even though languages or other properties are not vectorized but being saved in Weaviate then it should contribute to the BM25 scoring, right? The results is somehow not matching our expectation. We are using the default alpha=0.5 (so 50-50 keyword vs vector)

Also, is there a way we can give a high score to the matching full name as our client wants to be able to search by full name as priority. If not, is there like a recommended way to achieve this?

Hi @junbetterway,

There are many topics to unpack from your post :wink:

Full Name - importance

Let’s start with giving the full name property more importance.

bm25, and hybrid search offers a mechanism to boost matching on specific properties. We do that by adding ^2, ^3, etc to the list of hybrid properties. Where ^2 doubles the importance, while ^3 triples the importance.

Here is an example query in Python:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="Jon Doe doing something",
        properties=["full_name^2", "marketing_pitch"],
        alpha=0.5
    )
    .do()
)

You can see examples in other languages in the docs.

Alternatively, if you just want to make sure that the full_name is always matched. You could add a filter to the query. Note, only objects that match the filter will be returned. So, only use it when you must match the full_name.
For example, you could use Equals or Like operators (see the docs for more info).
Like this:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="doing something",
        properties=["marketing_pitch", "some_other_prop"],
        alpha=0.5
    )
    .with_where({
        "path": ["full_name"],
        "operator": "Like",
        "valueText": "*Jon Doe*"
    })
    .do()
)

Running Hybrid search on not vectorized properties

Yes, you can use the keyword part of the hybrid search on any text property in the database. The properties don’t need to be vectorized.

Also, as you define the data scheme, you can specify the type of tokenization per property.
Unfortunately, I am not an expert on the topic, but you can learn more here.

2 Likes

What vectorizer do you use?
Do you generate vector embeddings outside of Weaviate, or do you use one of the Weaviate modules?

It would be great if you could share your Schema configuration.

What do you mean by “through languages”?
Are you running queries in different language from the language of the text? i.e. your content is in German, and the query is in English?

If yes, then the keyword part of the query will most likely not work for words that are in different language. However, depending on the ML Model you use (vectorizer), you should be able to match results in other languages.

note, I’ve used Cohere’s models for multi-lingual search, and I was quite happy with the results. Here is a small recipe that shows end-to-end how to use it.

2 Likes

Hi @sebawita - thank you for taking time and giving me light on this path :metal:
Agreed, it has a lot of questions but you managed to give path forward to each - so thank you so much!

Please see below my replies (some follow-up question :bowing_man:).

Full Name - importance

Thank you so much on this - I wonder how I missed this Weight-boost capability in BM25 when I first implemented the Hybrid search.

Just a qq though, so if do this - I need now to declare all my targeted properties for keyword search - right?

properties=["full_name^2", "marketing_pitch", "xxx", "yyy", "zzz", so on]

there is no way to just say, boost the full_name in BM25 search but still include the rest? If not, I am still ok with this and will try to start checking this out in a couple of days.

Also, if I do something like below where I only wanted to declare all the non-vectorized props and boost full name only:

response = (
    client.query
    .get("MyCollection", ["full_name", "marketing_pitch", "some_other_prop"])
    .with_hybrid(
        query="Jon Doe doing something",
        properties=["full_name^2", "languages", "...other non-vectorized props"],
        alpha=0.5
    )
    .do()
)

will the vector search of the hybrid still work on my vectorized props even if they are not declared under properties (e.g., marketing_pitch, exp story) ?

Running Hybrid search on not vectorized properties

Thanks for confirming this.

Regarding the tokenization method, we are using the default “word” so it should work with search strings as shown here Tokenization and Search Filtering but I think because I am using hybrid with alpha=0.5 so somehow the vector scoring affects the result but will try it out once I played around with the boost props.

What vectorizer do you use?

We are using the Weaviate module: text2vec-transformers where we use the pre-built image:

semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1

To your knowledge, is there any recommendation to use for higher tokens (>250) which is part of Weaviate pre-built images?

What do you mean by “through languages”?

Sorry for the confusion, but this is just in our Profile model - this is just a collection of languages that our users can speak so it has nothing to do with multi-language searching - we are just using English in our platform for now.

I just gave this as an example that this field languages[] is not vectorized but will searching it be part of the hybrid search? It is of type text[] and not text type - will it still be part of the hybrid search?


Thank you!

List of properties for keyword search

That is correct, the moment you add properties inside with_hybrid() you need to list all properties you want to use for the keyword search part of the query.
If you only list full_name (like below), then only full_name will be used.

properties=["full_name^2"]

Properties vs vector search

That is correct, the list of properties doesn’t affect the vector search part of the query. (it only affects the keyword search).

Other questions

I need to check with the team, as I don’t use text2vec-transformers much, and need to check about text[] properties with hybrid.

Properties (keyword) vs vector search

Thanks for confirming this properties vs vector search behavior. I will give this a try then and let you know hopefully, next week :crossed_fingers:

Other questions

Thank you!

I need to check with the team, as I don’t use text2vec-transformers much, and need to check about text[] properties with hybrid.

Hi @junbetterway - I believe the T5 models (like google/flan-t5-base see the list here) can accept 512 tokens, or even longer, according to this discussion:

I don’t think we have a specific one that we recommend.

I hope that helps!

JP

1 Like

Yes, text arrays should be supported by hybrid and bm25

1 Like

Great thanks @sebawita @jphwang - thanks to both of you specially to @sebawita for answering this question.

We can consider this solved/closed :slight_smile:

I will just post a new question (and try to be more specific) if I hit some bumps along the road. Cheers!

2 Likes