Starting to have fun :)

rjalex · February 17, 2024, 11:48am

We built a benchmark of 83 tests against a collection of 14K objects. The corresponding search strings are paraphrases of the original and used as such for the index and also vectorized for the almighty hybrid search with all values of alpha in steps of 0.1. The hybrid search is limited to 20 results. Sharing the result just to show the approach:

The fact that the search string is a paraphrase that does not use verbatim terms in the original shows with the better results around 0.7 0.8.

DudaNogueira · February 19, 2024, 7:57pm

Hahaha. This is awesome!

I will share this internally.

Thanks for sharing.

bobvanluijt · February 20, 2024, 10:10am

Kewl, thanks for sharing

rjalex · February 20, 2024, 11:03am

As soon as I can I will try to write a little article on the benchmarking methodology hoping it will be useful to someone I think that for the hybrid search looking for one or more optimized alphas is something useful. Keep up the great work!

bobvanluijt · February 20, 2024, 11:23am

That’s awesome! Especially if the blog includes examples where you can see clear difference. People love that. Thanks

rjalex · February 20, 2024, 11:49am

Do you have any suggestions on a weaviate collection/source material with a property/field that is descriptive that can be publicly used as a benchmark target collection? My own is in Italian and has confidential material.

When I say descriptive I am meaning a field that describes in prose something, in such a way that I could describe it also paraphrasing it without using most/any of the original text, in such a way that if using the same words I can exercise the inverted index search while if I use the paraphrase that would only work with a vector search.

Something like Kaggle’s Wine Reviews - description field, but easily available to everybody?

bobvanluijt · February 20, 2024, 12:09pm

Ooo that’s a good question. I think you could use a generative model to create some demo content tho

rjalex · February 20, 2024, 12:32pm

What I’m going to do is to fetch a subset of Kaggle’s wine descriptions (approx 20K english texts), then use a generative model to generate the paraphrases of it. This will build my “to be searched with vectors” benchmark.

bobvanluijt · February 20, 2024, 12:54pm

You can actually kill two birds with one stone.

What I’m going to do is to fetch a subset of Kaggle’s wine descriptions (approx 20K english texts), then use a generative model to generate the paraphrases of it.

That’s an awesome example where a generative feedback loop can be used: Generative Feedback Loops with LLMs for Vector Databases | Weaviate - Vector Database

rjalex · February 20, 2024, 5:28pm

I have jotted down a simple python repo where I start from the public Kaggle wine list, trim it down to 20K objects such as:

{
        "title": "Cafaggio 2010 Basilica del Pruneto Merlot (Toscana)",
        "keywords": [
            "aromas",
            "cedar",
            "licorice",
            "plum",
            "blackberry",
            "palate",
            "peppercorn",
            "clove",
            "tobacco",
            "Firm",
            "tannins",
            "framework",
            "opens",
            "recall",
            "follow",
            "cracked",
            "provide",
            "Drink"
        ]
    },

which represent the name of a wine (title) and keywords describing it.
With this base the alphas graph is obviously very different:

which not surprisingly demonstrates that tje inverted index sparse search is superior.
If anyone is interested in taking a peek to the repo, it would contain all necessary files (although I will remove the very large Kaggle original list which you would need only if you needed to run the 01 program to generate another selection from it) and a pyproject.toml file to install all prerequisites with poetry.
Just add your Weaviate instance and have fun.
PS The .env.copy file will need to be copied/renamed to .env and you would put your own OpenAPI key in it.
PPS The next experiment will be randomly selecting only 3 keywords and searching with those; this would be closer to a real keyword search.

rjalex · February 20, 2024, 6:59pm

and now the last index search experiment. The original benchmark had an average of over 16 keywords per wine and with all of those we get the super-duper recall of the previous graph.

I have now randomly selected only 3 keywords from each. Here is an example:

{
        "title": "Moccagatta 2012 Basarin  (Barbaresco)",
        "keywords": [
            "resin",
            "vanilla",
            "plum"
        ]
    },

and here are the results:

and as you can see even with the hybrid search completely skewed towards using only the inverted index (alpha=0.0) the recall is only around 75%, we get the expected object as first in the retrieved results in only under 40% of the times and within the first 3 in around 50% of the times.

If you start cranking up the contribution of the vector search the perfomances drop dramatically with a total meltdown with an alpha of 1.

Now off to feed the family and tomorrow I might continue with the paraphrase benchmarking.

You all take care. Viva Weaviate !!!

Dirk · February 21, 2024, 4:05am

Cool, thanks for sharing!

Some interesting way to extend this would be test the two different hybrid fusion algorithms: Search operators | Weaviate - Vector Database and check if you see a difference

Topic		Replies	Views
Hybrid search explanation explanation :) General documentation	4	537	May 10, 2024
New member in Weaviate General	1	155	September 5, 2024
Why does an hybrid search with alpha=0 match an objects that has none of the keywords? General	3	254	May 16, 2024
Advice Needed on Optimizing Vector Search in Weaviate Support	1	281	September 6, 2024
How do I improve hybrid search on Weaviate? Been poking at this for too long but haven't made much headway General	2	818	April 23, 2024

Starting to have fun :)

Related topics