BUG - Duplicate and inconsistent results of BM25 search

Tomas · April 18, 2024, 10:49am

Hi,

we are storing non-English texts in WSC. Since the Stopwords Preset for an inverted index (BM25) is set to None, there is a lot of search for high occurrence keywords happening. And the keyword search gets compromised.

We can replicate it for English texts and demonstrate it a on the Quickstart Tutorial in Weviate Docs (Quickstart Tutorial | Weaviate - Vector Database). Using the same code and the same data (10 entries from a TV quiz show “Jeopardy!” with properties ‘category’, ‘question’, ‘answer’) and searching via BM25 for a query “the science is” gives you as a result 6 entries with the same score 0.35819104313850403.

Once you set Stopwords Preset to None in the collection definition (and allow for a search for high occurrence words), it starts giving you as a result more entries than there are in the original dataset (i. e. more than 10), meaning it starts giving you duplicate entries with different scores. To make things worse, the result is very different every time you make a fresh import of the dataset (aplies to both insert_many() and batch import).

An example result in full (returns 14 objects, 4 duplicates - answers: DNA, Liver, Antelope, species):

0.6708551645278931
, BM25F_is_frequency:1, BM25F_is_propLength:14, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'wire', 'question': 'A metal that is ductile can be pulled into this while cold & under pressure', 'category': 'SCIENCE'}

0.5260506272315979
, BM25F_is_frequency:1, BM25F_is_propLength:10, BM25F_the_frequency:1, BM25F_the_propLength:3
{'answer': 'the diamondback rattler', 'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'category': 'ANIMALS'}

0.4687879681587219
, BM25F_science_propLength:1, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_science_frequency:1
{'answer': 'the atmosphere', 'question': 'Changes in the tropospheric layer of this are what gives us weather', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'Sound barrier', 'question': 'In 70-degree air, a plane traveling at about 1,130 feet per second breaks it', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:12
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.31266409158706665
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.1350332498550415
, BM25F_the_frequency:2, BM25F_the_propLength:9
{'answer': 'Elephant', 'question': "It's the only living mammal in the order Proboseidea", 'category': 'ANIMALS'}

0.11059693247079849
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.10673391073942184
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.09976458549499512
, BM25F_the_frequency:2, BM25F_the_propLength:17
{'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'}

0.07754258811473846
, BM25F_the_propLength:12, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.05944186821579933
, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

Example results of 3 different imports of the same dataset:

Answer + score	Answer + score	Answer + score
‘wire’ - 0.6708551645278931	‘wire’ - 0.6708551645278931	‘wire’ - 0.6708551645278931
‘the diamondback rattler’ - 0.5260506272315979	‘the diamondback rattler’ - 0.5260506272315979	‘the diamondback rattler’ - 0.5260506272315979
‘the atmosphere’ - 0.4687879681587219	‘the atmosphere’ - 0.4687879681587219	‘the atmosphere’ - 0.4687879681587219
‘Antelope’ - 0.42326104640960693	‘DNA’ - 0.35819104313850403	‘Liver’ - 0.4357336163520813
‘species’ - 0.41763290762901306	‘Liver’ - 0.35819104313850403	‘Sound barrier’ - 0.35819104313850403
‘DNA’ - 0.35819104313850403	‘Sound barrier’ - 0.35819104313850403	‘species’ - 0.35819104313850403
‘Liver’ - 0.35819104313850403	‘species’ - 0.35819104313850403	‘DNA’ - 0.35819104313850403
‘Sound barrier’ - 0.35819104313850403	‘Antelope’ - 0.31266409158706665	‘Antelope’ - 0.31266409158706665
‘Elephant’ - 0.1350332498550415	‘Elephant’ - 0.1350332498550415	‘Elephant’ - 0.1350332498550415
‘DNA’ - 0.10673391073942184	‘Antelope’ - 0.11059693247079849	‘Antelope’ - 0.11059693247079849
‘the nose or snout’ - 0.09976458549499512	‘DNA’ - 0.10673391073942184	‘DNA’ - 0.10673391073942184
‘Liver’ - 0.07754258811473846	‘the nose or snout’ - 0.09976458549499512	‘the nose or snout’ - 0.09976458549499512
	‘Liver’ - 0.07754258811473846	‘species’ - 0.05944186821579933
	‘species’ - 0.05944186821579933

12 results (2 duplicates)	14 results (4 duplicates)	13 results (3 duplicates)

When you dig deeper into the details you can find that when you query separate properties (i. e. ‘category’, ‘question’, ‘answer’) via query_properties parametr of bm25(), the results have consistent scores for each import and no duplicates occur. So the BM25 search itself works but the algorithm which fuses the score of each property into a score of the whole object is broken.

Also we don’t think there is a problem with Stopwords Preset itself. More likely the problem lays in a keyword search for words with a high occurrence which is only hidden by Stopwords Preset set to EN.

The same problem occurs with larger datasets, the small dataset was chosen for brevity.

It might seem as a small bug but it prohibits any production use for non-English texts and any domain specific texts containing a high frequency of same words which are not defined as stopwords (e. g. legal texts).

Weaviate Server Version: 1.24.8
Deployment Method: WSC
Client Language and Version: Python weaviate-client 4.5.5

Code used:

import weaviate
import weaviate.classes as wvc
import os
import requests
import json

client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCS_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
    headers={
        # Replace with your inference API key
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)

try:
    # ===== define collection =====
    questions = client.collections.create(
        name="Question",
        # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        # Ensure the `generative-openai` module is used for generative queries
        generative_config=wvc.config.Configure.Generative.openai(),
        inverted_index_config=wvc.config.Configure.inverted_index(
            stopwords_preset=wvc.config.StopwordsPreset.NONE,
        ),
    )

    # ===== import data =====
    resp = requests.get(
        'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
    data = json.loads(resp.text)  # Load data

    question_objs = list()
    for i, d in enumerate(data):
        question_objs.append({
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        })

    questions = client.collections.get("Question")
    questions.data.insert_many(question_objs)

    response = questions.query.bm25(
        query="the science is",
        return_metadata=wvc.query.MetadataQuery(score=True),
    )

    for obj in response.objects:
        print(obj.metadata.score)
        print(obj.metadata.explain_score)
        print(obj.properties)
        print()

finally:
    client.close()  # Close client gracefully

DudaNogueira · April 18, 2024, 2:29pm

hi @Tomas !! Welcome to our community!

Nice catch. Looks like a bug indeed. I was also able to reproduce it on my end.

Thanks a lot for this finding and awesome report!

Do you mind opening an issue in our github?

I can also do it, if you can’t.

Thanks again!

DudaNogueira · April 19, 2024, 7:05pm

by the way, the issue was created. Thanks a lot.

Link for reference: Duplicate and inconsistent results of BM25 search · Issue #4719 · weaviate/weaviate · GitHub

Tomas · August 11, 2024, 12:53pm

As I can see no one was even assigned to this problem

DudaNogueira · August 12, 2024, 9:37pm

hi @Tomas !

Sorry for the delay.

Unfortunately our team had “laser focus” on some implementation delivered in 1.25 and 1.26

As those are now released, we’ll have a sprint to deal with all those kind of issues, as well as reviewing bounties, etc.

Thanks for your patience!

Neil · October 2, 2024, 6:19pm

Hi @DudaNogueira, I think your team is doing a great job on following up.

I was curious is there a plan to work on this issue before end of calendar year?

Also, will testing for duplicative + inconsistent results be also tested for hybrid alongside BM25 results during this bug fix?

Dirk · October 4, 2024, 9:45am

There has been a fix in the latest version for BM25 - could you have a look if it still happens for you?

If yes - could you provide an example for us to reproduce it?

Topic		Replies	Views
After importing the document into Weaviate for a period of time, it cannot be searched using BM25, Support bug	4	732	October 4, 2023
Keyword search always results in score 0.0 Support	4	443	March 22, 2024
Unable to fully comprehend the computed score Support	6	198	August 12, 2024
Multiple keywords using BM25 Support	1	833	June 8, 2023
Unable to get expected results using BM25 or any search functions Support	8	485	July 3, 2024

BUG - Duplicate and inconsistent results of BM25 search

Related topics