BUG - Duplicate and inconsistent results of BM25 search

Hi,

we are storing non-English texts in WSC. Since the Stopwords Preset for an inverted index (BM25) is set to None, there is a lot of search for high occurrence keywords happening. And the keyword search gets compromised.

We can replicate it for English texts and demonstrate it a on the Quickstart Tutorial in Weviate Docs (Quickstart Tutorial | Weaviate - Vector Database). Using the same code and the same data (10 entries from a TV quiz show “Jeopardy!” with properties ‘category’, ‘question’, ‘answer’) and searching via BM25 for a query “the science is” gives you as a result 6 entries with the same score 0.35819104313850403.

Once you set Stopwords Preset to None in the collection definition (and allow for a search for high occurrence words), it starts giving you as a result more entries than there are in the original dataset (i. e. more than 10), meaning it starts giving you duplicate entries with different scores. To make things worse, the result is very different every time you make a fresh import of the dataset (aplies to both insert_many() and batch import).

An example result in full (returns 14 objects, 4 duplicates - answers: DNA, Liver, Antelope, species):

0.6708551645278931
, BM25F_is_frequency:1, BM25F_is_propLength:14, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'wire', 'question': 'A metal that is ductile can be pulled into this while cold & under pressure', 'category': 'SCIENCE'}

0.5260506272315979
, BM25F_is_frequency:1, BM25F_is_propLength:10, BM25F_the_frequency:1, BM25F_the_propLength:3
{'answer': 'the diamondback rattler', 'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'category': 'ANIMALS'}

0.4687879681587219
, BM25F_science_propLength:1, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_science_frequency:1
{'answer': 'the atmosphere', 'question': 'Changes in the tropospheric layer of this are what gives us weather', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'Sound barrier', 'question': 'In 70-degree air, a plane traveling at about 1,130 feet per second breaks it', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:12
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.31266409158706665
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.1350332498550415
, BM25F_the_frequency:2, BM25F_the_propLength:9
{'answer': 'Elephant', 'question': "It's the only living mammal in the order Proboseidea", 'category': 'ANIMALS'}

0.11059693247079849
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.10673391073942184
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.09976458549499512
, BM25F_the_frequency:2, BM25F_the_propLength:17
{'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'}

0.07754258811473846
, BM25F_the_propLength:12, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.05944186821579933
, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

Example results of 3 different imports of the same dataset:

Answer + score Answer + score Answer + score
‘wire’ - 0.6708551645278931 ‘wire’ - 0.6708551645278931 ‘wire’ - 0.6708551645278931
‘the diamondback rattler’ - 0.5260506272315979 ‘the diamondback rattler’ - 0.5260506272315979 ‘the diamondback rattler’ - 0.5260506272315979
‘the atmosphere’ - 0.4687879681587219 ‘the atmosphere’ - 0.4687879681587219 ‘the atmosphere’ - 0.4687879681587219
‘Antelope’ - 0.42326104640960693 ‘DNA’ - 0.35819104313850403 ‘Liver’ - 0.4357336163520813
‘species’ - 0.41763290762901306 ‘Liver’ - 0.35819104313850403 ‘Sound barrier’ - 0.35819104313850403
‘DNA’ - 0.35819104313850403 ‘Sound barrier’ - 0.35819104313850403 ‘species’ - 0.35819104313850403
‘Liver’ - 0.35819104313850403 ‘species’ - 0.35819104313850403 ‘DNA’ - 0.35819104313850403
‘Sound barrier’ - 0.35819104313850403 ‘Antelope’ - 0.31266409158706665 ‘Antelope’ - 0.31266409158706665
‘Elephant’ - 0.1350332498550415 ‘Elephant’ - 0.1350332498550415 ‘Elephant’ - 0.1350332498550415
‘DNA’ - 0.10673391073942184 ‘Antelope’ - 0.11059693247079849 ‘Antelope’ - 0.11059693247079849
‘the nose or snout’ - 0.09976458549499512 ‘DNA’ - 0.10673391073942184 ‘DNA’ - 0.10673391073942184
‘Liver’ - 0.07754258811473846 ‘the nose or snout’ - 0.09976458549499512 ‘the nose or snout’ - 0.09976458549499512
‘Liver’ - 0.07754258811473846 ‘species’ - 0.05944186821579933
‘species’ - 0.05944186821579933
12 results (2 duplicates) 14 results (4 duplicates) 13 results (3 duplicates)

When you dig deeper into the details you can find that when you query separate properties (i. e. ‘category’, ‘question’, ‘answer’) via query_properties parametr of bm25(), the results have consistent scores for each import and no duplicates occur. So the BM25 search itself works but the algorithm which fuses the score of each property into a score of the whole object is broken.

Also we don’t think there is a problem with Stopwords Preset itself. More likely the problem lays in a keyword search for words with a high occurrence which is only hidden by Stopwords Preset set to EN.

The same problem occurs with larger datasets, the small dataset was chosen for brevity.

It might seem as a small bug but it prohibits any production use for non-English texts and any domain specific texts containing a high frequency of same words which are not defined as stopwords (e. g. legal texts).

  • Weaviate Server Version: 1.24.8
  • Deployment Method: WSC
  • Client Language and Version: Python weaviate-client 4.5.5

Code used:

import weaviate
import weaviate.classes as wvc
import os
import requests
import json

client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCS_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
    headers={
        # Replace with your inference API key
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)

try:
    # ===== define collection =====
    questions = client.collections.create(
        name="Question",
        # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        # Ensure the `generative-openai` module is used for generative queries
        generative_config=wvc.config.Configure.Generative.openai(),
        inverted_index_config=wvc.config.Configure.inverted_index(
            stopwords_preset=wvc.config.StopwordsPreset.NONE,
        ),
    )

    # ===== import data =====
    resp = requests.get(
        'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
    data = json.loads(resp.text)  # Load data

    question_objs = list()
    for i, d in enumerate(data):
        question_objs.append({
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        })

    questions = client.collections.get("Question")
    questions.data.insert_many(question_objs)

    response = questions.query.bm25(
        query="the science is",
        return_metadata=wvc.query.MetadataQuery(score=True),
    )

    for obj in response.objects:
        print(obj.metadata.score)
        print(obj.metadata.explain_score)
        print(obj.properties)
        print()

finally:
    client.close()  # Close client gracefully

hi @Tomas !! Welcome to our community! :hugs:

Nice catch. Looks like a bug indeed. I was also able to reproduce it on my end.

Thanks a lot for this finding and awesome report! :sunglasses:

Do you mind opening an issue in our github?

I can also do it, if you can’t.

Thanks again!

by the way, the issue was created. Thanks a lot.

Link for reference: Duplicate and inconsistent results of BM25 search · Issue #4719 · weaviate/weaviate · GitHub