Unable to fully comprehend the computed score

rjalex · August 6, 2024, 5:20pm

Description

I have a collection which is declared as follows:

wv_client.collections.create(
    name=wv_artcollname,
    description="A collection of Articles with only a custom list of stopwords",
    vectorizer_config=None,
    inverted_index_config=wvcc.Configure.inverted_index(**BM25_PARAMS),
    properties=[
        wvcc.Property(
            name="kg_article_id",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
            tokenization=wvcc.Tokenization.FIELD,
        ),  # {GRAPH_BASE}/article/{isoEditionDate}-{slug}
        wvcc.Property(
            name="articletitle",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
        ),  # as displayed on the page
        wvcc.Property(
            name="isoEditionDate",
            data_type=wvcc.DataType.DATE,  # DATE for RFC3339 ISO8601 date
            skip_vectorization=True,
        ),  # alternative, declare it as TEXT and tokenization=wvcc.Tokenization.FIELD,
        wvcc.Property(
            name="author",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
        ),  # the author string as displayed on the page
        wvcc.Property(
            name="category",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
            tokenization=wvcc.Tokenization.FIELD,
        ),  # the category string
        wvcc.Property(
            name="prose",
            data_type=wvcc.DataType.TEXT,
        ),  # title+excerpt+kicker
        wvcc.Property(
            name="keywords",
            data_type=wvcc.DataType.TEXT,
            skip_vectorization=True,
        ),  # category+tag+topic+namedentities
    ],
)

and filled with around 650K objects along with custom embedings for the “prose” property. See at the bottom for the BM25 properties if needed.

I am querying this collection with the following example: “attentato trump” and use the same embedding model for the query.

I then build an hybrid query as follows (but as I set alpha to 0 I should be doing a pure BM25 keyword search right?):

response = wv_artcoll.query.hybrid(
                query="attentato trump",
                query_properties=[
                    "keywords^1.3",
                    "prose"
                ],
                vector=query_vector, # this has the embedding
                target_vector=graphql_model_name, # embedding name
                limit=60,
                alpha=0,
                return_metadata=MetadataQuery(score=True, explain_score=True),
            )

from the above query I fetch 60 results from which I show you two results that matter to me. The LAST result from the 60 limit query above is as follows:

        {
            "properties": {
                "kg_article_id": "https://ilmanifesto.it/mema/article/2000-10-06-sri-lanka-attentato-pre-elettorale",
                "articletitle": "Sri lanka,attentato pre elettorale",
                "isoEditionDate": "2000-10-06",
                "author": "Redazione",
                "category": "Mondo",
                "prose": "Sri lanka,attentato pre elettorale; Attentato kamikaze indipendentista; ",
                "keywords": "Redazione, Sri Lanka, Medawachchiya"
            },
            "score": 0.24272824823856354,
            "explain_score": "\nHybrid (Result Set keyword,bm25) Document cbd3bb20-4570-543c-9ac9-972ac559e480: original score 4.068611, normalized score: 0.24272825"
        }

Now if I repeat the very same query with a limit raised from 60 to 200 I get the same object as above with the following different score and related explanation:

        {
            "properties": {
                "kg_article_id": "https://ilmanifesto.it/mema/article/2000-10-06-sri-lanka-attentato-pre-elettorale",
                "articletitle": "Sri lanka,attentato pre elettorale",
                "isoEditionDate": "2000-10-06",
                "author": "Redazione",
                "category": "Mondo",
                "prose": "Sri lanka,attentato pre elettorale; Attentato kamikaze indipendentista; ",
                "keywords": "Redazione, Sri Lanka, Medawachchiya"
            },
            "score": 0.46517714858055115,
            "explain_score": "\nHybrid (Result Set keyword,bm25) Document cbd3bb20-4570-543c-9ac9-972ac559e480: original score 4.068611, normalized score: 0.46517715"
        },

so the BM25 search with the input terms “attentato” and “trump” as I have performed are matching TWO instances of the string “attentato” in the “prose” property yielding that score. Right?

What I’m not understanding is that in the limit=200 version of the search I also fetch the following object:

            "properties": {
                "kg_article_id": "http://ilmanifesto.it/mema/article/2024-07-14-attentato-a-trump-spari-durante-un-comizio",
                "articletitle": "Attentato a Trump: spari durante un comizio",
                "isoEditionDate": "2024-07-14",
                "author": "Marina Catucci",
                "category": "Internazionale",
                "prose": "Attentato a Trump: spari durante un comizio; Alle 18.20 ora locale il comizio di Donald Trump a Butler in Pennsylvania era cominciato da poco, quando sono esplosi gli spari. Quando il tycoon si è abbassato dietro il […]; L'ex presidente ferito a un orecchio. L'attentatore, un ventenne, è morto. Biden: «Non c'è posto in America per questo tipo di violenza»",
                "keywords": "Marina Catucci; Thomas Matthew Crooks; Chuck Schumer; Nancy Pelosi; Mike Johnson; Donald Trump; Trump; Noé Chartier; Butler; Truth; Obama; Biden; Rehoboth Beach; Pennsylvania; New Jersey; Bedminster; Milwaukee; America; Associated Press; polizia; Usa Usa; Camera; Senato; Fbi;"
            },
            "score": 0.3737032413482666,
            "explain_score": "\nHybrid (Result Set keyword,bm25) Document cc993586-255d-5084-8bf9-f2c69fc34ec1: original score 3.9525168, normalized score: 0.37370324"
        },

with a 0.373 score which is lower than the “Sri Lanka” bject match, but if I try matching the two search terms (attentato and trump) with the strings in the object I find trump twice in the “prose” property and also “trump” twice in the keywords property (which I also weigh more with the ^1.3 modifier).

So why is this score lower for this object even though apparently it has more matches? Thank you for clarifying

Server Setup Information

Weaviate Server Version: 1.25.4
Deployment Method: docker compose
Multi Node? Number of Running Nodes: 1
Client Language and Version: python 4.6.4
Multitenancy?: no

other info

BM25_PARAMS = {
    "bm25_b": 0.75,
    "bm25_k1": 1.2,
    "cleanup_interval_seconds": 60,
    "index_timestamps": False,
    "index_property_length": False,
    "index_null_state": False,
    "stopwords_preset": None,
    "stopwords_additions": italian_stopwords,
    "stopwords_removals": None,
}

DudaNogueira · August 7, 2024, 6:33pm

cia @rjalex !!

Awesome question! Thanks!

I noticed that articletitle is also tokenized as word, so “attentato” has 3 hits.
Two in prose, and one in articletitle.

the other object has “2.3” hits (1.3 prose and 1 in articletitle ) hits and a “attentatore” that I don’t believe matches.

But that doesn’t explain the “trump” part

One wild guess: if you change the order of the words, do you get the same results?

Also, if you run bm25, will you get same scoring?

Thanks!

rjalex · August 7, 2024, 7:22pm

As usual thanks a lot @DudaNogueira. Now it’s late but tomorrow will try if a pure BM25 behaves in the same way and report back.

The “prose” and “keywords” properties are indexed case insensitive and word tokenized (Keep only alpha-numeric characters, lowercase them, and split by whitespace.) right?

The same holds true for the query string, right?

DudaNogueira · August 7, 2024, 7:25pm

Ops, sorry.

Missed that you provide the properties, so the articletitle will not count towards the score.

I am pretty sure it does lowercase both properties content and query.

rjalex · August 8, 2024, 8:16am

Ok so the first test tells us that the “trump attentato” or “attentato trump” queries give identical results (as expected).

Still no clue as of why this object:

            "properties": {
                "kg_article_id": "http://ilmanifesto.it/mema/article/2024-07-14-attentato-a-trump-spari-durante-un-comizio",
                "articletitle": "Attentato a Trump: spari durante un comizio",
                "isoEditionDate": "2024-07-14",
                "author": "Marina Catucci",
                "category": "Internazionale",
                "prose": "Attentato a Trump: spari durante un comizio; Alle 18.20 ora locale il comizio di Donald Trump a Butler in Pennsylvania era cominciato da poco, quando sono esplosi gli spari. Quando il tycoon si è abbassato dietro il […]; L'ex presidente ferito a un orecchio. L'attentatore, un ventenne, è morto. Biden: «Non c'è posto in America per questo tipo di violenza»",
                "keywords": "Marina Catucci; Thomas Matthew Crooks; Chuck Schumer; Donald Trump; Trump; Mike Johnson; Nancy Pelosi; Noé Chartier; Butler; Truth; Biden; Obama; Rehoboth Beach; Pennsylvania; Bedminster; New Jersey; Milwaukee; America; Associated Press; polizia; Usa Usa; Camera; Senato; Fbi;"
            },
            "score": 0.3737860321998596,
            "explain_score": "\nHybrid (Result Set keyword,bm25) Document cc993586-255d-5084-8bf9-f2c69fc34ec1: original score 3.952838, normalized score: 0.37378603"
        },

has a lower (0.374) score while counting manually for the “attentato” and “trump” keywords there should be 3 matches for the prose property (one attentato and two trump) and 2 matches for the keyword property (trump twice) and as the keywords property is queried with a 1.3 factor the overall naif score should be 5.6 (2.6 for the keywords and 3 for the prose).

With the same query the following object:

        {
            "properties": {
                "kg_article_id": "https://ilmanifesto.it/mema/article/2000-10-06-sri-lanka-attentato-pre-elettorale",
                "articletitle": "Sri lanka,attentato pre elettorale",
                "isoEditionDate": "2000-10-06",
                "author": "Redazione",
                "category": "Mondo",
                "prose": "Sri lanka,attentato pre elettorale; Attentato kamikaze indipendentista; ",
                "keywords": "Redazione, Sri Lanka, Medawachchiya"
            },
            "score": 0.46509456634521484,
            "explain_score": "\nHybrid (Result Set keyword,bm25) Document cbd3bb20-4570-543c-9ac9-972ac559e480: original score 4.068744, normalized score: 0.46509457"
        },

gets an higher score despite to my untrained eye I only see “attentato” matching twice in the prose property so a naif score of 2.

Now the second test is even more interesting. I change the query to a pure BM25 one, not an hybrid with alpha=0:

            response = wv_artcoll.query.bm25(
                query=request.query_text,
                query_properties=[
                    "keywords^1.3",
                    "prose"
                ],
                limit=request.result_limit,
                return_metadata=MetadataQuery(score=True, explain_score=True),
            )

and with this the “attentato trump” query changes the score and explains it in a different way:

        {
            "properties": {
                "kg_article_id": "http://ilmanifesto.it/mema/article/2024-07-14-attentato-a-trump-spari-durante-un-comizio",
                "articletitle": "Attentato a Trump: spari durante un comizio",
                "isoEditionDate": "2024-07-14",
                "author": "Marina Catucci",
                "category": "Internazionale",
                "prose": "Attentato a Trump: spari durante un comizio; Alle 18.20 ora locale il comizio di Donald Trump a Butler in Pennsylvania era cominciato da poco, quando sono esplosi gli spari. Quando il tycoon si è abbassato dietro il […]; L'ex presidente ferito a un orecchio. L'attentatore, un ventenne, è morto. Biden: «Non c'è posto in America per questo tipo di violenza»",
                "keywords": "Marina Catucci; Thomas Matthew Crooks; Chuck Schumer; Donald Trump; Trump; Mike Johnson; Nancy Pelosi; Noé Chartier; Butler; Truth; Biden; Obama; Rehoboth Beach; Pennsylvania; Bedminster; New Jersey; Milwaukee; America; Associated Press; polizia; Usa Usa; Camera; Senato; Fbi;"
            },
            "score": 4.247416973114014,
            "explain_score": ", BM25F_attentato_frequency:1, BM25F_attentato_propLength:48, BM25F_trump_frequency:8, BM25F_trump_propLength:82"
        },

but I cannot understand why is this scoring higher?

           "properties": {
                "kg_article_id": "http://ilmanifesto.it/mema/article/2017-05-21-il-silenzio-usa-su-julian-assange",
                "articletitle": "Il silenzio Usa su Julian Assange",
                "isoEditionDate": "2017-05-21",
                "author": "BenOld",
                "category": "Internazionale",
                "prose": "Il silenzio Usa su Julian Assange; Caduta l’accusa di stupro, per Julian Assange la partita più complicata da giocare è con le autorità americane. Negli Stati Uniti il fondatore di Wikileaks è accusato di attentato alla […]; Archiviata l'accusa di stupro, per il fondatore di Wikileaks la partita più importate da giocare è con l'Amministrazione Trump ",
                "keywords": "BenOld; Hillary Clinton; Julian Assange; Assange; Donald Trump; Trump; Usa; Usa; Casa Bianca; Stati Uniti; Wikileaks; Pentagono; Onu; Fbi;"
            },
            "score": 4.8405938148498535,
            "explain_score": ", BM25F_attentato_frequency:1, BM25F_attentato_propLength:35, BM25F_trump_frequency:7, BM25F_trump_propLength:51"
        },

visually I only see “attentato” once in “prose” and “trump” once in “prose” plus twice in “keywords”, the latter with a weight of 1.3

Why the BM25F_trump_frequency:7 ???

rjalex · August 8, 2024, 11:02am

Ok I have been reading some literature on BM25F and it is shedding some light

I also asked ChatGPT to explain and it might be an interesting read.

The short summary is that the scoring is not only counting matches but also search term proximity, search terms to target material length etc etc

Enjoy

The BM25F algorithm is a popular ranking function used in search engines to evaluate the relevance of documents based on the query terms. It is an extension of the BM25 algorithm, incorporating field weighting to handle documents with structured fields, such as titles, keywords, and body text. Here’s a detailed breakdown of why the search “attentato trump” yields a higher score for the first object than the second, even though the second object seems more directly relevant at first glance:

BM25F Scoring Factors

Term Frequency (TF): How often the query terms appear in the document.
Inverse Document Frequency (IDF): How common or rare the query terms are across all documents.
Field Length Normalization: Adjusts the influence of terms based on the length of the field they appear in.
Field Weights: Different fields (e.g., title, keywords, body) can have different weights assigned, affecting their impact on the final score.

Analysis of the Provided Objects

First Object:

Prose:

Il silenzio Usa su Julian Assange; Caduta l’accusa di stupro, per Julian Assange la partita più complicata da giocare è con le autorità americane. Negli Stati Uniti il fondatore di Wikileaks è accusato di attentato alla […]; Archiviata l'accusa di stupro, per il fondatore di Wikileaks la partita più importate da giocare è con l'Amministrazione Trump

Keywords:

BenOld; Hillary Clinton; Julian Assange; Assange; Donald Trump; Trump; Usa; Usa; Casa Bianca; Stati Uniti; Wikileaks; Pentagono; Onu; Fbi;

Second Object:

Prose:

Attentato a Trump: spari durante un comizio; Alle 18.20 ora locale il comizio di Donald Trump a Butler in Pennsylvania era cominciato da poco, quando sono esplosi gli spari. Quando il tycoon si è abbassato dietro il […]; L'ex presidente ferito a un orecchio. L'attentatore, un ventenne, è morto. Biden: «Non c'è posto in America per questo tipo di violenza»

Keywords:

Marina Catucci; Thomas Matthew Crooks; Chuck Schumer; Donald Trump; Trump; Mike Johnson; Nancy Pelosi; Noé Chartier; Butler; Truth; Biden; Obama; Rehoboth Beach; Pennsylvania; Bedminster; New Jersey; Milwaukee; America; Associated Press; polizia; Usa Usa; Camera; Senato; Fbi;

Detailed Explanation

Term Frequency in Keywords:
- The first object’s keywords contain both “Trump” and “attentato”. These exact matches contribute significantly to the score due to high term frequency within a heavily weighted field (keywords).
- The second object’s keywords contain “Trump” but do not contain “attentato”.
Proximity and Co-occurrence in Prose:
- In the first object’s prose, “attentato” and “Trump” are not directly adjacent but are mentioned in the same context. The BM25F algorithm recognizes their presence within a relevant field, contributing to a higher score.
- The second object has a direct mention of an “attentato” involving Trump, which seems more relevant contextually. However, if the overall term frequency and field weights are not as optimized, this direct mention might not outscore the weighted factors from the first object.
Field Length Normalization:
- The first object may have shorter prose or keyword fields, making the occurrences of the terms more significant relative to the field length.
- The second object might have longer prose or keyword fields, diluting the impact of the term frequencies.
Field Weights:
- If the system assigns higher weights to the keywords field, the first object benefits more since both terms appear in this highly weighted field.
- Even if the prose in the second object is highly relevant, lower field weights for prose compared to keywords might result in a lower overall score.

Conclusion

The BM25F score is a function of term frequencies, field lengths, and field weights. In this case, the first object’s keywords field, containing both query terms, outweighs the relevance of the direct mention in the second object’s prose, leading to a higher score for the first object. This demonstrates the importance of term distribution across different fields and their respective weights in determining relevance in BM25F.

DudaNogueira · August 12, 2024, 9:22pm

Oh wow!!!

Also learned this hahaha.

Thanks!!!

Topic		Replies	Views
Python V4 client silently skipping objects if property is an object? Support developer-experience	1	360	February 14, 2024
[Question] YOUR TOPIC Support python	1	122	July 30, 2024
V4 client with custom vectorizer question Support	4	771	February 14, 2024
How to manage the merging of an hybrid query on a property and a BM25 on another General	2	250	May 15, 2024
Own embedding value not vector search Support python	3	150	July 30, 2024