Description
I have a collection which is declared as follows:
wv_client.collections.create(
name=wv_artcollname,
description="A collection of Articles with only a custom list of stopwords",
vectorizer_config=None,
inverted_index_config=wvcc.Configure.inverted_index(**BM25_PARAMS),
properties=[
wvcc.Property(
name="kg_article_id",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
tokenization=wvcc.Tokenization.FIELD,
), # {GRAPH_BASE}/article/{isoEditionDate}-{slug}
wvcc.Property(
name="articletitle",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
), # as displayed on the page
wvcc.Property(
name="isoEditionDate",
data_type=wvcc.DataType.DATE, # DATE for RFC3339 ISO8601 date
skip_vectorization=True,
), # alternative, declare it as TEXT and tokenization=wvcc.Tokenization.FIELD,
wvcc.Property(
name="author",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
), # the author string as displayed on the page
wvcc.Property(
name="category",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
tokenization=wvcc.Tokenization.FIELD,
), # the category string
wvcc.Property(
name="prose",
data_type=wvcc.DataType.TEXT,
), # title+excerpt+kicker
wvcc.Property(
name="keywords",
data_type=wvcc.DataType.TEXT,
skip_vectorization=True,
), # category+tag+topic+namedentities
],
)
and filled with around 650K objects along with custom embedings for the “prose” property. See at the bottom for the BM25 properties if needed.
I am querying this collection with the following example: “attentato trump” and use the same embedding model for the query.
I then build an hybrid query as follows (but as I set alpha to 0 I should be doing a pure BM25 keyword search right?):
response = wv_artcoll.query.hybrid(
query="attentato trump",
query_properties=[
"keywords^1.3",
"prose"
],
vector=query_vector, # this has the embedding
target_vector=graphql_model_name, # embedding name
limit=60,
alpha=0,
return_metadata=MetadataQuery(score=True, explain_score=True),
)
from the above query I fetch 60 results from which I show you two results that matter to me. The LAST result from the 60 limit query above is as follows:
{
"properties": {
"kg_article_id": "https://ilmanifesto.it/mema/article/2000-10-06-sri-lanka-attentato-pre-elettorale",
"articletitle": "Sri lanka,attentato pre elettorale",
"isoEditionDate": "2000-10-06",
"author": "Redazione",
"category": "Mondo",
"prose": "Sri lanka,attentato pre elettorale; Attentato kamikaze indipendentista; ",
"keywords": "Redazione, Sri Lanka, Medawachchiya"
},
"score": 0.24272824823856354,
"explain_score": "\nHybrid (Result Set keyword,bm25) Document cbd3bb20-4570-543c-9ac9-972ac559e480: original score 4.068611, normalized score: 0.24272825"
}
Now if I repeat the very same query with a limit raised from 60 to 200 I get the same object as above with the following different score and related explanation:
{
"properties": {
"kg_article_id": "https://ilmanifesto.it/mema/article/2000-10-06-sri-lanka-attentato-pre-elettorale",
"articletitle": "Sri lanka,attentato pre elettorale",
"isoEditionDate": "2000-10-06",
"author": "Redazione",
"category": "Mondo",
"prose": "Sri lanka,attentato pre elettorale; Attentato kamikaze indipendentista; ",
"keywords": "Redazione, Sri Lanka, Medawachchiya"
},
"score": 0.46517714858055115,
"explain_score": "\nHybrid (Result Set keyword,bm25) Document cbd3bb20-4570-543c-9ac9-972ac559e480: original score 4.068611, normalized score: 0.46517715"
},
so the BM25 search with the input terms “attentato” and “trump” as I have performed are matching TWO instances of the string “attentato” in the “prose” property yielding that score. Right?
What I’m not understanding is that in the limit=200 version of the search I also fetch the following object:
"properties": {
"kg_article_id": "http://ilmanifesto.it/mema/article/2024-07-14-attentato-a-trump-spari-durante-un-comizio",
"articletitle": "Attentato a Trump: spari durante un comizio",
"isoEditionDate": "2024-07-14",
"author": "Marina Catucci",
"category": "Internazionale",
"prose": "Attentato a Trump: spari durante un comizio; Alle 18.20 ora locale il comizio di Donald Trump a Butler in Pennsylvania era cominciato da poco, quando sono esplosi gli spari. Quando il tycoon si è abbassato dietro il […]; L'ex presidente ferito a un orecchio. L'attentatore, un ventenne, è morto. Biden: «Non c'è posto in America per questo tipo di violenza»",
"keywords": "Marina Catucci; Thomas Matthew Crooks; Chuck Schumer; Nancy Pelosi; Mike Johnson; Donald Trump; Trump; Noé Chartier; Butler; Truth; Obama; Biden; Rehoboth Beach; Pennsylvania; New Jersey; Bedminster; Milwaukee; America; Associated Press; polizia; Usa Usa; Camera; Senato; Fbi;"
},
"score": 0.3737032413482666,
"explain_score": "\nHybrid (Result Set keyword,bm25) Document cc993586-255d-5084-8bf9-f2c69fc34ec1: original score 3.9525168, normalized score: 0.37370324"
},
with a 0.373 score which is lower than the “Sri Lanka” bject match, but if I try matching the two search terms (attentato and trump) with the strings in the object I find trump twice in the “prose” property and also “trump” twice in the keywords property (which I also weigh more with the ^1.3 modifier).
So why is this score lower for this object even though apparently it has more matches? Thank you for clarifying
Server Setup Information
- Weaviate Server Version: 1.25.4
- Deployment Method: docker compose
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: python 4.6.4
- Multitenancy?: no
other info
BM25_PARAMS = {
"bm25_b": 0.75,
"bm25_k1": 1.2,
"cleanup_interval_seconds": 60,
"index_timestamps": False,
"index_property_length": False,
"index_null_state": False,
"stopwords_preset": None,
"stopwords_additions": italian_stopwords,
"stopwords_removals": None,
}