Pure BM25 search is wrong

Description

Running this code:
print("query", query)
print("vector", vector)
print("alpha", alpha)
resp = artigo.query.hybrid(
query=query,
alpha=alpha,
vector=vector,
return_properties=return_properties,
query_properties=['article_number^10', 'law_title^5', 'law_number^5'],
return_metadata=wvc.query.MetadataQuery(score=True),
limit=75,
)
for i, obj in enumerate(resp.objects):
print(f"{i}: ", end='')
for prop in ['article_number', 'law_title', 'law_number']:
print(f'{prop}: {obj.properties[prop]}', end=', ')
print('\n')

Getting this output:

query CÓDIGO PENAL 152
vector None
alpha 0
0: article_number: 152, law_title: CÓDIGO DO DIREITO DE AUTOR E DOS DIREITOS CONEXOS, law_number: Decreto-Lei n.Âș 63/85 , 

1: article_number: 152, law_title: Orçamento do Estado para 2009, law_number: Lei n.Âș 64-A/2008 , 

2: article_number: 152, law_title: Regime jurĂ­dico das instituiçÔes de ensino superior, law_number: Lei n.Âș 62/2007 , 

3: article_number: 152, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

All the query args are printed above.

Let’s compare #1 with #3.

article_number both have 152. It’s a tie!

Now #3 has CÓDIGO PENAL in the law_title. #1 doesn’t have any other match.

If I remove the boosts and just do:

query_properties=['article_number', 'law_title', 'law_number'],

Then the correct article is not even in the top 10.

Top 10 here:

0: article_number: 61, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

1: article_number: 66, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

2: article_number: 32, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

3: article_number: 33, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

4: article_number: 8, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

5: article_number: 51, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

6: article_number: 55, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

7: article_number: 82, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

8: article_number: 11, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

9: article_number: 30, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

10: article_number: 89, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

Server Setup Information

  • Weaviate Server Version: Using weaviate cloud. Database version: 1.33.4
  • Client Language and Version: Python 4.15.0

hi @jpiabrantes !!

For the hybrid, can you also return the explain_score?

Also, can you try artigo.query.bm25 instead of hybrid and also post the results?

Thanks!

It seems that bm25 != hybrid with alpha equal 0. But they both give wrong results?

Hybrid Search With Explain Score and Boosted Fields:

0: explain_score:
Hybrid (Result Set keyword,bm25) Document fc7f6afc-3852-58b8-b187-e146d973aa87: original score 28.79685, normalized score: 1, article_number: 152, law_title: CÓDIGO DO DIREITO DE AUTOR E DOS DIREITOS CONEXOS, law_number: Decreto-Lei n.Âș 63/85 ,

1: explain_score:
Hybrid (Result Set keyword,bm25) Document fb043967-ea5f-5aa2-9372-7c4590bcf4be: original score 28.79685, normalized score: 1, article_number: 152, law_title: Orçamento do Estado para 2009, law_number: Lei n.Âș 64-A/2008 ,

2: explain_score:
Hybrid (Result Set keyword,bm25) Document f92d79db-850b-558e-b1e2-67eb6cc36332: original score 28.79685, normalized score: 1, article_number: 152, law_title: Regime jurĂ­dico das instituiçÔes de ensino superior, law_number: Lei n.Âș 62/2007 ,

3: explain_score:
Hybrid (Result Set keyword,bm25) Document f8dd5103-f921-505f-a5a8-e0f95cff6039: original score 28.79685, normalized score: 1, article_number: 152, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 ,

No Boosted Fields:

0: explain_score: 
Hybrid (Result Set keyword,bm25) Document ffbc7b57-59f8-541c-9f66-2fb3622544cf: original score 4.99375, normalized score: 1, article_number: 61, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

1: explain_score: 
Hybrid (Result Set keyword,bm25) Document fd84d4cf-1ef7-5b3c-a50b-6a9a9739536f: original score 4.99375, normalized score: 1, article_number: 66, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

2: explain_score: 
Hybrid (Result Set keyword,bm25) Document fc0cfae3-9ac6-5c0c-8fe4-5ab52be659a0: original score 4.99375, normalized score: 1, article_number: 32, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

3: explain_score: 
Hybrid (Result Set keyword,bm25) Document f53be831-2ce5-5b67-ab41-d16c2384fc3f: original score 4.99375, normalized score: 1, article_number: 33, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

BM25 Search With Boosted Fields:

0: explain_score: , BM25F_152_frequency:1, BM25F_152_propLength:1, article_number: 152, law_title: Orçamento do Estado para 2009, law_number: Lei n.Âș 64-A/2008 , 

1: explain_score: , BM25F_152_frequency:1, BM25F_152_propLength:1, article_number: 152, law_title: CÓDIGO DAS SOCIEDADES COMERCIAIS, law_number: Decreto-Lei n.Âș 262/86 , 

2: explain_score: , BM25F_152_frequency:1, BM25F_152_propLength:1, article_number: 152, law_title: Orçamento do Estado para 2009, law_number: Lei n.Âș 64-A/2008 , 

3: explain_score: , BM25F_152_frequency:1, BM25F_152_propLength:1, article_number: 152, law_title: Orçamento do Estado para 2009, law_number: Lei n.Âș 64-A/2008 , 

BM25 Search No Boosted Fields:

0: explain_score: , BM25F_cĂłdigo_frequency:1, BM25F_cĂłdigo_propLength:2, BM25F_penal_frequency:1, BM25F_penal_propLength:2, article_number: 8, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

1: explain_score: , BM25F_cĂłdigo_frequency:1, BM25F_cĂłdigo_propLength:2, BM25F_penal_frequency:1, BM25F_penal_propLength:2, article_number: 3, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

2: explain_score: , BM25F_cĂłdigo_frequency:1, BM25F_cĂłdigo_propLength:2, BM25F_penal_frequency:1, BM25F_penal_propLength:2, article_number: 7, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

3: explain_score: , BM25F_cĂłdigo_frequency:1, BM25F_cĂłdigo_propLength:2, BM25F_penal_frequency:1, BM25F_penal_propLength:2, article_number: 4, law_title: CÓDIGO PENAL, law_number: Lei n.Âș 59/2007 , 

Hey @jpiabrantes ,

thanks for doing the extra tests!

to help us debug the issue, can you sent the WCS cluster id?

As a DM here or on community Slack.

Obrigado.

André Mourão