Keyword search always results in score 0.0

kun432 · March 12, 2024, 1:53am

This might be a newbie question.

I am trying keyword search, not semantic search. keyword search itself seems working but those scores are always 0.0. This is a correct behavior? Am I missing something?

example:

from datasets import load_dataset

dataset = load_dataset('jeopardy', split='train').shuffle(seed=42)
sample_dataset = dataset.select(range(500))
sample_data = [{col: row[col] for col in ["category", "question", "answer"]} for row in sample_dataset]

import weaviate
import weaviate.classes as wvc
import os
from google.colab import userdata

client = weaviate.connect_to_wcs(
    cluster_url=userdata.get('WEAVIATE_CLUSTER_URL'),
    auth_credentials=weaviate.auth.AuthApiKey(userdata.get('WEAVIATE_API_KEY')),
)

questions = client.collections.create(
    name="Question",
)

question_objs = list()
for i, d in enumerate(sample_data):
    question_objs.append({
        "answer": d["answer"],
        "question": d["question"],
        "category": d["category"],
    })

questions.data.insert_many(question_objs)

response = questions.query.bm25(
    query="america",
    query_properties=["question","answer","category"],
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
    limit=10
)

for r in response.objects:
    print(r.metadata.score)
    print(r.metadata.explain_score)
    print(r.properties)
    print()

result:

0.0
, BM25F_america_frequency:1, BM25F_america_propLength:2
{'answer': 'Jean Lafitte', 'question': "'A Natl. Historical Park & Preserve named for this pirate includes the site of the Battle of New Orleans'", 'category': 'HISTORIC AMERICA'}

0.0
, BM25F_america_frequency:1, BM25F_america_propLength:3
{'answer': 'Rockford', 'question': '\'The "files" on this large Illinois city include its historic leadership in screw production\'', 'category': 'AMERICA THE BEAUTIFUL'}

0.0
, BM25F_america_propLength:3, BM25F_america_frequency:1
{'answer': '"God Bless America"', 'question': '\'This Irving Berlin song has been called "The Nation\'s Unofficial Second National Anthem"\'', 'category': 'IRVING BERLIN'}

0.0
, BM25F_america_frequency:3, BM25F_america_propLength:17
{'answer': 'law', 'question': '\'"America!  America!  God mend thine every flaw/ Confirm thy soul in self-control, thy liberty in" this\'', 'category': 'AMERICA THE BEAUTIFUL'}

0.0
, BM25F_america_frequency:1, BM25F_america_propLength:11
{'answer': 'Milwaukee', 'question': '\'This city on Lake Michigan is "The Beer Capital of America"\'', 'category': 'AMERICAN CITIES'}

0.0
, BM25F_america_frequency:1, BM25F_america_propLength:16
{'answer': 'CNN Headline News', 'question': "'Every half hour since 1982, this CNN network has updated America on news, sports, business & entertainment'", 'category': 'CNN'}

Thank you.

DudaNogueira · March 12, 2024, 1:40pm

Hi @kun432 !!

Welcome to our community

Thank you very much! I believe you have found a regression bug.

I was not able to reproduce this in 1.23.12, but only in 1.24.0 and 1.24.1

What is the version you are using?

Thanks!

kun432 · March 12, 2024, 5:10pm

Thank you!

I’ve tested v1.24.1 for server (both WCS and Docker-Compose) and v4.5.1 for python client.

DudaNogueira · March 20, 2024, 6:31pm

Hi @kun432 !

Thanks you very much for your report and opening an issue on that

Our team already fixed this with this PR was already merged

Thanks!

kun432 · March 22, 2024, 2:17pm

using WCS in my end, cluster version is still 1.24.4. so I confirmed that it has been fixed on embedded v1.24.5! Thank you!

import weaviate
import weaviate.classes as wvc
import os
from datasets import load_dataset

dataset = load_dataset('jeopardy', split='train').shuffle(seed=42)
sample_dataset = dataset.select(range(500))
sample_data = [{col: row[col] for col in ["category", "question", "answer"]} for row in sample_dataset]

client = weaviate.connect_to_embedded(
    version="1.24.5",
)

questions = client.collections.create(
    name="Question",
)

question_objs = list()
for i, d in enumerate(sample_data):
    question_objs.append({
        "answer": d["answer"],
        "question": d["question"],
        "category": d["category"],
    })

questions.data.insert_many(question_objs)

response = questions.query.bm25(
    query="america",
    query_properties=["question","answer","category"],
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
    limit=10
)

for r in response.objects:
    print(r.metadata.score)
    print(r.metadata.explain_score)
    print(r.properties)
    print()

result:

2.748971939086914
, BM25F_america_frequency:1, BM25F_america_propLength:2
{'answer': 'Jean Lafitte', 'question': "'A Natl. Historical Park & Preserve named for this pirate includes the site of the Battle of New Orleans'", 'category': 'HISTORIC AMERICA'}

2.52490496635437
, BM25F_america_propLength:3, BM25F_america_frequency:1
{'answer': 'Rockford', 'question': '\'The "files" on this large Illinois city include its historic leadership in screw production\'', 'category': 'AMERICA THE BEAUTIFUL'}

2.52490496635437
, BM25F_america_frequency:1, BM25F_america_propLength:3
{'answer': '"God Bless America"', 'question': '\'This Irving Berlin song has been called "The Nation\'s Unofficial Second National Anthem"\'', 'category': 'IRVING BERLIN'}

2.293008327484131
, BM25F_america_frequency:3, BM25F_america_propLength:17
{'answer': 'law', 'question': '\'"America!  America!  God mend thine every flaw/ Confirm thy soul in self-control, thy liberty in" this\'', 'category': 'AMERICA THE BEAUTIFUL'}

1.5283229351043701
, BM25F_america_frequency:1, BM25F_america_propLength:11
{'answer': 'Milwaukee', 'question': '\'This city on Lake Michigan is "The Beer Capital of America"\'', 'category': 'AMERICAN CITIES'}

1.225906491279602
, BM25F_america_frequency:1, BM25F_america_propLength:16
{'answer': 'CNN Headline News', 'question': "'Every half hour since 1982, this CNN network has updated America on news, sports, business & entertainment'", 'category': 'CNN'}

Topic		Replies	Views
Query score 0 Support python	1	13	December 24, 2024
Hybrid similarity scoring is so weird - it doesn't make any sense Support	1	45	November 12, 2024
Why distance score not equal to 0 (searching exactly the same words) Support python	3	42	July 18, 2024
Cosine similarity between unrelated keywords return a high score Support python	7	141	June 13, 2024
Simple keyword search not working Support	4	977	September 14, 2023

Keyword search always results in score 0.0

Related topics