Description
I’m encountering unexpected behavior when using BM25 (BM25F) for keyword retrieval in a Chinese + English mixed corpus. I’d like to confirm whether this is expected behavior or a tokenizer/index configuration issue.
Relevant text properties use:
tokenization = gse
indexSearchable = true
Observed behavior
When the query contains English tokens (e.g. SELECT 文本段落 ), documents that contain large amounts of English text receive disproportionately high BM25 scores, even when they are semantically irrelevant.
Explain score findings
Using explain_score, I observed that the query appears to be tokenized into both word-level and character-level tokens.
Example (simplified):
BM25F_select_frequency:100
BM25F_s_frequency:39
BM25F_e_frequency:29
BM25F_l_frequency:108
BM25F_c_frequency:10
BM25F_段落_frequency:145
BM25F_本_frequency:48
This suggests:
- English words are indexed both as whole tokens (
select) and individual characters (s,e,l,c,t) - Chinese text is also partially indexed at character level
Because single-character tokens occur very frequently, English-heavy documents accumulate large BM25 scores.
I guess the GSE tokenization setting cause the issue, but that is recomonded tokenization in Chinses or Janpanese text.
What I already tested
- Spaces in the query do not affect results significantly
- Lowercasing English tokens does not solve the issue
- Behavior appears strongly correlated with English token presence
Expected behavior
I would expect BM25 ranking to rely primarily on meaningful tokens (words or segmented phrases), not character-level matches that dominate scoring.
Questions
- Is it expected that
gsetokenization produces character-level tokens that participate in BM25 scoring? - Is BM25F designed to use both word-level and character-level tokens simultaneously?
- What is the recommended tokenizer for BM25 in mixed Chinese + English corpora?
gsetrigramword- Separate BM25 field without GSE?
- Is it a recommended pattern to maintain separate fields for:
- semantic embedding (GSE)
- keyword BM25 (non-GSE tokenizer)?
- Are there ways to reduce the impact of single-character tokens on BM25 scoring?
Goal
I want keyword retrieval to:
- Work well for Chinese queries
- Not over-score documents simply because they contain large amounts of English text
- Avoid heavy query routing or custom filtering logic
Any guidance or best practices would be greatly appreciated.
Thanks!
Server Setup Information
- Weaviate Server Version:1.31.2
- Deployment Method: docker
- Multi Node? Number of Running Nodes: 1 node
- Client Language and Version: 4.15.2
- Multitenancy?: No
Any additional Information
Code blocks:
res_keywords = self.db_instance.child_collection.query.bm25(
query="SELECT 文本段落", # I tried "SELECT文本段落" to check if " " caused this issue
return_properties=['parent_uuid', 'content', "file_id", "content_chapter_info", "file_name"],
return_metadata=MetadataQuery(distance=True, score=True, explain_score=True),
limit=max_return_limit,
filters=file_id_filter)
for i, obj in enumerate(res_keywords.objects[:10], 1):
print("="*10)
print(f"[{i}] score={getattr(obj.metadata, 'score', None)} distance={getattr(obj.metadata, 'distance', None)}")
explain = getattr(obj.metadata, "explain_score", None) or getattr(obj.metadata, "explainScore", None)
print("EXPLAIN:\n", explain)