BM25 returns high scores for English-heavy documents when query contains English tokens in Chinese-English mxied chunks (enable GSE)

Carloszone · February 28, 2026, 7:03pm

Description

I’m encountering unexpected behavior when using BM25 (BM25F) for keyword retrieval in a Chinese + English mixed corpus. I’d like to confirm whether this is expected behavior or a tokenizer/index configuration issue.

Relevant text properties use:

tokenization = gse
indexSearchable = true

Observed behavior

When the query contains English tokens (e.g. SELECT 文本段落 ), documents that contain large amounts of English text receive disproportionately high BM25 scores, even when they are semantically irrelevant.

Explain score findings

Using explain_score, I observed that the query appears to be tokenized into both word-level and character-level tokens.

Example (simplified):

BM25F_select_frequency:100
BM25F_s_frequency:39
BM25F_e_frequency:29
BM25F_l_frequency:108
BM25F_c_frequency:10
BM25F_段落_frequency:145
BM25F_本_frequency:48

This suggests:

English words are indexed both as whole tokens (select) and individual characters (s, e, l, c, t)
Chinese text is also partially indexed at character level

Because single-character tokens occur very frequently, English-heavy documents accumulate large BM25 scores.

I guess the GSE tokenization setting cause the issue, but that is recomonded tokenization in Chinses or Janpanese text.

What I already tested

Spaces in the query do not affect results significantly
Lowercasing English tokens does not solve the issue
Behavior appears strongly correlated with English token presence

Expected behavior

I would expect BM25 ranking to rely primarily on meaningful tokens (words or segmented phrases), not character-level matches that dominate scoring.

Questions

Is it expected that gse tokenization produces character-level tokens that participate in BM25 scoring?
Is BM25F designed to use both word-level and character-level tokens simultaneously?
What is the recommended tokenizer for BM25 in mixed Chinese + English corpora?

gse
trigram
word
Separate BM25 field without GSE?

Is it a recommended pattern to maintain separate fields for:

semantic embedding (GSE)
keyword BM25 (non-GSE tokenizer)?

Are there ways to reduce the impact of single-character tokens on BM25 scoring?

Goal

I want keyword retrieval to:

Work well for Chinese queries
Not over-score documents simply because they contain large amounts of English text
Avoid heavy query routing or custom filtering logic

Any guidance or best practices would be greatly appreciated.

Thanks!

Server Setup Information

Weaviate Server Version:1.31.2
Deployment Method: docker
Multi Node? Number of Running Nodes: 1 node
Client Language and Version: 4.15.2
Multitenancy?: No

Any additional Information

Code blocks:

        res_keywords = self.db_instance.child_collection.query.bm25(
            query="SELECT 文本段落",  # I  tried "SELECT文本段落" to check if " " caused this issue
            return_properties=['parent_uuid', 'content', "file_id", "content_chapter_info", "file_name"],
            return_metadata=MetadataQuery(distance=True, score=True, explain_score=True),
            limit=max_return_limit,
            filters=file_id_filter)
        
        for i, obj in enumerate(res_keywords.objects[:10], 1):
            print("="*10)
            print(f"[{i}] score={getattr(obj.metadata, 'score', None)}  distance={getattr(obj.metadata, 'distance', None)}")
            explain = getattr(obj.metadata, "explain_score", None) or getattr(obj.metadata, "explainScore", None)
            print("EXPLAIN:\n", explain)

DudaNogueira · March 2, 2026, 2:31pm

Hi @Carloszone !!

This is the expected behavior. GSE produces both word-level and character-level tokens, and BM25F uses all tokens for scoring. Character tokens like ‘s’, ‘e’, ‘l’ appear frequently in English text, causing high scores for English-heavy documents.

So I believe a good approach would be:

1. Use Different Tokenizer for BM25

For mixed Chinese-English corpora where you want to avoid character-level token scoring, consider using:

word tokenization for BM25 (better for mixed languages)
Keep GSE only for semantic embedding fields

2. Separate Fields Pattern

Yes, maintaining separate fields is a recommended pattern:

One field with GSE tokenization for semantic search/embeddings
Another field with word or trigram tokenization for BM25 keyword search

Let me know if this helps!

Thanks!

Carloszone · March 4, 2026, 5:38am

Hi, @DudaNogueira

Thank you for your reply.

I have redesigned my collection structure to solve the issue.

In short, I added two additional attributes: content_zh_bm25 (which stores only Chinese content with GSE tokenization) and content_en_bm25 (which stores only English content with LOWERCASE tokenization).

Now the performance of the new structure is very good. It only costs a little more storage space but provides more accurate keyword search results.

DudaNogueira · March 4, 2026, 4:59pm

Awesome! Glad it helped!

Topic		Replies	Views
BM25 CJK (Chinese, Japanese, Korean) Support Support	1	590	October 18, 2024
The GSE tokenizer has been configured, but the tokenization effect is poor when performing BM25 retrieval General technical	4	184	September 26, 2025
Unable to fully comprehend the computed score Support	6	594	August 12, 2024
Cannot do keywords search for Chinese content in Python Support python	4	489	July 2, 2025
BUG - Duplicate and inconsistent results of BM25 search Support bug , wcs , python	6	926	October 4, 2024