BM25 returns high scores for English-heavy documents when query contains English tokens in Chinese-English mxied chunks (enable GSE)

Description

I’m encountering unexpected behavior when using BM25 (BM25F) for keyword retrieval in a Chinese + English mixed corpus. I’d like to confirm whether this is expected behavior or a tokenizer/index configuration issue.

Relevant text properties use:

tokenization = gse
indexSearchable = true

Observed behavior

When the query contains English tokens (e.g. SELECT 文本段落 ), documents that contain large amounts of English text receive disproportionately high BM25 scores, even when they are semantically irrelevant.

Explain score findings

Using explain_score, I observed that the query appears to be tokenized into both word-level and character-level tokens.

Example (simplified):

BM25F_select_frequency:100
BM25F_s_frequency:39
BM25F_e_frequency:29
BM25F_l_frequency:108
BM25F_c_frequency:10
BM25F_段落_frequency:145
BM25F_本_frequency:48

This suggests:

  • English words are indexed both as whole tokens (select) and individual characters (s, e, l, c, t)
  • Chinese text is also partially indexed at character level

Because single-character tokens occur very frequently, English-heavy documents accumulate large BM25 scores.

I guess the GSE tokenization setting cause the issue, but that is recomonded tokenization in Chinses or Janpanese text.

What I already tested

  • Spaces in the query do not affect results significantly
  • Lowercasing English tokens does not solve the issue
  • Behavior appears strongly correlated with English token presence

Expected behavior

I would expect BM25 ranking to rely primarily on meaningful tokens (words or segmented phrases), not character-level matches that dominate scoring.

Questions

  1. Is it expected that gse tokenization produces character-level tokens that participate in BM25 scoring?
  2. Is BM25F designed to use both word-level and character-level tokens simultaneously?
  3. What is the recommended tokenizer for BM25 in mixed Chinese + English corpora?
  • gse
  • trigram
  • word
  • Separate BM25 field without GSE?
  1. Is it a recommended pattern to maintain separate fields for:
  • semantic embedding (GSE)
  • keyword BM25 (non-GSE tokenizer)?
  1. Are there ways to reduce the impact of single-character tokens on BM25 scoring?

Goal

I want keyword retrieval to:

  • Work well for Chinese queries
  • Not over-score documents simply because they contain large amounts of English text
  • Avoid heavy query routing or custom filtering logic

Any guidance or best practices would be greatly appreciated.

Thanks!

Server Setup Information

  • Weaviate Server Version:1.31.2
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1 node
  • Client Language and Version: 4.15.2
  • Multitenancy?: No

Any additional Information

Code blocks:

        res_keywords = self.db_instance.child_collection.query.bm25(
            query="SELECT 文本段落",  # I  tried "SELECT文本段落" to check if " " caused this issue
            return_properties=['parent_uuid', 'content', "file_id", "content_chapter_info", "file_name"],
            return_metadata=MetadataQuery(distance=True, score=True, explain_score=True),
            limit=max_return_limit,
            filters=file_id_filter)
        
        for i, obj in enumerate(res_keywords.objects[:10], 1):
            print("="*10)
            print(f"[{i}] score={getattr(obj.metadata, 'score', None)}  distance={getattr(obj.metadata, 'distance', None)}")
            explain = getattr(obj.metadata, "explain_score", None) or getattr(obj.metadata, "explainScore", None)
            print("EXPLAIN:\n", explain)

Hi @Carloszone !!

This is the expected behavior. GSE produces both word-level and character-level tokens, and BM25F uses all tokens for scoring. Character tokens like ‘s’, ‘e’, ‘l’ appear frequently in English text, causing high scores for English-heavy documents.

So I believe a good approach would be:

1. Use Different Tokenizer for BM25

For mixed Chinese-English corpora where you want to avoid character-level token scoring, consider using:

  • word tokenization for BM25 (better for mixed languages)
  • Keep GSE only for semantic embedding fields

2. Separate Fields Pattern

Yes, maintaining separate fields is a recommended pattern:

  • One field with GSE tokenization for semantic search/embeddings
  • Another field with word or trigram tokenization for BM25 keyword search

Let me know if this helps!

Thanks!

Hi, @DudaNogueira

Thank you for your reply.

I have redesigned my collection structure to solve the issue.

In short, I added two additional attributes: content_zh_bm25 (which stores only Chinese content with GSE tokenization) and content_en_bm25 (which stores only English content with LOWERCASE tokenization).

Now the performance of the new structure is very good. It only costs a little more storage space but provides more accurate keyword search results.

1 Like

Awesome! Glad it helped!