BM25 CJK (Chinese, Japanese, Korean) Support

henry_m · October 18, 2024, 8:03am

Description

I’m exploring keyword search using BM25.

From Multiple keywords using BM25 I see that multiple words can be supplied. However we’re handling Chinese text:

>>> collection.query.bm25("水痘是怎樣形成的？")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])
>>> collection.query.bm25("水 痘 是 怎 樣 形 成 的 ？")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘 是 怎 樣 形 成 的 ？")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])

Can weaviate expose the underlying BM25 tokenizer so that CJK texts are handled properly?

Server Setup Information

Weaviate Server Version: 1.25.2
Deployment Method: docker (semitechnologies/weaviate:1.25.2)
Multi Node? Number of Running Nodes: 1

Any additional Information

Dirk · October 18, 2024, 8:12am

Hello!

We have special tokenizers for chinese+japanese and korean

Please have a look here: Collection schema | Weaviate

Topic		Replies	Views
Cannot do keywords search for Chinese content in Python Support python	1	52	June 24, 2025
After importing the document into Weaviate for a period of time, it cannot be searched using BM25, Support bug	4	722	October 4, 2023
[Question] How to support keyword search in Chinese Support technical	1	125	November 18, 2024
Weaviate Use Case with other language Support	6	664	January 31, 2024
Is fuzzy search supported here? Support	1	881	July 25, 2023

BM25 CJK (Chinese, Japanese, Korean) Support

Description

Server Setup Information

Any additional Information

Related topics