BM25 CJK (Chinese, Japanese, Korean) Support

Description

I’m exploring keyword search using BM25.

From Multiple keywords using BM25 I see that multiple words can be supplied. However we’re handling Chinese text:

>>> collection.query.bm25("水痘是怎樣形成的?")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])
>>> collection.query.bm25("水 痘 是 怎 樣 形 成 的 ?")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘 是 怎 樣 形 成 的 ?")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])

Can weaviate expose the underlying BM25 tokenizer so that CJK texts are handled properly?

Server Setup Information

  • Weaviate Server Version: 1.25.2
  • Deployment Method: docker (semitechnologies/weaviate:1.25.2)
  • Multi Node? Number of Running Nodes: 1

Any additional Information

Hello!

We have special tokenizers for chinese+japanese and korean :slight_smile:

Please have a look here: Collection schema | Weaviate