Description
I’m exploring keyword search using BM25.
From Multiple keywords using BM25 I see that multiple words can be supplied. However we’re handling Chinese text:
>>> collection.query.bm25("水痘是怎樣形成的?")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])
>>> collection.query.bm25("水 痘 是 怎 樣 形 成 的 ?")
QueryReturn(objects=[])
>>> collection.query.bm25("水痘 是 怎 樣 形 成 的 ?")
QueryReturn(objects=[Object(uuid=...),Object(uuid=...),Object(uuid=...)])
Can weaviate expose the underlying BM25 tokenizer so that CJK texts are handled properly?
Server Setup Information
- Weaviate Server Version: 1.25.2
- Deployment Method: docker (semitechnologies/weaviate:1.25.2)
- Multi Node? Number of Running Nodes: 1