GSE.CutAll not work well for some Chinese text

gfwgfw · November 22, 2024, 1:54pm

How to reproduce this bug?

query Get {
    Get {
        NewspaperArticle_V2(
            limit: 10000
            nearVector: {
                vector: [... ]
            }
            where: {
                operator: Or
                operands: [{ path: ["content"], operator: ContainsAll, valueText: ["黄海峰","刘捷"] }]
            }
        ) {
            title
            articleId
            content
        }
    }
}

In the some case, when the GSE.CutAll will not generate correctly tokens from input the Chinese text.
For example:
source: “本报讯（首席记者赵芳洲）平安杭州建设20周年大会昨日下午召开。省委副书记、市委书记刘捷”

result in
GSE.CutAll: [本报本报讯（首席首席记者记者赵芳洲）平安杭州建设 2 0 周年大会昨日下午召开。省委副书记、市委市委书记书记刘捷]
// use DAG and HMM GSE.Cut(text, true): [本报讯（首席记者赵芳洲）平安杭州建设 20 周年大会昨日下午召开。省委副书记、市委书记刘捷]
//cut search use hmm: GSE.CutSearch(text, true): [本报本报讯（首席记者首席记者赵芳洲）平安杭州建设 20 周年大会昨日下午召开。省委副书记、市委书记市委书记刘捷]

Some Chinese person name and others has wrong tokenized: ‘刘捷’ should be ‘刘捷’, ‘赵芳洲’ should be ‘赵芳洲’, ‘2 0’ should be ‘20’ (there is a unnecessary space between the two chars)

Even in go-ego/gse 's example can see the difference,

github.com

go-ego/gse/blob/627fa87efa481d4f734d6e06798363a4e1dde1d8/examples/main.go#L99C3-L99C4


      
          	// cut all:  [《复仇者联盟3：无限战争》 复仇 复仇者 仇者 联盟 3 ： 无限 战争 》 是 全片 使用 i m a x 摄影 摄影机 拍摄 摄制 制作 的 的 科幻 科幻片 .]

‘imax’ is tokenized ‘i m a x’ in CutAll method, that’s not correct.

What is the expected behavior?

use DAG and HMM GSE.Cut(text, true) or cut search use hmm: GSE.CutSearch(text, true) to generate tokens

What is the actual behavior?

wrong tokens generate by GSE.CutAll method

Supporting information

No response

Server Version

1.27.0

Weaviate Setup

Single Node

Topic		Replies	Views
Cannot do keywords search for Chinese content in Python Support python	1	52	June 24, 2025
[Question] How to support keyword search in Chinese Support technical	1	125	November 18, 2024
Inaccurate search results Support bug	4	407	April 8, 2024
How does the filter ContainsAny and ContainsAll work? Support	2	247	October 2, 2024
Plain GQL query with "containsAny" operator not working Support bug , technical	4	233	March 19, 2025