How can I config the gse tokenization

Charlie · November 16, 2024, 11:40am

Description

I’m trying to use weaviate to handle chinese documents, I know weaviate support gse as the tokenization, but how can I config the gse. I need to load chinese dict to the gse for particular terms. (The official gse support LoadDict method.)

The schema looks like as below:

{
                    "class": self.index_name,
                    "description": "Chunks of Documentations",
                    "vectorizer": "none",
                    "properties": [
                        {
                            "name": "text",
                            "dataType": ["text"],
                            "description": "Content of the document",
                            "tokenization": "gse",
                            "indexSearchable": True,
                        },
                   ]
}

Thanks.

DudaNogueira · November 18, 2024, 1:11pm

hi @Charlie !!

Welcome to our community

In order to enable GSE tokenization, you will need to enable it in your server.

For that, you need to set the environment variable ENABLE_TOKENIZER_GSE to true as documented here:

Let me know if that helps!

Thanks!

Charlie · November 20, 2024, 10:05am

Thanks, I already enabled the gse. But I don’t know how to customize the gse option, as I need to load chinese dict like below:

var seg1 gse.Segmenter
seg1.DictSep = ","
err := seg1.LoadDict("./testdata/test_cn.txt")

Topic		Replies	Views
How to enable GSE tokenization for keyword search? Support	7	457	May 17, 2024
GSE Tokenizer on WCD General wcs , technical	1	184	October 21, 2024
[Question] How to support keyword search in Chinese Support technical	1	139	November 18, 2024
Cannot do keywords search for Chinese content in Python Support python	4	100	July 2, 2025
Looking for a way to vectorize a data object using WCS internal vectorizer module General	1	446	July 7, 2023

How can I config the gse tokenization

Description

The schema looks like as below:

Related topics