How can I config the gse tokenization

Description

I’m trying to use weaviate to handle chinese documents, I know weaviate support gse as the tokenization, but how can I config the gse. I need to load chinese dict to the gse for particular terms. (The official gse support LoadDict method.)

The schema looks like as below:

{
                    "class": self.index_name,
                    "description": "Chunks of Documentations",
                    "vectorizer": "none",
                    "properties": [
                        {
                            "name": "text",
                            "dataType": ["text"],
                            "description": "Content of the document",
                            "tokenization": "gse",
                            "indexSearchable": True,
                        },
                   ]
}

Thanks.

hi @Charlie !!

Welcome to our community :hugs:

In order to enable GSE tokenization, you will need to enable it in your server.

For that, you need to set the environment variable ENABLE_TOKENIZER_GSE to true as documented here:

Let me know if that helps!

Thanks!

Thanks, I already enabled the gse. But I don’t know how to customize the gse option, as I need to load chinese dict like below:

var seg1 gse.Segmenter
seg1.DictSep = ","
err := seg1.LoadDict("./testdata/test_cn.txt")