How to enable GSE tokenization for keyword search?

Cheng_Hao · May 14, 2024, 11:11am

Description

I am using the English word document for testing, it works perfect if I specified the “tokenization” as “word”, but get nothing if I switched to “gse”.

Server Setup Information

  weaviate:
    image: semitechnologies/weaviate:1.24.11
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    restart: always
    volumes:
      # Mount the Weaviate data directory to the container.
      - ./volumes/weaviate:/var/lib/weaviate
    environment:
      # The Weaviate configurations
      # You can refer to the [Weaviate](https://weaviate.io/developers/weaviate/config-refs/env-vars) documentation for more information.
      QUERY_DEFAULTS_LIMIT: 25
      USE_GSE: 'true'
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      CLUSTER_HOSTNAME: 'node1'
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'xxx'
      AUTHENTICATION_APIKEY_USERS: 'hello@xxx'
      AUTHORIZATION_ADMINLIST_ENABLED: 'true'
      AUTHORIZATION_ADMINLIST_USERS: 'hello@xxx'
    ports:
      - 8080:8080
      - 50051:50051

Weaviate Server Version:
1.24.11
Deployment Method:
docker
Multi Node? Number of Running Nodes:
single node
Client Language and Version:
name = “weaviate-client”
version = “4.5.5”
description = “A python native Weaviate client”
optional = false
python-versions = “3.10”

Any additional Information

The schema looks like as below:

{
                    "class": self.index_name,
                    "description": "Chunks of Documentations",
                    "vectorizer": "none",
                    "properties": [
                        {
                            "name": "text",
                            "dataType": ["text"],
                            "description": "Content of the document",
                            "tokenization": "gse",   <--- will work if it is "word"
                            "indexSearchable": True,
                        },
                   ]
}

I am wondering if there any further configuration for GSE in the weaviate docker image?

Thanks,
Hao

Cheng_Hao · May 14, 2024, 11:26am

And I do want to process the Chinese/Japanese document, that’s why I put the “gse” for tokenization, but after search GSE, I only got https://github.com/weaviate/weaviate/blob/257b2d366201faabf8bd0256f7e21a386bfdf08b/adapters/repos/db/helpers/tokenizer.go#L44, so I tried put the USE_GSE in the yaml file as environment variable, however, it still doesn’t work.

I have spent 2 days in this issue, hopefully there is a quick fix from the forum.

BRs.
Hao

DudaNogueira · May 15, 2024, 12:59am

hi @Cheng_Hao !

Check if this helps:

import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()
client.collections.delete("Test")
collection = client.collections.create(
    name="Test",
    description="gse test",
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.GSE),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai()
)
print(collection.config.get().properties[0].tokenization)
collection.data.insert({"text": "親切にしてください"})
collection.data.insert({"text": "幸せになる"})

Otherwise, it would help to come up with a reproducible code, with search examples, so we can start from there.

Thanks!

DudaNogueira · May 15, 2024, 12:59am

Also, Welcome to our community

Cheng_Hao · May 15, 2024, 12:12pm

Thank you so much DudaNogueira for the quick response. The demo code works well for me. But I have another question, will the “insert” API wait until the “index building” finished in the weaviate server? Or the “index building” will auto kick off some time later in the weaviate server after the insertion calls?

DudaNogueira · May 16, 2024, 2:10pm

Hi! Glad to hear that!

insert will not wait. It will insert one object with one request.

When you want to insert many objects with a single request, you can leverage insert_many.

Now, if you want to insert a truck load of data, you will need to do some batch:

A good starting point is using dynamic size batch, then play around with fixed size and grow as needed.

The ingestion performance will depend on a lot of factors (cluster replication and resource allocation), so it is advised to plan your resources accordingly

The feature you mentioned is ASYNC_INDEXING that you must enable server wide first. This way, whenever a new object comes in, Weaviate will write data and queue objects for index build.

Note that the vectorization stage is not ASYNC, only the indexing Vectorization will happen on ingestion time.

Let me know if this helps!

Thanks!

Cheng_Hao · May 17, 2024, 3:43am

Thank you very much, that’s really helpful.

I guess the previous failure on keyword search probably due to the index building asynchronously. Now it works perfectly.

DudaNogueira · May 17, 2024, 7:40pm

Glad to hear that @Cheng_Hao !!

If you have any other issues in your Weaviate journey, we are here to help

Topic		Replies	Views
How can I config the gse tokenization Support python , technical	2	189	November 20, 2024
Cannot do keywords search for Chinese content in Python Support python	4	62	July 2, 2025
GSE Tokenizer on WCD General wcs , technical	1	173	October 21, 2024
[Question] How to support keyword search in Chinese Support technical	1	125	November 18, 2024
Using sentence_transformers together with Weaviate Support bug , python	5	652	July 24, 2024

How to enable GSE tokenization for keyword search?

Description

Server Setup Information

Any additional Information

Related topics