How to enable GSE tokenization for keyword search?

Description

I am using the English word document for testing, it works perfect if I specified the “tokenization” as “word”, but get nothing if I switched to “gse”.

Server Setup Information

  weaviate:
    image: semitechnologies/weaviate:1.24.11
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    restart: always
    volumes:
      # Mount the Weaviate data directory to the container.
      - ./volumes/weaviate:/var/lib/weaviate
    environment:
      # The Weaviate configurations
      # You can refer to the [Weaviate](https://weaviate.io/developers/weaviate/config-refs/env-vars) documentation for more information.
      QUERY_DEFAULTS_LIMIT: 25
      USE_GSE: 'true'
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      CLUSTER_HOSTNAME: 'node1'
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'xxx'
      AUTHENTICATION_APIKEY_USERS: 'hello@xxx'
      AUTHORIZATION_ADMINLIST_ENABLED: 'true'
      AUTHORIZATION_ADMINLIST_USERS: 'hello@xxx'
    ports:
      - 8080:8080
      - 50051:50051
  • Weaviate Server Version:
    1.24.11

  • Deployment Method:
    docker

  • Multi Node? Number of Running Nodes:
    single node

  • Client Language and Version:
    name = “weaviate-client”
    version = “4.5.5”
    description = “A python native Weaviate client”
    optional = false
    python-versions = “3.10”

Any additional Information

The schema looks like as below:

{
                    "class": self.index_name,
                    "description": "Chunks of Documentations",
                    "vectorizer": "none",
                    "properties": [
                        {
                            "name": "text",
                            "dataType": ["text"],
                            "description": "Content of the document",
                            "tokenization": "gse",   <--- will work if it is "word"
                            "indexSearchable": True,
                        },
                   ]
}

I am wondering if there any further configuration for GSE in the weaviate docker image?

Thanks,
Hao

And I do want to process the Chinese/Japanese document, that’s why I put the “gse” for tokenization, but after search GSE, I only got https://github.com/weaviate/weaviate/blob/257b2d366201faabf8bd0256f7e21a386bfdf08b/adapters/repos/db/helpers/tokenizer.go#L44, so I tried put the USE_GSE in the yaml file as environment variable, however, it still doesn’t work.

I have spent 2 days in this issue, hopefully there is a quick fix from the forum.

BRs.
Hao

hi @Cheng_Hao !

Check if this helps:

import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()
client.collections.delete("Test")
collection = client.collections.create(
    name="Test",
    description="gse test",
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.GSE),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai()
)
print(collection.config.get().properties[0].tokenization)
collection.data.insert({"text": "親切にしてください"})
collection.data.insert({"text": "幸せになる"})

Otherwise, it would help to come up with a reproducible code, with search examples, so we can start from there.

Thanks!

Also, Welcome to our community :hugs:

Thank you so much DudaNogueira for the quick response. The demo code works well for me. But I have another question, will the “insert” API wait until the “index building” finished in the weaviate server? Or the “index building” will auto kick off some time later in the weaviate server after the insertion calls?

Hi! Glad to hear that!

insert will not wait. It will insert one object with one request.

When you want to insert many objects with a single request, you can leverage insert_many.

Now, if you want to insert a truck load of data, you will need to do some batch:

A good starting point is using dynamic size batch, then play around with fixed size and grow as needed.

The ingestion performance will depend on a lot of factors (cluster replication and resource allocation), so it is advised to plan your resources accordingly

The feature you mentioned is ASYNC_INDEXING that you must enable server wide first. This way, whenever a new object comes in, Weaviate will write data and queue objects for index build.

Note that the vectorization stage is not ASYNC, only the indexing Vectorization will happen on ingestion time.

Let me know if this helps!

Thanks!

Thank you very much, that’s really helpful.

I guess the previous failure on keyword search probably due to the index building asynchronously. Now it works perfectly.

1 Like

Glad to hear that @Cheng_Hao !!

If you have any other issues in your Weaviate journey, we are here to help :slight_smile: