Description
I am using the English word document for testing, it works perfect if I specified the “tokenization” as “word”, but get nothing if I switched to “gse”.
Server Setup Information
weaviate:
image: semitechnologies/weaviate:1.24.11
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
restart: always
volumes:
# Mount the Weaviate data directory to the container.
- ./volumes/weaviate:/var/lib/weaviate
environment:
# The Weaviate configurations
# You can refer to the [Weaviate](https://weaviate.io/developers/weaviate/config-refs/env-vars) documentation for more information.
QUERY_DEFAULTS_LIMIT: 25
USE_GSE: 'true'
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
CLUSTER_HOSTNAME: 'node1'
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'xxx'
AUTHENTICATION_APIKEY_USERS: 'hello@xxx'
AUTHORIZATION_ADMINLIST_ENABLED: 'true'
AUTHORIZATION_ADMINLIST_USERS: 'hello@xxx'
ports:
- 8080:8080
- 50051:50051
-
Weaviate Server Version:
1.24.11
-
Deployment Method:
docker
-
Multi Node? Number of Running Nodes:
single node
-
Client Language and Version:
name = “weaviate-client”
version = “4.5.5”
description = “A python native Weaviate client”
optional = false
python-versions = “3.10”
Any additional Information
The schema looks like as below:
{
"class": self.index_name,
"description": "Chunks of Documentations",
"vectorizer": "none",
"properties": [
{
"name": "text",
"dataType": ["text"],
"description": "Content of the document",
"tokenization": "gse", <--- will work if it is "word"
"indexSearchable": True,
},
]
}
I am wondering if there any further configuration for GSE in the weaviate docker image?
Thanks,
Hao
And I do want to process the Chinese/Japanese document, that’s why I put the “gse” for tokenization, but after search GSE, I only got https://github.com/weaviate/weaviate/blob/257b2d366201faabf8bd0256f7e21a386bfdf08b/adapters/repos/db/helpers/tokenizer.go#L44
, so I tried put the USE_GSE
in the yaml file as environment variable, however, it still doesn’t work.
I have spent 2 days in this issue, hopefully there is a quick fix from the forum.
BRs.
Hao
hi @Cheng_Hao !
Check if this helps:
import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()
client.collections.delete("Test")
collection = client.collections.create(
name="Test",
description="gse test",
properties=[
wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.GSE),
],
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai()
)
print(collection.config.get().properties[0].tokenization)
collection.data.insert({"text": "親切にしてください"})
collection.data.insert({"text": "幸せになる"})
Otherwise, it would help to come up with a reproducible code, with search examples, so we can start from there.
Thanks!
Also, Welcome to our community 
Thank you so much DudaNogueira for the quick response. The demo code works well for me. But I have another question, will the “insert” API wait until the “index building” finished in the weaviate server? Or the “index building” will auto kick off some time later in the weaviate server after the insertion calls?
Hi! Glad to hear that!
insert will not wait. It will insert one object with one request.
When you want to insert many objects with a single request, you can leverage insert_many.
Now, if you want to insert a truck load of data, you will need to do some batch:
A good starting point is using dynamic size batch, then play around with fixed size and grow as needed.
The ingestion performance will depend on a lot of factors (cluster replication and resource allocation), so it is advised to plan your resources accordingly
The feature you mentioned is ASYNC_INDEXING
that you must enable server wide first. This way, whenever a new object comes in, Weaviate will write data and queue objects for index build.
Note that the vectorization stage is not ASYNC, only the indexing Vectorization will happen on ingestion time.
Let me know if this helps!
Thanks!
Thank you very much, that’s really helpful.
I guess the previous failure on keyword search probably due to the index building asynchronously. Now it works perfectly.
1 Like
Glad to hear that @Cheng_Hao !!
If you have any other issues in your Weaviate journey, we are here to help 