Description
I 'm seeking help with an issue regarding keyword search for Chinese text. I am using Weaviate v1.25.3
with the latest weaviate-client
library in Python.
My goal is to index Chinese documents and use bm25
keyword search with the gse
tokenizer. However, despite setting index_searchable=True
and tokenization=Tokenization.GSE
in my collection schema, keyword searches consistently return zero results. Vector search works fine, and I can fetch objects by their UUID, which confirms the data is being inserted correctly. The issue seems to be specific to the keyword search/tokenization functionality.
Server Setup Information
My .yml file (have set ENABLE_TOKENIZER_GSE
to TRUE
):
version: '3.4'
services:
weaviate:
image: semitechnologies/weaviate:1.25.3
ports:
- "8080:8080"
- "50051:50051"
volumes:
- ./weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'none'
ENABLE_TOKENIZER_GSE: 'true'
CLUSTER_HOSTNAME: 'node1'
Any additional Information
My test code:
import weaviate
import time
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.exceptions import WeaviateQueryException
COLLECTION_NAME = "KeywordSearchDiagnosticTest"
TEST_SENTENCE = "Weaviate是一个开源的向量数据库,非常适合RAG系统。" # Chinese content
KEYWORD_SHOULD_MATCH = "数据库" # The keywords is 100% in the above content
KEYWORD_SHOULD_NOT_MATCH = "人工智能" # this keyword is not in the above content
def run_diagnostic():
client = None
try:
# --- step 1: connect to Weaviate and delete old test collection ---
print("--- step 1: connect to Weaviate and delete old test collection ---")
client = weaviate.connect_to_local()
if client.collections.exists(COLLECTION_NAME):
client.collections.delete(COLLECTION_NAME)
print(f"✅ delete the old test collection: '{COLLECTION_NAME}'。")
else:
print(f"✅ old test collection doesn't exit。")
# --- step2: create a new Collection ---
print("\n--- tep2: create a new Collection ---")
client.collections.create(
name=COLLECTION_NAME,
properties=[
Property(
name="content",
data_type=DataType.TEXT,
tokenization=Tokenization.GSE, # set tokenization to GSE
index_searchable=True
)
],
vectorizer_config=Configure.Vectorizer.none()
)
print(f"✅ collection '{COLLECTION_NAME}' has been created successfully。")
# --- step 3: add a test info ---
print("\n--- step 3: add a test info ---")
collection = client.collections.get(COLLECTION_NAME)
uuid = collection.data.insert({
"content": TEST_SENTENCE
})
print(f"✅ add test info successfully,UUID: {uuid}...")
# --- step4: wait for a while for index update ---
print("\n--- step4: wait for 2s for index update ---")
time.sleep(2)
# --- step 5: start to run diagnostic task ---
print("\n--- step 5: start to run diagnostic task ---")
# test A: get the info directly by uuid
print("\n[test A] get info by uuid ...")
fetched_object = collection.query.fetch_object_by_id(uuid)
if fetched_object and fetched_object.properties.get("content") == TEST_SENTENCE:
print(" -> ✅ get the info")
test_a_success = True
else:
print(" -> ❌ cannot gdet the info by uuid。")
test_a_success = False
# test B: keyword search (should match)
print(f"\n[test B] use keywords '{KEYWORD_SHOULD_MATCH}' to search ...")
response_should_match = collection.query.bm25(
query=KEYWORD_SHOULD_MATCH,
query_properties=["content"]
)
if response_should_match.objects:
print(f" -> ✅ find {len(response_should_match.objects)} records。")
test_b_success = True
else:
print(f" -> ❌ cannot find any records。")
test_b_success = False
# test C: keyword search(should faild)
print(f"\n[test C] se keywords '{KEYWORD_SHOULD_NOT_MATCH}' to search...")
response_should_not_match = collection.query.bm25(
query=KEYWORD_SHOULD_NOT_MATCH,
query_properties=["content"]
)
if not response_should_not_match.objects:
print(f" -> ✅ no records found")
test_c_success = True
else:
print(f" -> ❌ find {len(response_should_not_match.objects)} records wrongly。")
test_c_success = False
except Exception as e:
print(f"\n❌ meet issue or error: {e}")
finally:
if 'client' in locals() and client.is_connected():
client.close()
print("\n task completed, close the connection。")
if __name__ == "__main__":
run_diagnostic()
My output:
[test A] get info by uuid ...
-> ✅ get the info
[test B] use keywords '数据库' to search ...
-> ❌ cannot find any records。
[test C] se keywords '人工智能' to search...
-> ✅ no records found
task completed, close the connection。
Now, I guess the GSE feature is not working correctly, so I cannot get any matching record from collection. But I don’t know how to test my assumption and how to fix this issue
My python ver: 3.10.18
My OS: Ubuntu 24.04.2