Cannot do keywords search for Chinese content in Python

Description

I 'm seeking help with an issue regarding keyword search for Chinese text. I am using Weaviate v1.25.3 with the latest weaviate-client library in Python.

My goal is to index Chinese documents and use bm25 keyword search with the gse tokenizer. However, despite setting index_searchable=True and tokenization=Tokenization.GSE in my collection schema, keyword searches consistently return zero results. Vector search works fine, and I can fetch objects by their UUID, which confirms the data is being inserted correctly. The issue seems to be specific to the keyword search/tokenization functionality.

Server Setup Information

My .yml file (have set ENABLE_TOKENIZER_GSE to TRUE):

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.25.3
    ports:
      - "8080:8080"
      - "50051:50051"
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_TOKENIZER_GSE: 'true' 
      CLUSTER_HOSTNAME: 'node1'

Any additional Information

My test code:

import weaviate
import time
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.exceptions import WeaviateQueryException


COLLECTION_NAME = "KeywordSearchDiagnosticTest"
TEST_SENTENCE = "Weaviate是一个开源的向量数据库,非常适合RAG系统。"  # Chinese content

KEYWORD_SHOULD_MATCH = "数据库"  # The keywords is 100% in the above content 

KEYWORD_SHOULD_NOT_MATCH = "人工智能" # this keyword is not in the above content


def run_diagnostic():
    client = None
    try:
        # --- step 1: connect to Weaviate and delete old test collection ---
        print("--- step 1: connect to Weaviate and delete old test collection ---")
        client = weaviate.connect_to_local()
        if client.collections.exists(COLLECTION_NAME):
            client.collections.delete(COLLECTION_NAME)
            print(f"✅ delete the old test collection: '{COLLECTION_NAME}'。")
        else:
            print(f"✅ old test collection doesn't exit。")

        # --- step2: create a new Collection ---
        print("\n--- tep2: create a new Collection ---")
        client.collections.create(
            name=COLLECTION_NAME,
            properties=[
                Property(
                    name="content",
                    data_type=DataType.TEXT,
                    tokenization=Tokenization.GSE,  # set tokenization to GSE
                    index_searchable=True          
                )
            ],
            vectorizer_config=Configure.Vectorizer.none()
        )
        print(f"✅ collection '{COLLECTION_NAME}' has been created successfully。")

        # --- step 3: add a test info ---
        print("\n--- step 3: add a test info ---")
        collection = client.collections.get(COLLECTION_NAME)
        uuid = collection.data.insert({
            "content": TEST_SENTENCE
        })
        print(f"✅ add test info successfully,UUID: {uuid}...")

        # --- step4: wait for a while for index update ---
        print("\n---  step4: wait for 2s for index update ---")
        time.sleep(2)

        # --- step 5: start to run diagnostic task  ---
        print("\n--- step 5: start to run diagnostic task ---")
        
        # test A: get the info directly by uuid
        print("\n[test A] get info by uuid ...")
        fetched_object = collection.query.fetch_object_by_id(uuid)
        if fetched_object and fetched_object.properties.get("content") == TEST_SENTENCE:
            print("  -> ✅ get the info")
            test_a_success = True
        else:
            print("  -> ❌ cannot gdet the info by uuid。")
            test_a_success = False

        # test B: keyword search (should match)
        print(f"\n[test B] use keywords '{KEYWORD_SHOULD_MATCH}' to search ...")
        response_should_match = collection.query.bm25(
            query=KEYWORD_SHOULD_MATCH,
            query_properties=["content"]
        )
        if response_should_match.objects:
            print(f"  -> ✅ find {len(response_should_match.objects)} records。")
            test_b_success = True
        else:
            print(f"  -> ❌ cannot find any records。")
            test_b_success = False

        # test C: keyword search(should faild)
        print(f"\n[test C] se keywords '{KEYWORD_SHOULD_NOT_MATCH}' to search...")
        response_should_not_match = collection.query.bm25(
            query=KEYWORD_SHOULD_NOT_MATCH,
            query_properties=["content"]
        )
        if not response_should_not_match.objects:
            print(f"  -> ✅ no records found")
            test_c_success = True
        else:
            print(f"  -> ❌ find {len(response_should_not_match.objects)} records wrongly。")
            test_c_success = False

    except Exception as e:
        print(f"\n❌ meet issue or error: {e}")
    finally:
        if 'client' in locals() and client.is_connected():
            client.close()
            print("\n task completed, close the connection。")

if __name__ == "__main__":
    run_diagnostic()

My output:

[test A] get info by uuid ...
  -> ✅ get the info

[test B] use keywords '数据库' to search ...
  -> ❌ cannot find any records。

[test C] se keywords '人工智能' to search...
  -> ✅ no records found

 task completed, close the connection。

Now, I guess the GSE feature is not working correctly, so I cannot get any matching record from collection. But I don’t know how to test my assumption and how to fix this issue

My python ver: 3.10.18
My OS: Ubuntu 24.04.2

hi @Carloszone !!

Welcome to our community :hugs: !!

You are using a very old version, 1.25.3 :grimacing:

I have run your code in a latest version 1.31.2 and I got no issues:

Can you try changing the version to:

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.31.2
... continues ...

Thanks!

1 Like