Cannot do keywords search for Chinese content in Python

Description

I 'm seeking help with an issue regarding keyword search for Chinese text. I am using Weaviate v1.25.3 with the latest weaviate-client library in Python.

My goal is to index Chinese documents and use bm25 keyword search with the gse tokenizer. However, despite setting index_searchable=True and tokenization=Tokenization.GSE in my collection schema, keyword searches consistently return zero results. Vector search works fine, and I can fetch objects by their UUID, which confirms the data is being inserted correctly. The issue seems to be specific to the keyword search/tokenization functionality.

Server Setup Information

My .yml file (have set ENABLE_TOKENIZER_GSE to TRUE):

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.25.3
    ports:
      - "8080:8080"
      - "50051:50051"
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_TOKENIZER_GSE: 'true' 
      CLUSTER_HOSTNAME: 'node1'

Any additional Information

My test code:

import weaviate
import time
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.exceptions import WeaviateQueryException


COLLECTION_NAME = "KeywordSearchDiagnosticTest"
TEST_SENTENCE = "Weaviate是一个开源的向量数据库,非常适合RAG系统。"  # Chinese content

KEYWORD_SHOULD_MATCH = "数据库"  # The keywords is 100% in the above content 

KEYWORD_SHOULD_NOT_MATCH = "人工智能" # this keyword is not in the above content


def run_diagnostic():
    client = None
    try:
        # --- step 1: connect to Weaviate and delete old test collection ---
        print("--- step 1: connect to Weaviate and delete old test collection ---")
        client = weaviate.connect_to_local()
        if client.collections.exists(COLLECTION_NAME):
            client.collections.delete(COLLECTION_NAME)
            print(f"✅ delete the old test collection: '{COLLECTION_NAME}'。")
        else:
            print(f"✅ old test collection doesn't exit。")

        # --- step2: create a new Collection ---
        print("\n--- tep2: create a new Collection ---")
        client.collections.create(
            name=COLLECTION_NAME,
            properties=[
                Property(
                    name="content",
                    data_type=DataType.TEXT,
                    tokenization=Tokenization.GSE,  # set tokenization to GSE
                    index_searchable=True          
                )
            ],
            vectorizer_config=Configure.Vectorizer.none()
        )
        print(f"✅ collection '{COLLECTION_NAME}' has been created successfully。")

        # --- step 3: add a test info ---
        print("\n--- step 3: add a test info ---")
        collection = client.collections.get(COLLECTION_NAME)
        uuid = collection.data.insert({
            "content": TEST_SENTENCE
        })
        print(f"✅ add test info successfully,UUID: {uuid}...")

        # --- step4: wait for a while for index update ---
        print("\n---  step4: wait for 2s for index update ---")
        time.sleep(2)

        # --- step 5: start to run diagnostic task  ---
        print("\n--- step 5: start to run diagnostic task ---")
        
        # test A: get the info directly by uuid
        print("\n[test A] get info by uuid ...")
        fetched_object = collection.query.fetch_object_by_id(uuid)
        if fetched_object and fetched_object.properties.get("content") == TEST_SENTENCE:
            print("  -> ✅ get the info")
            test_a_success = True
        else:
            print("  -> ❌ cannot gdet the info by uuid。")
            test_a_success = False

        # test B: keyword search (should match)
        print(f"\n[test B] use keywords '{KEYWORD_SHOULD_MATCH}' to search ...")
        response_should_match = collection.query.bm25(
            query=KEYWORD_SHOULD_MATCH,
            query_properties=["content"]
        )
        if response_should_match.objects:
            print(f"  -> ✅ find {len(response_should_match.objects)} records。")
            test_b_success = True
        else:
            print(f"  -> ❌ cannot find any records。")
            test_b_success = False

        # test C: keyword search(should faild)
        print(f"\n[test C] se keywords '{KEYWORD_SHOULD_NOT_MATCH}' to search...")
        response_should_not_match = collection.query.bm25(
            query=KEYWORD_SHOULD_NOT_MATCH,
            query_properties=["content"]
        )
        if not response_should_not_match.objects:
            print(f"  -> ✅ no records found")
            test_c_success = True
        else:
            print(f"  -> ❌ find {len(response_should_not_match.objects)} records wrongly。")
            test_c_success = False

    except Exception as e:
        print(f"\n❌ meet issue or error: {e}")
    finally:
        if 'client' in locals() and client.is_connected():
            client.close()
            print("\n task completed, close the connection。")

if __name__ == "__main__":
    run_diagnostic()

My output:

[test A] get info by uuid ...
  -> ✅ get the info

[test B] use keywords '数据库' to search ...
  -> ❌ cannot find any records。

[test C] se keywords '人工智能' to search...
  -> ✅ no records found

 task completed, close the connection。

Now, I guess the GSE feature is not working correctly, so I cannot get any matching record from collection. But I don’t know how to test my assumption and how to fix this issue

My python ver: 3.10.18
My OS: Ubuntu 24.04.2

hi @Carloszone !!

Welcome to our community :hugs: !!

You are using a very old version, 1.25.3 :grimacing:

I have run your code in a latest version 1.31.2 and I got no issues:

Can you try changing the version to:

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.31.2
... continues ...

Thanks!

1 Like

Hi, @DudaNogueira! Could you have time to check this issue GSE.CutAll not work well for some Chinese text · Issue #6115 · weaviate/weaviate · GitHub and pull request : Better tokenize the Chinese word, especially if the word is not in dictionary , for example, people's name by smoothdvd · Pull Request #8025 · weaviate/weaviate · GitHub.
Main issue is CutAll is not working well on tokenize Chinese except there are the prop words in dict (but there is no method we can update dict realtime through Weavaite sdk).

1 Like

hi @gfwgfw !!

I have raised it to our team! Thanks!!!

@DudaNogueira Thanks a lot!

1 Like