Cannot do keywords search for Chinese content in Python

Carloszone · June 23, 2025, 9:31am

Description

I 'm seeking help with an issue regarding keyword search for Chinese text. I am using Weaviate v1.25.3 with the latest weaviate-client library in Python.

My goal is to index Chinese documents and use bm25 keyword search with the gse tokenizer. However, despite setting index_searchable=True and tokenization=Tokenization.GSE in my collection schema, keyword searches consistently return zero results. Vector search works fine, and I can fetch objects by their UUID, which confirms the data is being inserted correctly. The issue seems to be specific to the keyword search/tokenization functionality.

Server Setup Information

My .yml file (have set ENABLE_TOKENIZER_GSE to TRUE):

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.25.3
    ports:
      - "8080:8080"
      - "50051:50051"
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_TOKENIZER_GSE: 'true' 
      CLUSTER_HOSTNAME: 'node1'

Any additional Information

My test code:

import weaviate
import time
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.exceptions import WeaviateQueryException


COLLECTION_NAME = "KeywordSearchDiagnosticTest"
TEST_SENTENCE = "Weaviate是一个开源的向量数据库，非常适合RAG系统。"  # Chinese content

KEYWORD_SHOULD_MATCH = "数据库"  # The keywords is 100% in the above content 

KEYWORD_SHOULD_NOT_MATCH = "人工智能" # this keyword is not in the above content


def run_diagnostic():
    client = None
    try:
        # --- step 1: connect to Weaviate and delete old test collection ---
        print("--- step 1: connect to Weaviate and delete old test collection ---")
        client = weaviate.connect_to_local()
        if client.collections.exists(COLLECTION_NAME):
            client.collections.delete(COLLECTION_NAME)
            print(f"✅ delete the old test collection: '{COLLECTION_NAME}'。")
        else:
            print(f"✅ old test collection doesn't exit。")

        # --- step2: create a new Collection ---
        print("\n--- tep2: create a new Collection ---")
        client.collections.create(
            name=COLLECTION_NAME,
            properties=[
                Property(
                    name="content",
                    data_type=DataType.TEXT,
                    tokenization=Tokenization.GSE,  # set tokenization to GSE
                    index_searchable=True          
                )
            ],
            vectorizer_config=Configure.Vectorizer.none()
        )
        print(f"✅ collection '{COLLECTION_NAME}' has been created successfully。")

        # --- step 3: add a test info ---
        print("\n--- step 3: add a test info ---")
        collection = client.collections.get(COLLECTION_NAME)
        uuid = collection.data.insert({
            "content": TEST_SENTENCE
        })
        print(f"✅ add test info successfully，UUID: {uuid}...")

        # --- step4: wait for a while for index update ---
        print("\n---  step4: wait for 2s for index update ---")
        time.sleep(2)

        # --- step 5: start to run diagnostic task  ---
        print("\n--- step 5: start to run diagnostic task ---")
        
        # test A: get the info directly by uuid
        print("\n[test A] get info by uuid ...")
        fetched_object = collection.query.fetch_object_by_id(uuid)
        if fetched_object and fetched_object.properties.get("content") == TEST_SENTENCE:
            print("  -> ✅ get the info")
            test_a_success = True
        else:
            print("  -> ❌ cannot gdet the info by uuid。")
            test_a_success = False

        # test B: keyword search (should match)
        print(f"\n[test B] use keywords '{KEYWORD_SHOULD_MATCH}' to search ...")
        response_should_match = collection.query.bm25(
            query=KEYWORD_SHOULD_MATCH,
            query_properties=["content"]
        )
        if response_should_match.objects:
            print(f"  -> ✅ find {len(response_should_match.objects)} records。")
            test_b_success = True
        else:
            print(f"  -> ❌ cannot find any records。")
            test_b_success = False

        # test C: keyword search（should faild）
        print(f"\n[test C] se keywords '{KEYWORD_SHOULD_NOT_MATCH}' to search...")
        response_should_not_match = collection.query.bm25(
            query=KEYWORD_SHOULD_NOT_MATCH,
            query_properties=["content"]
        )
        if not response_should_not_match.objects:
            print(f"  -> ✅ no records found")
            test_c_success = True
        else:
            print(f"  -> ❌ find {len(response_should_not_match.objects)} records wrongly。")
            test_c_success = False

    except Exception as e:
        print(f"\n❌ meet issue or error: {e}")
    finally:
        if 'client' in locals() and client.is_connected():
            client.close()
            print("\n task completed, close the connection。")

if __name__ == "__main__":
    run_diagnostic()

My output:

[test A] get info by uuid ...
  -> ✅ get the info

[test B] use keywords '数据库' to search ...
  -> ❌ cannot find any records。

[test C] se keywords '人工智能' to search...
  -> ✅ no records found

 task completed, close the connection。

Now, I guess the GSE feature is not working correctly, so I cannot get any matching record from collection. But I don’t know how to test my assumption and how to fix this issue

My python ver: 3.10.18
My OS: Ubuntu 24.04.2

DudaNogueira · June 24, 2025, 3:20pm

hi @Carloszone !!

Welcome to our community !!

You are using a very old version, 1.25.3

I have run your code in a latest version 1.31.2 and I got no issues:

Can you try changing the version to:

version: '3.4'

services:
  weaviate:
    image: semitechnologies/weaviate:1.31.2
... continues ...

Thanks!

gfwgfw · June 30, 2025, 6:11am

Hi, @DudaNogueira! Could you have time to check this issue GSE.CutAll not work well for some Chinese text · Issue #6115 · weaviate/weaviate · GitHub and pull request : Better tokenize the Chinese word, especially if the word is not in dictionary , for example, people's name by smoothdvd · Pull Request #8025 · weaviate/weaviate · GitHub.
Main issue is CutAll is not working well on tokenize Chinese except there are the prop words in dict (but there is no method we can update dict realtime through Weavaite sdk).

DudaNogueira · July 1, 2025, 7:14pm

hi @gfwgfw !!

I have raised it to our team! Thanks!!!

gfwgfw · July 2, 2025, 12:41am

@DudaNogueira Thanks a lot!

Topic		Replies	Views
How to enable GSE tokenization for keyword search? Support	7	457	May 17, 2024
[Question] How to support keyword search in Chinese Support technical	1	139	November 18, 2024
After importing the document into Weaviate for a period of time, it cannot be searched using BM25, Support bug	4	737	October 4, 2023
Unable to get expected results using BM25 or any search functions Support	8	499	July 3, 2024
How can I config the gse tokenization Support python , technical	2	208	November 20, 2024

Cannot do keywords search for Chinese content in Python

Description

Server Setup Information

Any additional Information

Related topics