Panic when using GSE tokenizer with replication (nil Segmenter, EOF on /replicas/…:commit) in v1.34.10

build_wv_version: 1.34.10
build_image_tag: v1.34.10
build_go_version: go1.24.11
Docker image: semitechnologies/weaviate:latest (resolves to 1.34.10)


Deployment

  • 3‑node Docker Compose cluster (multi‑node with replication)
  • Example node environment (simplified):
environment:
  RAFT_JOIN: node1,node2,node3
  DEFAULT_VECTORIZER_MODULE: text2vec-ollama
  LOG_LEVEL: debug
  ENABLE_TOKENIZER_GSE: 'true'
volumes:
  - ./data-node-3:/home/vector/lib/weaviate
ports:
  - "8082:8080"
  - "6062:6060"
  - "50015:50051"
  • Internal Docker network: 192.168.16.0/20
  • Internal data ports (cluster‑internal API) like 7103, 7105, 7111 are used for /replicas/...:commit calls.

Schema / collection definition

Using Python client v4 to create a collection Test_media with:

  • Replication factor: 3
  • Multiple named vectors (text2vec‑ollama)
  • Several Chinese text fields using GSE tokenization
from weaviate.classes.config import (
    Configure,
    Property,
    DataType,
    Tokenization,
    VectorDistances,
    VectorFilterStrategy,
)

client.collections.create(
    "Test_media",
    replication_config=Configure.replication(
        factor=3,
    ),
    vector_config=[
        Configure.Vectors.text2vec_ollama(
            name="title",
            api_endpoint=ollama_api_endpoint,
            model="quentinz/bge-large-zh-v1.5:latest",
            source_properties=["title"],
            quantizer=Configure.VectorIndex.Quantizer.rq(),
            vector_index_config=Configure.VectorIndex.hnsw(
                ef_construction=300,
                distance_metric=VectorDistances.COSINE,
                filter_strategy=VectorFilterStrategy.ACORN,
            ),
        ),

    ],
    properties=[
        Property(name="title",   data_type=DataType.TEXT, tokenization=Tokenization.GSE),
        Property(name="location",     data_type=DataType.TEXT_ARRAY, skip_vectorization=True, tokenization=Tokenization.FIELD),
        Property(name="p_id",         data_type=DataType.TEXT,       skip_vectorization=True),
    ],
)

GSE is used as the tokenizer for several Chinese text fields. GSE is documented as a language‑specific tokenizer for Japanese/Chinese, enabled via ENABLE_TOKENIZER_GSE. [Language-specific tokenization]


Steps to reproduce

  1. Start a 3‑node Docker Compose Weaviate cluster with replication and ENABLE_TOKENIZER_GSE=true.
  2. Create the Test_media collection as above.
  3. Insert objects via POST /v1/objects?consistency_level=ONE (or Python client v4), where title, content, summary, catalog contain Chinese text.
  4. During import, observe:
    • 500 errors on the client;
    • EOF errors on internal /replicas/...:commit calls;
    • panics in the logs on the replica nodes.

Observed behavior

Client‑side error example

Object was not added! Unexpected status code: 500, with response body:
{
  "error": [{
    "message": "put object: import into index test_media: replicate insertion:
      shard=\"d2X2cUCHZ3WK\": 192.168.16.4:7111: connect:
      Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF"
  }]
}

Similar errors appear for other shards and nodes, always on internal /replicas/indices/Test_media/shards/<shard>:commit calls to the cluster‑internal data port (e.g. 7111). [Multi-node config]

Server‑side logs around the error

{"class":"Test_media","level":"error",
 "msg":"192.168.16.4:7111: connect: Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF",
 "op":"put","shard":"d2X2cUCHZ3WK"}
{"action":"requests_total","api":"rest","class_name":"Test_media","level":"error",
 "error":"put object: import into index test_media: replicate insertion: shard=\"d2X2cUCHZ3WK\": 192.168.16.4:7111: connect: Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF",
 "msg":"unexpected error","query_type":"objects"}

Panic stack trace on the replica node

2026/01/20 09:26:32 http: panic serving 192.168.16.4:51360: runtime error: invalid memory address or nil pointer dereference
goroutine 471281 [running]:
net/http.(*conn).serve.func1()
        net/http/server.go:1947 +0xbe
panic({0x2e39640?, 0x7dd97c0?})
        runtime/panic.go:792 +0x132
github.com/go-ego/gse.(*Segmenter).Find(...)
        github.com/go-ego/gse@v0.80.3/dag.go:40
github.com/go-ego/gse.(*Segmenter).getDag(0x0, {0xc002741d10, 0x17, 0x20})
        github.com/go-ego/gse@v0.80.3/dag.go:142 +0x1f1
github.com/go-ego/gse.(*Segmenter).cutAll(0x0, {0xc0017eab70, 0x2f})
        github.com/go-ego/gse@v0.80.3/dag.go:318 +0xd5
github.com/go-ego/gse.(*Segmenter).CutAll(...)
        github.com/go-ego/gse@v0.80.3/gse.go:92
github.com/weaviate/weaviate/entities/tokenizer.tokenizeGSE({0xc0017eab70, 0x2f})
        github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:278 +0xea
github.com/weaviate/weaviate/entities/tokenizer.Tokenize({0xc001104b69?, 0x49c90e0?}, {0xc0017eab70, 0x2f})
        github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:163 +0xbe
github.com/weaviate/weaviate/entities/tokenizer.TokenizeForClass({0xc001104b69, 0x3}, {0xc0017eab70, 0x2f}, {0xc0013c42e0?, 0x6?})
        github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:144 +0x24e
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).TextArray(0xc0027428d0, {0xc001104b69, 0x3}, {0xc002742480?, ...})
        github.com/weaviate/weaviate/adapters/repos/db/inverted/analyzer.go:81 +0xfd
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).Text(...)
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).analyzePrimitiveProp(...)
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).Object(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).AnalyzeObject(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putOne(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).preparePutObject.func1(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).commitReplication(...)
github.com/weaviate/weaviate/adapters/repos/db.(*LazyLoadShard).commitReplication(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Index).CommitReplication(...)
github.com/weaviate/weaviate/usecases/replica.(*RemoteReplicaIncoming).CommitReplication(...)
github.com/weaviate/weaviate/adapters/handlers/rest/clusterapi.(*replicatedIndices).executeCommitPhase.func12(...)
...
net/http.(*conn).serve(...)

Key points from the stack trace:

  • github.com/go-ego/gse.(*Segmenter).getDag(0x0, ...)Segmenter is nil.
  • Called via tokenizeGSETokenizeForClass → inverted index analyzer → Shard.commitReplication.
  • The panic happens while handling the internal /replicas/...:commit request, which explains the EOF seen by the coordinator node.

Expected behavior

  • When tokenization=GSE is configured on TEXT properties and ENABLE_TOKENIZER_GSE=true is set, GSE should initialize correctly and tokenize text for inverted index / BM25 as documented. [Language-specific tokenization]
  • Replication commits (/replicas/...:commit) should not panic, and clients should not see 500 errors due to EOF on internal replica calls.

Actual behavior

  • With tokenization=GSE on Chinese text fields in a replicated, multi‑node cluster:
    • Replica nodes panic in tokenizeGSE because Segmenter is nil.
    • Coordinator nodes see EOF on /replicas/indices/<class>/shards/<shard>:commit calls to the cluster‑internal data port.
    • Clients receive 500 errors and objects are not added.

Request

  • Please confirm whether this is the same underlying problem as the known GSE dictionary loading issue in 1.34.x, or a separate bug in GSE initialization.
  • Ideally:
    • Ensure GSE dictionaries are loaded correctly in 1.34.x+;
    • Or, if initialization fails, avoid leaving Segmenter as nil so that tokenizeGSE does not panic in the replication commit path.
  • I can provide full startup logs (including any Could not load dictionaries messages) and a minimal Docker Compose + Python script to reproduce, if needed.

Hey @longfei_yao,

You should definitely open a GitHub issue for this. Could you please copy your detailed report and stack trace to a new issue here: https://github.com/weaviate/weaviate/issues? also including startup logs.

The stack trace confirms a nil pointer dereference panic.

Best regards,
Mohamed Shahin
Weaviate Admin
(Ireland, UTC±00:00 / +01:00)

1 Like