build_wv_version: 1.34.10
build_image_tag: v1.34.10
build_go_version: go1.24.11
Docker image: semitechnologies/weaviate:latest (resolves to 1.34.10)
Deployment
- 3‑node Docker Compose cluster (multi‑node with replication)
- Example node environment (simplified):
environment:
RAFT_JOIN: node1,node2,node3
DEFAULT_VECTORIZER_MODULE: text2vec-ollama
LOG_LEVEL: debug
ENABLE_TOKENIZER_GSE: 'true'
volumes:
- ./data-node-3:/home/vector/lib/weaviate
ports:
- "8082:8080"
- "6062:6060"
- "50015:50051"
- Internal Docker network:
192.168.16.0/20 - Internal data ports (cluster‑internal API) like
7103,7105,7111are used for/replicas/...:commitcalls.
Schema / collection definition
Using Python client v4 to create a collection Test_media with:
- Replication factor: 3
- Multiple named vectors (text2vec‑ollama)
- Several Chinese text fields using GSE tokenization
from weaviate.classes.config import (
Configure,
Property,
DataType,
Tokenization,
VectorDistances,
VectorFilterStrategy,
)
client.collections.create(
"Test_media",
replication_config=Configure.replication(
factor=3,
),
vector_config=[
Configure.Vectors.text2vec_ollama(
name="title",
api_endpoint=ollama_api_endpoint,
model="quentinz/bge-large-zh-v1.5:latest",
source_properties=["title"],
quantizer=Configure.VectorIndex.Quantizer.rq(),
vector_index_config=Configure.VectorIndex.hnsw(
ef_construction=300,
distance_metric=VectorDistances.COSINE,
filter_strategy=VectorFilterStrategy.ACORN,
),
),
],
properties=[
Property(name="title", data_type=DataType.TEXT, tokenization=Tokenization.GSE),
Property(name="location", data_type=DataType.TEXT_ARRAY, skip_vectorization=True, tokenization=Tokenization.FIELD),
Property(name="p_id", data_type=DataType.TEXT, skip_vectorization=True),
],
)
GSE is used as the tokenizer for several Chinese text fields. GSE is documented as a language‑specific tokenizer for Japanese/Chinese, enabled via ENABLE_TOKENIZER_GSE. [Language-specific tokenization]
Steps to reproduce
- Start a 3‑node Docker Compose Weaviate cluster with replication and
ENABLE_TOKENIZER_GSE=true. - Create the
Test_mediacollection as above. - Insert objects via
POST /v1/objects?consistency_level=ONE(or Python client v4), wheretitle,content,summary,catalogcontain Chinese text. - During import, observe:
- 500 errors on the client;
- EOF errors on internal
/replicas/...:commitcalls; - panics in the logs on the replica nodes.
Observed behavior
Client‑side error example
Object was not added! Unexpected status code: 500, with response body:
{
"error": [{
"message": "put object: import into index test_media: replicate insertion:
shard=\"d2X2cUCHZ3WK\": 192.168.16.4:7111: connect:
Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF"
}]
}
Similar errors appear for other shards and nodes, always on internal /replicas/indices/Test_media/shards/<shard>:commit calls to the cluster‑internal data port (e.g. 7111). [Multi-node config]
Server‑side logs around the error
{"class":"Test_media","level":"error",
"msg":"192.168.16.4:7111: connect: Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF",
"op":"put","shard":"d2X2cUCHZ3WK"}
{"action":"requests_total","api":"rest","class_name":"Test_media","level":"error",
"error":"put object: import into index test_media: replicate insertion: shard=\"d2X2cUCHZ3WK\": 192.168.16.4:7111: connect: Post \"http://192.168.16.4:7111/replicas/indices/Test_media/shards/d2X2cUCHZ3WK:commit?request_id=node1-01-19bdaba28e0-28\": EOF",
"msg":"unexpected error","query_type":"objects"}
Panic stack trace on the replica node
2026/01/20 09:26:32 http: panic serving 192.168.16.4:51360: runtime error: invalid memory address or nil pointer dereference
goroutine 471281 [running]:
net/http.(*conn).serve.func1()
net/http/server.go:1947 +0xbe
panic({0x2e39640?, 0x7dd97c0?})
runtime/panic.go:792 +0x132
github.com/go-ego/gse.(*Segmenter).Find(...)
github.com/go-ego/gse@v0.80.3/dag.go:40
github.com/go-ego/gse.(*Segmenter).getDag(0x0, {0xc002741d10, 0x17, 0x20})
github.com/go-ego/gse@v0.80.3/dag.go:142 +0x1f1
github.com/go-ego/gse.(*Segmenter).cutAll(0x0, {0xc0017eab70, 0x2f})
github.com/go-ego/gse@v0.80.3/dag.go:318 +0xd5
github.com/go-ego/gse.(*Segmenter).CutAll(...)
github.com/go-ego/gse@v0.80.3/gse.go:92
github.com/weaviate/weaviate/entities/tokenizer.tokenizeGSE({0xc0017eab70, 0x2f})
github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:278 +0xea
github.com/weaviate/weaviate/entities/tokenizer.Tokenize({0xc001104b69?, 0x49c90e0?}, {0xc0017eab70, 0x2f})
github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:163 +0xbe
github.com/weaviate/weaviate/entities/tokenizer.TokenizeForClass({0xc001104b69, 0x3}, {0xc0017eab70, 0x2f}, {0xc0013c42e0?, 0x6?})
github.com/weaviate/weaviate/entities/tokenizer/tokenizer.go:144 +0x24e
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).TextArray(0xc0027428d0, {0xc001104b69, 0x3}, {0xc002742480?, ...})
github.com/weaviate/weaviate/adapters/repos/db/inverted/analyzer.go:81 +0xfd
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).Text(...)
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).analyzePrimitiveProp(...)
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*Analyzer).Object(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).AnalyzeObject(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putOne(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).preparePutObject.func1(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).commitReplication(...)
github.com/weaviate/weaviate/adapters/repos/db.(*LazyLoadShard).commitReplication(...)
github.com/weaviate/weaviate/adapters/repos/db.(*Index).CommitReplication(...)
github.com/weaviate/weaviate/usecases/replica.(*RemoteReplicaIncoming).CommitReplication(...)
github.com/weaviate/weaviate/adapters/handlers/rest/clusterapi.(*replicatedIndices).executeCommitPhase.func12(...)
...
net/http.(*conn).serve(...)
Key points from the stack trace:
github.com/go-ego/gse.(*Segmenter).getDag(0x0, ...)→Segmenterisnil.- Called via
tokenizeGSE→TokenizeForClass→ inverted index analyzer →Shard.commitReplication. - The panic happens while handling the internal
/replicas/...:commitrequest, which explains the EOF seen by the coordinator node.
Expected behavior
- When
tokenization=GSEis configured on TEXT properties andENABLE_TOKENIZER_GSE=trueis set, GSE should initialize correctly and tokenize text for inverted index / BM25 as documented. [Language-specific tokenization] - Replication commits (
/replicas/...:commit) should not panic, and clients should not see 500 errors due to EOF on internal replica calls.
Actual behavior
- With
tokenization=GSEon Chinese text fields in a replicated, multi‑node cluster:- Replica nodes panic in
tokenizeGSEbecauseSegmenteris nil. - Coordinator nodes see EOF on
/replicas/indices/<class>/shards/<shard>:commitcalls to the cluster‑internal data port. - Clients receive 500 errors and objects are not added.
- Replica nodes panic in
Request
- Please confirm whether this is the same underlying problem as the known GSE dictionary loading issue in 1.34.x, or a separate bug in GSE initialization.
- Ideally:
- Ensure GSE dictionaries are loaded correctly in 1.34.x+;
- Or, if initialization fails, avoid leaving
Segmenteras nil so thattokenizeGSEdoes not panic in the replication commit path.
- I can provide full startup logs (including any
Could not load dictionariesmessages) and a minimal Docker Compose + Python script to reproduce, if needed.