Helm cluster, node spontaneously stops and restarts, shard unavailable for 1 min

Using helm chart to deploy 3x nodes to self hosted KS

Tried NodeJs weaviate-ts-client 1.4 and 2.0, same behaviour

While ingesting data and running NN queries, node randomly (sometimes 2h, sometimes 10 minutes between occurrences) disappears and starts again.

KS doesn’t restart the node - restart count is 0.

Node is unavailable to client for about 1min at which point (after retrying) it becomes available and ingestion/NN continues as normal
Frequency of restarts increases over time

Weaviate version: 1.22.3
Transformer: None (client supplies vectors)
Shards: 3, each with 8GB of which <3GB used
Replication: None
Number of elements in each shard: <1.5M
Number of clients: 16

Logs below.

Questions:
Is this normal? What’s going on?
Is any data lost during the ‘restart’?

Thanks for your help

Greg

  • Logs of weaviate-0 (during weaviate-1 restart)
{"level":"debug","msg":" memberlist: Failed UDP ping: weaviate-1 (timeout reached)","time":"2023-11-28T17:07:37Z"}
{"level":"info","msg":" memberlist: Suspect weaviate-1 has failed, no acks received","time":"2023-11-28T17:07:37Z"}
{"level":"debug","msg":" memberlist: Initiating push/pull sync with: weaviate-0 172.19.246.32:7000","time":"2023-11-28T17:07:39Z"}
...
{"level":"debug","msg":" memberlist: Failed UDP ping: weaviate-1 (timeout reached)","time":"2023-11-28T17:07:40Z"}
{"level":"info","msg":" memberlist: Marking weaviate-1 as failed, suspect timeout reached (0 peer confirmations)","time":"2023-11-28T17:07:40Z"}
{"level":"info","msg":" memberlist: Suspect weaviate-1 has failed, no acks received","time":"2023-11-28T17:07:40Z"}
{"level":"debug","msg":" memberlist: Stream connection from=172.19.246.32:49008","time":"2023-11-28T17:07:50Z"}
{"level":"debug","msg":" memberlist: Stream connection from=172.19.246.239:34340","time":"2023-11-28T17:08:08Z"}
{"level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 172.19.246.140:7000 Theirs: 172.19.246.239:7000 Old state: 2","time":"2023-11-28T17:08:08Z"}
{"level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 172.19.246.140:7000 Theirs: 172.19.246.239:7000 Old state: 2","time":"2023-11-28T17:08:08Z"}
{"level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 172.19.246.140:7000 Theirs: 172.19.246.239:7000 Old state: 2","time":"2023-11-28T17:08:08Z"}
{"level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 172.19.246.140:7000 Theirs: 172.19.246.239:7000 Old state: 2","time":"2023-11-28T17:08:09Z"}
{"level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 172.19.246.140:7000 Theirs: 172.19.246.239:7000 Old state: 2","time":"2023-11-28T17:08:09Z"}
{"level":"debug","msg":" memberlist: Initiating push/pull sync with: weaviate-0 172.19.246.32:7000","time":"2023-11-28T17:08:09Z"}
  • Logs of weaviate-1 (after restart)
{"action":"startup","level":"debug","msg":"created startup context, nothing done so far","startup_time_left":"59m59.998481609s","time":"2023-11-28T17:08:08Z"}
{"action":"config_load","config_file_path":"/weaviate-config/conf.yaml","level":"info","msg":"Usage of the weaviate.conf.json file is deprecated and will be removed in the future. Please use environment variables.","time":"2023-11-28T17:08:08Z"}
{"deprecation":{"apiType":"Configuration","id":"config-files","locations":["--config-file=\"\""],"mitigation":"Configure Weaviate using environment variables.","msg":"use of deprecated command line argument --config-file","sinceTime":"2020-09-08T09:46:00.000Z","
sinceVersion":"0.22.16","status":"deprecated"},"level":"warning","msg":"use of deprecated command line argument --config-file","time":"2023-11-28T17:08:08Z"}
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-11-28T17:08:08Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"config loaded","startup_time_left":"59m59.997738065s","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"configured OIDC and anonymous access client","startup_time_left":"59m59.997723864s","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"initialized schema","startup_time_left":"59m59.997716464s","time":"2023-11-28T17:08:08Z"}
{"level":"debug","msg":" memberlist: Initiating push/pull sync with:  172.19.246.32:7000","time":"2023-11-28T17:08:08Z"}
{"level":"warning","msg":" memberlist: Refuting a suspect message (from: weaviate-1)","time":"2023-11-28T17:08:08Z"}
{"level":"debug","msg":" memberlist: Initiating push/pull sync with:  172.19.246.239:7000","time":"2023-11-28T17:08:08Z"}
{"level":"debug","msg":" memberlist: Stream connection from=172.19.246.239:44148","time":"2023-11-28T17:08:08Z"}
{"level":"debug","msg":" memberlist: Initiating push/pull sync with:  172.19.246.108:7000","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"startup routine complete","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"start registering modules","time":"2023-11-28T17:08:08Z"}
{"action":"startup","level":"debug","msg":"completed registering modules","time":"2023-11-28T17:08:08Z"}
{"action":"cluster_api_startup","level":"debug","msg":"serving cluster api on port 7001","port":7001,"time":"2023-11-28T17:08:08Z"}
{"action":"hnsw_grow_index","level":"debug","msg":"index grown from 1000 to 3000, took 48.103µs\n","new_size":3000,"previous_size":1000,"time":"2023-11-28T17:08:08Z","took":48103}
{"action":"hnsw_grow_index","level":"debug","msg":"index grown from 3000 to 5000, took 19.401µs\n","new_size":5000,"previous_size":3000,"time":"2023-11-28T17:08:08Z","took":19401}
...
{"action":"hnsw_grow_index","level":"debug","msg":"index grown from 975616 to 1219520, took 9.448062ms\n","new_size":1219520,"previous_size":975616,"time":"2023-11-28T17:08:11Z","took":9448062}
{"action":"hnsw_grow_index","level":"debug","msg":"index grown from 1219520 to 1524400, took 14.119639ms\n","new_size":1524400,"previous_size":1219520,"time":"2023-11-28T17:08:13Z","took":14119639}
{"action":"lsm_init_disk_segment_build_bloom_filter_secondary","class":"MyClass","index":"myclass","level":"debug","msg":"building bloom filter took 8.865927ms\n","path":"/var/lib/weaviate/myclass_dRVK6rneWdGH_lsm/objects/segment-1701191194984974204.db","secondary_index_position":0,"shard":"dRVK6rneWdGH","time":"2023-11-28T17:08:15Z","took":8865927}
...
{"action":"lsm_init_disk_segment_build_bloom_filter_primary","class":"MyClass","index":"myclass","level":"debug","msg":"building bloom filter took 24.338447ms\n","path":"/var/lib/weaviate/myclass_dRVK6rneWdGH_lsm/property_commonclusterid/segment-1701191195173181719.db","shard":"dRVK6rneWdGH","time":"2023-11-28T17:08:16Z","took":24338447}
{"action":"startup","level":"debug","msg":"shard=dRVK6rneWdGH is ready","time":"2023-11-28T17:08:16Z"}
{"action":"startup","level":"debug","msg":"shard=dRVK6rneWdGH is ready","time":"2023-11-28T17:08:16Z"}

Hi!

This is definitely not the expected behavior.

Are you using our helm?

Hi,

Yes, I’m using the latest weaviate/weaviate chart with updates to values as below.

I suspect lack of resources might have something to do with the restart (increasing CPU and RAM delayed the restart behaviour) but I still don’t know why there are no previous logs and KS shows 0 restarts.

I’ll be running a test with plenty of RAM this afternoon and should know if that helps

$ diff helm/values.yaml helm/values-orig.yaml
64c64
< replicas: 3
---
> replicas: 1
67,73c67,73
< resources:
<   requests:
<     cpu: '2000m'
<     memory: '4Gi'
< #  limits:
< #    cpu: '2000m'
< #    memory: '4Gi'
---
> resources: {}
>   # requests:
>   #   cpu: '500m'
>   #   memory: '300Mi'
>   # limits:
>   #   cpu: '1000m'
>   #   memory: '1Gi'
97c97
<   name: myproject-weaviate
---
>   name: weaviate
103c103
<   type: NodePort
---
>   type: LoadBalancer
105a106
>   clusterIP:
239,240d239
<
<   LOG_LEVEL: debug

Sure.

Let me know about your findings. I have very basic knowledge in k8s. yet.

Bear in mind that when you ingest data, it will take a lot of CPU in order to index it.

And Weaviate will eat all data your throw in.

So usually, this first “big” ingestion, should have some strategies around it. Otherwise, you will need to have a lot of resources only to deal with this initial load.

We have some best practices on data import here:

Let me know if that helps :slight_smile:

If we need some k8s experts here, I can “ring some bells” internally :wink:

OK, I maxed out the resources we have (given our rather puny cluster) and managed to finish the ingestion. So it probably was a resource issue, but I saw nothing in the KS logs or pod event list.

Some details:
KS cluster node 4xCPU x 12GB - scales up to 7 nodes on deployment
Weaviate cluster x5 nodes - Requests: 2xCPU, 8GB

Estimated RAM required at 36M items x 20 vector size x 8B = 6GB
Viewing RAM during original (failed) execution shows 3x8GB was continuously full

  • might be related to GC - anyway decided to increase to 5x8GB ‘to be sure’
  • I wanted 3x10GB but KS wouldn’t allow it

Regarding you best-practices link:

  • We need to run NN before deciding to add a new element - so batch loading is not possible
  • We are running 16x clients so the CPUs are kept fairly busy
  • We will need to revisit error handling - maybe that needs tightening up
1 Like

Thanks for your help with this. At the moment I’m happy that we can complete our tests and get results as long as we’re careful about resources

The mode of failure when under-resourced is a little disconcerting in that there seems to be no record of the problem after the fact

If I have time I’ll try to experiment a little to see what exactly is happening - for all our sakes.

Other than that I’m happy this was resolved and you can consider the question closed.

Thanks,

Greg

1 Like