Hnsw load commit log corruption error

Description

Weaviate got crash…
Log showed this… I’m completely clue less
write ahead log ended abruptly

Server Setup Information

  • Weaviate Server Version: 1.23.7
  • Deployment Method:
  • Multi Node? Number of Running Nodes:
  • Client Language and Version: V4

Any additional Information

Hi, we’ve got the same error since yesterday

{"action":"hnsw_load_commit_log_corruption","level":"error","msg":"write-ahead-log ended abruptly, some elements may not have been recovered","path":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1707929440","time":"2024-02-14T17:00:55Z"}
{"level":"info","msg":"Completed loading shard aiskillv2_1efo8r43N3tC in 55.950096536s","time":"2024-02-14T17:00:55Z"}
{"level":"info","msg":"Completed loading shard aiskillv2_5SQtIKAXTNQ6 in 56.630300501s","time":"2024-02-14T17:01:52Z"}
{"action":"requests_total","api":"graphql","class_name":"","error":"context canceled","level":"error","msg":"unexpected error","query_type":"","time":"2024-02-14T17:03:36Z"}
{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.9.22.24:8080-\u003e127.0.0.6:39361: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2024-02-14T17:04:48Z"}
{"action":"requests_total","api":"graphql","class_name":"","error":"context canceled","level":"error","msg":"unexpected error","query_type":"","time":"2024-02-14T17:06:20Z"}
{"action":"requests_total","api":"graphql","class_name":"","error":"context canceled","level":"error","msg":"unexpected error","query_type":"","time":"2024-02-14T17:07:48Z"}
{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.9.22.24:8080-\u003e127.0.0.6:46245: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2024-02-14T17:08:31Z"}
{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.9.22.24:8080-\u003e127.0.0.6:36071: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2024-02-14T17:13:40Z"}
{"action":"requests_total","api":"graphql","class_name":"","error":"context canceled","level":"error","msg":"unexpected error","query_type":"","time":"2024-02-14T17:13:43Z"}
{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.9.22.24:8080-\u003e127.0.0.6:34179: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2024-02-14T17:14:17Z"}
{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.9.22.24:8080-\u003e127.0.0.6:38333: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/graphql","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2024-02-14T17:25:32Z"}
{"level":"error","msg":" memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.9.22.24:44408-\u003e10.9.23.28:7000: i/o timeout","time":"2024-02-14T17:30:48Z"}
{"level":"info","msg":" memberlist: Suspect weaviate-0 has failed, no acks received","time":"2024-02-14T17:30:48Z"}
{"level":"error","msg":" memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.9.22.24:44428-\u003e10.9.23.28:7000: i/o timeout","time":"2024-02-14T17:30:50Z"}
{"level":"info","msg":" memberlist: Suspect weaviate-0 has failed, no acks received","time":"2024-02-14T17:30:50Z"}
{"level":"info","msg":" memberlist: Marking weaviate-0 as failed, suspect timeout reached (0 peer confirmations)","time":"2024-02-14T17:30:52Z"}
{"level":"error","msg":" memberlist: Failed fallback TCP ping: timeout 1s: read tcp 10.9.22.24:45276-\u003e10.9.23.28:7000: i/o timeout","time":"2024-02-14T17:30:52Z"}
{"level":"info","msg":" memberlist: Suspect weaviate-0 has failed, no acks received","time":"2024-02-14T17:30:52Z"}

Hi! Welcomet to our community, @Grzegorz_Pasieka !! :hugs:

What are the versions you are running?

I believe this is a recent bug our team was recently able to sort out. The ones that was really hard to reproduce.

If that is the same, it will only happen in very specific circunstances (crash, restart and crash/restart in a specifc start cycle) with some delete in the mix.

There are some PRs already. Follow this issue for for more on related issues that is being worked on this after this discovery:

Thanks!

weviate: v1.23.3
helm-chart: v16.8.0

Thanks for your response!

weviate: v1.23.3
helm-chart: v16.8.0
Multi Node : Single node
Client Language : Python & version : ‘4.4.4’

Ok.

A patch to fix this is being developed.

Is your server in a crash loop? There is one “variant” of this bug that upgrading will keep it from crashlooping for now. Not sure this is the same situation, but it’s always good to run latest versions.

Here:

We’ll keep you posted here for any news.

Thanks!