Runtime error: makeslice: len out of range

Description

Hello, I started using your database, at first everything was fine, but after a long recording of vectors(I recorded 1730168 vectors, disk space Estimated Sizes LSM stores 333Gb), it started to give an error in the logs:

{"action":"cyclemanager","build_git_commit":"","build_go_version":"go1.22.0","build_image_tag":"","build_wv_version":"1.28.2","callback_id":"segmentgroup/compaction//home/user/rdata/weaviate/tksad/K5j5EQ9XTNCU/lsm/objects","callbacks_id":"store/compaction/..","class":"Tksad","index":"tksad","level":"error","msg":"callback panic: runtime error: makeslice: len out of range","shard":"K5j5EQ9XTNCU","time":"2025-01-19T07:56:44Z"}

I thought after rebooting the database would be cured and the error would disappear, but this did not happen.

Server Setup Information

I use a multi-node configuration, I end up with 2 nodes, each running on its own physical server.
Configuration of each server:

  • 128gb memory ram ddr4
  • xeon 2678v3
  • 4tb ssd nvme on raid0
    There were no such errors on the second node.
    I launch it manually via binary files, here is an example of launching:
export LOG_LEVEL="trace"
export CLUSTER_HOSTNAME="wv1"
export CLUSTER_GOSSIP_BIND_PORT=7100
export CLUSTER_DATA_BIND_PORT=7101
export AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true
export PERSISTENCE_DATA_PATH="/home/user/rdata/weaviate"
export ASYNC_INDEXING=true
export RAFT_BOOTSTRAP_EXPECT=1
export RAFT_JOIN="wv1:8300"
export RAFT_BOOTSTRAP_TIMEOUT=3600
export PROMETHEUS_MONITORING_ENABLED=true
export LIMIT_RESOURCES=true
export TOMBSTONE_DELETION_MIN_PER_CYCLE=30000
export TOMBSTONE_DELETION_MAX_PER_CYCLE=300000
export QUERY_DEFAULTS_LIMIT=40
export PERSISTENCE_LSM_MAX_SEGMENT_SIZE="100GB"
export REPLICATION_MINIMUM_FACTOR=1

export IMAGE_INFERENCE_API="http://192.168.88.246:8111"
export DEFAULT_VECTORIZER_MODULE="img2vec-neural"
export ENABLE_MODULES="img2vec-neural"
export GO_PROFILING_DISABLE=true

/home/user/app/weaviate --host 0.0.0.0 --port 8080 --scheme http

Why does this error occur?
And how does it affect the functionality of the database?
How can I fix this error?

Hello @ilsg, the error log seems to indicate there is a corrupted .db file at /home/user/rdata/weaviate/tksad/K5j5EQ9XTNCU/lsm/objects. if such .db file does not have an associated .wal file it may indicate the file got corrupted after being successfully written on disk.
There may not be a way to recover such file in an isolated manner but restoring from a backup or by moving out such .db file from that path in a multi-node setup the data will be replicated automatically as the data is queried (there is a replication mechanism called read-repair).

There is an ongoing effort already to identify integrity checking on .db files so to automatically detect this situation in a better way.

And how will this file damage affect the operation of the database itself?
Because now, the database works normally, it also writes data and performs searches.
Maybe there is a way to delete this damaged file?
Unfortunately, I did not have a replica for this collection in a multi-node configuration to restore it.

if the collection was created with replication factor greater than one, the same data will be stored in other nodes, in such a case, removing the corrupted .db files may be the simplest solution. If a backup is not available and replication factor is one, it would mean the inserted data stored in such files wont be recoverable and re-ingestion will be required.
Currently, having a corrupted file could generate that kind of issues you are seeing, probably preventing the operations to succeed, so I’d recommend to remove such file. In that log line the filename is not shown but this situations will be handled in a better manner once integrity checking in such type of files is completed.

note: if possible, it would better to use a three nodes setup as it will be possible to continue normal operations if a node goes down.

Am I right in understanding that right now there is no way to find out which file is damaged?
Also, do I understand correctly that the newly inserted data will work correctly in the database?

newly inserted data will work correctly. currently you may identify which is the corrupted .db file based on some error log lines, in the compaction one it’s not shown but if you pursue a search e.g. for a non-existing object uuid it may attempt to read from all the .db files and such error may appear (including the filename)

OK, thank you for your prompt assistance in resolving my issue.

1 Like