Hello, I started using your database, at first everything was fine, but after a long recording of vectors(I recorded 1730168 vectors, disk space Estimated Sizes LSM stores 333Gb), it started to give an error in the logs:
{"action":"cyclemanager","build_git_commit":"","build_go_version":"go1.22.0","build_image_tag":"","build_wv_version":"1.28.2","callback_id":"segmentgroup/compaction//home/user/rdata/weaviate/tksad/K5j5EQ9XTNCU/lsm/objects","callbacks_id":"store/compaction/..","class":"Tksad","index":"tksad","level":"error","msg":"callback panic: runtime error: makeslice: len out of range","shard":"K5j5EQ9XTNCU","time":"2025-01-19T07:56:44Z"}
I thought after rebooting the database would be cured and the error would disappear, but this did not happen.
Server Setup Information
I use a multi-node configuration, I end up with 2 nodes, each running on its own physical server.
Configuration of each server:
128gb memory ram ddr4
xeon 2678v3
4tb ssd nvme on raid0
There were no such errors on the second node.
I launch it manually via binary files, here is an example of launching:
Hello @ilsg, the error log seems to indicate there is a corrupted .db file at /home/user/rdata/weaviate/tksad/K5j5EQ9XTNCU/lsm/objects. if such .db file does not have an associated .wal file it may indicate the file got corrupted after being successfully written on disk.
There may not be a way to recover such file in an isolated manner but restoring from a backup or by moving out such .db file from that path in a multi-node setup the data will be replicated automatically as the data is queried (there is a replication mechanism called read-repair).
There is an ongoing effort already to identify integrity checking on .db files so to automatically detect this situation in a better way.
And how will this file damage affect the operation of the database itself?
Because now, the database works normally, it also writes data and performs searches.
Maybe there is a way to delete this damaged file?
Unfortunately, I did not have a replica for this collection in a multi-node configuration to restore it.
if the collection was created with replication factor greater than one, the same data will be stored in other nodes, in such a case, removing the corrupted .db files may be the simplest solution. If a backup is not available and replication factor is one, it would mean the inserted data stored in such files wont be recoverable and re-ingestion will be required.
Currently, having a corrupted file could generate that kind of issues you are seeing, probably preventing the operations to succeed, so I’d recommend to remove such file. In that log line the filename is not shown but this situations will be handled in a better manner once integrity checking in such type of files is completed.
note: if possible, it would better to use a three nodes setup as it will be possible to continue normal operations if a node goes down.
Am I right in understanding that right now there is no way to find out which file is damaged?
Also, do I understand correctly that the newly inserted data will work correctly in the database?
newly inserted data will work correctly. currently you may identify which is the corrupted .db file based on some error log lines, in the compaction one it’s not shown but if you pursue a search e.g. for a non-existing object uuid it may attempt to read from all the .db files and such error may appear (including the filename)