Startup failure with "mmap file: invalid argument" error on v1.24.6

After a Weaviate reboot (v1.24.6) we hit the shard corruption issue:

{"error":"init shard \"confluence20240405022809_YpL8JtBifDId\": init shard \"confluence20240405022809_YpL8JtBifDId\": shard db: create objects bucket: init disk segments: init segment segment-1712297123505677595.db: mmap file: invalid argument","level":"error","msg":"Unable to load shard YpL8JtBifDId: init shard \"confluence20240405022809_YpL8JtBifDId\": init shard \"confluence20240405022809_YpL8JtBifDId\": shard db: create objects bucket: init disk segments: init segment segment-1712297123505677595.db: mmap file: invalid argument","time":"2024-04-05T20:50:57Z"}

It seems a similar issue was discussed at Startup Failure - #7 by msj242, but over there it was made to sound like that v1.24 has improved the integrity of the files and if the file was created with v1.24+, then this shouldn’t really be happening.

Well, this file was created with v1.24.6 (it was created yesterday) and today after a reboot/crash it is causing the above issue at startup and prevents Weaviate from starting up.

What is the recovery here?
Manually dropping the corrupted .db file and rebuilding it from zero?

I have moved the folder of this offending class into a subfolder within the Weaviate data library - then Weaviate has recreated the original class folder (with 1Mb of data) and things has suddenly started to work - including the given class is now accessible with all its data.
I did not expect that. I thought I have lost this class.

Hi @Zoltan_Fedor !

Is this a multi node setup?

Indeed, that’s not expected :thinking:

Also, what is number of objects you are running on this Weaviate cluster?

This may be a big scale kind of issue, so it is interesting to understand your scaling here.

Thanks!

Hi @DudaNogueira ,
Yes, this is a multi node setup (2 nodes) with object numbers in the 15-30 range (variable, as most objects gets rebuilt daily).
For the larger ones we do additional sharding (additional - above of what the multi-node setup would mean).

15-30 objects only?
Or million objects?

:thinking:

Sorry, I meant 15-30 classes, but yes, many million records (about 15-20 million)

Hi!

This kind of issue seems to arise when you start reaching the machine individual limits.

So maybe you’ll need to resize your cluster and provision more nodes.

:thinking:

It is possible.
But if that is the case, then shouldn’t there be a better method to inform us about the machines reaching their individual limits then bricking the system?

There are some, like stated here:

but this may happen “outside” of Weaviate, so it is hard to get those.