Constant storage drain without increasing number of vectors

Description

Hello. Currently, our Weaviate instance has a stable number of vectors which is ~10,5 kk and is not increasing. However we are hitting storage limits because something constantly eating free space in our nodes.

At first glance it looks like some failures with operating HNSW commitlogs.

The disc usage growth look like linear

We cannot find any useful logs about commit log cleanup failure

Server Setup Information

  • Weaviate Server Version: 1.24.14
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 3 replicas x 2 shards
  • Client Language and Version:

Any additional Information

Hey @Grzegorz_Pasieka thanks for raising this.

Is this something that has only become a problem since upgrading to 1.24.14? Or has it persisted since pre-upgrade?

Hi @Grzegorz_Pasieka, do you know which files are becoming bigger or created over time?

Thanks!

We have had those problems since v1.24.2, that’s why we decided to upgrade to 1.24.14 to check if it helps somehow.

I think main.hnsw.commitlog.d is becoming bigger in each shard

What is also worth mentioning is that one of my replicas constantly crashes because of goroutine panics:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-21T22:52:45Z"}
goroutine 7255736 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42840000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc0148948c0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc02fe30d20, {0x21084e6, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc02fe30d20, {0xc1023935a0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc01be01080, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc02fe30d20)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 7155590
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716332109","old_file_name":"1716328911","old_file_size":105340363,"time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:11Z"}
{"action":"hnsw_commit_logger_combine_condensed_logs","id":"main","input_first":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649.condensed","input_second":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716328911.condensed","level":"info","msg":"successfully combined previously condensed commit log files","output":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:39Z"}

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.

Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.
Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

We are still observing this deployment however I can see another file size increase in the second shard of one of the replicas.

Also we constantly observe crashes in the first replica:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-22T14:42:28Z"}
goroutine 22835974 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42a00000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc048353dc0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc17759ec30, {0x212cc81, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc17759ec30, {0xc15ba27dc0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc00449f300, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc17759ec30)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 22408192
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716388985","old_file_name":"1716388910","old_file_size":119471320,"time":"2024-05-22T14:43:05Z"}
```

A post was split to a new topic: Adding removing objects with the same vectors

Today we noticed storage usage drops in each replica. Surprisingly it was not because of commit logs which are still growing, but objects in the second shard.

It’s great that the usage is decreasing and that the compaction works for objects. I think that I would be able to share with you a preview docker image with fixes that might help in your situation. I’m just now gathering all of it here internally, I can then share with you a preview image (that is built on top of v1.24)

we recorded another drop in storage usage in the first shard. We also performed migration process to 1.25.1 which I already described in another topic Panic after migration to 1.25.1

@Grzegorz_Pasieka I want to inform you also that we will be releasing tomorrow (the latest on Wednesday) new patch releases of v1.24 and v1.25 that will contain fixes that should improve your situation.

We have identified those bugs and we are now pushing all of the fixes to our code base, so please stay tuned. I will let you know once those patch releases will be ready.

Hi! Thanks for addressing the fixes in 1.25.2!

Today we did the migration from 1.25.1 to 1.25.2 and currently non of our instances are healthy. we observe logs with “restoring snapshots” and then I/O timeouts, restarts and the same process again.

Hi, @antas-marcin - currently after successfully migrating to 1.25.4 Our weaviate-0 replica is constantly crashing due to the following error:

runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x19b6980?, 0x2116d40?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonShardMetaData).TrackProperty(0xc2c850d280, {0xc0038567d0, 0x7}, 0x41e80000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:218 +0x408
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc002aa8280, {0xc1580d9500?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x95
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc002aa8280, 0xc0207e10e0, {0x249985d, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:413 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc002aa8280, 0xc0207e10e0, {0xc0e59d3ce0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:289 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc1a195fa80, {0x2135e70, 0x2f6ca20}, 0xc1c126a460?, 0xc0207e10e0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:208 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:178 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 104762307
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79

We have been observing this “new_prop_length_tracker” error for a very long time. Do you think it could be fixed somehow?

@Grzegorz_Pasieka for this panic we have only this way of fixing it atm (look here)

This RECOUNT_PROPERTIES_AT_STARTUP setting can be set only on a node on which you see this panics. This setting recreates the proplen file, it may take some time to regenerate this once it’s done then this panic should go away, at the end remember about removing this setting.

Thanks for this suggestion - I’ve just restarted node with this setting and got

{"action":"recount","level":"warning","msg":"Recounted 0 objects. Recounting properties complete. Please remove environment variable \tRECOUNT_PROPERTIES_AT_STARTUP before next startup","time":"2024-06-24T17:13:15Z"}

should I worry about “Recounted 0 objects” part?

I think it’s OK, did the panic disappear?

also I remember that you had a problem with increased storage usage, do you still experience this problem? or the usage started to drop when you upgraded to v1.25.4?

The panic did not disappear

when it comes to the increased storage usage - it’s still constantly increasing but more often some storage is reclaimed.

Unfortunately in your case you need to manually fix this.

I think this are the steps that you need to do:

  1. log into the node where this exception occurs and find the proplengths files which BucketedData is nil and delete those proplengths files
  2. restart this node with RECOUNT_PROPERTIES_AT_STARTUP set to true

This operation should re-create the missing propelengths files. You should see there the a log stating Recounted X objects not Recounted 0 objects.

Please let me know if that worked.