Constant storage drain without increasing number of vectors

Description

Hello. Currently, our Weaviate instance has a stable number of vectors which is ~10,5 kk and is not increasing. However we are hitting storage limits because something constantly eating free space in our nodes.

At first glance it looks like some failures with operating HNSW commitlogs.

The disc usage growth look like linear

We cannot find any useful logs about commit log cleanup failure

Server Setup Information

  • Weaviate Server Version: 1.24.14
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 3 replicas x 2 shards
  • Client Language and Version:

Any additional Information

Hey @Grzegorz_Pasieka thanks for raising this.

Is this something that has only become a problem since upgrading to 1.24.14? Or has it persisted since pre-upgrade?

Hi @Grzegorz_Pasieka, do you know which files are becoming bigger or created over time?

Thanks!

We have had those problems since v1.24.2, that’s why we decided to upgrade to 1.24.14 to check if it helps somehow.

I think main.hnsw.commitlog.d is becoming bigger in each shard

What is also worth mentioning is that one of my replicas constantly crashes because of goroutine panics:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-21T22:52:45Z"}
goroutine 7255736 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42840000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc0148948c0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc02fe30d20, {0x21084e6, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc02fe30d20, {0xc1023935a0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc01be01080, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc02fe30d20)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 7155590
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716332109","old_file_name":"1716328911","old_file_size":105340363,"time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:11Z"}
{"action":"hnsw_commit_logger_combine_condensed_logs","id":"main","input_first":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649.condensed","input_second":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716328911.condensed","level":"info","msg":"successfully combined previously condensed commit log files","output":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:39Z"}

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.

Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.
Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

We are still observing this deployment however I can see another file size increase in the second shard of one of the replicas.

Also we constantly observe crashes in the first replica:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-22T14:42:28Z"}
goroutine 22835974 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42a00000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc048353dc0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc17759ec30, {0x212cc81, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc17759ec30, {0xc15ba27dc0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc00449f300, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc17759ec30)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 22408192
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716388985","old_file_name":"1716388910","old_file_size":119471320,"time":"2024-05-22T14:43:05Z"}
```

A post was split to a new topic: Adding removing objects with the same vectors

Today we noticed storage usage drops in each replica. Surprisingly it was not because of commit logs which are still growing, but objects in the second shard.

It’s great that the usage is decreasing and that the compaction works for objects. I think that I would be able to share with you a preview docker image with fixes that might help in your situation. I’m just now gathering all of it here internally, I can then share with you a preview image (that is built on top of v1.24)

we recorded another drop in storage usage in the first shard. We also performed migration process to 1.25.1 which I already described in another topic Panic after migration to 1.25.1

@Grzegorz_Pasieka I want to inform you also that we will be releasing tomorrow (the latest on Wednesday) new patch releases of v1.24 and v1.25 that will contain fixes that should improve your situation.

We have identified those bugs and we are now pushing all of the fixes to our code base, so please stay tuned. I will let you know once those patch releases will be ready.

Hi! Thanks for addressing the fixes in 1.25.2!

Today we did the migration from 1.25.1 to 1.25.2 and currently non of our instances are healthy. we observe logs with “restoring snapshots” and then I/O timeouts, restarts and the same process again.