Constant storage drain without increasing number of vectors

Grzegorz_Pasieka · May 21, 2024, 3:19pm

Description

Hello. Currently, our Weaviate instance has a stable number of vectors which is ~10,5 kk and is not increasing. However we are hitting storage limits because something constantly eating free space in our nodes.

At first glance it looks like some failures with operating HNSW commitlogs.

The disc usage growth look like linear

We cannot find any useful logs about commit log cleanup failure

Server Setup Information

Weaviate Server Version: 1.24.14
Deployment Method: k8s
Multi Node? Number of Running Nodes: 3 replicas x 2 shards
Client Language and Version:

Any additional Information

parkerduckworth · May 21, 2024, 9:14pm

Hey @Grzegorz_Pasieka thanks for raising this.

Is this something that has only become a problem since upgrading to 1.24.14? Or has it persisted since pre-upgrade?

jeronimo_irazabal · May 21, 2024, 9:39pm

Hi @Grzegorz_Pasieka, do you know which files are becoming bigger or created over time?

Thanks!

Grzegorz_Pasieka · May 21, 2024, 9:59pm

We have had those problems since v1.24.2, that’s why we decided to upgrade to 1.24.14 to check if it helps somehow.

Grzegorz_Pasieka · May 21, 2024, 10:01pm

I think main.hnsw.commitlog.d is becoming bigger in each shard

Grzegorz_Pasieka · May 21, 2024, 11:11pm

What is also worth mentioning is that one of my replicas constantly crashes because of goroutine panics:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-21T22:52:45Z"}
goroutine 7255736 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42840000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc0148948c0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc02fe30d20, {0x21084e6, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc02fe30d20, {0xc1023935a0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc01be01080, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc02fe30d20)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 7155590
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716332109","old_file_name":"1716328911","old_file_size":105340363,"time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:09Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:11Z"}
{"action":"hnsw_commit_logger_combine_condensed_logs","id":"main","input_first":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649.condensed","input_second":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716328911.condensed","level":"info","msg":"successfully combined previously condensed commit log files","output":"/var/lib/weaviate/aiskillv2/1efo8r43N3tC/main.hnsw.commitlog.d/1716246649","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing","level":"info","msg":"start hnsw condensing","time":"2024-05-21T22:55:14Z"}
{"action":"hnsw_condensing_complete","level":"info","msg":"completed hnsw condensing","time":"2024-05-21T22:55:39Z"}

antas-marcin · May 22, 2024, 10:24am

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.

Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

Grzegorz_Pasieka · May 22, 2024, 2:46pm

@Grzegorz_Pasieka we have pushed fixes for hnsw combining logic so if you are using v1.24.14 you should have all of them already deployed.
Do you observe that your main.hnsw.commitlog.d is becoming already smaller in size?

We are still observing this deployment however I can see another file size increase in the second shard of one of the replicas.

Also we constantly observe crashes in the first replica:

{"level":"error","msg":"Recovered from panic: assignment to entry in nil map","time":"2024-05-22T14:42:28Z"}
goroutine 22835974 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x18b86a0?, 0x1fc9980?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonPropertyLengthTracker).TrackProperty(0xc0fd697800, {0xc000124ea0, 0x7}, 0x42a00000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:181 +0x3c8
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc000215000, {0xc048353dc0?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x8f
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc000215000, 0xc17759ec30, {0x212cc81, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:412 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc000215000, 0xc17759ec30, {0xc15ba27dc0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:288 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc00449f300, {0x1fe7538, 0x2d8dbc0}, 0x0?, 0xc17759ec30)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:207 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:177 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 22408192
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79
{"action":"commit_log_file_switched","id":"main","level":"info","msg":"commit log size crossed threshold, switching to new file","new_file_name":"1716388985","old_file_name":"1716388910","old_file_size":119471320,"time":"2024-05-22T14:43:05Z"}
```

sebawita · May 22, 2024, 4:22pm

A post was split to a new topic: Adding removing objects with the same vectors

Grzegorz_Pasieka · May 24, 2024, 8:14am

Today we noticed storage usage drops in each replica. Surprisingly it was not because of commit logs which are still growing, but objects in the second shard.

antas-marcin · May 24, 2024, 9:46am

It’s great that the usage is decreasing and that the compaction works for objects. I think that I would be able to share with you a preview docker image with fixes that might help in your situation. I’m just now gathering all of it here internally, I can then share with you a preview image (that is built on top of v1.24)

Grzegorz_Pasieka · May 27, 2024, 12:04pm

we recorded another drop in storage usage in the first shard. We also performed migration process to 1.25.1 which I already described in another topic Panic after migration to 1.25.1

antas-marcin · May 27, 2024, 1:46pm

@Grzegorz_Pasieka I want to inform you also that we will be releasing tomorrow (the latest on Wednesday) new patch releases of v1.24 and v1.25 that will contain fixes that should improve your situation.

We have identified those bugs and we are now pushing all of the fixes to our code base, so please stay tuned. I will let you know once those patch releases will be ready.

Grzegorz_Pasieka · June 3, 2024, 12:49pm

Hi! Thanks for addressing the fixes in 1.25.2!

Today we did the migration from 1.25.1 to 1.25.2 and currently non of our instances are healthy. we observe logs with “restoring snapshots” and then I/O timeouts, restarts and the same process again.

Grzegorz_Pasieka · June 24, 2024, 3:35pm

Hi, @antas-marcin - currently after successfully migrating to 1.25.4 Our weaviate-0 replica is constantly crashing due to the following error:

runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x13
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1.1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:30 +0x110
panic({0x19b6980?, 0x2116d40?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*JsonShardMetaData).TrackProperty(0xc2c850d280, {0xc0038567d0, 0x7}, 0x41e80000)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/new_prop_length_tracker.go:218 +0x408
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).SetPropertyLengths(0xc002aa8280, {0xc1580d9500?, 0x7, 0x41a848?})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_inverted_lsm.go:205 +0x95
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).updateInvertedIndexLSM(0xc002aa8280, 0xc0207e10e0, {0x249985d, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:413 +0x38e
github.com/weaviate/weaviate/adapters/repos/db.(*Shard).putObjectLSM(0xc002aa8280, 0xc0207e10e0, {0xc0e59d3ce0, 0x10, 0x10})
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_put.go:289 +0x3cc
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeObjectOfBatchInLSM(0xc1a195fa80, {0x2135e70, 0x2f6ca20}, 0xc1c126a460?, 0xc0207e10e0)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:208 +0xd4
github.com/weaviate/weaviate/adapters/repos/db.(*objectsBatcher).storeSingleBatchInLSM.func1()
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/shard_write_batch_objects.go:178 +0x11a
github.com/weaviate/weaviate/entities/errors.GoWrapper.func1()
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:34 +0x62
created by github.com/weaviate/weaviate/entities/errors.GoWrapper in goroutine 104762307
	/go/src/github.com/weaviate/weaviate/entities/errors/go_wrapper.go:25 +0x79

We have been observing this “new_prop_length_tracker” error for a very long time. Do you think it could be fixed somehow?

antas-marcin · June 24, 2024, 4:39pm

@Grzegorz_Pasieka for this panic we have only this way of fixing it atm (look here)

This RECOUNT_PROPERTIES_AT_STARTUP setting can be set only on a node on which you see this panics. This setting recreates the proplen file, it may take some time to regenerate this once it’s done then this panic should go away, at the end remember about removing this setting.

Grzegorz_Pasieka · June 24, 2024, 5:15pm

Thanks for this suggestion - I’ve just restarted node with this setting and got

{"action":"recount","level":"warning","msg":"Recounted 0 objects. Recounting properties complete. Please remove environment variable \tRECOUNT_PROPERTIES_AT_STARTUP before next startup","time":"2024-06-24T17:13:15Z"}

should I worry about “Recounted 0 objects” part?

antas-marcin · June 24, 2024, 7:14pm

I think it’s OK, did the panic disappear?

also I remember that you had a problem with increased storage usage, do you still experience this problem? or the usage started to drop when you upgraded to v1.25.4?

Grzegorz_Pasieka · June 25, 2024, 6:12am

The panic did not disappear

when it comes to the increased storage usage - it’s still constantly increasing but more often some storage is reclaimed.

antas-marcin · June 25, 2024, 9:32am

Unfortunately in your case you need to manually fix this.

I think this are the steps that you need to do:

log into the node where this exception occurs and find the proplengths files which BucketedData is nil and delete those proplengths files
restart this node with RECOUNT_PROPERTIES_AT_STARTUP set to true

This operation should re-create the missing propelengths files. You should see there the a log stating Recounted X objects not Recounted 0 objects.

Please let me know if that worked.

Topic		Replies	Views
Crashing 1.24.10 Support wcs , python	5	888	May 7, 2024
Hnsw load commit log corruption error Support	8	882	March 14, 2024
Weaviate increasing memory usage after a successful backup daily Support developer-experience	1	549	June 7, 2024
Production weaviate 24.6 crashed Support technical	1	690	March 8, 2025
Upload to and Deletion from Custom Weviate Instance Always Times out Support	9	843	July 17, 2024

Constant storage drain without increasing number of vectors

Description

Server Setup Information

Any additional Information

Related topics