Volume and objects size going up instead of down after removing >50% of objects

Description

Due to a bug in our update handling, instead of updating objects, we actually created new ones without deleting the old ones for a while. We noticed this a few days ago and have solved the issue since.
This situation resulted in having 160GB of data from Weaviate (according to df), which didnt seem to make our Weaviate very happy in our 120GB RAM server.
Then, we ran a script, removing all objects that should have been deleted. For some reason, the over-all file size increased from 160GB to 185GB, causing the resource issues to become even bigger.
Is this expected and just a matter of time until it actually deletes objects? Is there a way to speed up that deletion?

Server Setup Information

  • Weaviate Version: 1.28.2
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: python v3
  • Multitenancy?: nope

Any additional Information

Setup is quite normal, but we’ve noticed this issue due to our AWS costs getting significant overnight, due to what seems like weaviate constantly swapping objects in and out of our EFS

We did a quick calculation based on our raw object size, and we expect about 30GB to be a realistic number. We’re not sure how much overhead weaviate adds, but if we’d store the objects raw on disk, we expect about 30GB in total

hi @afstkla !!

The disk usage should be freed as the compaction goes.

Can you confirm to me if the disk has not reduced even 1 day after the object deletion?

Thanks!

Hi @DudaNogueira , no, even though we ran the cleanup script Tuesday night (CET), this morning (48+hrs later) it’s still at 184.3 (which is ± the same as when I made this post).

I’ve already tried re-running our deletion script again, and it doesn’t find any objects anymore that we’ve deleted, nor new objects to delete since the last run (we used Python v3’s client.data_object.delete(<uuid>, "WeaviateDocumentChunk"). So we’re quite at a loss here. Anything you can propose to fix it (or even debug it? We don’t see any weird logs (other than the fact that our service is clearly struggling, probably due to the significant storage)

(Can’t upload more than 1 image, so trying it like this. Terminal info from our prod instance, AWS logs from ± the past hr in DEBUG mode)

hi @afstkla !!

I will be investigating this further to escalate with our team.

I will get back here when I get more info.

Thanks!

Hi @DudaNogueira , thanks!

Reason this initially spiked our worries was that we suddenly had an insane increase in spend on AWS due to our EFS volume suddenly being bombarded with read requests (which triggered our investigation & our data purge).
Since yesterday we’ve moved our Weaviate away from ECS/EFS, which made things cool off cost wise. So the urgency on our side is slightly less, but maybe this context helps you figure out what’s up.

1 Like

Thanks for sharing!

We did some investigations, and it indeed seems to have room for improvements:

We may be able to adjust some configs to free that space faster and improve it.

Thanks for reporting!

Thanks, I’ve subscribed to the issue to get updates.

As we’re running a workload with a lot of deletions & creations (we update a lot of our documents daily / weekly), is it expected that the impact due to this bug are also bigger?

In our case we went from ±160GB in disk size before the deletion, to ±185GB in disk size after the deletion (also, for some reason, it seems that our issues actually got worse after the deletion), and it’s been at that 185GB for a week.

Indeed, EFS is not recommended.

Ou team is analyzing this kind of scenario (lots of deletions). We may be able to improve configurations like PERSISTENCE_LSM_SEGMENTS_CLEANUP_INTERVAL_HOURS that will speed up the cleaning/merging of segments after deletion.

Thanks!