I would like to inquire about storage space occupied by Weaviate when embedding txt file

Description

Hello,
I am checking the storage change amount while embedding with txt file. Embedding a txt file of 1 MB confirms that the storage of the weaviate increases by about 6 to 10 MB, and when uploading a 4.2 MB txt file, it increases by 21 to 115 MB.

Test with the same embedding model, chunk size, and same txt files, but the storage capacity that increases after embedding does not seem to be constant.

I would like to ask if there is a way to calculate the storage space occupied when the contents of the original file are embedded in the Weaviate or if there is any reference material.

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method:
  • Multi Node? Number of Running Nodes:
  • Client Language and Version: Python 3.10

Any additional Information

hi @pu007 !

Welcome to our community :hugs:

There is a lot of variables in place for this calculation.

For example, how many dimensions does this embeddings have? This will have a big impact, as Weavaite will store those dimensions in disk for when it is restarting.

Also, are you using any compression algorithm, like PQ for example? This will also impact, as now Weaviate will store not only the original vector, but also the trained ones.

on top of that, have you done some ingestion recently? This because there is a cleanup process that goes on under the hood, dealing with write ahead logs and commit logs. This is probably the one causing this variation of disk storage you are seeing.

For example, when you delete objects (and not the entire collection), Weaviate will in fact mark it as deleted. Because deleting an object is computationally expensive, it will delete the object properly when the time comes on this cycle.

Unfortunately there isn’t a formula to calculate the storage, but I believe that knowing those details, will allow you to have a better understanding on this :slight_smile:

Please, let us know if this helps!

thanks!

1 Like