Weaviate Disk Usage Question

Hi, I’m trying to use weaviate for large-scale text data. But I think Its real usage is too large than I expected.

I stored about 50GB artifical text for test. There is no extra metadata for each document. The total number of chunks is 10436101, and the dimension of each vector is 1024. As I Know, 2 * vector_size byte is needed for each vector. So, each vector needs 1024 * 2 * 8 bytes, and we have 10436101 chunks so about 170GB is needed as the prediction.

However, We failed to insert due to the lack of storage. We checked the weaviate data and found about 380GB used. property_text, property_text_searchable is the most largest folder, but however, we used 50GB texts, 380GB is too big for our expectations.

Could I get a brief explanation of how the storage is used?

hi @jaehyoyoo !!

The artificially generated content can contribute for the size of the the index, as it may have lots of words with very few documents.

We are working on a new implementation that will make this more efficient, called BlockMax WAND-based BM25, that will be released as experimental in 1.29:

The current indisk format is not very efficient and wastes a lot of disk space on most properties.

Let me know if this helps.

Thanks!