How much resource is needed for 30M 1536d vector records index with bm25 index?

And I need to use scalar field(at least 3 field) to deal with access control.

Hi @shadowlin !! Welcome to our community :hugs:

Here we have our docs regarding resource planning:

Here you can check the calculation needed.

But to make things easier, we can use Verba for answering this :wink:

this was the answer to:
How much memory I need for 30 milion objects with 1536 dimensions?

To calculate the memory needed for 30 million objects with 1536 dimensions, we can use the information provided about the memory requirements for vectors in Weaviate.

Each vector dimension is a float32, which requires 4 bytes. So for a single vector with 1536 dimensions, the memory required is:

1536 dimensions * 4 bytes/dimension = 6144 bytes/vector

For 30 million vectors, the memory required is:

30,000,000 vectors * 6144 bytes/vector = 184,320,000,000 bytes

Convert bytes to gigabytes (1 GB = 2^30 bytes):

184,320,000,000 bytes / (2^30) = 171,798.69184 GB β‰ˆ 171.8 GB

This calculation only accounts for the raw vector data. However, the actual memory usage will be higher due to the overhead of the data structures used to store the vectors, the graph representation for the k-nearest neighbors, and the garbage collection overhead in Go.

The rule of thumb provided in the context is to estimate memory usage as twice the memory footprint of all vectors. Therefore, using this rule, we would estimate:

2 * 171.8 GB = 343.6 GB

This estimate does not include the memory required for the maxConnections setting, which adds additional memory usage per vector. If we assume maxConnections is 64, and each connection uses 8 bytes, the additional memory required for the graph would be:

30,000,000 vectors * 64 connections/vector * 8 bytes/connection = 15,360,000,000 bytes

Convert bytes to gigabytes:

15,360,000,000 bytes / (2^30) = 14.305114746 GB β‰ˆ 14.3 GB

Adding this to the previous estimate, we get:

343.6 GB (vectors) + 14.3 GB (graph) = 357.9 GB

However, this still does not account for the garbage collection overhead. The GOGC environment variable affects how often garbage collection runs, and the default value of 100 means the heap size can double before garbage collection is triggered. If your application has a large heap, like in this case, you may need to adjust GOGC to prevent running out of memory.

In summary, you would need at least 357.9 GB of memory to hold 30 million 1536-dimensional vectors in memory, according to the rule of thumb and the additional memory for the graph. However, due to garbage collection and other overheads, the actual memory required could be significantly higher. It’s important to conduct empirical tests to determine the exact memory requirements for your specific use case.

That is a lot of memory. With that sad, you can also look into Product Quantization to reduce the memory footprint:

In our recent 1.23 version, we have a new feature, Auto PQ, that will help you on this process.

Let me know if this helps :slight_smile:

Thank you for the quick reply.
@DudaNogueira
Is there any data for latency and QPS wise for such amount of 1536d vectors?
The benchmark doc only have 960d or lower dimension dataset test.

I dont think there is. :frowning: