How to get document count and disk size of weaviate volumes and collections?

Description

I created two collections using default parameters and a weaviate docker container. How do I get:

  1. The number of documents in each collection
  2. The disk usage for weaviate in total
  3. The disk usage for weaviate data for each collection

I tried this:

sudo du -sh $(docker volume inspect --format ‘{{ .Mountpoint }}’ weaviate_weaviate_data)

where weaviate_weaviate_data is the name of my weaviate volume. However, it reports 26M size, which is abou 1000x smaller than what I am expecting (based on vector indexing I’ve done with the same dataset in multiple other vector dbs).

In the local volume storage directory I can see the two separate collections, however they are 1000x smaller than expected.

When I query the collections, I can see that there are documents inside so the indexing appears to be successful.

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python 4.5.3

Any additional Information

Maybe there are zero vectors stored, and the index is solely based on BM25 information. How do I find out?

This is how I created the collection:

...
client.connect()

try:
    collection1 = client.collections.create(
        name="collection1",
        properties=[
            wvcc.Property(name="URL", data_type=wvcc.DataType.TEXT, skip_vectorization=True),
            wvcc.Property(name="CONTENT", data_type=wvcc.DataType.TEXT, skip_vectorization=False),
        ],
        vector_index_config=wvc.config.Configure.VectorIndex.hnsw(),
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
            passage_inference_url="http://t2v-transformers-passage:8080",
            query_inference_url='http://t2v-transformers-query:8080'
        ),
        reranker_config=wvc.config.Configure.Reranker.transformers(),
    )

    with collection1.batch.dynamic() as batch:
        for data_row in doc_objs:
            batch.add_object(
                properties=data_row,
            )

finally:
    client.close()

Using this hack I was able to see that there are fewer than 1000 docs in the collection. However, I was expecting to see on the order of 1 million.

client.connect()
try:
    collection = client.collections.get("collection1")
    count = 0
    for item in collection.iterator():
        count += 1
        if count > 100000:
            break
finally:
    client.close()
print(count)

Originally when I populated the vector db, I used this code, which took a few hours, in line with the timing expectation based on what I’ve observed with other vector dbs on the same dataset.

try:
    collection = client.collections.create(
        name="collection1",
        properties=[
            wvcc.Property(name="URL", data_type=wvcc.DataType.TEXT, skip_vectorization=True),
            wvcc.Property(name="CONTENT", data_type=wvcc.DataType.TEXT, skip_vectorization=False),
        ],
        vector_index_config=wvc.config.Configure.VectorIndex.hnsw(),
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
            passage_inference_url="http://t2v-transformers-passage:8080",
            query_inference_url='http://t2v-transformers-query:8080'
        ),
        reranker_config=wvc.config.Configure.Reranker.transformers(),
    )
    with collection.batch.dynamic() as batch:
        for data_row in doc_objs:
            batch.add_object(
                properties=data_row,
            )
finally:
    client.close()

Hi!

Maybe the majority of those objects were not added and you didn’t catch those errors?

It is a good practice to do error handling. Here is how:

Let me know if this helps.

Thanks!

I added some of the error code

    with collection.batch.dynamic() as batch:
        for data_row in doc_objs:
            if batch.number_errors > 100:
                print(f"WARNING: ERROR COUNT = {batch.number_errors}")
            batch.add_object(
                properties=data_row,
            )

    failed_objs_a = collection.batch.failed_objects
    failed_refs_a = collection.batch.failed_references

I tried to insert 10000 records. Only 1000 records were inserted, and at the end of insertion, batch.number_errors is 9000.

The print statement for “WARNING: ERROR COUNT” does not execute.

The first error object is:

print(failed_objs_a[0])

ErrorObject(message="WeaviateBatchError('Query call with protocol GRPC batch failed with message Deadline Exceeded.')", object_=_BatchObject(collection= ...

Based on some advice someone had with a similar error, I added this to the connect_with_custom() arguments:

additional_config=weaviate.config.AdditionalConfig(timeout=(300, 300) # (connect timeout, read timeout)

However, it still fails with errors for 9000 of the docs.

Is there a batch size limit of 1000? Should I chunk my collection into smaller chunks, say 500 passages per batch, and do multiple batches? I assumed that the collection.batch.dynamic() stuff would automatically create batches and insert them. Instead, it can spend 2 hours calculating embeddings and fail for 99.9%…

It appears to be successful with fixed_size batching

    with collection.batch.fixed_size(batch_size=100) as batch:

This has number_errors=0 and len(failed_objects)=0.

Hi @moruga123 !

Interesting. This error message means that you the client tried to pass too many objects in one request alone. However, the dynamic batch size should take care of that limit.

I will loop in our dev team, maybe they can helps us here.