Description
I created two collections using default parameters and a weaviate docker container. How do I get:
- The number of documents in each collection
- The disk usage for weaviate in total
- The disk usage for weaviate data for each collection
I tried this:
sudo du -sh $(docker volume inspect --format ‘{{ .Mountpoint }}’ weaviate_weaviate_data)
where weaviate_weaviate_data is the name of my weaviate volume. However, it reports 26M size, which is abou 1000x smaller than what I am expecting (based on vector indexing I’ve done with the same dataset in multiple other vector dbs).
In the local volume storage directory I can see the two separate collections, however they are 1000x smaller than expected.
When I query the collections, I can see that there are documents inside so the indexing appears to be successful.
Server Setup Information
- Weaviate Server Version:
- Deployment Method: docker
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: Python 4.5.3
Any additional Information
Maybe there are zero vectors stored, and the index is solely based on BM25 information. How do I find out?
This is how I created the collection:
...
client.connect()
try:
collection1 = client.collections.create(
name="collection1",
properties=[
wvcc.Property(name="URL", data_type=wvcc.DataType.TEXT, skip_vectorization=True),
wvcc.Property(name="CONTENT", data_type=wvcc.DataType.TEXT, skip_vectorization=False),
],
vector_index_config=wvc.config.Configure.VectorIndex.hnsw(),
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
passage_inference_url="http://t2v-transformers-passage:8080",
query_inference_url='http://t2v-transformers-query:8080'
),
reranker_config=wvc.config.Configure.Reranker.transformers(),
)
with collection1.batch.dynamic() as batch:
for data_row in doc_objs:
batch.add_object(
properties=data_row,
)
finally:
client.close()
Using this hack I was able to see that there are fewer than 1000 docs in the collection. However, I was expecting to see on the order of 1 million.
client.connect()
try:
collection = client.collections.get("collection1")
count = 0
for item in collection.iterator():
count += 1
if count > 100000:
break
finally:
client.close()
print(count)
Originally when I populated the vector db, I used this code, which took a few hours, in line with the timing expectation based on what I’ve observed with other vector dbs on the same dataset.
try:
collection = client.collections.create(
name="collection1",
properties=[
wvcc.Property(name="URL", data_type=wvcc.DataType.TEXT, skip_vectorization=True),
wvcc.Property(name="CONTENT", data_type=wvcc.DataType.TEXT, skip_vectorization=False),
],
vector_index_config=wvc.config.Configure.VectorIndex.hnsw(),
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_transformers(
passage_inference_url="http://t2v-transformers-passage:8080",
query_inference_url='http://t2v-transformers-query:8080'
),
reranker_config=wvc.config.Configure.Reranker.transformers(),
)
with collection.batch.dynamic() as batch:
for data_row in doc_objs:
batch.add_object(
properties=data_row,
)
finally:
client.close()
Hi!
Maybe the majority of those objects were not added and you didn’t catch those errors?
It is a good practice to do error handling. Here is how:
Let me know if this helps.
Thanks!
I added some of the error code
with collection.batch.dynamic() as batch:
for data_row in doc_objs:
if batch.number_errors > 100:
print(f"WARNING: ERROR COUNT = {batch.number_errors}")
batch.add_object(
properties=data_row,
)
failed_objs_a = collection.batch.failed_objects
failed_refs_a = collection.batch.failed_references
I tried to insert 10000 records. Only 1000 records were inserted, and at the end of insertion, batch.number_errors is 9000.
The print statement for “WARNING: ERROR COUNT” does not execute.
The first error object is:
print(failed_objs_a[0])
ErrorObject(message="WeaviateBatchError('Query call with protocol GRPC batch failed with message Deadline Exceeded.')", object_=_BatchObject(collection= ...
Based on some advice someone had with a similar error, I added this to the connect_with_custom() arguments:
additional_config=weaviate.config.AdditionalConfig(timeout=(300, 300) # (connect timeout, read timeout)
However, it still fails with errors for 9000 of the docs.
Is there a batch size limit of 1000? Should I chunk my collection into smaller chunks, say 500 passages per batch, and do multiple batches? I assumed that the collection.batch.dynamic() stuff would automatically create batches and insert them. Instead, it can spend 2 hours calculating embeddings and fail for 99.9%…
It appears to be successful with fixed_size batching
with collection.batch.fixed_size(batch_size=100) as batch:
This has number_errors=0 and len(failed_objects)=0.
Hi @moruga123 !
Interesting. This error message means that you the client tried to pass too many objects in one request alone. However, the dynamic batch size should take care of that limit.
I will loop in our dev team, maybe they can helps us here.