Weaviate Holding Locks on EFS Files Causing disk quota exceeded Errors

Description

We are ingesting and chunking data from files, and are creating tens of thousands of tenants under a single weaviate class. Our process runs fine for a short time, but weaviate eventually starts throwing disk quota exceeded errors. AWS is telling me that Weaviate is creating locks on EFS files and not releasing them.

{"action":"hnsw_commit_log_condensing","error":"open commit log to be condensed: open /var/lib/weaviate/catalogsearch/CatalogItem_BERT_MultiLingual_e5/main.hnsw.commitlog.d/1736210947: disk quota exceeded","level":"error","msg":"hnsw commit log maintenance (condensing) failed","time":"2025-01-07T03:38:27Z"}
{"action":"hnsw_commit_log_maintenance","error":"stat /var/lib/weaviate/catalogsearch/CatalogItem_BERT_MultiLingual_e5/main.hnsw.commitlog.d/1736212133: use of closed file","level":"error","msg":"hnsw commit log maintenance failed","time":"2025-01-07T03:38:28Z"}

UnexpectedStatusCodeError: Creating object! Unexpected status code: 500, with response body: {'error': [{'message': 'put object: import into index contractexcellence: put local object: shard="e0fc9bf8b418ce94ce0614609813b123f84dd9d4fc2c4a11c6b2214c6a0333ff": flush prop length tracker to disk: open /var/lib/weaviate/contractexcellence/e0fc9bf8b418ce94ce0614609813b123f84dd9d4fc2c4a11c6b2214c6a0333ff/proplengths.tmp: disk quota exceeded'}]}.
Traceback (most recent call last):
  File "/var/task/lib/python3.9/site-packages/gpai_document_ragification/handlers/rag_creation_event_lambda_handler.py", line 10, in handle_s3_creation_event
    to_return = HandlerModule.get_document_create_handler().handle_s3_event(event)
  File "/var/task/lib/python3.9/site-packages/gpai_document_ragification/handlers/document_handler.py", line 32, in handle_s3_event
    return self._handle_event(state_machine_input, document_rag_record)
  File "/var/task/lib/python3.9/site-packages/gpai_document_ragification/handlers/document_create_handler.py", line 53, in _handle_event
    self.save_chunks_to_vector_db(
  File "/var/task/lib/python3.9/site-packages/gpai_document_ragification/handlers/document_create_handler.py", line 120, in save_chunks_to_vector_db
    self.weaviate_proxy.save_object(
  File "/var/task/lib/python3.9/site-packages/gpai_document_ragification/proxy/weaviate_proxy.py", line 26, in save_object
    self.weaviate_client.data_object.create(
  File "/var/task/lib/python3.9/site-packages/weaviate/data/crud_data.py", line 160, in create
    raise UnexpectedStatusCodeException("Creating object", response)

{"action":"hnsw_commit_log_maintenance","error":"stat /var/lib/weaviate/catalogsearch/CatalogItem_BERT_MultiLingual_e5/main.hnsw.commitlog.d/1736212133: use of closed file","level":"error","msg":"hnsw commit log maintenance failed","time":"2025-01-07T03:36:51Z"}
{"action":"lsm_compaction","class":"Contractexcellence","error":"open /var/lib/weaviate/contractexcellence/9994450152d1268c6f6e6c64c10c31168b4d880906a69ce2bc6c410deeb3b22d/lsm/property_text_chunk/segment-1736207677691312573_1736210936799682015.db.tmp: disk quota exceeded","index":"contractexcellence","level":"error","msg":"compaction failed","path":"/var/lib/weaviate/contractexcellence/9994450152d1268c6f6e6c64c10c31168b4d880906a69ce2bc6c410deeb3b22d/lsm/property_text_chunk","shard":"9994450152d1268c6f6e6c64c10c31168b4d880906a69ce2bc6c410deeb3b22d","time":"2025-01-07T03:36:51Z"}

Server Setup Information

  • Weaviate Server Version: 1.24.8
  • Deployment Method: EKS on AWS
  • Multi Node? Number of Running Nodes: 1 ECS Node
  • Client Language and Version: PythonV3
  • Multitenancy?: Yes

Any additional Information

hi @JLiz2803 !!

EBS volumes provide low-latency, high-performance block storage that is optimized for fast read and write operations. Weaviate requires quick access to vector data and other storage, where high throughput and low latency are crucial, especially for high-query workloads. EFS, on the other hand, is a network file system, which inherently has higher latency due to network overhead and doesn’t support the same high-performance requirements efficiently. Also, EBS volumes offer consistent, predictable performance, which is critical for a database like Weaviate that relies on intensive I/O operations.

Can you try this same scenario on EBS?

Thanks!

Hi Duda,
Thanks for your reply. In the beginning we were using EBS, but due to scaling needs we switched over to EFS to take advantage of the autoscaling EFS offers which EBS does not. We did so per the weaviate documentation recommendation: Kubernetes | Weaviate

It would now be a big, and costly, effort to switch back to EBS as we would have to backfill millions of records, and over scale our storage instance, given the static nature of EBS, to accommodate our needs.

Is it Weaviate’s reccomendation not to use EFS, but rather use EBS?

I was able to confirm with the AWS EFS team that Weaviate, for some reason, is opening up 65K+ files and holding locks on them with EFS. This is causing us to reach the limits of EFS and causing our weaviate instance to start failing.

hi!

In cases where there is a limit on the machine running Weaviate, on this case the locks for EFS, the recommended way to avoid this is using more nodes.

While our team is constantly working on improving Weaviate resource usages, adding more nodes and distributing to different nodes is the way to go.

Also, tenant offload can help here, as you can offload tenants not being actively used:

Upgrade will also help, as you can leverage the latest improvements on file and memory management