Weaviate using tons of CPU during "tombstone_cleanup_begin"

Description

I am using local docker container weaviate instance on a Mac. I have test run with it for past week and have >1.2m doc with embedding vectors. I am on “bring my own vector” case. What I noticed is one day, when I started out the weaviate instance, it will use single digit % in cpu, but after roughly 4-5min, it will ramp up to 500% (I have 12 cores). When I looked at the log, I highly suspect it is running “tombstone_cleanup_begin” (see below for more details). I don’t notice ever this has caused so much CPU and going at it for so long (still waiting…). Is this normal? I did make a lot of deletions. This is only a dev env, we will eventually expect this to be either on weaviate cloud, or roll our own on GCP. But I would like to understand if what’s causing this and if this could disrupt service in a deployment env. Any advice or debug tip will be appreciated. (we are new and still poking around in dev, and touching a little on scalability matter).

Server Setup Information

  • Weaviate Server Version: 1.27.0
  • Deployment Method: docker version 4.34.3 (170107) on macOS 14.2.1 (23C71)
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: ?
  • Multitenancy?: No

Any additional Information

2024-10-29 21:49:29 {“action”:“lsm_recover_from_active_wal”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“class”:“Listing_Text”,“index”:“listing_text”,“level”:“warning”,“msg”:“empty write-ahead-log found. Did weaviate crash prior to this or the tenant on/loaded from the cloud? Nothing to recover from this file.”,“path”:“/var/lib/weaviate/listing_text/48GOdFIN20rh/lsm/property_baths_searchable/segment-1730088224356409675”,“shard”:“48GOdFIN20rh”,“time”:“2024-10-30T01:49:29Z”}
2024-10-29 21:49:30 {“action”:“hnsw_prefill_cache_async”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“level”:“info”,“msg”:“not waiting for vector cache prefill, running in background”,“time”:“2024-10-30T01:49:30Z”,“wait_for_cache_prefill”:false}
2024-10-29 21:49:30 {“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“level”:“info”,“msg”:“Completed loading shard listing_text_48GOdFIN20rh in 1.691448875s”,“time”:“2024-10-30T01:49:30Z”}
2024-10-29 21:49:33 {“action”:“hnsw_prefill_cache_async”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“level”:“info”,“msg”:“not waiting for vector cache prefill, running in background”,“time”:“2024-10-30T01:49:33Z”,“wait_for_cache_prefill”:false}
2024-10-29 21:49:33 {“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“level”:“info”,“msg”:“Completed loading shard listing_image_48d5dFITJv7f in 4.504469669s”,“time”:“2024-10-30T01:49:33Z”}
2024-10-29 21:49:57 {“action”:“hnsw_vector_cache_prefill”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“count”:401613,“index_id”:“main”,“level”:“info”,“limit”:2000000,“msg”:“prefilled vector cache”,“time”:“2024-10-30T01:49:57Z”,“took”:26793010096}
2024-10-29 21:50:37 {“action”:“hnsw_vector_cache_prefill”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“count”:1221520,“index_id”:“main”,“level”:“info”,“limit”:2000000,“msg”:“prefilled vector cache”,“time”:“2024-10-30T01:50:37Z”,“took”:64299448321}
2024-10-29 21:54:29 {“action”:“tombstone_cleanup_begin”,“build_git_commit”:“6c571ff”,“build_go_version”:“go1.22.8”,“build_image_tag”:“”,“build_wv_version”:“”,“class”:“Listing_Image”,“level”:“info”,“msg”:“class Listing_Image: shard 48d5dFITJv7f: starting tombstone cleanup”,“shard”:“48d5dFITJv7f”,“time”:“2024-10-30T01:54:29Z”,“tombstones_in_cycle”:22162,“tombstones_total”:22162}

Focusing on last 2 lines, I think CPU was low up until tombstone_cleanup_begin when it ramped to 500% and after >20min, it is still apparently working on it… since I dont see tombstone_cleanup_complete.

hi @00.lope.naughts !!

You can set the up the aggressiveness of the tombstone cycles:

You will want to tweak the TOMBSTONE_DELETION_ variables.

This blogpost also can help you understanding how Weaviate uses GC or run cycles:

Let me know if that helps!

Thanks, I will check out those pages. I looked back at older logs and found the tombstone delete operations took only few sec to a few min even with >10k in size. So the one that got stuck is not out of the norm in terms of size. what I remembered was I shutdown the container, maybe too soon after a large delete op, or maybe I was unlucky doing so during some preset cycle/schedule. I ended up just exporting all data, delete the instance, and just create a new one. I suspect something is corrupted, and I have these old collections saved on disk.

and so far, I haven’t seen this happen again. perhaps, I shouldnt have just stop the weaviate container, but just “pause” it.

1 Like