Weaviate not backing up for a long time

swetag · July 22, 2024, 4:19am

Hi,

I batch uploaded around 20k objects and backed them up in s3, worked fine. I can see the backup json file and node1 folder in s3.
Moments later I deleted all the data and then batch uploaded them again which worked fine but this time while creating backups I am seeing a lot of this messages in my logs:
“HTTP Request: GET http://------:8080/v1/backups/s3/{backup_id}"HTTP/1.1 200 OK"”.
when I see weaviate logs, there are alot of this kind of logging:
“action”: “backup_status”; “level”: “info”; “msg”: “”,
but before these logs I got following logs in my record (in sequence):

“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property__id/segment-1721615078510284041.scratch.d: directory not empty”

"“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_title_searchable/segment-1721615075238438328.scratch.d: directory not empty”

“action”: “lsm_compaction”; “error”: “write index: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_annotations_id_searchable/segment-1721615076605784103.dbcompaction.scratch.d: directory not empty”;

“action”: “lsm_compaction”; “error”: “write index: clean up previous scratch space: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_annotations_id_searchable/segment-1721615076605784103.dbcompaction.scratch.d/.nfs3f46fabddd01f71200000009: device or resource busy”,

“action”: “lsm_compaction”; “error”: “write indices: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/objects/segment-1721615326755105246.dbcompaction.scratch.d: directory not empty”

“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_collections/segment-1721615316392568869.scratch.d: directory not empty”,

“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_url_searchable/segment-1721615317703263522.scratch.d: directory not empty”
After around 15 mins (having same messages/logs) My server is timed out and restarted. So the backup never completed.

Server Setup Information

Weaviate Server Version: weaviate:1.25.6
Deployment Method: docker
Client Language and Version: python 3.10.9, weaviate-client: 4.6.5

Environment variables:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
DEFAULT_VECTORIZER_MODULE: ‘none’
ENABLE_MODULES: ‘backup-filesystem,backup-s3’
CLUSTER_HOSTNAME: ‘node1’
QUERY_MAXIMUM_RESULTS: 1000000
DISK_USE_READONLY_PERCENTAGE: 100
DISABLE_TELEMETRY: true
DISABLE_LAZY_LOAD_SHARDS: true

Could you please guide me in how to resolve this issue? TIA.

DudaNogueira · July 23, 2024, 9:44pm

hi @swetag !!

Welcome to our community

Nice! This looks like one of our chaos pipelines

How have you deleted those objects in step 2?

When you delete the entire collection, it’s simpler, as no object deletion calculation is needed.

If you delete the objects itself, it will generate tombstones, and then a whole new cycle is triggered.

This seems related to this issue:

github.com/weaviate/weaviate

Backup gets stuck and Weaviate as well

opened 11:38AM - 26 Jun 24 UTC

phlegx

bug

### How to reproduce this bug? Install Weaviate in version 1.25.5 with latest… Helm chart on a bare metal cluster as a single instance and run the backup script below and the weaviate helm settings also see below. ### What is the expected behavior? Backup should not be stuck in state STARTED forever,. ### What is the actual behavior? On our very small data (roughly 7k objects) in one of our Weaviate test instances, the Backup seems to get stuck for hours. We experience this in a single test instance, as well as also in our cluster test instance. After starting the backup and noticing that it is never finishing, in Weaviate we get this continuous output, as we check every 10 seconds for the status ``` │ 2024-06-26T07:54:43.157558168Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:54:43Z","took":31544} │ │ 2024-06-26T07:54:53.185646557Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:54:53Z","took":25112} │ │ 2024-06-26T07:55:03.224599217Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:03Z","took":23890} │ │ 2024-06-26T07:55:13.258652773Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:13Z","took":25426} │ │ 2024-06-26T07:55:23.290172417Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:23Z","took":22835} │ │ 2024-06-26T07:55:33.332576022Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:33Z","took":25835} ``` Manually checking with curl what the status of the backup is indeed still shows us the Backup is stuck in state STARTED: ``` curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTHENTICATION_APIKEY_ALLOWED_KEYS " "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/2024-06-26_01-00-14" ``` `{"backend":"backup-s3","id":"2024-06-26_01-00-14","path":"s3://xxxxx-weaviate-backups/staging/2024-06-26_01-00-14","status":"STARTED"}` I then stop/delete the kubernetes cronjob resp. job and restart it. The curl command that executes the job then gives back 422 error: `curl: (22) The requested URL returned error: 422` On weaviate side in the logs we get this info: │ ``` │ 2024-06-26T07:57:17.896867926Z weaviate {"action":"try_backup","backend":"s3","backup_id":"2024-06-26_07-57-12","level":"error","ms │ │ g":"backup 2024-06-26_01-00-14 already in progress","time":"2024-06-26T07:57:17Z","took":128102813} ``` After restarting Weaviate the backup seems to be again working and it is fast as well (SEEMS NOT TO BE THE CASE IN LATEST 1.25.5, there it gets stuck immediately again). However after some time we see the same behaviour. Then again only a restart of weaviate is helping. ### Supporting information Backup script ``` # Prerequisites backup_id=$(date +%Y-%m-%d_%H-%M-%S) KEEP_BACKUPS_COUNT="${KEEP_BACKUPS_COUNT:=10}" # Backup json=$(printf '{ "id": "%s" }' "$backup_id") curl --silent --fail --show-error -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ "http://weaviate.weaviate.svc.cluster.local/v1/backups/s3" -d "$json" state="" printf "Waiting for backup to finish" while [[ "$state" != "SUCCESS" ]]; do sleep 10 printf "." state=$(curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $API_KEY" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/$backup_id" | jq -r ".status") if [[ "$state" == "FAILED" ]]; then echo "Backup failed" exit 1 fi done printf "\n" ``` Helm Config: ``` backups: s3: enabled: true envconfig: BACKUP_S3_BUCKET: xxxxx-weaviate-backups AWS_REGION: eu-central-1 authentication: anonymous_access: enabled: false oidc: enabled: false authorization: admin_list: enabled: false users: %{~ for user in admin_users ~} - ${user} %{ endfor } read_only_users: %{~ for user in readonly_users ~} - ${user} %{ endfor } # NOTE: 524288 is default value on Weaviate. Elasticsearch value is 262144 # So for now we can simply set the value to 524288 on both sides. # Setting this here even if default value is used to make sure it is and known. initContainers: sysctlInitContainer: enabled: true sysctVmMaxMapCount: 524288 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: deploy/weaviate operator: In values: - "true" env: ########################## # API Keys with ENV Vars # ########################## # If using ENV Vars to set up API Keys make sure to have `authentication.apikey` block commented out # to avoid any future changes. ENV Vars has priority over the config above `authentication.apikey`. # If using `authentication.apikey `the below ENV Vars will be used because they have priority, # so comment them out to avoid any future changes. # Enables API key authentication. If it is set to 'false' the AUTHENTICATION_APIKEY_ALLOWED_KEYS # and AUTHENTICATION_APIKEY_USERS will not have any effect. AUTHENTICATION_APIKEY_ENABLED: 'true' # Expose metrics on port 2112 for Prometheus to scrape PROMETHEUS_MONITORING_ENABLED: true # List one or more keys, separated by commas. Each key corresponds to a specific user identity below. # If you want to use a kubernetes secret for the API Keys comment out this Variable and use the one in `envSecrets` below # AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'jane-secret-key,ian-secret-key' (plain text) # List one or more user identities, separated by commas. You can have only one User for all the keys or one user per key. # The User/s can be a simple name or an email, no matter if it exists or not. # NOTE: Make sure to add the users to the authorization above overwise they will not be allowed to interact with Weaviate. # AUTHENTICATION_APIKEY_USERS: '' LOG_LEVEL: info envSecrets: # create a Kubernetes secret with AUTHENTICATION_APIKEY_ALLOWED_KEYS key and its respective value # NOTE: set from set block in main.tf AUTHENTICATION_APIKEY_ALLOWED_KEYS: weaviate service: type: ClusterIP grpcService: enabled: false resources: requests: cpu: 1m memory: 100Mi limits: memory: ${memory}Gi annotations: ad.datadoghq.com/weaviate.checks: | { "weaviate": { "init_config": {}, "instances": [ { "openmetrics_endpoint": "http://%%host%%:2112/metrics", "weaviate_api_endpoint": "http://%%host%%:8080", "headers": {"Authorization": "Bearer ${api_key}"} } ] } } ``` ### Server Version 1.25.5 ### Code of Conduct - [X] I have read and agree to the Weaviate's [Contributor Guide](https://weaviate.io/developers/contributor-guide) and [Code of Conduct](https://weaviate.io/service/code-of-conduct)

Can you confirm this issue persists in 1.25.8?

Thanks!

swetag · July 24, 2024, 6:21am

Hey @DudaNogueira , thanks for your response.

I catch the uuids of the objects by running a conditional search operation and in this scenario the uuid of all the objects(around 27k) are returned but this need not be the case always. So i can not delete the collection and have to delete the objects only. Anyway, after that I run delete operation with this code:

collection.data.delete_many(where=Filter.by_id().contains_any(uuids_batch))

where each batch has 5k uuids. and this code is inside a for loop over all the uuid batches.

After around 5 sec, the function returns from the delete operation and starts the indexing.

The issue persists in 1.25.8 as well.

What should I do to resolve this issue?

Topic		Replies	Views
Error when Data Stored in AWS EFS General	11	366	September 11, 2024
Where does weaviate backup store logs Support	5	223	April 22, 2024
Weaviate k8s backup out of disk space Support	2	351	February 2, 2024
Upload to and Deletion from Custom Weviate Instance Always Times out Support	9	226	July 17, 2024
Issue with Weaviate Backup Process on Azure Support bug , integration	7	507	June 6, 2024

Weaviate not backing up for a long time

Server Setup Information

Related topics