### How to reproduce this bug?
Install Weaviate in version 1.25.5 with latest… Helm chart on a bare metal cluster as a single instance and run the backup script below and the weaviate helm settings also see below.
### What is the expected behavior?
Backup should not be stuck in state STARTED forever,.
### What is the actual behavior?
On our very small data (roughly 7k objects) in one of our Weaviate test instances, the Backup seems to get stuck for hours. We experience this in a single test instance, as well as also in our cluster test instance.
After starting the backup and noticing that it is never finishing, in Weaviate we get this continuous output, as we check every 10 seconds for the status
```
│ 2024-06-26T07:54:43.157558168Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:54:43Z","took":31544} │
│ 2024-06-26T07:54:53.185646557Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:54:53Z","took":25112} │
│ 2024-06-26T07:55:03.224599217Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:03Z","took":23890} │
│ 2024-06-26T07:55:13.258652773Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:13Z","took":25426} │
│ 2024-06-26T07:55:23.290172417Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:23Z","took":22835} │
│ 2024-06-26T07:55:33.332576022Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:33Z","took":25835}
```
Manually checking with curl what the status of the backup is indeed still shows us the Backup is stuck in state STARTED:
```
curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTHENTICATION_APIKEY_ALLOWED_KEYS
" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/2024-06-26_01-00-14"
```
`{"backend":"backup-s3","id":"2024-06-26_01-00-14","path":"s3://xxxxx-weaviate-backups/staging/2024-06-26_01-00-14","status":"STARTED"}`
I then stop/delete the kubernetes cronjob resp. job and restart it. The curl command that executes the job then gives back 422 error:
`curl: (22) The requested URL returned error: 422`
On weaviate side in the logs we get this info:
│
```
│ 2024-06-26T07:57:17.896867926Z weaviate {"action":"try_backup","backend":"s3","backup_id":"2024-06-26_07-57-12","level":"error","ms │
│ g":"backup 2024-06-26_01-00-14 already in progress","time":"2024-06-26T07:57:17Z","took":128102813}
```
After restarting Weaviate the backup seems to be again working and it is fast as well (SEEMS NOT TO BE THE CASE IN LATEST 1.25.5, there it gets stuck immediately again). However after some time we see the same behaviour. Then again only a restart of weaviate is helping.
### Supporting information
Backup script
```
# Prerequisites
backup_id=$(date +%Y-%m-%d_%H-%M-%S)
KEEP_BACKUPS_COUNT="${KEEP_BACKUPS_COUNT:=10}"
# Backup
json=$(printf '{ "id": "%s" }' "$backup_id")
curl --silent --fail --show-error -X POST \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $API_KEY" \
"http://weaviate.weaviate.svc.cluster.local/v1/backups/s3" -d "$json"
state=""
printf "Waiting for backup to finish"
while [[ "$state" != "SUCCESS" ]]; do
sleep 10
printf "."
state=$(curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $API_KEY" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/$backup_id" | jq -r ".status")
if [[ "$state" == "FAILED" ]]; then
echo "Backup failed"
exit 1
fi
done
printf "\n"
```
Helm Config:
```
backups:
s3:
enabled: true
envconfig:
BACKUP_S3_BUCKET: xxxxx-weaviate-backups
AWS_REGION: eu-central-1
authentication:
anonymous_access:
enabled: false
oidc:
enabled: false
authorization:
admin_list:
enabled: false
users:
%{~ for user in admin_users ~}
- ${user}
%{ endfor }
read_only_users:
%{~ for user in readonly_users ~}
- ${user}
%{ endfor }
# NOTE: 524288 is default value on Weaviate. Elasticsearch value is 262144
# So for now we can simply set the value to 524288 on both sides.
# Setting this here even if default value is used to make sure it is and known.
initContainers:
sysctlInitContainer:
enabled: true
sysctVmMaxMapCount: 524288
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: deploy/weaviate
operator: In
values:
- "true"
env:
##########################
# API Keys with ENV Vars #
##########################
# If using ENV Vars to set up API Keys make sure to have `authentication.apikey` block commented out
# to avoid any future changes. ENV Vars has priority over the config above `authentication.apikey`.
# If using `authentication.apikey `the below ENV Vars will be used because they have priority,
# so comment them out to avoid any future changes.
# Enables API key authentication. If it is set to 'false' the AUTHENTICATION_APIKEY_ALLOWED_KEYS
# and AUTHENTICATION_APIKEY_USERS will not have any effect.
AUTHENTICATION_APIKEY_ENABLED: 'true'
# Expose metrics on port 2112 for Prometheus to scrape
PROMETHEUS_MONITORING_ENABLED: true
# List one or more keys, separated by commas. Each key corresponds to a specific user identity below.
# If you want to use a kubernetes secret for the API Keys comment out this Variable and use the one in `envSecrets` below
# AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'jane-secret-key,ian-secret-key' (plain text)
# List one or more user identities, separated by commas. You can have only one User for all the keys or one user per key.
# The User/s can be a simple name or an email, no matter if it exists or not.
# NOTE: Make sure to add the users to the authorization above overwise they will not be allowed to interact with Weaviate.
# AUTHENTICATION_APIKEY_USERS: ''
LOG_LEVEL: info
envSecrets:
# create a Kubernetes secret with AUTHENTICATION_APIKEY_ALLOWED_KEYS key and its respective value
# NOTE: set from set block in main.tf
AUTHENTICATION_APIKEY_ALLOWED_KEYS: weaviate
service:
type: ClusterIP
grpcService:
enabled: false
resources:
requests:
cpu: 1m
memory: 100Mi
limits:
memory: ${memory}Gi
annotations:
ad.datadoghq.com/weaviate.checks: |
{
"weaviate": {
"init_config": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:2112/metrics",
"weaviate_api_endpoint": "http://%%host%%:8080",
"headers": {"Authorization": "Bearer ${api_key}"}
}
]
}
}
```
### Server Version
1.25.5
### Code of Conduct
- [X] I have read and agree to the Weaviate's [Contributor Guide](https://weaviate.io/developers/contributor-guide) and [Code of Conduct](https://weaviate.io/service/code-of-conduct)