swetag
July 22, 2024, 4:19am
1
Hi,
I batch uploaded around 20k objects and backed them up in s3, worked fine. I can see the backup json file and node1 folder in s3.
Moments later I deleted all the data and then batch uploaded them again which worked fine but this time while creating backups I am seeing a lot of this messages in my logs:
“HTTP Request: GET http://------:8080/v1/backups/s3/{backup_id}"HTTP/1.1 200 OK"”.
when I see weaviate logs, there are alot of this kind of logging:
“action”: “backup_status”; “level”: “info”; “msg”: “”,
but before these logs I got following logs in my record (in sequence):
“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property__id/segment-1721615078510284041.scratch.d: directory not empty”
"“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_title_searchable/segment-1721615075238438328.scratch.d: directory not empty”
“action”: “lsm_compaction”; “error”: “write index: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_annotations_id_searchable/segment-1721615076605784103.dbcompaction.scratch.d: directory not empty”;
“action”: “lsm_compaction”; “error”: “write index: clean up previous scratch space: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_annotations_id_searchable/segment-1721615076605784103.dbcompaction.scratch.d/.nfs3f46fabddd01f71200000009: device or resource busy”,
“action”: “lsm_compaction”; “error”: “write indices: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/objects/segment-1721615326755105246.dbcompaction.scratch.d: directory not empty”
“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_collections/segment-1721615316392568869.scratch.d: directory not empty”,
“action”: “lsm_memtable_flush”; “error”: “flush: unlinkat /var/lib/weaviate/test/{index_name}/w8G6ci6OWwEL/lsm/property_url_searchable/segment-1721615317703263522.scratch.d: directory not empty”
After around 15 mins (having same messages/logs) My server is timed out and restarted. So the backup never completed.
Server Setup Information
Weaviate Server Version: weaviate:1.25.6
Deployment Method: docker
Client Language and Version: python 3.10.9, weaviate-client: 4.6.5
Environment variables:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
DEFAULT_VECTORIZER_MODULE: ‘none’
ENABLE_MODULES: ‘backup-filesystem,backup-s3’
CLUSTER_HOSTNAME: ‘node1’
QUERY_MAXIMUM_RESULTS: 1000000
DISK_USE_READONLY_PERCENTAGE: 100
DISABLE_TELEMETRY: true
DISABLE_LAZY_LOAD_SHARDS: true
Could you please guide me in how to resolve this issue? TIA.
hi @swetag !!
Welcome to our community
Nice! This looks like one of our chaos pipelines
How have you deleted those objects in step 2?
When you delete the entire collection, it’s simpler, as no object deletion calculation is needed.
If you delete the objects itself, it will generate tombstones, and then a whole new cycle is triggered.
This seems related to this issue:
opened 11:38AM - 26 Jun 24 UTC
bug
### How to reproduce this bug?
Install Weaviate in version 1.25.5 with latest… Helm chart on a bare metal cluster as a single instance and run the backup script below and the weaviate helm settings also see below.
### What is the expected behavior?
Backup should not be stuck in state STARTED forever,.
### What is the actual behavior?
On our very small data (roughly 7k objects) in one of our Weaviate test instances, the Backup seems to get stuck for hours. We experience this in a single test instance, as well as also in our cluster test instance.
After starting the backup and noticing that it is never finishing, in Weaviate we get this continuous output, as we check every 10 seconds for the status
```
│ 2024-06-26T07:54:43.157558168Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:54:43Z","took":31544} │
│ 2024-06-26T07:54:53.185646557Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:54:53Z","took":25112} │
│ 2024-06-26T07:55:03.224599217Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:03Z","took":23890} │
│ 2024-06-26T07:55:13.258652773Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:13Z","took":25426} │
│ 2024-06-26T07:55:23.290172417Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:23Z","took":22835} │
│ 2024-06-26T07:55:33.332576022Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │
│ info","msg":"","time":"2024-06-26T07:55:33Z","took":25835}
```
Manually checking with curl what the status of the backup is indeed still shows us the Backup is stuck in state STARTED:
```
curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTHENTICATION_APIKEY_ALLOWED_KEYS
" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/2024-06-26_01-00-14"
```
`{"backend":"backup-s3","id":"2024-06-26_01-00-14","path":"s3://xxxxx-weaviate-backups/staging/2024-06-26_01-00-14","status":"STARTED"}`
I then stop/delete the kubernetes cronjob resp. job and restart it. The curl command that executes the job then gives back 422 error:
`curl: (22) The requested URL returned error: 422`
On weaviate side in the logs we get this info:
│
```
│ 2024-06-26T07:57:17.896867926Z weaviate {"action":"try_backup","backend":"s3","backup_id":"2024-06-26_07-57-12","level":"error","ms │
│ g":"backup 2024-06-26_01-00-14 already in progress","time":"2024-06-26T07:57:17Z","took":128102813}
```
After restarting Weaviate the backup seems to be again working and it is fast as well (SEEMS NOT TO BE THE CASE IN LATEST 1.25.5, there it gets stuck immediately again). However after some time we see the same behaviour. Then again only a restart of weaviate is helping.
### Supporting information
Backup script
```
# Prerequisites
backup_id=$(date +%Y-%m-%d_%H-%M-%S)
KEEP_BACKUPS_COUNT="${KEEP_BACKUPS_COUNT:=10}"
# Backup
json=$(printf '{ "id": "%s" }' "$backup_id")
curl --silent --fail --show-error -X POST \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $API_KEY" \
"http://weaviate.weaviate.svc.cluster.local/v1/backups/s3" -d "$json"
state=""
printf "Waiting for backup to finish"
while [[ "$state" != "SUCCESS" ]]; do
sleep 10
printf "."
state=$(curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $API_KEY" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/$backup_id" | jq -r ".status")
if [[ "$state" == "FAILED" ]]; then
echo "Backup failed"
exit 1
fi
done
printf "\n"
```
Helm Config:
```
backups:
s3:
enabled: true
envconfig:
BACKUP_S3_BUCKET: xxxxx-weaviate-backups
AWS_REGION: eu-central-1
authentication:
anonymous_access:
enabled: false
oidc:
enabled: false
authorization:
admin_list:
enabled: false
users:
%{~ for user in admin_users ~}
- ${user}
%{ endfor }
read_only_users:
%{~ for user in readonly_users ~}
- ${user}
%{ endfor }
# NOTE: 524288 is default value on Weaviate. Elasticsearch value is 262144
# So for now we can simply set the value to 524288 on both sides.
# Setting this here even if default value is used to make sure it is and known.
initContainers:
sysctlInitContainer:
enabled: true
sysctVmMaxMapCount: 524288
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: deploy/weaviate
operator: In
values:
- "true"
env:
##########################
# API Keys with ENV Vars #
##########################
# If using ENV Vars to set up API Keys make sure to have `authentication.apikey` block commented out
# to avoid any future changes. ENV Vars has priority over the config above `authentication.apikey`.
# If using `authentication.apikey `the below ENV Vars will be used because they have priority,
# so comment them out to avoid any future changes.
# Enables API key authentication. If it is set to 'false' the AUTHENTICATION_APIKEY_ALLOWED_KEYS
# and AUTHENTICATION_APIKEY_USERS will not have any effect.
AUTHENTICATION_APIKEY_ENABLED: 'true'
# Expose metrics on port 2112 for Prometheus to scrape
PROMETHEUS_MONITORING_ENABLED: true
# List one or more keys, separated by commas. Each key corresponds to a specific user identity below.
# If you want to use a kubernetes secret for the API Keys comment out this Variable and use the one in `envSecrets` below
# AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'jane-secret-key,ian-secret-key' (plain text)
# List one or more user identities, separated by commas. You can have only one User for all the keys or one user per key.
# The User/s can be a simple name or an email, no matter if it exists or not.
# NOTE: Make sure to add the users to the authorization above overwise they will not be allowed to interact with Weaviate.
# AUTHENTICATION_APIKEY_USERS: ''
LOG_LEVEL: info
envSecrets:
# create a Kubernetes secret with AUTHENTICATION_APIKEY_ALLOWED_KEYS key and its respective value
# NOTE: set from set block in main.tf
AUTHENTICATION_APIKEY_ALLOWED_KEYS: weaviate
service:
type: ClusterIP
grpcService:
enabled: false
resources:
requests:
cpu: 1m
memory: 100Mi
limits:
memory: ${memory}Gi
annotations:
ad.datadoghq.com/weaviate.checks: |
{
"weaviate": {
"init_config": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:2112/metrics",
"weaviate_api_endpoint": "http://%%host%%:8080",
"headers": {"Authorization": "Bearer ${api_key}"}
}
]
}
}
```
### Server Version
1.25.5
### Code of Conduct
- [X] I have read and agree to the Weaviate's [Contributor Guide](https://weaviate.io/developers/contributor-guide) and [Code of Conduct](https://weaviate.io/service/code-of-conduct)
Can you confirm this issue persists in 1.25.8?
Thanks!
swetag
July 24, 2024, 6:21am
3
Hey @DudaNogueira , thanks for your response.
I catch the uuids of the objects by running a conditional search operation and in this scenario the uuid of all the objects(around 27k) are returned but this need not be the case always. So i can not delete the collection and have to delete the objects only. Anyway, after that I run delete operation with this code:
collection.data.delete_many(where=Filter.by_id().contains_any(uuids_batch))
where each batch has 5k uuids. and this code is inside a for loop over all the uuid batches.
After around 5 sec, the function returns from the delete operation and starts the indexing.
The issue persists in 1.25.8 as well.
What should I do to resolve this issue?