Error when Data Stored in AWS EFS

saurbhhsharrma · July 1, 2024, 10:53am

I have created a Weaviate Setup in AWS ECS and create auto-scaling. Due to spike load, the weaviate instance restarted and now, I am getting below error which accessing the data.

{
“data”: {
“Get”: {
“WeaviateDemo3”: null
}
},
“errors”: [
{
“locations”: [
{
“column”: 6,
“line”: 1
}
],
“message”: “explorer: list class: search: object search at index weaviatedemo3: local shard object search weaviatedemo3_PXksOv6syq7H: Unable to load shard PXksOv6syq7H: init shard "weaviatedemo3_PXksOv6syq7H": init shard "weaviatedemo3_PXksOv6syq7H": shard db: create objects bucket: init disk segments: init segment segment-1719829104648996364.db: mmap file: invalid argument”,
“path”: [
“Get”,
“WeaviateDemo3”
]
}
]
}

@DudaNogueira @jphwang

DudaNogueira · July 1, 2024, 9:02pm

Hi!

Do you have any logs from the server side?

Note that, for Weaviate, as long as there is writable disk space, it should work properly.

saurbhhsharrma · July 2, 2024, 12:25am

I found below in logs. Please let me know if this helps.

{"action":"lsm_memtable_flush","class":"WeaviateDemo3","error":"flush: unlinkat /var/lib/weaviate/weaviatedemo3/PXksOv6syq7H/lsm/objects/segment-1719829104648996364.scratch.d: directory not empty","index":"weaviatedemo3","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/weaviatedemo3/PXksOv6syq7H/lsm/objects","shard":"PXksOv6syq7H","time":"2024-07-01T10:18:50Z"}

@DudaNogueira

DudaNogueira · July 3, 2024, 6:43pm

Seems that it raised an error due to this folder not being empty

PLease, when opening technical support thread, select the Support category and answer this:

Weaviate Server Version:
Deployment Method:
Multi Node? Number of Running Nodes:
Client Language and Version:
Multitenancy?:

Mainly: what version are you runnig?

DudaNogueira · July 3, 2024, 6:44pm

for reference, this error traces back to:

github.com

weaviate/weaviate/blob/6cf3e16edf797e34b4f55ca1fa38943627dd4e5a/adapters/repos/db/lsmkv/bucket.go#L990


      
          	}
          
          	b.flushLock.RUnlock()
          	if shouldSwitch {
          		b.haltedFlushTimer.Reset()
          		cycleLength := b.active.ActiveDuration()
          		if err := b.FlushAndSwitch(); err != nil {
          			b.logger.WithField("action", "lsm_memtable_flush").
          				WithField("path", b.dir).
          				WithError(err).
          				Errorf("flush and switch failed")
          		}
          
          		if b.memtableResizer != nil {
          			next, ok := b.memtableResizer.NextTarget(int(b.memtableThreshold), cycleLength)
          			if ok {
          				b.memtableThreshold = uint64(next)
          			}
          		}
          		return true
          	}

saurbhhsharrma · July 8, 2024, 7:53am

Thanks for your response.

andrewisplinghoff · July 23, 2024, 1:10pm

I did some further research on this as we encountered the same problem (Some objects not readable after batch import / flush and switch failed - Support - Weaviate Community Forum).

The only code path I was able to find that could lead to this exact error output (other calls to os.remove() / os.removeAll should have more output in the log) is:
weaviate/adapters/repos/db/lsmkv/bucket.go at main · weaviate/weaviate (github.com)
weaviate/adapters/repos/db/lsmkv/memtable_flush.go at main · weaviate/weaviate (github.com)
weaviate/adapters/repos/db/lsmkv/segmentindex/indexes.go at main · weaviate/weaviate (github.com)

So in the end os.removeAll() is called on the scratch directory, which is failing with “directory not empty”.

Doing a search on when that would happen I came up with
os: “os.RemoveAll” sometimes returns error “remove files: directory not empty” · Issue #23452 · golang/go (github.com), which provokes the question: are there any goroutines potentially still accessing the scratch directory while trying to remove it?

Hope this helps a bit in figuring out the root cause of this.

DudaNogueira · July 24, 2024, 8:56pm

hi @andrewisplinghoff !! THanks for this.

I have escalated this issue with our core developers.

Hope to bring more info soon.

Thanks!

saurbhhsharrma · July 31, 2024, 7:56am

Thanks @andrewisplinghoff for looking into it.

@DudaNogueira Whenever you get any update from the developers, then please let us know as it is critical for us.
Thanks in advance.

DudaNogueira · July 31, 2024, 2:46pm

Hi! I have asked our team to take concentrate the discussion on this issue as it seems related (unless it is not):

github.com/weaviate/weaviate

Backup gets stuck and Weaviate as well

opened 11:38AM - 26 Jun 24 UTC

phlegx

bug

### How to reproduce this bug? Install Weaviate in version 1.25.5 with latest… Helm chart on a bare metal cluster as a single instance and run the backup script below and the weaviate helm settings also see below. ### What is the expected behavior? Backup should not be stuck in state STARTED forever,. ### What is the actual behavior? On our very small data (roughly 7k objects) in one of our Weaviate test instances, the Backup seems to get stuck for hours. We experience this in a single test instance, as well as also in our cluster test instance. After starting the backup and noticing that it is never finishing, in Weaviate we get this continuous output, as we check every 10 seconds for the status ``` │ 2024-06-26T07:54:43.157558168Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:54:43Z","took":31544} │ │ 2024-06-26T07:54:53.185646557Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:54:53Z","took":25112} │ │ 2024-06-26T07:55:03.224599217Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:03Z","took":23890} │ │ 2024-06-26T07:55:13.258652773Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:13Z","took":25426} │ │ 2024-06-26T07:55:23.290172417Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:23Z","took":22835} │ │ 2024-06-26T07:55:33.332576022Z weaviate {"action":"backup_status","backend":"backup-s3","backup_id":"2024-06-26_01-00-14","level":" │ │ info","msg":"","time":"2024-06-26T07:55:33Z","took":25835} ``` Manually checking with curl what the status of the backup is indeed still shows us the Backup is stuck in state STARTED: ``` curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTHENTICATION_APIKEY_ALLOWED_KEYS " "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/2024-06-26_01-00-14" ``` `{"backend":"backup-s3","id":"2024-06-26_01-00-14","path":"s3://xxxxx-weaviate-backups/staging/2024-06-26_01-00-14","status":"STARTED"}` I then stop/delete the kubernetes cronjob resp. job and restart it. The curl command that executes the job then gives back 422 error: `curl: (22) The requested URL returned error: 422` On weaviate side in the logs we get this info: │ ``` │ 2024-06-26T07:57:17.896867926Z weaviate {"action":"try_backup","backend":"s3","backup_id":"2024-06-26_07-57-12","level":"error","ms │ │ g":"backup 2024-06-26_01-00-14 already in progress","time":"2024-06-26T07:57:17Z","took":128102813} ``` After restarting Weaviate the backup seems to be again working and it is fast as well (SEEMS NOT TO BE THE CASE IN LATEST 1.25.5, there it gets stuck immediately again). However after some time we see the same behaviour. Then again only a restart of weaviate is helping. ### Supporting information Backup script ``` # Prerequisites backup_id=$(date +%Y-%m-%d_%H-%M-%S) KEEP_BACKUPS_COUNT="${KEEP_BACKUPS_COUNT:=10}" # Backup json=$(printf '{ "id": "%s" }' "$backup_id") curl --silent --fail --show-error -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ "http://weaviate.weaviate.svc.cluster.local/v1/backups/s3" -d "$json" state="" printf "Waiting for backup to finish" while [[ "$state" != "SUCCESS" ]]; do sleep 10 printf "." state=$(curl --silent --fail --show-error -H 'Content-Type: application/json' -H "Authorization: Bearer $API_KEY" "http://weaviate.weaviate.svc.cluster.local/v1/backups/backup-s3/$backup_id" | jq -r ".status") if [[ "$state" == "FAILED" ]]; then echo "Backup failed" exit 1 fi done printf "\n" ``` Helm Config: ``` backups: s3: enabled: true envconfig: BACKUP_S3_BUCKET: xxxxx-weaviate-backups AWS_REGION: eu-central-1 authentication: anonymous_access: enabled: false oidc: enabled: false authorization: admin_list: enabled: false users: %{~ for user in admin_users ~} - ${user} %{ endfor } read_only_users: %{~ for user in readonly_users ~} - ${user} %{ endfor } # NOTE: 524288 is default value on Weaviate. Elasticsearch value is 262144 # So for now we can simply set the value to 524288 on both sides. # Setting this here even if default value is used to make sure it is and known. initContainers: sysctlInitContainer: enabled: true sysctVmMaxMapCount: 524288 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: deploy/weaviate operator: In values: - "true" env: ########################## # API Keys with ENV Vars # ########################## # If using ENV Vars to set up API Keys make sure to have `authentication.apikey` block commented out # to avoid any future changes. ENV Vars has priority over the config above `authentication.apikey`. # If using `authentication.apikey `the below ENV Vars will be used because they have priority, # so comment them out to avoid any future changes. # Enables API key authentication. If it is set to 'false' the AUTHENTICATION_APIKEY_ALLOWED_KEYS # and AUTHENTICATION_APIKEY_USERS will not have any effect. AUTHENTICATION_APIKEY_ENABLED: 'true' # Expose metrics on port 2112 for Prometheus to scrape PROMETHEUS_MONITORING_ENABLED: true # List one or more keys, separated by commas. Each key corresponds to a specific user identity below. # If you want to use a kubernetes secret for the API Keys comment out this Variable and use the one in `envSecrets` below # AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'jane-secret-key,ian-secret-key' (plain text) # List one or more user identities, separated by commas. You can have only one User for all the keys or one user per key. # The User/s can be a simple name or an email, no matter if it exists or not. # NOTE: Make sure to add the users to the authorization above overwise they will not be allowed to interact with Weaviate. # AUTHENTICATION_APIKEY_USERS: '' LOG_LEVEL: info envSecrets: # create a Kubernetes secret with AUTHENTICATION_APIKEY_ALLOWED_KEYS key and its respective value # NOTE: set from set block in main.tf AUTHENTICATION_APIKEY_ALLOWED_KEYS: weaviate service: type: ClusterIP grpcService: enabled: false resources: requests: cpu: 1m memory: 100Mi limits: memory: ${memory}Gi annotations: ad.datadoghq.com/weaviate.checks: | { "weaviate": { "init_config": {}, "instances": [ { "openmetrics_endpoint": "http://%%host%%:2112/metrics", "weaviate_api_endpoint": "http://%%host%%:8080", "headers": {"Authorization": "Bearer ${api_key}"} } ] } } ``` ### Server Version 1.25.5 ### Code of Conduct - [X] I have read and agree to the Weaviate's [Contributor Guide](https://weaviate.io/developers/contributor-guide) and [Code of Conduct](https://weaviate.io/service/code-of-conduct)

saurbhhsharrma · September 11, 2024, 8:15am

Hi @DudaNogueira
Could you please help us with this issue as we are again facing the same issue, and it is blocking our development.

DudaNogueira · September 11, 2024, 12:36pm

hi @saurbhhsharrma !

Are all the backups failing? Can you reproduce this scenario on a fresh, clean install?

We will release a feature to cancel the backups in 1.27 (that may be also backported to previous versions)

this can help this scenario while we identify the root cause of those issues.

Let me know if this helps!

Topic		Replies	Views
Error when Data Stored in AWS EFS in Weaviate Support	1	190	July 8, 2024
Weaviate not backing up for a long time Support python	2	283	July 24, 2024
Weaviate k8s backup out of disk space Support	2	392	February 2, 2024
Error restoring backup and file corruption Support	2	214	November 27, 2024
Weaviate Holding Locks on EFS Files Causing disk quota exceeded Errors Support	4	285	January 10, 2025

Error when Data Stored in AWS EFS

Related topics