Some objects not readable after batch import / flush and switch failed

andrewisplinghoff · July 9, 2024, 7:21pm

Description

We are performing a batch import where in the last step we create cross-references between the objects. While doing so, we perform consistency checks that all objects that were written in the previous step can also be retrieved. In our last run, this was not always the case. We did not receive a client-side error during the batch import, but at the time of the import i the server logs there was (among others) the following error:

{"action":"lsm_memtable_flush","class":"PageNode_v3","error":"flush: unlinkat /var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404.scratch.d: directory not empty","index":"pagenode_v3","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects","shard":"kUuKzkTaWVxi","time":"2024-07-08T17:25:48Z"}

The following query did not return any objects although they had been inserted before:

weaviate_client.query.get(
            'PageNode_v3',
            ['page_id', 'node_index']
        )
        .with_where({
            "path": ["page_id"],
            "operator": "Equal",
            "valueText": page_id
        })
        .with_limit(100_000)
        .do()

Interestingly, after a server restart (we upgraded to 1.25.7 during that restart, but I do not think that that made a difference), the objects are now retrievable. During server startup, the following messages related to this shard were printed to the log:

{"action":"lsm_segment_init","class":"PageNode_v3","index":"pagenode_v3","level":"info","msg":"discarded (partially written) LSM segment, because an active WAL for the same segment was found. A recovery from the WAL will follow.","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404.db","shard":"kUuKzkTaWVxi","time":"2024-07-09T10:49:47Z","wal_path":"segment-1720459484309792404.wal"}
{"action":"lsm_recover_from_active_wal","class":"PageNode_v3","index":"pagenode_v3","level":"warning","msg":"active write-ahead-log found. Did weaviate crash prior to this? Trying to recover...","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404","shard":"kUuKzkTaWVxi","time":"2024-07-09T10:49:47Z"}

Obviously having to restart the server to have all objects readable is not optimal. Could you please help us understand what is happening here and if there is a chance of having this not happen in the first place?

Server Setup Information

Weaviate Server Version: 1.25.6
Deployment Method: k8s using helm
Multi Node? Number of Running Nodes: 1
Client Language and Version: Python, 3.26.2
Multitenancy?: no

andrewisplinghoff · July 10, 2024, 8:56am

BTW we have the text2vec-openai vectorizer enabled for the collection, if that might be related to the problem as I assume that Weaviate waits for the vectorization to complete before writing the final object to the file system.

This was a fresh install of Weaviate 1.25, we did not perform an upgrade from 1.24 on this cluster.

DudaNogueira · July 11, 2024, 8:11pm

hi @andrewisplinghoff !!

How big is this cluster?

I have searched internally and have found some discussions on this very same error log. This may be a hardware limit

Let me know about this.

Considering you only have 1 node, if you have a lot of objects, maybe it is time to think about scaling your cluster.

andrewisplinghoff · July 12, 2024, 9:34am

There’s not really so much data in the cluster, overall 1.7G (Size of PVC weaviate-data-weaviate-0).

Collection Counts:
Page_v3: 13199
PageNode_v3: 64392

Are there recommendations when multiple nodes should be used?

DudaNogueira · July 15, 2024, 6:48pm

Oh, that’s not a lot of objects.

Where are you running this? One thing to look for is the hardware of that server, specially on hard drive specs.

andrewisplinghoff · July 16, 2024, 3:22pm

This is running on Azure, disk space provided using a
NetApp Cloud Volume (CVO) via NFS3.

DudaNogueira · July 16, 2024, 4:25pm

With that amount of objects, can you try reindexing on a new collection?

This could be an indexing error and reindexing shouldn’t be a lot of efforts.

Let me know if this is possible.

THere is a migration guide here:

that can get your data from one collection to the new one.

andrewisplinghoff · July 16, 2024, 4:44pm

Not quite sure what the reindexing would help with the original issue? The data is available now after a server restart, the question is just why did it require a server restart to recover it. So what would reindexing do? The data is already available now.

DudaNogueira · July 17, 2024, 3:43pm

Oh ok.

My guess is that after the restart, it recovered the missing batch objects.

I thought it was still missing, or broken.

Glad it solved, then.

Topic		Replies	Views
Batch insert logs 'Failed to send 1 objects in a batch of 1' but collection.batch.failed_objects is empty Support python	11	706	November 19, 2024
[Question] client.batch.failed_objects or collection.batch.failed_objects for the failed objects. Support technical	1	387	July 29, 2024
Batch insert error Support	1	149	November 21, 2024
Inconsistent errors for weaviate batchInsert General	6	572	August 29, 2024
How to handle error for Batch Import (add_object) when weaviate instance becomes unavailable Support developer-experience , python	8	352	December 4, 2024

Some objects not readable after batch import / flush and switch failed

Description

Server Setup Information

Related topics