Some objects not readable after batch import / flush and switch failed

Description

We are performing a batch import where in the last step we create cross-references between the objects. While doing so, we perform consistency checks that all objects that were written in the previous step can also be retrieved. In our last run, this was not always the case. We did not receive a client-side error during the batch import, but at the time of the import i the server logs there was (among others) the following error:

{"action":"lsm_memtable_flush","class":"PageNode_v3","error":"flush: unlinkat /var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404.scratch.d: directory not empty","index":"pagenode_v3","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects","shard":"kUuKzkTaWVxi","time":"2024-07-08T17:25:48Z"}

The following query did not return any objects although they had been inserted before:

weaviate_client.query.get(
            'PageNode_v3',
            ['page_id', 'node_index']
        )
        .with_where({
            "path": ["page_id"],
            "operator": "Equal",
            "valueText": page_id
        })
        .with_limit(100_000)
        .do()

Interestingly, after a server restart (we upgraded to 1.25.7 during that restart, but I do not think that that made a difference), the objects are now retrievable. During server startup, the following messages related to this shard were printed to the log:

{"action":"lsm_segment_init","class":"PageNode_v3","index":"pagenode_v3","level":"info","msg":"discarded (partially written) LSM segment, because an active WAL for the same segment was found. A recovery from the WAL will follow.","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404.db","shard":"kUuKzkTaWVxi","time":"2024-07-09T10:49:47Z","wal_path":"segment-1720459484309792404.wal"}
{"action":"lsm_recover_from_active_wal","class":"PageNode_v3","index":"pagenode_v3","level":"warning","msg":"active write-ahead-log found. Did weaviate crash prior to this? Trying to recover...","path":"/var/lib/weaviate/pagenode_v3/kUuKzkTaWVxi/lsm/objects/segment-1720459484309792404","shard":"kUuKzkTaWVxi","time":"2024-07-09T10:49:47Z"}

Obviously having to restart the server to have all objects readable is not optimal. Could you please help us understand what is happening here and if there is a chance of having this not happen in the first place?

Server Setup Information

  • Weaviate Server Version: 1.25.6
  • Deployment Method: k8s using helm
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python, 3.26.2
  • Multitenancy?: no

BTW we have the text2vec-openai vectorizer enabled for the collection, if that might be related to the problem as I assume that Weaviate waits for the vectorization to complete before writing the final object to the file system.

This was a fresh install of Weaviate 1.25, we did not perform an upgrade from 1.24 on this cluster.

hi @andrewisplinghoff !!

How big is this cluster?

I have searched internally and have found some discussions on this very same error log. This may be a hardware limit :thinking:

Let me know about this.

Considering you only have 1 node, if you have a lot of objects, maybe it is time to think about scaling your cluster.

There’s not really so much data in the cluster, overall 1.7G (Size of PVC weaviate-data-weaviate-0).

Collection Counts:
Page_v3: 13199
PageNode_v3: 64392

Are there recommendations when multiple nodes should be used?

Oh, that’s not a lot of objects.

Where are you running this? One thing to look for is the hardware of that server, specially on hard drive specs.

This is running on Azure, disk space provided using a
NetApp Cloud Volume (CVO) via NFS3.

With that amount of objects, can you try reindexing on a new collection?

This could be an indexing error and reindexing shouldn’t be a lot of efforts.

Let me know if this is possible.

THere is a migration guide here:

that can get your data from one collection to the new one.

Not quite sure what the reindexing would help with the original issue? The data is available now after a server restart, the question is just why did it require a server restart to recover it. So what would reindexing do? The data is already available now.

Oh ok.

My guess is that after the restart, it recovered the missing batch objects.

I thought it was still missing, or broken.

Glad it solved, then.