Python client v4 batch create reference issue

Description

Python client weaviate-client==4.4.1

I have a schema of Documents and Chunks. I have a 2 way link Document->hasChunks, Chunk->ofDocument. The document I’m testing has 260 Chunks.

If I use client.batch.dynamic, all of the Document->hasChunks references are valid, but only about 40 of the Chunk->ofDocument are added, the rest are null. No errors are being reported.

If I use client.batch.fixed_size(100), I get more Chunks with ofDocument (usually around 200)

If I use client.batch.fixed_size(50, I get more Chunks (around 220)

I’m stumped.

Code:

with client.batch.dynamic() as batch:

    for index, element in enumerate(chunks):
        wv_chunk = Chunk.from_element(element, index)
        chunk_uuid= batch.add_object(
            collection="Chunk",
            properties=wv_chunk.get_data(),
        )
        print(f"Cross ref Doc id: {doc_uuid} hasChunks-> Chunk id: {chunk_uuid}")
        ref=batch.add_reference(
            from_collection="Document",
            from_uuid=doc_uuid,
            from_property="hasChunks",
            to=chunk_uuid
        )
        print(f"Cross ref Chunk id: {chunk_uuid} ofDocument-> Doc id: {doc_uuid}")
        ref = batch.add_reference(
            from_collection="Chunk",
            from_uuid=chunk_uuid,
            from_property="ofDocument",
            to=doc_uuid,
        )
        print('-' * 80)
  
failed_objs = client.batch.failed_objects
failed_refs = client.batch.failed_references
print(f"Failed batch objects: {failed_objs}")
print(f"Failed batch refs: {failed_refs}")
client.close()

Output:

Cross ref Doc id: 69438102-4464-4edb-a2d4-32887b5281e4 hasChunks-> Chunk id: 6e2c4288-5538-42c5-b525-0838329125f0
Cross ref Chunk id: 6e2c4288-5538-42c5-b525-0838329125f0 ofDocument-> Doc id: 69438102-4464-4edb-a2d4-32887b5281e4
--------------------------------------------------------------------------------
Cross ref Doc id: 69438102-4464-4edb-a2d4-32887b5281e4 hasChunks-> Chunk id: c3a53462-f70d-4290-b0b9-90869790cd8b
Cross ref Chunk id: c3a53462-f70d-4290-b0b9-90869790cd8b ofDocument-> Doc id: 69438102-4464-4edb-a2d4-32887b5281e4
--------------------------------------------------------------------------------
Failed batch objects: []
Failed batch refs: []

Now count the chunks with ofDocument

{
  Aggregate {
    Chunk(where: {
      operator: Equal,
      path: ["ofDocument","Document","id"],
      valueString: "69438102-4464-4edb-a2d4-32887b5281e4"
    }) {
			content {
        count
      }
    }
  }
}

{
  "data": {
    "Aggregate": {
      "Chunk": [
        {
          "content": {
            "count": 45
          }
        }
      ]
    }
  }
}

Here’s a run changing the code to client.batch.fixed_size(50):

Cross ref Doc id: 69438102-4464-4edb-a2d4-32887b5281e4 hasChunks-> Chunk id: 0aa245c0-7cba-4623-91de-3936503def2d
Cross ref Chunk id: 0aa245c0-7cba-4623-91de-3936503def2d ofDocument-> Doc id: 69438102-4464-4edb-a2d4-32887b5281e4
--------------------------------------------------------------------------------
Cross ref Doc id: 69438102-4464-4edb-a2d4-32887b5281e4 hasChunks-> Chunk id: 269d625e-ef04-46a4-99f3-bcf2e63d54b1
Cross ref Chunk id: 269d625e-ef04-46a4-99f3-bcf2e63d54b1 ofDocument-> Doc id: 69438102-4464-4edb-a2d4-32887b5281e4
--------------------------------------------------------------------------------
Failed batch objects: []
Failed batch refs: []

Now count:

{
  Aggregate {
    Chunk(where: {
      operator: Equal,
      path: ["ofDocument","Document","id"],
      valueString: "69438102-4464-4edb-a2d4-32887b5281e4"
    }) {
			content {
        count
      }
    }
  }
}

{
  "data": {
    "Aggregate": {
      "Chunk": [
        {
          "content": {
            "count": 240
          }
        }
      ]
    }
  }
}

That was a good run.

Any idea what is going on?

Server Setup Information

  • Weaviate Server Version: 1.23.7
  • Deployment Method: Weaviate cluster
  • Multi Node? Number of Running Nodes:

Hi @Brian_Money ! Welcome to our community :hugs:

if you import that same content using python v3 you get all objects? Trying to isolate here the py v4.

No errors are being reported.

This in both on server and client, right?

Thanks!

This is something I sometimes noticed while developing the v4 client and I am 99% sure that it is a weaviate bug (reference batch add code is almost 100% copied from v3) . However I could not reliably reproduce it and am unsure about the cause.

We limited the ref batch size internally (regardless of user input) and that seemed to have fixed it in our testing, but might not have been enough :frowning:

Could you test with fixed_size(50,1)? I have the suspicion that it is caused by the concurrency but am not really sure

Ok, I dug a bit more and I think I found the issue:

we are sending references too early, before weavaite has processed the from-object. In this case, the reference is silently discarded. Because weaviate does not return an error, we did not notice that wrong behaviour.

Not 100% sure how to solve this, but I’ll keep you up to date.

Until then:
Try to first add all objects and then all references

1 Like

@Brian_Money Can you please try again with 4.4.3?