I’m experiencing an issue with duplicate UUID handling in Weaviate when using batch import.
In my batch import code, I generate a deterministic UUID using generate_uuid5(guid)
for each object. This ensures that the same GUID consistently produces the same UUID. However, despite this, the batch import doesn’t seem to detect duplicates.
Here’s a simplified version of my code:
with collection.batch.fixed_size(batch_size=100) as batch:
for _, row in processed_data.iterrows():
for guid in row["GUIDs"]:
try:
batch.add_object(
properties={
"GUID": guid,
"a": row["a"]
},
vector={
key + "_embeddings": embeddings_dict.get(
row[key], [0.0] * 1536)
for key in ["a"]
},
uuid=generate_uuid5(guid)
)
records_processed += 1
except weaviate.exceptions.UnexpectedStatusCodeError as e:
skipped_details.append(
{
"GUID": guid,
"message": "Duplicate GUID found"
if e.status_code == 422
else str(e),
}
)
When I run this code, it processes records with duplicate GUIDs without raising any errors. However, when I use the insert
method with the same UUID generation logic, it correctly identifies duplicates and raises a 422 Unprocessable Entity
error.
Why does the batch import not detect duplicates while the insert method does? Am I missing something in the batch import process?
Any insights or suggestions would be greatly appreciated.