No error during indexing yet aggregate is off

Hi,

I’m trying to insert roughly 147k passages into my local weaviate 1.19.6 instance.
I’m using the following batching code:

with client.batch(
    batch_size=5, timeout_retries=20, connection_error_retries=20, weaviate_error_retries=WeaviateErrorRetryConf(number_retries=3), dynamic=True
) as batch:
    # Batch import all Questions
    for i, d in df.iterrows():
        if i % 1000 == 0:
            print(f"importing passage: {i}")
        
        properties = {
            "title": d["title"],
            "lemmatized_title": lemmatize(d["title"], lemma_dict),
            "text": d["text"],
            "lemmatized_text": lemmatize(d["text"], lemma_dict),
        }

        client.batch.add_data_object(
            properties,
            "Passage",
            uuid=create_uuid_from_string(d["id"])
        )

I’m getting a few retries for timeouts but they seem to work on the 2nd attempts. They’re also not enough to account for the missing docs.
I’m not seeing any other errors during indexing but yet the number I’m getting at the end is roughly 1400 docs off.

Any ideas on what might be causing this?
Are there silent errors that I’d be missing given my code?

Any help would be much appreciated.

Hi @lnatspacy

That sounds a bit frustrating. Let me see if I can help.

Individual object-level errors during batch insertion will not show up as “errors” as such, if each request was fulfilled.

I wonder if that’s what is going on here.

Since you are using the Python client, you could to make use of the callbacks to check for this during insertion (Python | Weaviate - vector database).

I see that your uuid is deterministic, so if you don’t want to re-import your data you could use that to loop through your data and check for missing objects. Then you could try importing those.

Cheers!
JP

Thank you @jphwang for responding so quickly.
I took your advice, and adapted my code to look like this:

def check_batch_result(results):    
    if results is not None:
        for result in results:
            if "result" in result and "errors" in result["result"]:
                if "error" in result["result"]["errors"]:
                    print(result["result"])

# Configure a batch process
with client.batch(
    batch_size=5, 
    timeout_retries=20, 
    connection_error_retries=20, 
    weaviate_error_retries=WeaviateErrorRetryConf(number_retries=3), 
    dynamic=True,
    callback=check_batch_result
) as batch:
    # Batch import all Questions
    for i, d in df.iterrows():
        if i % 1000 == 0:
            print(f"importing passage: {i}")
        
        properties = {
            "title": d["title"],
            "lemmatized_title": lemmatize(d["title"], lemma_dict),
            "text": d["text"],
            "lemmatized_text": lemmatize(d["text"], lemma_dict),
        }

        client.batch.add_data_object(
            properties,
            "Passage",
            uuid=create_uuid_from_string(d["id"])
        )

No errors were printed during import and there is again a difference between how many rows were in my dataframe and what weaviate is reporting.
I made sure that the callback was actually working and tested it with a smaller dataset first, printin even on success.

Any other ideas?

If that’s the case - is it possible that you have duplicate IDs, or somehow the create_uuid_from_string function is generating duplicates?

What happens if you run:

len(df.drop_duplicates(subset=['id']))

Damn, of course! Thank you for pointing me in the right direction!

No worries! Happy to help and even happier we got it resolved together :slight_smile: