No error during indexing yet aggregate is off

lnatspacy · July 18, 2023, 5:08am

Hi,

I’m trying to insert roughly 147k passages into my local weaviate 1.19.6 instance.
I’m using the following batching code:

with client.batch(
    batch_size=5, timeout_retries=20, connection_error_retries=20, weaviate_error_retries=WeaviateErrorRetryConf(number_retries=3), dynamic=True
) as batch:
    # Batch import all Questions
    for i, d in df.iterrows():
        if i % 1000 == 0:
            print(f"importing passage: {i}")
        
        properties = {
            "title": d["title"],
            "lemmatized_title": lemmatize(d["title"], lemma_dict),
            "text": d["text"],
            "lemmatized_text": lemmatize(d["text"], lemma_dict),
        }

        client.batch.add_data_object(
            properties,
            "Passage",
            uuid=create_uuid_from_string(d["id"])
        )

I’m getting a few retries for timeouts but they seem to work on the 2nd attempts. They’re also not enough to account for the missing docs.
I’m not seeing any other errors during indexing but yet the number I’m getting at the end is roughly 1400 docs off.

Any ideas on what might be causing this?
Are there silent errors that I’d be missing given my code?

Any help would be much appreciated.

jphwang · July 18, 2023, 9:24am

Hi @lnatspacy

That sounds a bit frustrating. Let me see if I can help.

Individual object-level errors during batch insertion will not show up as “errors” as such, if each request was fulfilled.

I wonder if that’s what is going on here.

Since you are using the Python client, you could to make use of the callbacks to check for this during insertion (Python | Weaviate - vector database).

I see that your uuid is deterministic, so if you don’t want to re-import your data you could use that to loop through your data and check for missing objects. Then you could try importing those.

Cheers!
JP

lnatspacy · July 19, 2023, 4:45am

Thank you @jphwang for responding so quickly.
I took your advice, and adapted my code to look like this:

def check_batch_result(results):    
    if results is not None:
        for result in results:
            if "result" in result and "errors" in result["result"]:
                if "error" in result["result"]["errors"]:
                    print(result["result"])

# Configure a batch process
with client.batch(
    batch_size=5, 
    timeout_retries=20, 
    connection_error_retries=20, 
    weaviate_error_retries=WeaviateErrorRetryConf(number_retries=3), 
    dynamic=True,
    callback=check_batch_result
) as batch:
    # Batch import all Questions
    for i, d in df.iterrows():
        if i % 1000 == 0:
            print(f"importing passage: {i}")
        
        properties = {
            "title": d["title"],
            "lemmatized_title": lemmatize(d["title"], lemma_dict),
            "text": d["text"],
            "lemmatized_text": lemmatize(d["text"], lemma_dict),
        }

        client.batch.add_data_object(
            properties,
            "Passage",
            uuid=create_uuid_from_string(d["id"])
        )

No errors were printed during import and there is again a difference between how many rows were in my dataframe and what weaviate is reporting.
I made sure that the callback was actually working and tested it with a smaller dataset first, printin even on success.

Any other ideas?

jphwang · July 19, 2023, 7:53pm

If that’s the case - is it possible that you have duplicate IDs, or somehow the create_uuid_from_string function is generating duplicates?

What happens if you run:

len(df.drop_duplicates(subset=['id']))

lnatspacy · July 20, 2023, 5:08am

Damn, of course! Thank you for pointing me in the right direction!

jphwang · July 20, 2023, 7:49am

No worries! Happy to help and even happier we got it resolved together

Topic		Replies	Views
Inconsistent errors for weaviate batchInsert General	6	641	August 29, 2024
Weaviate Batch Errors during Batch Insertion with v4 client Support bug , developer-experience , wcs , python , documentation	11	1275	May 15, 2024
Batch insert error Support	1	173	November 21, 2024
Batch insert failed Support	3	36	July 25, 2025
Getting timeout error on the batch insertion of the data Support technical	1	146	December 16, 2024

No error during indexing yet aggregate is off

Related topics