Issue During Batch Insert

ROHAN_BALKONDEKAR · August 1, 2024, 2:45pm

Description

I am using Weaviate locally with a Docker container and the Weaviate Python client. I encounter a “Deadline Exceeded” error when trying to insert a large batch of data.

Code:

import weaviate
import os

client = weaviate.Client(
    url="http://localhost:8080",
    additional_headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    },
)

client.schema.create_class({
    "class": "work_steps",
    "vectorizer": "text2vec-openai",
    "module_config": {
        "generative-openai": {}
    }
})

work_steps_data = [
    {"wtd_text": d["wtd_text"], "wta_text": d["wta_text"]}
    for d in data_json
]

# len(work_steps_data)  # 106954

try:
    client.batch.create_objects(work_steps_data)
except weaviate.exceptions.WeaviateBatchError as e:
    print(f"Error: {e}")

Error:

{
    "name": "WeaviateBatchError",
    "message": "Query call with protocol GRPC batch failed with message <AioRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = \"Deadline Exceeded\"
    debug_error_string = \"UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2024-08-01T18:14:40.555441469+04:00\"}\"
>.",
    ...
}

docker-compose.yaml

version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.25.6
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      CLIP_INFERENCE_API: 'http://multi2vec-clip:8080'
      OPENAI_APIKEY: $OPENAI_APIKEY
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'multi2vec-clip'
      ENABLE_MODULES: 'multi2vec-clip,generative-openai,generative-cohere,text2vec-openai,text2vec-huggingface,text2vec-cohere,reranker-cohere'
      CLUSTER_HOSTNAME: 'node1'
    restart: on-failure:0
  multi2vec-clip:
    image: semitechnologies/multi2vec-clip:sentence-transformers-clip-ViT-B-32-multilingual-v1
    environment:
      ENABLE_CUDA: '0'
volumes:
  weaviate_data:

Additional Information

Docker Logs:

weaviate-1        | {"action":"startup","default_vectorizer_module":"multi2vec-clip","level":"info","msg":"the default vectorizer modules is set to \"multi2vec-clip\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-08-01T07:04:33Z"}
...
weaviate-1        | {"level":"warning","msg":"prop len tracker file /var/lib/weaviate/work_steps/iPkMMMILWoTR/proplengths does not exist, creating new tracker","time":"2024-08-01T08:54:43Z"}
...
multi2vec-clip-1  | INFO:     Model initialization complete
...
weaviate-1        | {"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-08-01T07:04:38Z"}

Problem

I am trying to insert a large dataset (about 106,954 records) into Weaviate, but I keep encountering a “Deadline Exceeded” error when using the batch insert functionality.

Questions

How can I avoid the “Deadline Exceeded” error during batch insertion?
Are there any recommended configurations or settings for handling large batch inserts?
Is there a way to increase the timeout settings for GRPC batch operations in Weaviate?

Any assistance or recommendations would be greatly appreciated. Thank you!

P.S:

I am used below workaround for the batch upsert to avoid any errors:

work_step_col = client.collections.get("work_steps")
# work_step_col.data.insert_many(work_steps_data)

import time

batch_size = 1000  # Adjust the batch size as needed
for i in range(0, len(work_steps_data), batch_size):
    batch = work_steps_data[i:i + batch_size]
    work_step_col.data.insert_many(batch)
    time.sleep(2)

I has been 15 minutes and counting, so I am posting this anyway

andrewisplinghoff · August 1, 2024, 4:45pm

If I understand correctly you tried to import more than 100.000 objects to Weaviate at the same time? It is normal to run into problems with requests that large, what about using the dynamic batching as proposed in the documentation? It will automatically choose appropriate batch sizes for an efficient import.

Batch import | Weaviate - Vector Database

ROHAN_BALKONDEKAR · August 2, 2024, 6:58am

cool!
I somehow completely missed dynamic batching.
Didn’t knew this existed. Will use this next time
Thanks

Topic		Replies	Views
Query call with protocol GRPC batch failed with message Deadline Exceeded Support	4	2576	March 31, 2025
Vectorizer Timeout settings and behavior Support	4	892	July 26, 2024
Inconsistent errors for weaviate batchInsert General	6	1138	August 29, 2024
Weaviate Batch Errors during Batch Insertion with v4 client Support bug , developer-experience , wcs , python , documentation	11	1665	May 15, 2024
Timing out on a batch of 1 Support	1	972	September 25, 2023