Issue During Batch Insert

Description

I am using Weaviate locally with a Docker container and the Weaviate Python client. I encounter a “Deadline Exceeded” error when trying to insert a large batch of data.

Code:

import weaviate
import os

client = weaviate.Client(
    url="http://localhost:8080",
    additional_headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    },
)

client.schema.create_class({
    "class": "work_steps",
    "vectorizer": "text2vec-openai",
    "module_config": {
        "generative-openai": {}
    }
})

work_steps_data = [
    {"wtd_text": d["wtd_text"], "wta_text": d["wta_text"]}
    for d in data_json
]

# len(work_steps_data)  # 106954

try:
    client.batch.create_objects(work_steps_data)
except weaviate.exceptions.WeaviateBatchError as e:
    print(f"Error: {e}")

Error:

{
    "name": "WeaviateBatchError",
    "message": "Query call with protocol GRPC batch failed with message <AioRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = \"Deadline Exceeded\"
    debug_error_string = \"UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2024-08-01T18:14:40.555441469+04:00\"}\"
>.",
    ...
}

docker-compose.yaml

version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.25.6
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      CLIP_INFERENCE_API: 'http://multi2vec-clip:8080'
      OPENAI_APIKEY: $OPENAI_APIKEY
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'multi2vec-clip'
      ENABLE_MODULES: 'multi2vec-clip,generative-openai,generative-cohere,text2vec-openai,text2vec-huggingface,text2vec-cohere,reranker-cohere'
      CLUSTER_HOSTNAME: 'node1'
    restart: on-failure:0
  multi2vec-clip:
    image: semitechnologies/multi2vec-clip:sentence-transformers-clip-ViT-B-32-multilingual-v1
    environment:
      ENABLE_CUDA: '0'
volumes:
  weaviate_data:

Additional Information

Docker Logs:

weaviate-1        | {"action":"startup","default_vectorizer_module":"multi2vec-clip","level":"info","msg":"the default vectorizer modules is set to \"multi2vec-clip\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-08-01T07:04:33Z"}
...
weaviate-1        | {"level":"warning","msg":"prop len tracker file /var/lib/weaviate/work_steps/iPkMMMILWoTR/proplengths does not exist, creating new tracker","time":"2024-08-01T08:54:43Z"}
...
multi2vec-clip-1  | INFO:     Model initialization complete
...
weaviate-1        | {"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-08-01T07:04:38Z"}

Problem

I am trying to insert a large dataset (about 106,954 records) into Weaviate, but I keep encountering a “Deadline Exceeded” error when using the batch insert functionality.

Questions

  1. How can I avoid the “Deadline Exceeded” error during batch insertion?
  2. Are there any recommended configurations or settings for handling large batch inserts?
  3. Is there a way to increase the timeout settings for GRPC batch operations in Weaviate?

Any assistance or recommendations would be greatly appreciated. Thank you!

P.S:

I am used below workaround for the batch upsert to avoid any errors:

work_step_col = client.collections.get("work_steps")
# work_step_col.data.insert_many(work_steps_data)

import time

batch_size = 1000  # Adjust the batch size as needed
for i in range(0, len(work_steps_data), batch_size):
    batch = work_steps_data[i:i + batch_size]
    work_step_col.data.insert_many(batch)
    time.sleep(2)

I has been 15 minutes and counting, so I am posting this anyway

If I understand correctly you tried to import more than 100.000 objects to Weaviate at the same time? It is normal to run into problems with requests that large, what about using the dynamic batching as proposed in the documentation? It will automatically choose appropriate batch sizes for an efficient import.

Batch import | Weaviate - Vector Database

1 Like

cool!
I somehow completely missed dynamic batching.
Didn’t knew this existed. Will use this next time
Thanks :slight_smile: