Description
I am using Weaviate locally with a Docker container and the Weaviate Python client. I encounter a “Deadline Exceeded” error when trying to insert a large batch of data.
Code:
import weaviate
import os
client = weaviate.Client(
url="http://localhost:8080",
additional_headers={
"X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"] # Replace with your inference API key
},
)
client.schema.create_class({
"class": "work_steps",
"vectorizer": "text2vec-openai",
"module_config": {
"generative-openai": {}
}
})
work_steps_data = [
{"wtd_text": d["wtd_text"], "wta_text": d["wta_text"]}
for d in data_json
]
# len(work_steps_data) # 106954
try:
client.batch.create_objects(work_steps_data)
except weaviate.exceptions.WeaviateBatchError as e:
print(f"Error: {e}")
Error:
{
"name": "WeaviateBatchError",
"message": "Query call with protocol GRPC batch failed with message <AioRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = \"Deadline Exceeded\"
debug_error_string = \"UNKNOWN:Error received from peer {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2024-08-01T18:14:40.555441469+04:00\"}\"
>.",
...
}
docker-compose.yaml
version: '3.4'
services:
weaviate:
image: semitechnologies/weaviate:1.25.6
ports:
- 8080:8080
- 50051:50051
volumes:
- weaviate_data:/var/lib/weaviate
environment:
CLIP_INFERENCE_API: 'http://multi2vec-clip:8080'
OPENAI_APIKEY: $OPENAI_APIKEY
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'multi2vec-clip'
ENABLE_MODULES: 'multi2vec-clip,generative-openai,generative-cohere,text2vec-openai,text2vec-huggingface,text2vec-cohere,reranker-cohere'
CLUSTER_HOSTNAME: 'node1'
restart: on-failure:0
multi2vec-clip:
image: semitechnologies/multi2vec-clip:sentence-transformers-clip-ViT-B-32-multilingual-v1
environment:
ENABLE_CUDA: '0'
volumes:
weaviate_data:
Additional Information
Docker Logs:
weaviate-1 | {"action":"startup","default_vectorizer_module":"multi2vec-clip","level":"info","msg":"the default vectorizer modules is set to \"multi2vec-clip\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-08-01T07:04:33Z"}
...
weaviate-1 | {"level":"warning","msg":"prop len tracker file /var/lib/weaviate/work_steps/iPkMMMILWoTR/proplengths does not exist, creating new tracker","time":"2024-08-01T08:54:43Z"}
...
multi2vec-clip-1 | INFO: Model initialization complete
...
weaviate-1 | {"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-08-01T07:04:38Z"}
Problem
I am trying to insert a large dataset (about 106,954 records) into Weaviate, but I keep encountering a “Deadline Exceeded” error when using the batch insert functionality.
Questions
- How can I avoid the “Deadline Exceeded” error during batch insertion?
- Are there any recommended configurations or settings for handling large batch inserts?
- Is there a way to increase the timeout settings for GRPC batch operations in Weaviate?
Any assistance or recommendations would be greatly appreciated. Thank you!
P.S:
I am used below workaround for the batch upsert to avoid any errors:
work_step_col = client.collections.get("work_steps")
# work_step_col.data.insert_many(work_steps_data)
import time
batch_size = 1000 # Adjust the batch size as needed
for i in range(0, len(work_steps_data), batch_size):
batch = work_steps_data[i:i + batch_size]
work_step_col.data.insert_many(batch)
time.sleep(2)
I has been 15 minutes and counting, so I am posting this anyway