GRCP connection failure when processing data-intensive batches

Good morning!
A question, we are batch processing a set of data, we have noticed that after processing the first batch we started to receive this error:

 Query call with protocol GRPC search failed with message <AioRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-08-15T22:09:21.067965744-04:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
>.  [level: ERROR]
Exception in thread Thread-30 (worker_thread):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/weaviate/collections/grpc/query.py", line 762, in __call
    res = await self._connection.grpc_stub.Search(
  File "/usr/local/lib/python3.10/site-packages/grpc/aio/_call.py", line 318, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-08-15T22:09:21.067965744-04:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"

we have tried to change the batch size, but something similar happens to us, after processing a small number of batches they start to appear. what we do is to find for each vector the nearest vectors (query.near_object).
does anybody know if it could be because of the number of vectors? do you know how we can optimize this process?

  • weaviate-client = “4.7.1”
  • version=“1.24.9”

hi @Nancy_Viviana_Espino !!

What is the batch configuration you are using?

We suggest using something like this as a base, and start tweaking the batch size and concurrent requests according to the resources you have for you cluster:

with movies.batch.fixed_size(batch_size=20, concurrent_requests=2) as batch:
    for i, row in df.iterrows():
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

Let me know if this helps.

Thanks!

Thank you for reviewing this case.
I would like to clarify that we use batches mainly to process our vectors, and not to add data to the collection. However, we have faced a problem when searching for nearby vectors, as there is a disconnect in the GRCP communication. Also, I mentioned that we process each batch on a different thread, with the intent to determine if I might be overloading the system by using this connection.

You mean that you ingest not only the data but also the vectors, right?

You can also do that with batch:

with collection.batch.dynamic() as batch:
    for i, data_row in enumerate(data_rows):
        batch.add_object(
            properties=data_row,
            vector=vectors[i]
        )

Do you have any readings on this cluster memory and cpu usage? do they still have resources?

How is this cluster deployed? Have you tried changing some of the options as stated here?

Effectively, we generate the vectors in another independent flow and, during this process we download them, upload them to a collection in order to process them and find the closest one for each vector. I have been monitoring the process and have noticed some warning messages, such as the following:

/usr/local/lib/python3.10/asyncio/selector_events.py:701: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=598 read=idle write=<idle, bufsize=0>>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

I leave an image of the memory and cpu monitor.

Hi @Nancy_Viviana_Espino !

How are you uploading the vectors? Have you tried changing some config as per the resource planning doc?

One option is try using the ASYNX_INDEXING.

My guess here is that the cluster is having a hard time both indexing your content and ingesting new data.