Timing out on a batch of 1

Hi all,

I am working on a script (and testing it locally) to deploy a very large dataset into my self-hosted weaviate database (using self-hosted transformer models as well for vectorizing and reranking). For now, I’m testing this script out by port-forwarding my hosted weaviate pod to my local machine.

While this works fine manually adding single entries (nearly instantaneously, I may add), I’m having a lot of trouble importing batches due to timeouts on connections.

Here is my code for configuring and importing batches into the weaviate client, based on some of the public documentation:

def configure_batch(client: weaviate.Client, batch_size: int, batch_target_rate: int):
    """
    Configure the weaviate client's batch so it creates objects at `batch_target_rate`.

    Parameters
    ----------
    client : Client
        The Weaviate client instance.
    batch_size : int
        The batch size.
    batch_target_rate : int
        The batch target rate as # of objects per second.
    """

    def callback(batch_results: dict) -> None:

        # you could print batch errors here
        time_took_to_create_batch = batch_size * (client.batch.creation_time/client.batch.recommended_num_objects)
        time.sleep(
            max(batch_size/batch_target_rate - time_took_to_create_batch + 1, 0)
        )

    client.batch.configure(
        batch_size=batch_size,
        timeout_retries=5,
        callback=callback,
        num_workers = 1,
        creation_time = 300,
        dynamic = True
    )


counter = 0
interval = 1000
def import_papers(papers: [PaperMetadata], batch_size = 100):
    """
    Imports the papers into Weaviate
    """
    global counter

    class_name = PaperMetadata.schema_name

    configure_batch(client, 1, 5)
    with client.batch as batch:
        for paper in papers:
            try:
                batch.add_data_object(
                    paper.to_map(),
                    class_name,
                    uuid = generate_uuid5(paper.corpusId)
                )

                counter += 1
                if counter % interval == 0:
                    logging.info(f'Imported {counter} articles...')
            except Exception as e:
                logging.error(f"error adding paper: {e}")
                raise e

For now, I’m only ever sending 1 paper per call to “import_papers”. But yet, every single starting call I make to this function results in a stream of

raceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn
[ERROR] Batch ConnectTimeout Exception occurred! Retrying in 2s. [1/3]
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
TimeoutError: timed out

How can I get around this? I know my port-forward pod is able to receive and transmit changes to the database, since this code works perfectly:

paper = PaperMetadata(
...
)

response = ( 
    client.data_object.create(
        paper.to_map(),
        class_name,
        uuid = generate_uuid5(paper.corpusId)
    )
)

Hi!

This seems like a networking error at your infra stack.

Are you running it with k8s, right? Have you seen this document?

Have you experimented with docker-compose?