Hi all,
I am working on a script (and testing it locally) to deploy a very large dataset into my self-hosted weaviate database (using self-hosted transformer models as well for vectorizing and reranking). For now, I’m testing this script out by port-forwarding my hosted weaviate pod to my local machine.
While this works fine manually adding single entries (nearly instantaneously, I may add), I’m having a lot of trouble importing batches due to timeouts on connections.
Here is my code for configuring and importing batches into the weaviate client, based on some of the public documentation:
def configure_batch(client: weaviate.Client, batch_size: int, batch_target_rate: int):
"""
Configure the weaviate client's batch so it creates objects at `batch_target_rate`.
Parameters
----------
client : Client
The Weaviate client instance.
batch_size : int
The batch size.
batch_target_rate : int
The batch target rate as # of objects per second.
"""
def callback(batch_results: dict) -> None:
# you could print batch errors here
time_took_to_create_batch = batch_size * (client.batch.creation_time/client.batch.recommended_num_objects)
time.sleep(
max(batch_size/batch_target_rate - time_took_to_create_batch + 1, 0)
)
client.batch.configure(
batch_size=batch_size,
timeout_retries=5,
callback=callback,
num_workers = 1,
creation_time = 300,
dynamic = True
)
counter = 0
interval = 1000
def import_papers(papers: [PaperMetadata], batch_size = 100):
"""
Imports the papers into Weaviate
"""
global counter
class_name = PaperMetadata.schema_name
configure_batch(client, 1, 5)
with client.batch as batch:
for paper in papers:
try:
batch.add_data_object(
paper.to_map(),
class_name,
uuid = generate_uuid5(paper.corpusId)
)
counter += 1
if counter % interval == 0:
logging.info(f'Imported {counter} articles...')
except Exception as e:
logging.error(f"error adding paper: {e}")
raise e
For now, I’m only ever sending 1 paper per call to “import_papers”. But yet, every single starting call I make to this function results in a stream of
raceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn
[ERROR] Batch ConnectTimeout Exception occurred! Retrying in 2s. [1/3]
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: timed out
How can I get around this? I know my port-forward pod is able to receive and transmit changes to the database, since this code works perfectly:
paper = PaperMetadata(
...
)
response = (
client.data_object.create(
paper.to_map(),
class_name,
uuid = generate_uuid5(paper.corpusId)
)
)