Not able to ingest the batches of data

Vipul_Maheshwari · July 10, 2024, 5:43am

Hey Hi Guys, i am facing so many issues in adding the batch of data during the ingestion part in the weaviate. Tried all the things but there is not a single start to end script which can help me to ingest all the data in the batches…

I tried skimp with this link : Batch import | Weaviate - Vector Database

but nothing seems to be working.

Can anyone help me with it? I really need it…

DudaNogueira · July 10, 2024, 1:31pm

hi @Vipul_Maheshwari !!

Welcome to our community.

Please, when opening a thread, fill in the requested info, like server version, deployment, etc.

Do you see any error logs? Can you share any code we can reproduce?

Thanks!

Vipul_Maheshwari · July 11, 2024, 12:52pm

Hey @DudaNogueira thanks for reverting back.

From the next time, I will make sure to fill the requested info and other details.

So I have completed this script for batch ingestion, can you just skim through it fast and let me know if there is any kind of error in it:

import numpy as np
import logging
import time
import weaviate
from tqdm import tqdm
import weaviate.classes.config as wc

# Constants
COLLECTION_NAME = "weaviate_test_collection_part6"
NUM_BATCHES = 10
VECTORS_PER_BATCH = 100
VECTOR_SIZE = 1536

# Setup logging
logging.basicConfig(level=logging.INFO)

# Connect to Weaviate
client = weaviate.connect_to_embedded()

# Create Weaviate collection
weaviate_collection = client.collections.create(
    name=COLLECTION_NAME,
    properties=[
        wc.Property(name="item", data_type=wc.DataType.TEXT),
    ],
    vectorizer_config=None
)

# Define the batch generation function
def make_batches(num_batches, vectors_per_batch, vector_size):
    for i in range(num_batches):
        try:
            vectors = np.random.rand(vectors_per_batch, vector_size).astype(np.float32)
            vectors_list = vectors.tolist()
            items = [str(i * vectors_per_batch + j + 1) for j in range(vectors_per_batch)]
            batch = list(zip(items, vectors_list))
            logging.info(f"Successfully generated batch {i+1}/{num_batches}")
            yield batch
        except Exception as e:
            logging.error(f"Error in batch {i+1}: {str(e)}")
            raise

# Main processing loop
try:
    total_time = 0.0
    batch_times = []
    for _batch_index, _batch in enumerate(tqdm(make_batches(num_batches=NUM_BATCHES,  vectors_per_batch=VECTORS_PER_BATCH, vector_size=VECTOR_SIZE), desc="Processing batches", total=NUM_BATCHES)):
        ct = 0
        with weaviate_collection.batch.fixed_size(VECTORS_PER_BATCH) as batch:
            
            batch_start_time = time.time()
            for item, vector in _batch:

                batch.add_object(
                    properties={"item": item},
                    vector=vector
                )

                ct += 1
                
                # If the number of vectors reached VECTORS_PER_BATCH threshold, it means the batch is injected with the desired number of vectors. (Ingestion of one batch is completed)
                if ct % VECTORS_PER_BATCH == 0:
                    duration = time.time() - batch_start_time
                    batch_times.append(duration)
                    total_time += duration
                    print(f"Processed {ct} vectors in batch {_batch_index + 1} of {NUM_BATCHES} in {duration:.2f}s")
    
    print(f"Total processing time: {total_time:.2f}s")
    print(f"Average time per batch: {np.mean(batch_times):.2f}s")

except Exception as e:
    logging.error(f"An error occurred during processing: {str(e)}")
    raise

finally:
    pass

DudaNogueira · July 15, 2024, 1:22pm

Hi!

Can you try catching those errors?

Check here:

Vipul_Maheshwari · July 21, 2024, 7:08am

Hey Hi! I think I figured it out, can you just confirm if this sounds good to you, Thanks in advance:

import numpy as np
import logging
import time
import weaviate
from tqdm import tqdm
import weaviate.classes.config as wc

# Constants
COLLECTION_NAME = "weaviate_test_collection_part6"
NUM_BATCHES = 10
VECTORS_PER_BATCH = 100
VECTOR_SIZE = 1536

# Setup logging
logging.basicConfig(level=logging.INFO)
# Connect to Weaviate
client = weaviate.connect_to_embedded()

# Create Weaviate collection
weaviate_collection = client.collections.create(
    name=COLLECTION_NAME,
    properties=[
        wc.Property(name="item", data_type=wc.DataType.TEXT),
    ],
    vectorizer_config=None
)

# Define the batch generation function
def make_batches(num_batches, vectors_per_batch, vector_size):
    for i in range(num_batches):
        try:
            vectors = np.random.rand(vectors_per_batch, vector_size).astype(np.float32)
            vectors_list = vectors.tolist()
            items = [str(i * vectors_per_batch + j + 1) for j in range(vectors_per_batch)]
            batch = list(zip(items, vectors_list))
            logging.info(f"Successfully generated batch {i+1}/{num_batches}")
            yield batch
        except Exception as e:
            logging.error(f"Error in batch {i+1}: {str(e)}")
            raise

# Main processing loop
try:
    total_time = 0.0
    batch_times = []
    for _batch_index, _batch in enumerate(tqdm(make_batches(num_batches=NUM_BATCHES,  vectors_per_batch=VECTORS_PER_BATCH, vector_size=VECTOR_SIZE), desc="Processing batches", total=NUM_BATCHES)):
        ct = 0
        with weaviate_collection.batch.fixed_size(VECTORS_PER_BATCH) as batch:
            
            batch_start_time = time.time()
            for item, vector in _batch:

                batch.add_object(
                    properties={"item": item},
                    vector=vector
                )

                ct += 1
                
                # If the number of vectors reached VECTORS_PER_BATCH threshold, it means the batch is injected with the desired number of vectors. (Ingestion of one batch is completed)
                if ct % VECTORS_PER_BATCH == 0:
                    duration = time.time() - batch_start_time
                    batch_times.append(duration)
                    total_time += duration
                    print(f"Processed {ct} vectors in batch {_batch_index + 1} of {NUM_BATCHES} in {duration:.2f}s")
    
    print(f"Total processing time: {total_time:.2f}s")
    print(f"Average time per batch: {np.mean(batch_times):.2f}s")

except Exception as e:
    logging.error(f"An error occurred during processing: {str(e)}")
    raise

finally:
    pass

DudaNogueira · July 23, 2024, 12:45pm

Hi!

This seems fine.

Now the idea is that you can experiment with different batch sizes.

Here you can have more info on that:
https://weaviate-python-client.readthedocs.io/en/stable/weaviate.batch.html#module-weaviate.batch.crud_batch

Thanks!

Vipul_Maheshwari · July 23, 2024, 1:07pm

Yes! But i wanted to take the VECTORS_PER_ BATCH variables as my go to for deciding the number of batches I want to ingest at a time…

Thanks for putting this through, sorry for the inconvenience! I am just glad you reviewed the snippet and I am good to go…

DudaNogueira · July 23, 2024, 1:10pm

That’s ok.

You can probably get interesting results with dynamic, as it will adjust the batch size according to what the server reports back, taking into account the current server load.

Vipul_Maheshwari · July 23, 2024, 1:29pm

Well to be honest, I am running a benchmark for various DBs to understand the time it takes for the ingestion as well as the bottleneck for the server load…

So it would be unfair to change the batch size dynamically

DudaNogueira · July 23, 2024, 1:31pm

Fair enough

You can also try enabling the ASYNC_INDEXING, so you don’t need to wait for the indexation step.

Topic		Replies	Views
Weaviate Batch Errors during Batch Insertion with v4 client Support bug , developer-experience , wcs , python , documentation	11	1629	May 15, 2024
Inconsistent errors for weaviate batchInsert General	6	1108	August 29, 2024
Error in Batch Addition Support	14	3543	April 8, 2024
Timing out on a batch of 1 Support	1	957	September 25, 2023
Getting the error "UnexpectedStatusCodeException: batch response! Unexpected status code: 502, with response body: None." Support	3	1388	January 2, 2024

Not able to ingest the batches of data

Related topics