Indexing embeddings taking too long. What am I doing wrong?

alt-glitch · July 10, 2023, 5:53am

Hi
Just starting out with Weaviate. I had a 1000 documents that i split into 200 size chunks. I then attempted importing them into weaviate. Mostly followed the tutorial and getting started guides.
Following is the schema and the import code.

class_obj = {
    'class': 'className',
    'description': 'description',
    'properties': [
        {
            'name': 'title',
            'description': 'Title',
            'dataType': ['text']
        },
        {
            'name': 'source',
            'description': 'Source',
            'dataType': ['text']
        },
        {
            'name': 'content',
            'description': 'Content',
            'dataType': ['text']
        },
    ],
    'vectorizer': 'text2vec-openai',
    'moduleConfig': {
        'text2vec-openai': {  # this must match the vectorizer used
            'vectorizeClassName': False,
            'model': 'ada',
            'modelVersion': '002',
            'type': 'text'
       }
    }
}

# ===== Import data =====
# Configure the batch import
client.batch.configure(
    batch_size=100,
)

for document in documents:
    properties = {
        "title": document.metadata["title"],
        "content": document.page_content,
        "source": document.metadata["source"]
    }
    try:
        client.batch.add_data_object(properties, "className")
    except Exception as e:
        print(e)
        print(document.metadata["title"])

client.batch.flush()

However this took quite a bit of time, 30+ mins and at the only about 60% of the documents had been added to weaviate. I am using OpenAI to generate embeddings.
What is the bottleneck in this situation?
My theory is that i’m being rate limited by OpenAI.

jphwang · July 10, 2023, 9:00am

Hi @alt-glitch - welcome!

That does seem like it’s taking a while. Some questions to help us diagnose:

What version of Weaviate are you running?
What version of the Weaviate client are you running?
Are you running Weaviate locally or on the cloud?
How many objects are you importing, given the chunk size? (or, how long are the documents?)

Additionally, I see that the batch process is not instantiated explicitly. We recommend using a context manager like this:

Does it make a difference if you instantiate the batch with with client.batch() as batch: and add the data accordingly?

Cheers,
JP

Dirk · July 10, 2023, 3:41pm

a) Is there any output from the client?
b) Could you try a smaller batch size?

alt-glitch · July 11, 2023, 2:02am

Hello! Thanks for the response!
I am running Weaviate version: 1.19.11 in a trial cluster on the cloud. Using the weaviate python client version 3.22.1.
I instantiated the batch as you mentioned, reduced the number of documents and the number of workers (so as to not hit OpenAI’s limit)
The problem seems to have been solved. Now the documents get indexed within a couple minutes.

Thank you!

Amir_Sohail · September 27, 2024, 10:55am

Hi @Dirk I am using weaviate with llamindex and running custom weaviate instance on ec2. When I upload a document of around 300 pages and create embeddings it takes around 10 minutes but still keeps running. But when I run my flask code and weaviate instance locally it completes the same file embeddings in 1 minute.

Topic		Replies	Views
Weaviate Batch Errors during Batch Insertion with v4 client Support bug , developer-experience , wcs , python , documentation	11	1275	May 15, 2024
Openai Vectorizer failing to reach embeddings endpoint Support	11	883	May 6, 2025
Not able to ingest the batches of data Support integration , python	9	339	July 23, 2024
weaviate.exceptions.UnexpectedStatusCodeException: batch response! Unexpected status code: 400, with response body: {'code': 400, 'message': 'parsing body body from "" failed, because json: cannot unmarshal array into Go struct field Object.objects.vector Support python	1	560	March 15, 2024
Getting the error "UnexpectedStatusCodeException: batch response! Unexpected status code: 502, with response body: None." Support	3	1112	January 2, 2024

Indexing embeddings taking too long. What am I doing wrong?

Related topics