Indexing embeddings taking too long. What am I doing wrong?

Hi
Just starting out with Weaviate. I had a 1000 documents that i split into 200 size chunks. I then attempted importing them into weaviate. Mostly followed the tutorial and getting started guides.
Following is the schema and the import code.

class_obj = {
    'class': 'className',
    'description': 'description',
    'properties': [
        {
            'name': 'title',
            'description': 'Title',
            'dataType': ['text']
        },
        {
            'name': 'source',
            'description': 'Source',
            'dataType': ['text']
        },
        {
            'name': 'content',
            'description': 'Content',
            'dataType': ['text']
        },
    ],
    'vectorizer': 'text2vec-openai',
    'moduleConfig': {
        'text2vec-openai': {  # this must match the vectorizer used
            'vectorizeClassName': False,
            'model': 'ada',
            'modelVersion': '002',
            'type': 'text'
       }
    }
}

# ===== Import data =====
# Configure the batch import
client.batch.configure(
    batch_size=100,
)

for document in documents:
    properties = {
        "title": document.metadata["title"],
        "content": document.page_content,
        "source": document.metadata["source"]
    }
    try:
        client.batch.add_data_object(properties, "className")
    except Exception as e:
        print(e)
        print(document.metadata["title"])

client.batch.flush()

However this took quite a bit of time, 30+ mins and at the only about 60% of the documents had been added to weaviate. I am using OpenAI to generate embeddings.
What is the bottleneck in this situation?
My theory is that i’m being rate limited by OpenAI.

Hi @alt-glitch - welcome!

That does seem like it’s taking a while. Some questions to help us diagnose:

  • What version of Weaviate are you running?
  • What version of the Weaviate client are you running?
  • Are you running Weaviate locally or on the cloud?
  • How many objects are you importing, given the chunk size? (or, how long are the documents?)

Additionally, I see that the batch process is not instantiated explicitly. We recommend using a context manager like this:

Does it make a difference if you instantiate the batch with with client.batch() as batch: and add the data accordingly?

Cheers,
JP

a) Is there any output from the client?
b) Could you try a smaller batch size?

Hello! Thanks for the response!
I am running Weaviate version: 1.19.11 in a trial cluster on the cloud. Using the weaviate python client version 3.22.1.
I instantiated the batch as you mentioned, reduced the number of documents and the number of workers (so as to not hit OpenAI’s limit)
The problem seems to have been solved. Now the documents get indexed within a couple minutes.

Thank you!

Hi @Dirk I am using weaviate with llamindex and running custom weaviate instance on ec2. When I upload a document of around 300 pages and create embeddings it takes around 10 minutes but still keeps running. But when I run my flask code and weaviate instance locally it completes the same file embeddings in 1 minute.