Strange batch import behavior

a-arteria · July 5, 2024, 4:14am

Description

Hi, I’m trying to import ~30 small chunks of text into a collection and the code to do so runs without visible errors, but oddly I only see 2 or 3 objects actually being created when I check using http://localhost:8080/v1/objects or through code.

Code is below

import weaviate

client = weaviate.connect_to_local()

schema = {
    "class": "WeaviateBlogChunk",
    "description": "A snippet from a Weaviate blogpost.",
    "vectorIndexType": "hnsw",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "skip": False,
            "vectorizeClassName": False,
            "vectorizePropertyName": False,
            "apiVersion": "<api version>",
            "baseURL": "<base url>",
            "deploymentId": "<model>",
            "resourceName": "<resource name>"
        },
        "generative-openai": {
            "apiVersion": "<api version>",
            "baseURL": "<base url>",
            "deploymentId": "<model>",
            "resourceName": "<resource name>"
        }
    },
    "properties": [
        {
            "name": "content",
            "dataType": ["text"],
            "description": "The text content of the podcast clip",
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": False,
                    "vectorizePropertyName": False,
                    "vectorizeClassName": False,
                    "apiVersion": "<api version>",
                    "baseURL": "<base url>",
                    "deploymentId": "<model>",
                    "resourceName": "<resource name>"
                }
            }
        }
    ]
}

client.collections.create_from_dict(schema)

collection = client.collections.get("WeaviateBlogChunk")
with collection.batch.fixed_size(batch_size=1) as batch:
    for idx, blog_chunk in enumerate(blog_chunks):
        batch.add_object(
            properties={"content": blog_chunk},
        )

This code is very similar to this weaviate tutorial Hurricane/import_blogs.py at main · weaviate-tutorials/Hurricane · GitHub. What’s strange as well is that if I modify some of the module configs, for example if I remove the generative-openai config, the number of properly imported text chunks changes, and I don’t know why that is.

Overall I’m pretty confused about this behavior because it seems like everything is running fine, there’s no errors or warnings, and yet only a very small number of text chunks are imported.

Here’s an example of a piece of text that doesn’t get imported, although I think the issue doesn’t have anything to do with the text itself, it occurs regardless of the text:

The Hurricane front-end user story is illustrated below:

<figure>
  <video width="100%" autoplay loop muted controls>
    <source src={demo} type="video/mp4" />
    Your browser does not support the video tag. </video>
  <figcaption>A walkthrough of the Hurricane user story</figcaption>
</figure>

- A user enters the question they want to write a blog post about. - Hurricane acknowledges the request and streams its progress while writing. - Hurricane returns a blog post to the user and updates the counter of AI-generated blog posts in Weaviate. As illuminated by [Arize Phoenix](https://docs.arize.com/phoenix), running Hurricane with GPT-3.5-Turbo takes about 45 seconds to convert a question into a blog post.

Server Setup Information

Weaviate Server Version: 1.25.6
Deployment Method: local docker container
Multi Node? Number of Running Nodes: only one node
Client Language and Version: Python, 4.6.5
Multitenancy?: not specified in collection schema

Any additional Information

DudaNogueira · July 5, 2024, 6:46pm

hi @a-arteria !!

Welcome to our community

Can you check if there is any error message as a result of this batch?

Here we have a doc on how to properly handle the error messages:

This has worked for me. I have copied the same schema from the code you linked:

import weaviate

client = weaviate.connect_to_local()

schema = {
    "class": "WeaviateBlogChunk",
    "description": "A snippet from a Weaviate blogpost.",
    "moduleConfig": {
        "text2vec-openai": {
                   "skip": False,
                   "vectorizeClassName": False,
                   "vectorizePropertyName": False
        },
        "generative-openai": {
            "model": "gpt-3.5-turbo"
        }
    },
    "vectorIndexType": "hnsw",
    "vectorizer": "text2vec-openai",
    "properties": [
        {
            "name": "content",
            "dataType": ["text"],
            "description": "The text content of the podcast clip",
            "moduleConfig": {
                       "text2vec-transformers": {
                           "skip": False,
                           "vectorizePropertyName": False,
                           "vectorizeClassName": False
                       }
            }
        },
        {
            "name": "author",
            "dataType": ["text"],
            "description": "The author of the blog post.",
            "moduleConfig": {
                "text2vec-openai": {
                           "skip": True,
                           "vectorizePropertyName": False,
                           "vectorizeClassName": False
                }
            }
        }
    ]
}

client.collections.create_from_dict(schema)

blog_chunks = [
    '''
The Hurricane front-end user story is illustrated below:

<figure>
  <video width="100%" autoplay loop muted controls>
    <source src={demo} type="video/mp4" />
    Your browser does not support the video tag. </video>
  <figcaption>A walkthrough of the Hurricane user story</figcaption>
</figure>

- A user enters the question they want to write a blog post about. - Hurricane acknowledges the request and streams its progress while writing. - Hurricane returns a blog post to the user and updates the counter of AI-generated blog posts in Weaviate. As illuminated by [Arize Phoenix](https://docs.arize.com/phoenix), running Hurricane with GPT-3.5-Turbo takes about 45 seconds to convert a question into a blog post.
    ''',
    "another blog chunk"
]

collection = client.collections.get("WeaviateBlogChunk")
with collection.batch.fixed_size(batch_size=1) as batch:
    for idx, blog_chunk in enumerate(blog_chunks):
        batch.add_object(
            properties={"content": blog_chunk},
        )

print(len(collection.query.fetch_objects().objects))
# outputs: 2

Let me know if this works!

Topic		Replies	Views
Issues with Batch Import and Vectorization Support python , technical	1	122	October 11, 2024
Batch import does not import full dataframe Support	1	261	February 15, 2024
Batch import silently fails Support	1	49	December 26, 2024
An error occurred: The 'objects' creation was cancelled because it took longer than the configured timeout of 60s. Try reducing the batch size (currently 1) to a lower value. Aim to on average complete batch request within less than 10s Support bug	1	76	October 15, 2024
Batch Import Fails in Collection (WCS) Support bug , developer-experience , wcs , python	1	119	June 30, 2024

Strange batch import behavior

Description

Server Setup Information

Any additional Information

Related topics