Handling Varying Vector Sizes in Weaviate Indexing: Ensuring Consistency and Error Prevention

I’m creating an index in Weaviate named “sample_index” and populating it with the following content and vectors:

content1 = [
    {
        "title": "title1",
        "article_id": "id1"
    },
    {
        "title": "title2",
        "article_id": "id2"
    }
]

vector1 = {
    "id1": [0.1, 0.2],
    "id2": [0.3, 0.4]
}

Now, when attempting to push another set of data into the same “sample_index” class, I encounter an error due to the varying vector sizes:

content2 = [
    {
        "title": "title1",
        "article_id": "id1"
    },
    {
        "title": "title2",
        "article_id": "id2"
    }
]

vector2 = {
    "id3": [0.1, 0.2, 0.3, 0.4],
    "id4": [0.5, 0.6, 0.7, 0.8]
}

The error message states:

{'error': [{'message': "insert to vector index: insert doc id 3 to vector index: find best entrypoint: calculate distance between insert node and entry point at level 1: vector lengths don't match: 2 vs 4"}]}
{'error': [{'message': "insert to vector index: insert doc id 4 to vector index: find best entrypoint: calculate distance between insert node and entry point at level 1: vector lengths don't match: 2 vs 4"}]}

Although the error occurs, the new data seems to be indexed in the “sample_index” class, as observed when attempting to extract all "article_id"s from the index.

To avoid this scenario, it’s essential to validate the vector size or schema before indexing the data. This can be achieved by implementing a validation step prior to indexing, ensuring that all vectors adhere to the expected size and format. By enforcing consistent vector dimensions across the index, such errors can be prevented.

Does anyone have suggestions on how to effectively manage such discrepancies in vector sizes within Weaviate indexing? Any insights or best practices would be greatly appreciated. Thank you.

Hi! it does check for dimensional consistency.

check this code, for instance:

import weaviate
client = weaviate.connect_to_local()
client.collections.delete("MyCollection")
collection = client.collections.create("MyCollection")
collection.data.insert({"name": "Duda"}, vector=[1,2,3,4, 5])
collection.data.insert({"name": "Bob"}, vector=[1,2,3,4,5,6,7,8,9,10])

it will raise this error:

UnexpectedStatusCodeError: Object was not added! 
Unexpected status code: 500, with response body: 
{'error': [{'message': 'put object: import into index mycollection: put local object: shard="lKNQZjKZzAke": Validate vector index for 2050d246-e87a-4147-a58c-56073f4e607e: new node has a vector with length 10. Existing nodes have vectors with length 5'}]}.

Interesting part: new node has a vector with length 10. Existing nodes have vectors with length 5

Can you share the code on how you are providing the vectors yourself? Maybe you are never passing the vector, so it’s not raising this same error.

Thanks!

@DudaNogueira Thanks for your reply. I don’t use insert, I upload the add on content in batch using following code:

import weaviate

client = weaviate.Client("'http://localhost:8080'")

weaviate_class = "sample_index"

class_definition = {
	"class": weaviate_class
}

client.schema.create_class(class_definition)

# This is where I read the content1 first and content2 mentioned above
# And also read vector for corresponding ids.
batch_size = 50
with client.batch as batch:
	batch.batch_size = batch_size
	for item in content:
		client.batch.add_data_object(
				data_object=item,
				vector=vectors["article_id"],
	            class_name=weaviate_class,
	         )

@DudaNogueira The way collection.data.insert handles this varying vector size condition, apparently client.batch.add_data_object doesn’t handle it.

Follow-up question: If I want to implement the vector size validation on my side before indexing new data existing class, is there any way to extract the vector size information from weaviate for already indexed data? I am looking into the schema for the current class using:

schema = client.schema.get(weaviate_class)

I am unable to find this vector size information in this object.

Hi @Sandip

Indeed it is not doing the dimensions check on batch imports, only in insert and insert_many. I was able to reproduce this. An issue in GH should follow soon.

For doing the validation yourself at the client level, you can fetch one object from that colection, asking to include its vectors, then you count the dimensions.

Let me know if that helps.

Thanks @DudaNogueira , For now I have implemented the way you suggested, i.e. fetching vector size from the first record from weaviate class ‘sample_index’ using following code:

vector_length = None
data_object = self.client.data_object.get(class_name="sample_index", with_vector=True)
        if data_object["objects"]:
            vector_length = len(data_object["objects"][0]["vector"])

I use this vector_length, which is 2 for my example for validation of future indexing operations.