Handling Varying Vector Sizes in Weaviate Indexing: Ensuring Consistency and Error Prevention

Sandip · March 5, 2024, 11:42am

I’m creating an index in Weaviate named “sample_index” and populating it with the following content and vectors:

content1 = [
    {
        "title": "title1",
        "article_id": "id1"
    },
    {
        "title": "title2",
        "article_id": "id2"
    }
]

vector1 = {
    "id1": [0.1, 0.2],
    "id2": [0.3, 0.4]
}

Now, when attempting to push another set of data into the same “sample_index” class, I encounter an error due to the varying vector sizes:

content2 = [
    {
        "title": "title1",
        "article_id": "id1"
    },
    {
        "title": "title2",
        "article_id": "id2"
    }
]

vector2 = {
    "id3": [0.1, 0.2, 0.3, 0.4],
    "id4": [0.5, 0.6, 0.7, 0.8]
}

The error message states:

{'error': [{'message': "insert to vector index: insert doc id 3 to vector index: find best entrypoint: calculate distance between insert node and entry point at level 1: vector lengths don't match: 2 vs 4"}]}
{'error': [{'message': "insert to vector index: insert doc id 4 to vector index: find best entrypoint: calculate distance between insert node and entry point at level 1: vector lengths don't match: 2 vs 4"}]}

Although the error occurs, the new data seems to be indexed in the “sample_index” class, as observed when attempting to extract all "article_id"s from the index.

To avoid this scenario, it’s essential to validate the vector size or schema before indexing the data. This can be achieved by implementing a validation step prior to indexing, ensuring that all vectors adhere to the expected size and format. By enforcing consistent vector dimensions across the index, such errors can be prevented.

Does anyone have suggestions on how to effectively manage such discrepancies in vector sizes within Weaviate indexing? Any insights or best practices would be greatly appreciated. Thank you.

DudaNogueira · March 5, 2024, 2:12pm

Hi! it does check for dimensional consistency.

check this code, for instance:

import weaviate
client = weaviate.connect_to_local()
client.collections.delete("MyCollection")
collection = client.collections.create("MyCollection")
collection.data.insert({"name": "Duda"}, vector=[1,2,3,4, 5])
collection.data.insert({"name": "Bob"}, vector=[1,2,3,4,5,6,7,8,9,10])

it will raise this error:

UnexpectedStatusCodeError: Object was not added! 
Unexpected status code: 500, with response body: 
{'error': [{'message': 'put object: import into index mycollection: put local object: shard="lKNQZjKZzAke": Validate vector index for 2050d246-e87a-4147-a58c-56073f4e607e: new node has a vector with length 10. Existing nodes have vectors with length 5'}]}.

Interesting part: new node has a vector with length 10. Existing nodes have vectors with length 5

Can you share the code on how you are providing the vectors yourself? Maybe you are never passing the vector, so it’s not raising this same error.

Thanks!

Sandip · March 5, 2024, 2:51pm

@DudaNogueira Thanks for your reply. I don’t use insert, I upload the add on content in batch using following code:

import weaviate

client = weaviate.Client("'http://localhost:8080'")

weaviate_class = "sample_index"

class_definition = {
	"class": weaviate_class
}

client.schema.create_class(class_definition)

# This is where I read the content1 first and content2 mentioned above
# And also read vector for corresponding ids.
batch_size = 50
with client.batch as batch:
	batch.batch_size = batch_size
	for item in content:
		client.batch.add_data_object(
				data_object=item,
				vector=vectors["article_id"],
	            class_name=weaviate_class,
	         )

@DudaNogueira The way collection.data.insert handles this varying vector size condition, apparently client.batch.add_data_object doesn’t handle it.

Sandip · March 6, 2024, 8:22am

Follow-up question: If I want to implement the vector size validation on my side before indexing new data existing class, is there any way to extract the vector size information from weaviate for already indexed data? I am looking into the schema for the current class using:

schema = client.schema.get(weaviate_class)

I am unable to find this vector size information in this object.

DudaNogueira · March 6, 2024, 11:51am

Hi @Sandip

Indeed it is not doing the dimensions check on batch imports, only in insert and insert_many. I was able to reproduce this. An issue in GH should follow soon.

For doing the validation yourself at the client level, you can fetch one object from that colection, asking to include its vectors, then you count the dimensions.

Let me know if that helps.

Sandip · March 8, 2024, 9:01am

Thanks @DudaNogueira , For now I have implemented the way you suggested, i.e. fetching vector size from the first record from weaviate class ‘sample_index’ using following code:

vector_length = None
data_object = self.client.data_object.get(class_name="sample_index", with_vector=True)
        if data_object["objects"]:
            vector_length = len(data_object["objects"][0]["vector"])

I use this vector_length, which is 2 for my example for validation of future indexing operations.

DudaNogueira · October 10, 2024, 10:58am

Hi!

Weaviate is now checking for dimensions mismatch on batch imports!

Topic		Replies	Views
Problems with vector (length) validation Support	4	356	July 1, 2024
502 errors while indexing and querying Support	5	675	November 27, 2023
No error during indexing yet aggregate is off Support	5	315	July 20, 2023
Help Needed: Resolving WeaviateQueryError with Nil or Zero-Length Vector at docID 715 Support	18	580	May 11, 2024
Indexing embeddings taking too long. What am I doing wrong? Support	4	1190	September 27, 2024

Handling Varying Vector Sizes in Weaviate Indexing: Ensuring Consistency and Error Prevention

Related topics