Description
Hi, we are using Weaviate as a vector database and backend for a chat bot creation and management suite.
When setting up a chatbot, we let the user add text material from various sources, then let them inspect that text material and then decide to actually vectorize / embed this text material. However, this also means that not all chunks we store in the Weaviate database always contain a vector, some do not yet.
This seems to lead to many problems with regard to updating metadata for text chunks as well as relations between chunks (we a use cross-reference next_chunk
to order chunks with regard to each other).
As long as no chunks have been vectorized yet, we can modify chunk metadata, delete chunks etc. without problem. But as soon as some chunks have non-null vectors, we cannot use the update
(PATCH
) endpoints anymore to change chunk metadata but always need to use replace
(PUT
) or else we get vector validation errors saying that some vectors have length 0 and some a length > 0.
Using replace
, we can work around the meta-data update errors.
However, we also discovered that deleting cross-references between chunks (which do not have vectors yet) when there are already chunks with vectors in the database also leads to vector validation errors:
ERROR | Delete property reference to object! Unexpected status code: 500, with response body: {'error': [{'message': 'msg:repo.putobject code:500 err:import into index chunk: shard chunk_2PtBzUxc1rZT: Validate vector index for [153 207 213 64 2 143 76 30 183 0 58 51 194 50 244 74]: new node has a vector with length 0. Existing nodes have vectors with length 1536'}]}.
I therefore wanted to ask whether there is any way to turn off this kind of validation or to avoid this problem when deleting cross-references?
And more generally, suppose we have a collection of text chunks that has already been vectorized / embedded using one embedding model and then we / our customer decides to change the embedding model used and to re-calculate all vectors for all chunks. If these vectors have a different length than those before, is it even possible to replace the old vectors with the new ones in place.
For example, by first deleting all vectors (setting them to null / None) and then adding new ones?
Thank you very much!
Server Setup Information
- Weaviate Server Version: 1.19.6 (but also tried with newest version)
- Deployment Method: k8s
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: python (weaviate-client 3.26.2)
Any additional Information
Hi @jan.strunk !
Sorry for the delay here.
I am pretty sure this is not possible as of now.
Considering your use case, this would be a really cool feature.
Not sure how to bypass this other than inserting those objects with a fake vector and marking it somehow. Ugly, I know. That would use unnecessary resources.
The other solutions would require having a second collection, that once selected, you could copy over to the vectorized collection. and with that requiring two queries.
I will try to build a notebook to illustrate this use case.
It would be interesting to create a feature request on this:
Thanks!
Hi @DudaNogueira,
thank you very much for your feedback!
Both fake vectors and second collections would be a good possibility. I’ll see which possibility I will choose for now.
One additional possibility that I found was to not store information about the “next_chunk” as a cross_reference but to simply store the uuid as a string and the using the replace (PUT) endpoint to change the stored “next_chunk” property of chunks.
I will certainly try to formulate a feature request. Thanks for the additional ideas!
Best regards,
Jan
1 Like
Is there any update here? I am inserting JSON objects with text fields that can vary wildly in length. I am getting errors when the vecotrs are not the same size. My challenge is that these JSON objects change all the time, I do not know which one is the largest, and I need to be able to add them without running into different vector size issues.
Thank you!
hi @Ken_Tola !
This is about the number of dimensions of a vector, and not the object itself.
Whenever you vectorize a text, the embedding model should always return the same number of dimensions.
Check this code, for example. We are requesting two vectors for two difference texts, and they will always have 1536 dimensions (default for text-embedding-ada-002
)
import os
import requests
headers = {
'Authorization': 'Bearer ' + os.getenv('OPENAI_API_KEY', ''),
'Content-Type': 'application/json',
}
json_data = {
'input': 'text here',
'model': 'text-embedding-ada-002',
'encoding_format': 'float',
}
# embedding1
json_data["input"] = "pet animals"
response1 = requests.post('https://api.openai.com/v1/embeddings', headers=headers, json=json_data)
embedding1 = response1.json().get("data")[0].get("embedding")
print(len(embedding1))
# embedding2
json_data["input"] = "something about dogs"
response2 = requests.post('https://api.openai.com/v1/embeddings', headers=headers, json=json_data)
embedding2 = response2.json().get("data")[0].get("embedding")
print(len(embedding2))
Outputs:
1536
1536