Description
First off, thanks to the Weaviate team for providing this forum and the resources for the product. It’s been a great experience revisiting Weaviate and seeing all the new features.
I’ve implemented a connection to Weaviate for my AI pipeline system. During this integration, I focused on using the batch import functions. I’m bringing my own vectors, and testing is currently being done with instructor-large
embeddings.
To get vectors in, I was using this batch code (edited for simplicity):
def weaviate_batch_insert(weaviate_url, weaviate_token, weaviate_collection_name, text, embeddings):
client = weaviate.connect_to_wcs(
cluster_url=weaviate_url,
auth_credentials=weaviate.auth.AuthApiKey(weaviate_token))
)
uuids = []
errors = []
try:
# Dynamic batching
collection = client.collections.get(weaviate_collection_name)
with collection.batch.dynamic() as batch:
num_objects = len(text)
for i in range(num_objects):
uuid = str(uuid4())
uuids.append(uuid)
data_objects = {
"text": text[i]
}
vector = {
"text_embedding": embeddings[i]
}
# Add to the Weaviate batch
batch.add_object(properties=data_objects, uuid=uuid, vector=vector)
except Exception as ex:
raise Exception(f"Weaviate insert failed: {ex}")
finally:
client.close()
return {
'uuids': uuids,
'status': errors
}
This code seems to work fine for inserts. I don’ t get errors, other than a warnings on the dynamic batch-size could not be refreshed.
When I went to implement the hybrid search function, I noticed that results were intermittent. With some keywords (query+vector), I would get results. Other keywords would return with empty results []
with no offer of why it was empty. I then tried using similarity_search for doing near_vector
and got empty results across the board.
I suspected that it had something to do with named vectors, as other implementations I have done worked fine. Those were on older versions, however.
In the above code I name the vector by passing vector in as a dict. I also use the target_vector
in the near_vector
or hybrid
query functions (just showing the hybrid one as that is the one I plan on using most):
# Prepare the query parameters
query_params = {
'limit': limit,
'offset': offset,
'alpha': 0.70
}
# Perform the similarity search using hybrid
response = collection.query.hybrid(
query=query,
vector=query_vector,
target_vector="text_embedding",
query_properties=["text"],
return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
**query_params
)
What I have discovered, by looking at the test_named_vectors.py
integration test in the weaviate-python-client
Github, was that in the batch import tests, there is a collection defined. As I suspected the named-vector
was the issue, and began suspecting the lack of definition was the issue, I used this to rewrite it into a create
statement:
client.collections.create(
weaviate_collection_name,
properties=[
wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
],
vectorizer_config=[
wvc.config.Configure.NamedVectors.none(name="text_embedding"),
],
)
I will note here that the documentation for batch inserts, for named vectors in particular, lacks these create functions.
By adding that create function to the batch insert, the problems with searching went away and I’m now getting good results.
I’m speculating here, but I think that maybe the assumption by myself about auto-schema
caused me to skip over the collection definition in my code (coupled with it not being in the docs). Then, when I ran some tests, I got results back, which lead me to work on the querying more than (what appears to be) an insertion issue.
I’m still not 100% certain that this is exactly what was causing the issue, but I can say right now that it’s working as I have it, so I thought I’d throw this in the forum for others, just in case.
I would comment on auto-schema
further, but I’m not certain enough this was the exact problem, so will let the experts say what they will on this! There well may be a reason the batch import example for named-vectors
does not have a collection create statement.
Server Setup Information
- Weaviate Server Version: 1.24.1 (on cloud)
- Deployment Method: Cloud
- Multi Node? Number of Running Nodes: No and one single node.
- Client Language and Version: Python 3.10.13
Any additional Information
The dynamic batch-size error:
C:\Users\kord\miniconda3\envs\slothai\lib\site-packages\weaviate\warnings.py:219: UserWarning: Bat003: The dynamic batch-size could not be refreshed successfully with error RemoteProtocolError('Server disconnected without sending a response.')