Description
I have an API connected to a Weaviate index. The API is running HuggingFace Transformers to convert my search terms to queries - I’m not using Weaviate vectorisation.
My API has three search modes: vector, keyword aka bm25, and hybrid. So I am converting the user’s search term into a vector with the transformer model, and then sending it to Weaviate (only vector and hybrid search take the vector)
For each of these three modes, I am retrieving only 25 results to display to the user. The remaining results, the user can see if they go to page 2, page 3, etc.
The problem is that we have no way to know how many pages there are in the result (except for bm25). The only way I have found is to use aggregate queries to get the size of the total result set. I tried using the aggregate queries with filters, but it never comes to the same size of results as what I get when I run the query with a high limit, like 100,000.
Question 1: if I want to know the size of a bm25 result set, should I use an aggregate with Filter.by_property like I am doing? It feels like I am reverse engineering/trying to re-implement my search in an entirely different way
Question 2: if I want to know the size of a vector or hybrid result set, how can I get the same number that I would get if I request all results? I cannot see any way to get this number other than requesting the entire result set
For example, hybrid and vector query for my keyword “anxiety” returns 423 results when I set distance=0.5
. I cannot find any way to get the number 423 apart from running this query with offset=0 and limit=10000. The function aggregate.near_vector
with the same vector and filter fields and distance=0.5
yields a count of 260. I could not get the aggregate hybrid search to work at all. Perhaps it works when Weaviate is deployed together with its own vectorisation instead of me doing the vectorisation on my end, but my deployment is just bare Weaviate without a transformer model so I have not verified this.
In short, there is no way that I have found to discover the size of a vector or hybrid result set, other than running the query and requesting the entire result set from Weaviate. Whatever I could find in the documentation which looked like it should do this, did not fulfil this purpose. I could run the aggregation function for Filter (trying to reproduce the same BM25 query, but not sure if this is really doing the same thing under the hood, but I couldn’t see an alternative) and I could run the vector aggregation.
For vector aggregation, no matter what parameters I input, I got numbers which were lower than the size of the actual corresponding result set for that search mode. For the hybrid search it was even worse as I could not get the hybrid aggregation to work at all. I got the error “vector index: missing target vector”, although I am supplying that argument.
TRYING BM25 - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF FILTER AGGREGATION CAN GET THE SAME RESULTS
- Number of results from BM25 query: 85 (
collection.query.bm25(query=query[0], limit=10000, query_properties=["name", "all_text"])
) - Result of Filter aggregation (should be the same as number of results from original BM25 query): 85 (
collection.aggregate.over_all(filters=Filter.any_of([Filter.by_property("all_text").contains_any(query), Filter.by_property("name").contains_any(query)]) )
)
This appears to be the correct number. However I am a bit worried that it won’t always be the case, since I am assembling this number in an entirely different way - how can I tell that the filter in the aggregation will behave the same as bm25? In fact I would expect them to be different since BM25 works on n-grams and is tolerant of spelling mistakes and the filter might not be. I have not verified this either way.
TRYING VECTOR - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF VECTOR AGGREGATION CAN GET THE SAME RESULTS
- Number of results from vector query: 423
- Result of vector aggregation (should be the same as number of results from original vector query): 260
TRYING HYBRID - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF HYBRID AGGREGATION CAN GET THE SAME RESULTS
- Number of results from hybrid query: 423
- [I cannot get hybrid aggregation function to work with any parameters - it just give errors (see stack trace below)]
Server Setup Information
- Weaviate Server Version:
cr.weaviate.io/semitechnologies/weaviate:1.29.0
- Deployment Method: docker
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: python 4.11.0
- Multitenancy?: no
Any additional Information
I have included my complete program, minus passwords, below. I have included the stack trace of my hybrid aggregation as well. Outputs are included in the script below as comments after #
.
from weaviate.classes.query import Filter
from weaviate.connect import ConnectionParams
COLLECTION_NAME = "COLLECTION_NAME"
from weaviate.classes.init import Auth
import weaviate
client = weaviate.WeaviateClient(
connection_params=ConnectionParams.from_params(
http_host="HOST",
http_port=443,
http_secure=True,
grpc_host="grpc.HOST",
grpc_port=50051,
grpc_secure=True,
),
auth_client_secret=Auth.api_key(APIKEY),
skip_init_checks=False
)
client.connect()
print(client.is_ready())
harmony_index = client.collections.get(COLLECTION_NAME)
harmony_count = harmony_index.aggregate.over_all().total_count
# This is the search term and the vector of the search term
query = ["anxiety"]
query_vector = [[3.35463315e-01, 1.51534006e-01, -5.64120829e-01,
6.77912474e-01, 6.65759966e-02, -9.33390558e-02,
6.84192479e-01, 3.64687294e-01, -2.82174069e-02,
-1.72709897e-01, -1.24643125e-01, -7.15998327e-03,
2.47695729e-01, -3.59601736e-01, 2.22267076e-01,
3.26149940e-01, 5.84262311e-02, -5.45905173e-01,
-1.03112139e-01, 2.28038147e-01, -1.02381267e-01,
-4.71212059e-01, 3.69310945e-01, -3.54012638e-01,
-3.22700113e-01, 2.76131988e-01, 3.73639278e-02,
-1.86167836e-01, -1.00431032e-01, -8.66742358e-02,
3.16369802e-01, -6.70351207e-01, 2.54377663e-01,
-4.57104146e-02, 4.37935501e-01, 1.20846249e-01,
-1.73799232e-01, 1.06407702e-01, 3.93517256e-01,
6.50031045e-02, 1.12434894e-01, 1.95664406e-01,
-2.26766005e-01, 1.35269193e-02, 1.97353542e-01,
6.61122575e-02, 6.88953474e-02, 1.33001581e-02,
2.70984143e-01, 1.45878449e-01, -5.78154763e-03,
-3.71801071e-02, 1.28119335e-01, -4.98251289e-01,
-1.40645728e-01, 2.06611276e-01, -3.05665553e-01,
4.08114076e-01, -2.22795025e-01, 1.79221079e-01,
-1.72730044e-01, 1.81372121e-01, -4.06667352e-01,
4.18124676e-01, 1.18704803e-01, -2.25808918e-01,
-4.81796235e-01, -2.71317065e-01, 2.07416728e-01,
-2.88918447e-02, 4.86879677e-01, 2.99828529e-01,
1.59070536e-01, 2.79036492e-01, 7.94327259e-01,
-2.98845440e-01, 1.18322439e-01, 9.32147503e-02,
-7.91112855e-02, 2.15424255e-01, 5.55794299e-01,
5.85238636e-01, 3.98371607e-01, -3.43910068e-01,
1.40659079e-01, -2.13161573e-01, 1.81872055e-01,
1.79136679e-01, -1.64859459e-01, -1.60634145e-02,
3.05600613e-01, -6.29799068e-02, -8.68289918e-03,
-3.23678628e-02, 5.36237299e-01, 2.47043207e-01,
-3.54895979e-01, -2.95386344e-01, -6.76021099e-01,
2.05918527e+00, 8.84405151e-02, 1.51585415e-01,
-1.44557595e-01, 5.76418638e-01, -2.93688744e-01,
3.82334799e-01, -1.09732658e-01, -5.13578117e-01,
-3.35707664e-01, -3.60577740e-02, -4.99324650e-01,
-3.37365627e-01, 7.23077580e-02, -2.92712152e-01,
1.69516385e-01, -5.71813107e-01, 7.82266632e-02,
1.92815512e-02, 4.70578641e-01, 3.97872686e-01,
-1.14258938e-01, -3.79356623e-01, 3.37463945e-01,
-2.72577286e-01, 1.92604437e-01, -4.72456068e-01,
-1.11911498e-01, -1.22308590e-01, -4.67121303e-02,
1.54250726e-01, -1.78190425e-01, 4.95246530e-01,
-1.21425010e-01, -4.81173359e-02, -1.71974421e-01,
-1.93188950e-01, 4.11586851e-01, 2.13519856e-01,
-6.98918551e-02, -1.64113548e-02, -1.66204765e-01,
5.80064058e-02, 4.60767269e-01, -2.06856892e-01,
-3.07465702e-01, 9.16019306e-02, -1.42542109e-01,
1.41571788e-02, 2.34147549e-01, -6.63373843e-02,
-1.45412341e-01, 5.88088147e-02, 5.98725118e-02,
5.80832899e-01, 1.10482417e-01, -4.00147140e-02,
-5.17827451e-01, 2.83491373e-01, -6.09525084e-01,
2.85991211e-03, -2.26733282e-01, -1.80871442e-01,
-1.34952947e-01, 2.15342566e-01, -3.50693278e-02,
-3.21026266e-01, -5.27675807e-01, -1.06256664e-01,
2.25965619e-01, -3.39970998e-02, 6.01748489e-02,
-1.82754114e-01, -1.67253256e-01, 2.42962301e-01,
-2.47286516e-03, -2.72219539e-01, -2.88920581e-01,
-2.69849330e-01, -3.10567617e-01, -1.86400875e-01,
4.76120114e-01, -3.54339242e-01, -9.87442508e-02,
1.93104431e-01, 4.08082902e-02, -5.26979230e-02,
-8.46949741e-02, -6.33822232e-02, -9.45824087e-02,
-5.73025882e-01, -3.20404060e-02, -2.63877124e-01,
5.47255278e-02, 3.21177870e-01, -3.75654370e-01,
2.93490261e-01, -1.98497340e-01, 7.39022717e-02,
-3.57463270e-01, 4.37498003e-01, 2.27000758e-01,
-4.13726419e-01, 2.68085927e-01, 7.86525756e-02,
2.10468993e-01, -3.84509176e-01, -1.36251524e-01,
-1.09019764e-01, 6.01922512e-01, -3.22339207e-01,
3.79671961e-01, 4.81130034e-01, 4.77516413e-01,
-1.24746181e-01, -3.12690362e-02, -1.51122734e-01,
-2.85211772e-01, 2.57927299e-01, -4.35654335e-02,
3.41354579e-01, 3.38322520e-01, 2.46698424e-01,
-4.61305946e-01, -2.62644142e-01, 1.66483477e-01,
-4.70261246e-01, -1.16948433e-01, -4.56474632e-01,
-2.25364730e-01, 4.70466167e-01, -1.05019033e-01,
8.42545554e-02, 1.75342754e-01, 2.95024484e-01,
2.46478513e-01, 1.48589671e-01, 1.11533143e-01,
1.23442240e-01, -4.58077103e-01, -2.12503076e-01,
-6.49597943e-01, 7.34895840e-02, 1.60925761e-01,
4.05711792e-02, -3.46383661e-01, -3.52607250e-01,
1.33997217e-01, -8.34354758e-02, 1.37121633e-01,
1.15843736e-01, -3.00232116e-02, -5.18483341e-01,
4.86333460e-01, -1.37704790e-01, -4.72671628e-01,
4.74346548e-01, 8.64914358e-02, 9.49817002e-02,
1.52383149e-01, 5.01141727e-01, -1.95603967e-01,
-3.83717567e-01, -2.10875809e-01, 5.01872338e-02,
-2.80028824e-02, 7.68501699e-01, 1.22991778e-01,
1.95511863e-01, 4.09941465e-01, 6.97474420e-01,
3.94268751e-01, 6.15394674e-02, 9.33347344e-02,
3.45813543e-01, 2.50950933e-01, 3.77290249e-01,
-5.23763359e-01, 4.43115622e-01, 7.18046948e-02,
-4.36774850e-01, -2.27288529e-01, -4.41563755e-01,
-2.82906681e-01, 4.00768183e-02, -1.43475056e-01,
-4.46148068e-01, -4.15665954e-01, 1.14555739e-01,
8.07124153e-02, 4.43762839e-02, -2.42462948e-01,
1.43958583e-01, 9.72241834e-02, 3.11084121e-01,
1.92619618e-02, -3.49334240e-01, 2.54832387e-01,
2.42490962e-01, 2.62845784e-01, 4.95936066e-01,
2.90113181e-01, -5.81976712e-01, 1.18311614e-01,
1.88573822e-01, -1.30133718e-01, -2.37258554e-01,
2.65680328e-02, -3.49867195e-01, -4.33916420e-01,
-1.50669709e-01, -8.72264281e-02, -3.71248759e-02,
-2.51314729e-01, -8.26448575e-02, -4.18445468e-02,
-1.39517665e-01, -1.89415123e-02, 4.96251583e-01,
-1.63653284e-01, -9.27618742e-01, 2.94972267e-02,
-2.33644783e-01, -1.28878215e-02, 4.84860688e-01,
2.28712127e-01, -7.29070827e-02, -3.19305658e-01,
-5.18278658e-01, -5.38706221e-02, -4.65666622e-01,
2.82632947e-01, -4.47521992e-02, -2.38389149e-02,
1.20173834e-01, -4.23557848e-01, 2.83777714e-01,
1.47676304e-01, 1.23942643e-02, 3.76267523e-01,
-1.78621426e-01, 8.82371902e-01, 2.29514942e-01,
5.78418225e-02, -8.75021040e-04, 1.40977785e-01,
-2.94940919e-03, -3.16139102e-01, -4.55647141e-01,
-7.91564584e-02, 4.89958078e-01, -1.12918414e-01,
-5.48998594e-01, -7.82390416e-04, -2.33876958e-01,
-3.26787204e-01, 4.62798566e-01, -1.41035780e-01,
1.59310680e-02, -6.85367510e-02, -3.57112795e-01,
1.09244324e-01, 1.86645865e-01, 3.71286362e-01,
1.18220029e-02, -1.85265467e-01, 1.43055975e-01,
1.80511877e-01, -3.01655918e-01, 4.11505476e-02,
-2.68879205e-01, -9.79132354e-02, -4.60742623e-01,
2.85828471e-01, -5.15420735e-01, 2.14212790e-01,
1.70870617e-01, -2.43845642e-01, 4.26682234e-01,
-4.69405860e-01, -3.87645394e-01, 1.72637045e-01,
2.99873471e-01, -5.24893463e-01, 3.09333056e-01]][0]
print(
"TRYING BM25 - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF BM25 AGGREGATION CAN GET THE SAME RESULTS")
harmony_query_response_anxiety_bm25_no_limit = harmony_index.query.bm25(
query=query[0],
limit=10000,
query_properties=["name", "all_text"]
)
print("Number of results from BM25 query:", len(harmony_query_response_anxiety_bm25_no_limit.objects))
# Outputs: Number of results from BM25 query: 706
aggregation_filter = harmony_index.aggregate.over_all(
filters=Filter.any_of(
[Filter.by_property("all_text").contains_any(query), Filter.by_property("name").contains_any(query)])
)
print("Result of Filter aggregation (should be the same as number of results from original BM25 query):",
aggregation_filter.total_count)
# Outputs: Result of Filter aggregation: 85
print(
"TRYING VECTOR - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF VECTOR AGGREGATION CAN GET THE SAME RESULTS")
harmony_query_response_vector = harmony_index.query.near_vector(
near_vector=query_vector,
target_vector=["all_text", "name"],
limit=10000,
distance=0.5
)
print("Number of results from vector query:", len(harmony_query_response_vector.objects))
# Outputs: Number of results from vector query: 423
aggregation_vector = harmony_index.aggregate.near_vector(near_vector=query_vector, distance=0.5,
target_vector=["all_text", "name"]
)
print("Result of vector aggregation (should be the same as number of results from original vector query):",
aggregation_vector.total_count)
# Outputs: Result of vector aggregation: 260
print(
"TRYING HYBRID - TRY FIRST NORMAL WEAVIATE QUERY, TAKE NUMBER OF OBJECTS, AND SEE IF HYBRID AGGREGATION CAN GET THE SAME RESULTS")
harmony_query_response_hybrid = harmony_index.query.hybrid(
vector=query_vector,
target_vector=["all_text", "name"],
query=query[0],
limit=10000,
max_vector_distance=0.5
)
print("Number of results from hybrid query:", len(harmony_query_response_hybrid.objects))
# Outputs: Number of results from hybrid query: 423
aggregation_vector = harmony_index.aggregate.hybrid(query=query[0], vector=query_vector,
max_vector_distance=0.5,
target_vector=["all_text", "name"]
)
print("Result of vector aggregation (should be the same as number of results from original vector query):",
aggregation_vector.total_count)
'''
---------------------------------------------------------------------------
AioRpcError Traceback (most recent call last)
File ~/anaconda3/lib/python3.12/site-packages/weaviate/collections/grpc/aggregate.py:284, in _AggregateGRPC.__call(self, request)
283 assert self._connection.grpc_stub is not None
--> 284 res = await _Retry(4).with_exponential_backoff(
285 0,
286 f"Searching in collection {request.collection}",
287 self._connection.grpc_stub.Aggregate,
288 request,
289 metadata=self._connection.grpc_headers(),
290 timeout=self._connection.timeout_config.query,
291 )
292 return cast(aggregate_pb2.AggregateReply, res)
File ~/anaconda3/lib/python3.12/site-packages/weaviate/collections/grpc/retry.py:31, in _Retry.with_exponential_backoff(self, count, error, f, *args, **kwargs)
30 if e.code() != StatusCode.UNAVAILABLE:
---> 31 raise e
32 logger.info(
33 f"{error} received exception: {e}. Retrying with exponential backoff in {2**count} seconds"
34 )
File ~/anaconda3/lib/python3.12/site-packages/weaviate/collections/grpc/retry.py:28, in _Retry.with_exponential_backoff(self, count, error, f, *args, **kwargs)
27 try:
---> 28 return await f(*args, **kwargs)
29 except AioRpcError as e:
File ~/anaconda3/lib/python3.12/site-packages/grpc/aio/_call.py:327, in _UnaryResponseMixin.__await__(self)
326 else:
--> 327 raise _create_rpc_error(
328 self._cython_call._initial_metadata,
329 self._cython_call._status,
330 )
331 else:
AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "aggregate: shard 8PTHwX6x2tay: vector index: missing target vector"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-04-11T22:08:14.335334099+01:00", grpc_status:2, grpc_message:"aggregate: shard 8PTHwX6x2tay: vector index: missing target vector"}"
>
During handling of the above exception, another exception occurred:
WeaviateQueryError Traceback (most recent call last)
Cell In[37], line 1
----> 1 aggregation_vector = harmony_index.aggregate.hybrid(query=query[0], vector=query_vector,
2 max_vector_distance=0.5,
3 target_vector=["all_text", "name"]
4 )
6 print ("Result of vector aggregation (should be the same as number of results from original vector query):", aggregation_vector.total_count)
File ~/anaconda3/lib/python3.12/site-packages/weaviate/syncify.py:23, in convert.<locals>.sync_method(self, __new_name, *args, **kwargs)
20 @wraps(method) # type: ignore
21 def sync_method(self, *args, __new_name=new_name, **kwargs):
22 async_func = getattr(cls, __new_name)
---> 23 return _EventLoopSingleton.get_instance().run_until_complete(
24 async_func, self, *args, **kwargs
25 )
File ~/anaconda3/lib/python3.12/site-packages/weaviate/event_loop.py:42, in _EventLoop.run_until_complete(self, f, *args, **kwargs)
40 raise WeaviateClosedClientError()
41 fut = asyncio.run_coroutine_threadsafe(f(*args, **kwargs), self.loop)
---> 42 return fut.result()
File ~/anaconda3/lib/python3.12/concurrent/futures/_base.py:456, in Future.result(self, timeout)
454 raise CancelledError()
455 elif self._state == FINISHED:
--> 456 return self.__get_result()
457 else:
458 raise TimeoutError()
File ~/anaconda3/lib/python3.12/concurrent/futures/_base.py:401, in Future.__get_result(self)
399 if self._exception:
400 try:
--> 401 raise self._exception
402 finally:
403 # Break a reference cycle with the exception in self._exception
404 self = None
File ~/anaconda3/lib/python3.12/site-packages/weaviate/collections/aggregations/hybrid.py:100, in _HybridAsync.hybrid(self, query, alpha, vector, query_properties, object_limit, filters, group_by, target_vector, max_vector_distance, total_count, return_metrics)
93 return (
94 self._to_aggregate_result(res, return_metrics)
95 if group_by is None
96 else self._to_group_by_result(res, return_metrics)
97 )
98 else:
99 # use grpc
--> 100 reply = await self._grpc.hybrid(
101 query=query,
102 alpha=alpha,
103 vector=vector,
104 properties=query_properties,
105 object_limit=object_limit,
106 target_vector=target_vector,
107 distance=max_vector_distance,
108 aggregations=(
109 [metric.to_grpc() for metric in return_metrics]
110 if return_metrics is not None
111 else []
112 ),
113 filters=_FilterToGRPC.convert(filters) if filters is not None else None,
114 group_by=group_by._to_grpc() if group_by is not None else None,
115 limit=group_by.limit if group_by is not None else None,
116 objects_count=total_count,
117 )
118 return self._to_result(reply)
File ~/anaconda3/lib/python3.12/site-packages/weaviate/collections/grpc/aggregate.py:296, in _AggregateGRPC.__call(self, request)
294 if e.code().name == PERMISSION_DENIED:
295 raise InsufficientPermissionsError(e)
--> 296 raise WeaviateQueryError(str(e), "GRPC search") # pyright: ignore
297 except WeaviateRetryError as e:
298 raise WeaviateQueryError(str(e), "GRPC search")
WeaviateQueryError: Query call with protocol GRPC search failed with message <AioRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "aggregate: shard 8PTHwX6x2tay: vector index: missing target vector"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-04-11T22:08:14.335334099+01:00", grpc_status:2, grpc_message:"aggregate: shard 8PTHwX6x2tay: vector index: missing target vector"}"
'''