Description
Hi, I’m seeing a flaky issue in being able to search for an object via text property filter after updating/inserting new items. I first saw this on 1.26.3
, but I upgraded to 1.27.1
and still see the same issue. I’ll provide code at the bottom to reproduce, but the general process is:
- Collection setup:
- Multitenant
- Single property
foldername
, using field tokenization - No vectorizer
- Go through various updating/insertion operations in rapid succession for renaming existing folders and adding new ones
- Test the ability to correctly query with a filter:
- Fetch all items without using a filter (no problem here, works every time)
- Iterate through the list and attempt to query each item individually by filtering on its
foldername
- Wait 1-2 minutes
- Repeat the test in step 3
This is where it fails inconsistently. Immediately after the insertion/update operations, the test works and I’m able to individually retrieve each item via a filtered query. But after a 1-2 minute delay, the index appears to break and some items (usually the ones inserted earlier in the script) cannot be found via filtered query. This issue also breaks deletion operations - e.g. “Delete folders not equal to x
” deletes everything including x
.
Unfortunately this issue is flaky and I can’t guarantee my example code will break the index every time. Here is example output from this morning:
- Immediately after the update/insertion operations:
Starting individual queries at 2024-10-30 07:29:46.896713
515 found
0 not found
- After sleeping for one minute and repeating the exact same set of queries:
Starting individual queries at 2024-10-30 07:31:46.411442
464 found
51 not found
Not found: ['/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a/a', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z']
One weird observation which may just be anecdotal - if I ran the validation queries concurrently (using asyncio.gather()
on 515 filtered queries), it didn’t seem to break as much as when I run the 515 queries one-after-another.
Server Setup Information
- Weaviate Server Version: Seen on both 1.26.3 & 1.27.1
- Deployment Method: k8s
- Number of Running Nodes: 1
- Client Language and Version: Python 4.8.1
- Multitenancy: Yes
Any additional Information
Here is the script to reproduce. It takes ~4 minutes to run:
import weaviate
import weaviate.classes as wvc
import asyncio
from datetime import datetime
client = weaviate.use_async_with_weaviate_cloud(...)
await client.connect()
# Cleanup from previous tests as needed
if await client.collections.exists("Folders"):
await client.collections.delete("Folders")
# Create collection and tenant
folders = await client.collections.create(
name="Folders",
multi_tenancy_config=wvc.config.Configure.multi_tenancy(
True, auto_tenant_creation=True
),
properties=[
wvc.config.Property(
name="foldername",
data_type=wvc.config.DataType.TEXT,
skip_vectorization=True,
tokenization=wvc.config.Tokenization.FIELD,
),
],
vectorizer_config=wvc.config.Configure.Vectorizer.none(),
)
await folders.tenants.create("me")
tenant_collection = folders.with_tenant("me")
# Load up some starting data
starting_foldernames = ["", "/a/a", "/a/a/a", "/b", "/b/b"]
data = [{"foldername": name} for name in starting_foldernames]
await tenant_collection.data.insert_many(objects=data)
# Go through several iterations of updating and inserting data
# Here we toggle characters to "c" and back to "b" repeatedly
# And also insert new items with a "/z" appended
iterations = 8
for i in range(iterations):
all_objects = await tenant_collection.query.fetch_objects(limit=10000)
for obj in all_objects.objects:
foldername = obj.properties["foldername"]
new_foldername = None
if "b" in foldername:
new_foldername = foldername.replace("b", "c")
elif "c" in foldername:
new_foldername = foldername.replace("c", "b")
if new_foldername is not None:
print(
f"Iteration {i+1}/{iterations}, updating {foldername} -> {new_foldername}"
+ " " * 30,
end="\r",
)
# Update the record, and also insert a new one
await asyncio.gather(
tenant_collection.data.update(
uuid=obj.uuid, properties={"foldername": new_foldername}
),
tenant_collection.data.insert(
properties={"foldername": new_foldername + "/z"}
),
)
print("")
# Define a function to fetch all items, and then perform an individualized
# search for each item
async def do_individualized_searches():
print(" " * 50)
print(f"Starting individual queries at {datetime.now()}")
all_objects = await tenant_collection.query.fetch_objects(
return_metadata=wvc.query.MetadataQuery(last_update_time=True), limit=10000
)
found = []
not_found = []
total = len(all_objects.objects)
for i, obj in enumerate(all_objects.objects):
print(f"Performing filtered query {i+1}/{total}" + " " * 20, end="\r")
foldername = obj.properties["foldername"]
individual_query_result = await tenant_collection.query.fetch_objects(
filters=wvc.query.Filter.by_property("foldername").equal(foldername)
)
if individual_query_result.objects:
found.append(foldername)
else:
not_found.append(foldername)
print(" " * 50, end="\r")
print(f"{len(found)} found")
print(f"{len(not_found)} not found")
if not_found:
print(f"Not found: {not_found}")
# Do the searches immediately, which is usually successful for all records
await do_individualized_searches()
# Wait some time
delay = 60
for i in range(int(delay)):
print(f"Sleeping {delay - i} seconds" + " " * 50, end="\r")
await asyncio.sleep(delay=1)
# Do the searches again - here is where we sometimes see the index fail
await do_individualized_searches()