Filter index breaks after updating/inserting new records

Description

Hi, I’m seeing a flaky issue in being able to search for an object via text property filter after updating/inserting new items. I first saw this on 1.26.3, but I upgraded to 1.27.1 and still see the same issue. I’ll provide code at the bottom to reproduce, but the general process is:

  1. Collection setup:
    • Multitenant
    • Single property foldername, using field tokenization
    • No vectorizer
  2. Go through various updating/insertion operations in rapid succession for renaming existing folders and adding new ones
  3. Test the ability to correctly query with a filter:
    • Fetch all items without using a filter (no problem here, works every time)
    • Iterate through the list and attempt to query each item individually by filtering on its foldername
  4. Wait 1-2 minutes
  5. Repeat the test in step 3

This is where it fails inconsistently. Immediately after the insertion/update operations, the test works and I’m able to individually retrieve each item via a filtered query. But after a 1-2 minute delay, the index appears to break and some items (usually the ones inserted earlier in the script) cannot be found via filtered query. This issue also breaks deletion operations - e.g. “Delete folders not equal to x” deletes everything including x.

Unfortunately this issue is flaky and I can’t guarantee my example code will break the index every time. Here is example output from this morning:

  • Immediately after the update/insertion operations:
Starting individual queries at 2024-10-30 07:29:46.896713
515 found                                            
0 not found
  • After sleeping for one minute and repeating the exact same set of queries:
Starting individual queries at 2024-10-30 07:31:46.411442
464 found                                            
51 not found
Not found: ['/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a/a', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z']

One weird observation which may just be anecdotal - if I ran the validation queries concurrently (using asyncio.gather() on 515 filtered queries), it didn’t seem to break as much as when I run the 515 queries one-after-another.

Server Setup Information

  • Weaviate Server Version: Seen on both 1.26.3 & 1.27.1
  • Deployment Method: k8s
  • Number of Running Nodes: 1
  • Client Language and Version: Python 4.8.1
  • Multitenancy: Yes

Any additional Information

Here is the script to reproduce. It takes ~4 minutes to run:

import weaviate
import weaviate.classes as wvc
import asyncio
from datetime import datetime

client = weaviate.use_async_with_weaviate_cloud(...)
await client.connect()

# Cleanup from previous tests as needed
if await client.collections.exists("Folders"):
    await client.collections.delete("Folders")

# Create collection and tenant
folders = await client.collections.create(
    name="Folders",
    multi_tenancy_config=wvc.config.Configure.multi_tenancy(
        True, auto_tenant_creation=True
    ),
    properties=[
        wvc.config.Property(
            name="foldername",
            data_type=wvc.config.DataType.TEXT,
            skip_vectorization=True,
            tokenization=wvc.config.Tokenization.FIELD,
        ),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
)
await folders.tenants.create("me")
tenant_collection = folders.with_tenant("me")

# Load up some starting data
starting_foldernames = ["", "/a/a", "/a/a/a", "/b", "/b/b"]
data = [{"foldername": name} for name in starting_foldernames]
await tenant_collection.data.insert_many(objects=data)

# Go through several iterations of updating and inserting data
# Here we toggle characters to "c" and back to "b" repeatedly
# And also insert new items with a "/z" appended
iterations = 8
for i in range(iterations):
    all_objects = await tenant_collection.query.fetch_objects(limit=10000)
    for obj in all_objects.objects:
        foldername = obj.properties["foldername"]
        new_foldername = None
        if "b" in foldername:
            new_foldername = foldername.replace("b", "c")
        elif "c" in foldername:
            new_foldername = foldername.replace("c", "b")
        if new_foldername is not None:
            print(
                f"Iteration {i+1}/{iterations}, updating {foldername} -> {new_foldername}"
                + " " * 30,
                end="\r",
            )
            # Update the record, and also insert a new one
            await asyncio.gather(
                tenant_collection.data.update(
                    uuid=obj.uuid, properties={"foldername": new_foldername}
                ),
                tenant_collection.data.insert(
                    properties={"foldername": new_foldername + "/z"}
                ),
            )
print("")


# Define a function to fetch all items, and then perform an individualized
# search for each item
async def do_individualized_searches():
    print(" " * 50)
    print(f"Starting individual queries at {datetime.now()}")
    all_objects = await tenant_collection.query.fetch_objects(
        return_metadata=wvc.query.MetadataQuery(last_update_time=True), limit=10000
    )
    found = []
    not_found = []
    total = len(all_objects.objects)

    for i, obj in enumerate(all_objects.objects):
        print(f"Performing filtered query {i+1}/{total}" + " " * 20, end="\r")
        foldername = obj.properties["foldername"]
        individual_query_result = await tenant_collection.query.fetch_objects(
            filters=wvc.query.Filter.by_property("foldername").equal(foldername)
        )

        if individual_query_result.objects:
            found.append(foldername)
        else:
            not_found.append(foldername)

    print(" " * 50, end="\r")
    print(f"{len(found)} found")
    print(f"{len(not_found)} not found")
    if not_found:
        print(f"Not found: {not_found}")


# Do the searches immediately, which is usually successful for all records
await do_individualized_searches()

# Wait some time
delay = 60
for i in range(int(delay)):
    print(f"Sleeping {delay - i} seconds" + " " * 50, end="\r")
    await asyncio.sleep(delay=1)

# Do the searches again - here is where we sometimes see the index fail
await do_individualized_searches()

hi @g_parki !!

Welcome to our community :hugs:

Thanks for the report and experiments :sunglasses: !!!

This is Chaos material! hehehe

Do you think you that on top of that script you could provide some dataset we could iterate on?

I will escalate this internally.

Thanks!

Hello @g_parki ,

First of all, thanks for reporting this issue. My name is Jose Luis and I am QA at Weaviate.
I have been trying to reproduce your issue in 1.27.1 in a kind local kubernetes cluster but it did work for me, so I guess it has something to do with the configuration parameters. Could you please share your values.yaml or the environment variables you enabled in your k8s cluster?

Thanks in advance!