Filter index breaks after updating/inserting new records

g_parki · October 30, 2024, 4:42pm

Description

Hi, I’m seeing a flaky issue in being able to search for an object via text property filter after updating/inserting new items. I first saw this on 1.26.3, but I upgraded to 1.27.1 and still see the same issue. I’ll provide code at the bottom to reproduce, but the general process is:

Collection setup:
- Multitenant
- Single property foldername, using field tokenization
- No vectorizer
Go through various updating/insertion operations in rapid succession for renaming existing folders and adding new ones
Test the ability to correctly query with a filter:
- Fetch all items without using a filter (no problem here, works every time)
- Iterate through the list and attempt to query each item individually by filtering on its foldername
Wait 1-2 minutes
Repeat the test in step 3

This is where it fails inconsistently. Immediately after the insertion/update operations, the test works and I’m able to individually retrieve each item via a filtered query. But after a 1-2 minute delay, the index appears to break and some items (usually the ones inserted earlier in the script) cannot be found via filtered query. This issue also breaks deletion operations - e.g. “Delete folders not equal to x” deletes everything including x.

Unfortunately this issue is flaky and I can’t guarantee my example code will break the index every time. Here is example output from this morning:

Immediately after the update/insertion operations:

Starting individual queries at 2024-10-30 07:29:46.896713
515 found                                            
0 not found

After sleeping for one minute and repeating the exact same set of queries:

Starting individual queries at 2024-10-30 07:31:46.411442
464 found                                            
51 not found
Not found: ['/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a/a', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/z', '/a/a', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z', '/b/z/z/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z', '/b/b/z/z/z/z/z/z/z']

One weird observation which may just be anecdotal - if I ran the validation queries concurrently (using asyncio.gather() on 515 filtered queries), it didn’t seem to break as much as when I run the 515 queries one-after-another.

Server Setup Information

Weaviate Server Version: Seen on both 1.26.3 & 1.27.1
Deployment Method: k8s
Number of Running Nodes: 1
Client Language and Version: Python 4.8.1
Multitenancy: Yes

Any additional Information

Here is the script to reproduce. It takes ~4 minutes to run:

import weaviate
import weaviate.classes as wvc
import asyncio
from datetime import datetime

client = weaviate.use_async_with_weaviate_cloud(...)
await client.connect()

# Cleanup from previous tests as needed
if await client.collections.exists("Folders"):
    await client.collections.delete("Folders")

# Create collection and tenant
folders = await client.collections.create(
    name="Folders",
    multi_tenancy_config=wvc.config.Configure.multi_tenancy(
        True, auto_tenant_creation=True
    ),
    properties=[
        wvc.config.Property(
            name="foldername",
            data_type=wvc.config.DataType.TEXT,
            skip_vectorization=True,
            tokenization=wvc.config.Tokenization.FIELD,
        ),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
)
await folders.tenants.create("me")
tenant_collection = folders.with_tenant("me")

# Load up some starting data
starting_foldernames = ["", "/a/a", "/a/a/a", "/b", "/b/b"]
data = [{"foldername": name} for name in starting_foldernames]
await tenant_collection.data.insert_many(objects=data)

# Go through several iterations of updating and inserting data
# Here we toggle characters to "c" and back to "b" repeatedly
# And also insert new items with a "/z" appended
iterations = 8
for i in range(iterations):
    all_objects = await tenant_collection.query.fetch_objects(limit=10000)
    for obj in all_objects.objects:
        foldername = obj.properties["foldername"]
        new_foldername = None
        if "b" in foldername:
            new_foldername = foldername.replace("b", "c")
        elif "c" in foldername:
            new_foldername = foldername.replace("c", "b")
        if new_foldername is not None:
            print(
                f"Iteration {i+1}/{iterations}, updating {foldername} -> {new_foldername}"
                + " " * 30,
                end="\r",
            )
            # Update the record, and also insert a new one
            await asyncio.gather(
                tenant_collection.data.update(
                    uuid=obj.uuid, properties={"foldername": new_foldername}
                ),
                tenant_collection.data.insert(
                    properties={"foldername": new_foldername + "/z"}
                ),
            )
print("")


# Define a function to fetch all items, and then perform an individualized
# search for each item
async def do_individualized_searches():
    print(" " * 50)
    print(f"Starting individual queries at {datetime.now()}")
    all_objects = await tenant_collection.query.fetch_objects(
        return_metadata=wvc.query.MetadataQuery(last_update_time=True), limit=10000
    )
    found = []
    not_found = []
    total = len(all_objects.objects)

    for i, obj in enumerate(all_objects.objects):
        print(f"Performing filtered query {i+1}/{total}" + " " * 20, end="\r")
        foldername = obj.properties["foldername"]
        individual_query_result = await tenant_collection.query.fetch_objects(
            filters=wvc.query.Filter.by_property("foldername").equal(foldername)
        )

        if individual_query_result.objects:
            found.append(foldername)
        else:
            not_found.append(foldername)

    print(" " * 50, end="\r")
    print(f"{len(found)} found")
    print(f"{len(not_found)} not found")
    if not_found:
        print(f"Not found: {not_found}")


# Do the searches immediately, which is usually successful for all records
await do_individualized_searches()

# Wait some time
delay = 60
for i in range(int(delay)):
    print(f"Sleeping {delay - i} seconds" + " " * 50, end="\r")
    await asyncio.sleep(delay=1)

# Do the searches again - here is where we sometimes see the index fail
await do_individualized_searches()

DudaNogueira · November 1, 2024, 1:55pm

hi @g_parki !!

Welcome to our community

Thanks for the report and experiments !!!

This is Chaos material! hehehe

Do you think you that on top of that script you could provide some dataset we could iterate on?

I will escalate this internally.

Thanks!

Jose_Luis_Franco · November 7, 2024, 9:32am

Hello @g_parki ,

First of all, thanks for reporting this issue. My name is Jose Luis and I am QA at Weaviate.
I have been trying to reproduce your issue in 1.27.1 in a kind local kubernetes cluster but it did work for me, so I guess it has something to do with the configuration parameters. Could you please share your values.yaml or the environment variables you enabled in your k8s cluster?

Thanks in advance!

DudaNogueira · November 12, 2024, 2:48pm

hi @g_parki !!

Thank you VERY MUCH for this bug report!

And great news: we already have a fix cooked:

github.com/weaviate/weaviate

fix: byteops reads buf of length 0 as empty slice instead of nil

weaviate:stable/v1.25 ← weaviate:fix_byteops_reading_empty_slice

opened 10:58AM - 12 Nov 24 UTC

aliszka

+20 -10

### What's being changed: **ReadWriter::ReadBytesFromBufferWithUint32LengthIn…dicator** used to replace empty `[]byte` with `nil` when length indicator was == 0. That replacement caused `PrimaryKey` of `SegmentNode` of roaringset index being incorrectly read as `nil` instead of empty `[]byte`, making SegmentCursor also return key = nil, whenever empty string was indexed and stored in segment. As `key == nil` is a stop condition for looping through segment by cursor and `""` is alphabetically first, segment accessed by cursor seemed empty. That could cause data lost in case of compaction segment containing `""` with other segment, resulting in merged segment being just copy of the 2nd one. Note: **ReadWriter::ReadBytesFromBufferWithUint64LengthIndicator** does not have empty `[]byte` to `nil` replacement. Bug identified thanks to **g_parki**, who reported the issue: https://forum.weaviate.io/t/filter-index-breaks-after-updating-inserting-new-records/7348 ### Review checklist - [ ] Documentation has been updated, if necessary. Link to changed documentation: - [x] Chaos pipeline run or not necessary. Link to pipeline: https://github.com/weaviate/weaviate-chaos-engineering/actions/runs/11796272571 - [x] All new code is covered by tests where it is reasonable. - [ ] Performance tests have been run or not necessary.

It was merged already and should be released soon!

Thanks!

g_parki · November 12, 2024, 3:24pm

Woohoo! Thank you @DudaNogueira @Jose_Luis_Franco and team for the quick action

g_parki · November 13, 2024, 5:30pm

For anybody finding this in the future, the fix is included in v1.27.3, v1.26.10, and v1.25.25

Topic		Replies	Views
New weaviate version filtering issue Support bug , integration , technical	3	247	October 1, 2024
Issue with v4 Filtering (Multi-Tenancy and Cloud) Support wcs	4	412	February 8, 2024
Update inverted indices: put inverted indices props:no bucket for prop '<property_name>' found Support	7	347	September 25, 2024
Modify Indices after Schema Creation Support	2	507	February 21, 2024
Error Encountered in Weaviate Vector Search Support bug , python	1	733	March 25, 2024

Filter index breaks after updating/inserting new records

Description

Server Setup Information

Any additional Information

Related topics