I am getting hnsw_vector_cache_prefill frequently

Description

Hello I am getting hnsw_vector_cache_prefill in log very frequently.

I don’t information on this log.
In addition of above, I am also getting Query call with protocol GRPC batch failed with message Deadline Exceeded and also
image

Can any one help me ?

Server Setup Information

  • Weaviate Server Version: 1.27.1
  • Deployment Method:
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python-latest
  • Multitenancy?: Yes

Any additional Information

hi @2020ashish !!

Welcome to our community :hugs:

the hnsw_vector_cache_prefill is just a INFO log, so no worries here.

Can you paste the entire stack trace error for the second issue?

Apologize, for late reply
It was from OOMKill issue.
I have few question.
In our case we have ingest all of data which is semi structured.
While ingesting there are so many read/ write operation to update previous data.

  1. While ingestion, First I check if data exist or not then update certain field. Ingestion time is less compare to updation time. What should be best approach ?

Hi! The best approach is doing a batch with deterministic IDs.

If you want to define a fixed ID for each object, that’s the way to go:

Remember to use batch instead of insert or insert_many

Let me know if this helps!

Thanks!

We are using batch.
While ingestion we need to check if record exist or not than inserting.
Sample code provide, how we are ingesting data in some case.
while updation time is huge for below approach.
Please suggest any optimized way ?

class_name = "Book"
if client.collections.exists(class_name):
    client.collections.delete(class_name)
client.collections.create(
    name=class_name,
    vectorizer_config=wcc.Configure.Vectorizer.text2vec_transformers(),
    multi_tenancy_config=wcc.Configure.multi_tenancy(enabled=True),
    inverted_index_config=wcc.Configure.inverted_index(
        index_timestamps = True
    ),
    properties=[
        wcc.Property(
            name="Book_name",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
        ),
        wcc.Property(
            name="Author",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.WORD,
            skip_vectorization=True,
        ),
        wcc.Property(
            name="Book_Summary",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
            
        ),
        wcc.Property(
            name="Update_date",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
            skip_vectorization=True,
        ),
    ],
)

def check_book_exist(client, class_name, value, tenant):
    filter = wvc.query.Filter.by_property("Book_name").equal(value)
    class_obj = client.collections.get(class_name).with_tenant(tenant)
    response = class_obj.query.fetch_objects(filters=filter, limit=1)
    uuid = None
    for o in response.objects:
        uuid = o.uuid
    return uuid

def update_records(uuid, collection, data, tenant, client):
    class_obj = client.collections.get(collection).with_tenant(tenant)
    class_obj.data.update(
        uuid=uuid,
        properties=data,
    )

def parser(data,tenant):
    with client.batch.fixed_size(batch_size=100) as batch:
        book_uuid = check_book_exist(
            client = client,
        class_name= "Book",
            value = data.get("Book_name"),
            tenant = tenant,
        )
        if book_uuid is None:
            book_data = { "Book_name": data.get("Book_name"),
                        "Author": data.get("Author"),
                        "Book_Summary": data.get("Book_Summary"),
                        "Update_date": data.get("Update_date"),
                }
            batch.add_object(
                properties=book_data,
                collection="Book",
                tenant=tenant,
            )
            print("Added the record for book name", data.get("Book_name"))
        else:
            update_records(
                tenant=tenant,
                client=client,
                uuid=book_uuid,
                collection="Book",
                data={"Update_date": data.get("Update_date")},
            )
            print("Update the record for book name", data.get("Book_name"))

Hi!

If you can define an id per book, you can do this:

import numpy as np

from weaviate.util import generate_uuid5
x = 1000
collection_name = "Book"

collection = client.collections.get(collection_name)
with collection.batch.dynamic() as batch:
    for i in range(x):
        batch.add_object(
            {"text": f"{collection_name}, object {i}"},
            vector=np.random.rand(1536),
            uuid=generate_uuid5(f"book-id-{i}")
        )

if collection.batch.failed_objects:
    print("Failed Objects: ", collection.batch.failed_objects)

So passing a fixed uuid (generated from a book-id from your system, for example) will make the batch process to insert or update. if there is an object with the same id: it updates or do nothing, and if there isn’t, it creates.

:slight_smile:

{“action”:“lsm_replace_compacted_segments_blocking”,“build_git_commit”:“05de0db”,“build_go_version”:“go1.22.8”,“build_image_tag”:“v1.27.1”,“build_wv_version”:“1.27.1”,“class”:“Book”,“index”:“Book”,“level”:“warning”,“msg”:“replacing compacted segments took 343.492803ms”,“path_left”:“/var/lib/weaviate/Book/4921/lsm/property_exploit_available_nullState/segment-1731491661482528037.db”,“path_right”:“/var/lib/weaviate/Book/4921/lsm/property_exploit_available_nullState/segment-1731491804128129752.db”,“segment_index”:7,“shard”:“4921”,“time”:“2024-11-13T09:59:06Z”,“took”:343492803}

Hey @DudaNogueira
In this logs what is “took”:343492803. What is problem with weaviate ?

I have multiple cases and need help optimizing the code:

  1. Case 1: Suppose I have an additional property, "is_new", in the Book schema. Initially, is_new will be set to True when a book is added for the first time. If the same book title appears again, is_new should be set to False.
wcc.Property(
            name="is_new",
            data_type=wcc.DataType.BOOL         
        )

Case 2: I have another collection called Publication, which has its own properties and a cross-reference between Book and Publication.

 properties=[
        wcc.Property(
            name="Publication_Title",
            data_type=wcc.DataType.TEXT,
        ),
        wcc.Property(
            name="Publication_Date",
            data_type=wcc.DataType.TEXT,
             ),
        wcc.ReferenceProperty(
            name="Book_Reference",
            target_collection="Book,
    ],

I have two separate files for ingestion: book.csv, which contains author details, and publication.csv, which contains publication details. Each collection is ingested separately, with book.csv ingested first, followed by publication.csv.
During publication.csv ingestion, we need to check if each book exists in the Book collection (mapping by a non-UUID field), fetch its UUID, and then link it to Publication.
How can we verify ( if Book exist ) then link ( Publication & Book), and ingest the data effectively?

Hey @2020ashish, as per thread Metadata properties - #3 by 2020ashish,

My answer based on your last message in this thread, since you have two files and are batching each separately, I would recommend going with a cross-reference approach. It will work well in your case, especially when you need to query and filter metadata. I wouldn’t go for a boolean property (is_new) unless there’s a compelling reason for it.

With two collections and one cross-reference, performance should not be an issue unless your queries become unusually complex. Based on my experience with similar use cases, a cross-reference is a solution for this scenario. This approach is clean, and maintainable for batching workflow.

Does that help?