I am getting hnsw_vector_cache_prefill frequently

2020ashish · November 8, 2024, 8:43am

Description

Hello I am getting hnsw_vector_cache_prefill in log very frequently.

I don’t information on this log.
In addition of above, I am also getting Query call with protocol GRPC batch failed with message Deadline Exceeded and also

Can any one help me ?

Server Setup Information

Weaviate Server Version: 1.27.1
Deployment Method:
Multi Node? Number of Running Nodes: 1
Client Language and Version: Python-latest
Multitenancy?: Yes

Any additional Information

DudaNogueira · November 8, 2024, 11:11am

hi @2020ashish !!

Welcome to our community

the hnsw_vector_cache_prefill is just a INFO log, so no worries here.

Can you paste the entire stack trace error for the second issue?

2020ashish · November 12, 2024, 11:26am

Apologize, for late reply
It was from OOMKill issue.
I have few question.
In our case we have ingest all of data which is semi structured.
While ingesting there are so many read/ write operation to update previous data.

While ingestion, First I check if data exist or not then update certain field. Ingestion time is less compare to updation time. What should be best approach ?

DudaNogueira · November 12, 2024, 12:40pm

Hi! The best approach is doing a batch with deterministic IDs.

If you want to define a fixed ID for each object, that’s the way to go:

Remember to use batch instead of insert or insert_many

Let me know if this helps!

Thanks!

2020ashish · November 12, 2024, 1:13pm

We are using batch.
While ingestion we need to check if record exist or not than inserting.
Sample code provide, how we are ingesting data in some case.
while updation time is huge for below approach.
Please suggest any optimized way ?

class_name = "Book"
if client.collections.exists(class_name):
    client.collections.delete(class_name)
client.collections.create(
    name=class_name,
    vectorizer_config=wcc.Configure.Vectorizer.text2vec_transformers(),
    multi_tenancy_config=wcc.Configure.multi_tenancy(enabled=True),
    inverted_index_config=wcc.Configure.inverted_index(
        index_timestamps = True
    ),
    properties=[
        wcc.Property(
            name="Book_name",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
        ),
        wcc.Property(
            name="Author",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.WORD,
            skip_vectorization=True,
        ),
        wcc.Property(
            name="Book_Summary",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
            
        ),
        wcc.Property(
            name="Update_date",
            data_type=wcc.DataType.TEXT,
            tokenization=wcc.Tokenization.FIELD,
            skip_vectorization=True,
        ),
    ],
)


def check_book_exist(client, class_name, value, tenant):
    filter = wvc.query.Filter.by_property("Book_name").equal(value)
    class_obj = client.collections.get(class_name).with_tenant(tenant)
    response = class_obj.query.fetch_objects(filters=filter, limit=1)
    uuid = None
    for o in response.objects:
        uuid = o.uuid
    return uuid

def update_records(uuid, collection, data, tenant, client):
    class_obj = client.collections.get(collection).with_tenant(tenant)
    class_obj.data.update(
        uuid=uuid,
        properties=data,
    )

def parser(data,tenant):
    with client.batch.fixed_size(batch_size=100) as batch:
        book_uuid = check_book_exist(
            client = client,
        class_name= "Book",
            value = data.get("Book_name"),
            tenant = tenant,
        )
        if book_uuid is None:
            book_data = { "Book_name": data.get("Book_name"),
                        "Author": data.get("Author"),
                        "Book_Summary": data.get("Book_Summary"),
                        "Update_date": data.get("Update_date"),
                }
            batch.add_object(
                properties=book_data,
                collection="Book",
                tenant=tenant,
            )
            print("Added the record for book name", data.get("Book_name"))
        else:
            update_records(
                tenant=tenant,
                client=client,
                uuid=book_uuid,
                collection="Book",
                data={"Update_date": data.get("Update_date")},
            )
            print("Update the record for book name", data.get("Book_name"))

DudaNogueira · November 12, 2024, 1:40pm

Hi!

If you can define an id per book, you can do this:

import numpy as np

from weaviate.util import generate_uuid5
x = 1000
collection_name = "Book"

collection = client.collections.get(collection_name)
with collection.batch.dynamic() as batch:
    for i in range(x):
        batch.add_object(
            {"text": f"{collection_name}, object {i}"},
            vector=np.random.rand(1536),
            uuid=generate_uuid5(f"book-id-{i}")
        )

if collection.batch.failed_objects:
    print("Failed Objects: ", collection.batch.failed_objects)

So passing a fixed uuid (generated from a book-id from your system, for example) will make the batch process to insert or update. if there is an object with the same id: it updates or do nothing, and if there isn’t, it creates.

2020ashish · November 13, 2024, 10:04am

{“action”:“lsm_replace_compacted_segments_blocking”,“build_git_commit”:“05de0db”,“build_go_version”:“go1.22.8”,“build_image_tag”:“v1.27.1”,“build_wv_version”:“1.27.1”,“class”:“Book”,“index”:“Book”,“level”:“warning”,“msg”:“replacing compacted segments took 343.492803ms”,“path_left”:“/var/lib/weaviate/Book/4921/lsm/property_exploit_available_nullState/segment-1731491661482528037.db”,“path_right”:“/var/lib/weaviate/Book/4921/lsm/property_exploit_available_nullState/segment-1731491804128129752.db”,“segment_index”:7,“shard”:“4921”,“time”:“2024-11-13T09:59:06Z”,“took”:343492803}

Hey @DudaNogueira
In this logs what is “took”:343492803. What is problem with weaviate ?

2020ashish · November 13, 2024, 12:08pm

I have multiple cases and need help optimizing the code:

Case 1: Suppose I have an additional property, "is_new", in the Book schema. Initially, is_new will be set to True when a book is added for the first time. If the same book title appears again, is_new should be set to False.

wcc.Property(
            name="is_new",
            data_type=wcc.DataType.BOOL         
        )

Case 2: I have another collection called Publication, which has its own properties and a cross-reference between Book and Publication.

 properties=[
        wcc.Property(
            name="Publication_Title",
            data_type=wcc.DataType.TEXT,
        ),
        wcc.Property(
            name="Publication_Date",
            data_type=wcc.DataType.TEXT,
             ),
        wcc.ReferenceProperty(
            name="Book_Reference",
            target_collection="Book,
    ],

I have two separate files for ingestion: book.csv, which contains author details, and publication.csv, which contains publication details. Each collection is ingested separately, with book.csv ingested first, followed by publication.csv.
During publication.csv ingestion, we need to check if each book exists in the Book collection (mapping by a non-UUID field), fetch its UUID, and then link it to Publication.
How can we verify ( if Book exist ) then link ( Publication & Book), and ingest the data effectively?

Mohamed_Shahin · November 15, 2024, 10:45am

Hey @2020ashish, as per thread Metadata properties - #3 by 2020ashish,

My answer based on your last message in this thread, since you have two files and are batching each separately, I would recommend going with a cross-reference approach. It will work well in your case, especially when you need to query and filter metadata. I wouldn’t go for a boolean property (is_new) unless there’s a compelling reason for it.

With two collections and one cross-reference, performance should not be an issue unless your queries become unusually complex. Based on my experience with similar use cases, a cross-reference is a solution for this scenario. This approach is clean, and maintainable for batching workflow.

Does that help?

Topic		Replies	Views
Write timeout in combination with replicas Support wcs , technical	13	203	April 18, 2025
An error occurred: The 'objects' creation was cancelled because it took longer than the configured timeout of 60s. Try reducing the batch size (currently 1) to a lower value. Aim to on average complete batch request within less than 10s Support bug	1	63	October 15, 2024
Issues with Batch Import and Vectorization Support python , technical	1	110	October 11, 2024
Weaviate cluster is very unstable (1.29.2) Support	8	89	April 9, 2025
Not able to ingest the batches of data Support integration , python	9	179	July 23, 2024

I am getting hnsw_vector_cache_prefill frequently

Description

Server Setup Information

Any additional Information

Related topics