Increased processing time on Weaviate v3 to v4 migration

DescriptionGood morning, everyone!

I recently migrated from Weaviate v3 to v4 and made some adjustments during the process. However, I’ve noticed that processing times in v4 seem to be longer compared to v3 for the same files.

Has anyone else encountered this issue? I would greatly appreciate any recommendations for best practices or tips on how to validate and improve processing times. Your insights would be invaluable as I work to optimize performance.

Thank you very much for your help!

weaviate-client = 4.7.1
version=1.24.9

Hi Nancy!

Welcome to our community :hugs:

I believe you have the same Weaviate Server version (for instance 1.24.9) for both the v3 and v4 clients, right?

Can you share the code you are using for both clients?

the python v4 should deliver significant improvements as it leverages GRPC instead of pure REST.

Ps: the latest in 1.24.X is 1.24.21 as I write. We strongly recommend keeping it updated to avoid any of the known issues of previous versions.

Thanks!

The process we mainly use consists of doing this

with ProcessPoolExecutor(mp_context=mp.get_context("spawn")) as executor:
    for post in collection.iterator(include_vector=False, after=cursor, return_properties=columns_to_retrieve):
        ... # get post filters
        post_requests = {
            "uid": str(post.uuid),
            "weaviate_post_id": post.properties["weaviate_post_id"],
            "where_filters": rel_filters_cache[rel_filters_combination],
        }
        batch_posts.append(post_requests)

        if len(batch_posts) % batch_size == 0:
            res = vector_db.process_batch_posts(batch_posts, max_recommendations, executor)

Inside the process_batch_posts what we do is with executor.map(partial(…)) to get the closest ones by batch. For this what we do is

weaviate.connect_to_local(
        # Avoid timeouts by setting one minute
        port=DEFAULT_PORT,
        grpc_port=DEFAULT_GRPC_PORT,
        additional_config=AdditionalConfig(
            timeout=Timeout(init=60, query=60, insert=120),  # Values in seconds
        skip_init_checks=True,
        ),

    ) as thread_client:
collection = thread_client.collections.get(collection_name)
ann_result = collection.query.near_object(  # type: ignore
            near_object=post_uid,
            limit=n_neighbors,
            return_metadata=MetadataQuery(distance=True),
            filters=filters,
            return_properties=["weaviate_post_id"],
        )