Optimizing Weaviate Class Indexing for Efficient Bulk Uploads and Hybrid Searches

mnkasikci · February 6, 2024, 1:51pm

I’m experiencing slow data upload rates when migrating a large number of objects (24,500) to the same class in my Weaviate docker instance version 1.23.7, with each object taking over 30 seconds. I suspect this might be related to my class definition and indexing settings.

I do batch uploads with 200 objects per batch (and it takes ~10 min) , and async indexing enabled.

I perform two types of searches:

Hybrid search to retrieve multiple fields
Vector search to retrieve content and vector only

Given these searches, I need some guidance:

Which fields should have indexfilterable and indexsearchable configured as true or false?
For hybrid searches, is there a way to restrict the keyword search to only the “content” field?
Does the withFields parameter in hybrid search influence keyword search?
If indexing is disabled for certain fields, will hybrid search ignore those fields or perform slower searches without indices?

Here’s a sanitized version of the relevant code:

Search for multiple fields using a hybrid query:

const hybridResults = await client.graphql
  .get()
  .withClassName('DataCollection')
  .withFields('detail dataId metadata {properties}')
  .withWhere({
    path: ["dataId"],
    operator: "ContainsAny",
    valueTextArray: idArray,
  })
  .withHybrid({
    query: searchQuery,
    vector: searchVector,
    alpha: semanticWeight,
  })
  .withLimit(limitResults ?? defaultLimit)
  .do();

Vector search to retrieve content and vector:

const vectorResults = await client.graphql
  .get()
  .withClassName('DataCollection')
  .withNearVector({ vector: searchVector })
  .withWhere({
    path: ["dataId"],
    operator: "ContainsAny",
    valueTextArray: idArray,
  })
  .withFields("detail _additional {vector}")
  .withLimit(chunkSize)
  .do();

Class definition:

const classDefinition = {
  class: 'DataCollection',
  properties: [
    { name: "dataId", datatype: ["text"] },
    { name: "detail", datatype: ["text"] },
    { name: "metadata", datatype: ["object"], nestedProperties: [
        { name: "pageNo", datatype: ["text"] },
        { name: "webAddress", datatype: ["text"] },
        { name: "heading", datatype: ["text"] },
        { name: "creator", datatype: ["text"] },
        { name: "pageNumber", datatype: ["int"] },
        { name: "fileExtension", datatype: ["text"] },
        { name: "originType", datatype: ["text"] },
      ],
    },
  ],
  vectorIndexConfig: { distance: "cosine" },
};

Would greatly appreciate any insights to optimize my Weaviate setup for better performance.

DudaNogueira · February 6, 2024, 6:47pm

Hi @mnkasikci !! Welcome to our community

First, please, make sure to use the new python v4 client. It uses GRPC, and will be way more performant than V3, that uses http rest endpoints.

Also, consider that async index is experimental

With that sad

1 - When a property in Weaviate is marked as indexFilterable , it means that the data stored in this property will be indexed using a Roaring Bitmap index, which is designed for fast filtering operations, while indexsearchable it indicates that the data stored in this property will be indexed to support BM25 or hybrid-search indexing

2 - You can use:

for example

    jeopardy = client.collections.get("JeopardyQuestion")
    response = jeopardy.query.hybrid(
        query="food",
        query_properties=["question"],
        alpha=0.25,
        limit=3
    )

3 - The withFields are basically the instruction to the graphql endpoint on what properties to bring from the query. In python v4 client it will return all properties by default, unless you specify properties to be returned using return_properties

4 - If you disable the indexing for some fields, it will affect the BM25 side in a hybrid search. For changing that in the vector search, you will need to set skip_vectorization=True for that property.

Let me know if this answer your questions and thanks for joining our comunity and helping around

mnkasikci · February 7, 2024, 6:26am

Hi @DudaNogueira . Thanks you so much for your detailed answer. It was really helpful.
Regarding the client, I’d love to use phython v4 client as a fan of GRPC, but my whole codebase is in Typescript so I am stuck with the typescript client for now, or write my own GRPC requests.

Is there any plan to upgrade Typescript client in the near future?

Dirk · February 7, 2024, 10:53am

We are working on it, but it will take a bit longer. Hopefully there is a public beta soonish (next 1-2month)

Topic		Replies	Views
Advice Needed on Optimizing Vector Search in Weaviate Support	1	185	September 6, 2024
Weaviate FAQ Resources	1	1700	June 20, 2023
Optimizing Object Import Performance in Large Weaviate Classes with HSNW Indexing Support developer-experience	1	437	February 9, 2024
Performance Issue when Extracting Documents with Field Filter in Weaviate Support	3	628	June 21, 2023
Why does search speed suffer (and RAM consumption increases) when there are a large number of vectors in Weaviate? General	3	65	January 14, 2025

Optimizing Weaviate Class Indexing for Efficient Bulk Uploads and Hybrid Searches

Related topics