Optimizing Weaviate Class Indexing for Efficient Bulk Uploads and Hybrid Searches

I’m experiencing slow data upload rates when migrating a large number of objects (24,500) to the same class in my Weaviate docker instance version 1.23.7, with each object taking over 30 seconds. I suspect this might be related to my class definition and indexing settings.

I do batch uploads with 200 objects per batch (and it takes ~10 min) , and async indexing enabled.

I perform two types of searches:

  1. Hybrid search to retrieve multiple fields
  2. Vector search to retrieve content and vector only

Given these searches, I need some guidance:

  1. Which fields should have indexfilterable and indexsearchable configured as true or false?
  2. For hybrid searches, is there a way to restrict the keyword search to only the “content” field?
  3. Does the withFields parameter in hybrid search influence keyword search?
  4. If indexing is disabled for certain fields, will hybrid search ignore those fields or perform slower searches without indices?

Here’s a sanitized version of the relevant code:

Search for multiple fields using a hybrid query:

const hybridResults = await client.graphql
  .get()
  .withClassName('DataCollection')
  .withFields('detail dataId metadata {properties}')
  .withWhere({
    path: ["dataId"],
    operator: "ContainsAny",
    valueTextArray: idArray,
  })
  .withHybrid({
    query: searchQuery,
    vector: searchVector,
    alpha: semanticWeight,
  })
  .withLimit(limitResults ?? defaultLimit)
  .do();

Vector search to retrieve content and vector:

const vectorResults = await client.graphql
  .get()
  .withClassName('DataCollection')
  .withNearVector({ vector: searchVector })
  .withWhere({
    path: ["dataId"],
    operator: "ContainsAny",
    valueTextArray: idArray,
  })
  .withFields("detail _additional {vector}")
  .withLimit(chunkSize)
  .do();

Class definition:

const classDefinition = {
  class: 'DataCollection',
  properties: [
    { name: "dataId", datatype: ["text"] },
    { name: "detail", datatype: ["text"] },
    { name: "metadata", datatype: ["object"], nestedProperties: [
        { name: "pageNo", datatype: ["text"] },
        { name: "webAddress", datatype: ["text"] },
        { name: "heading", datatype: ["text"] },
        { name: "creator", datatype: ["text"] },
        { name: "pageNumber", datatype: ["int"] },
        { name: "fileExtension", datatype: ["text"] },
        { name: "originType", datatype: ["text"] },
      ],
    },
  ],
  vectorIndexConfig: { distance: "cosine" },
};

Would greatly appreciate any insights to optimize my Weaviate setup for better performance.

Hi @mnkasikci !! Welcome to our community :hugs:

First, please, make sure to use the new python v4 client. It uses GRPC, and will be way more performant than V3, that uses http rest endpoints.

Also, consider that async index is experimental :grimacing:

With that sad :slight_smile:

1 - When a property in Weaviate is marked as indexFilterable , it means that the data stored in this property will be indexed using a Roaring Bitmap index, which is designed for fast filtering operations, while indexsearchable it indicates that the data stored in this property will be indexed to support BM25 or hybrid-search indexing

2 - You can use:

for example

    jeopardy = client.collections.get("JeopardyQuestion")
    response = jeopardy.query.hybrid(
        query="food",
        query_properties=["question"],
        alpha=0.25,
        limit=3
    )

3 - The withFields are basically the instruction to the graphql endpoint on what properties to bring from the query. In python v4 client it will return all properties by default, unless you specify properties to be returned using return_properties

4 - If you disable the indexing for some fields, it will affect the BM25 side in a hybrid search. For changing that in the vector search, you will need to set skip_vectorization=True for that property.

Let me know if this answer your questions and thanks for joining our comunity and helping around :wink:

Hi @DudaNogueira . Thanks you so much for your detailed answer. It was really helpful.
Regarding the client, I’d love to use phython v4 client as a fan of GRPC, but my whole codebase is in Typescript so I am stuck with the typescript client for now, or write my own GRPC requests.

Is there any plan to upgrade Typescript client in the near future?

We are working on it, but it will take a bit longer. Hopefully there is a public beta soonish (next 1-2month)

2 Likes