Help Needed: Explain top scored documents, Increase query speed

Hi all,

We are storing approximately 25 million (768d) vectors in a class. I have two questions regarding the performance of the search.

1. Explanation for the results and their scores:
We use hybrid search to get the results and our vector index configuration is as follows:

"vectorIndexConfig": {
                "skip": false,
                "cleanupIntervalSeconds": 300,
                "maxConnections": 64,
                "efConstruction": 128,
                "ef": -1,
                "dynamicEfMin": 100,
                "dynamicEfMax": 500,
                "dynamicEfFactor": 8,
                "vectorCacheMaxObjects": 1000000000000,
                "flatSearchCutoff": 40000,
                "distance": "cosine",
                "pq": {
                    "enabled": false,
                    "bitCompression": false,
                    "segments": 0,
                    "centroids": 256,
                    "trainingLimit": 100000,
                    "encoder": {
                        "type": "kmeans",
                        "distribution": "log-normal"
                    }
                }
            }

Since we configured the ef to be dynamic and the dynamicEfFactor to be 8, using any limit beyond 100 (8*100) does not have any effect on the query time ef as it’ll be set to maximum value of 500.

But the results I get vary a lot if I increase the limit to say 1000, 2000.

  • 100 (limit) : 0.013114754 (score of the top result)
  • 1000 (limit): 0.01558389(score of the top result)
  • 2000(limit): 0.016393442 (score of the top result)

My question is why is this happening and how can we control it without modifying the vector config values (efConstruction and maxConnections ) as it would required me to re-process the whole dataset.

2. Query speed

If I increase the limit it’s taking longer to get the response (which is expected I suppose).

limit 10: 4.88 seconds
limit 100: 4.58 seconds
limit 1000: 7.54 seconds
limit 2000: 11.69 seconds

Sometimes depending on the query, it takes up to 1 min 30 seconds to respond.

Is there anything that can be done to improve these speeds? (current hardware specs: 128GB RAM, 16 vcpus)

Thank you

Hey @vamsi - I don’t know what the answer to this is at all, but I am passing it on to the team and hopefully someone will be in touch soon.

Thanks!

@vamsi thanks for running this experiment!

Is this in AWS, GCP, or another cloud provider, or is it on site? Is this Linux or another OS?

Also, what sort of disks/backing store are you using? We recommend gp3 for AWS, for example, as it’s the best price/performance and Weaviate does make use of the disk IOPS and throughput/bandwidth.

With that in mind, can you see if you are blocked on iowait? For example, on the pod or machine, you can run iostat -x 2 or similar. If there is non-zero iowait, the disk(s) may be the culprit and you would want more performant storage. We can dive further into that if need be (diagnosing and increasing IOPS, etc).

If not, you may be able to see if CPU is the limiting factor. For example, the htop utility is a fairly user-friendly way to see what each CPU/vCPU/core is doing. If you see that one thread/process/core is >=100% utilization during the query, we may be blocking on single-thread performance concerns.

Finally, you can use the weaviate diagnostics utility to grab information that can help us further. Specifically, the profiling information can help indicate what Weaviate is spending time doing. Sharing that with us will help a ton!

3 Likes

Hi @kcm ,

Thank you for outlining the steps.

Here are the details you requested for:

  1. Cloud provider: Digital Ocean

    volume performance details for digitalocean (found here - Volume Features :: DigitalOcean Documentation)

    Type IOPS Throughput
    Standard 7,500 300 MB/s
    Standard (burst) 10,000 450 MB/s
    CPU-Optimized 10,000 450 MB/s
    CPU-Optimized (burst) 15,000 525 MB/s
  2. OS : Linux

  3. When I run iostat -x 2, %iowait is mostly close to 0 (idle) but increases while retrieving. Example stats captured while retrieving is shown below:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          43.17    0.00    2.78    4.10    0.00   49.95
  1. Monitored the CPU usage using htop, Most of the time the usage stays below 100%. Some vcpus were reaching 100% but isn’t this expected while I am importing the data?

  2. I have generated the report using weaviate diagnostics. Do you want me to share this over an email?

@kcm @jphwang

We are facing an issue while migrating the data and setting up the VM in Azure. Can you please suggest a way to fix this.

We backed up and imported 30 million vectors+objects (1.2TB) into Azure VM. We are using the disk with following specs: P40 - 7500 IOPS, 250 MB/s - Azure.

The query speed is terrible. It’s taking on an average 1 minute to fetch the documents.
I see that %iowait is non-zero and always above 20 while the CPU is idle almost all the time.

Increasing the disk to P80 - 20000 IOPS, 900 MB/s did not have any impact.

Is is expected to have high initial load on io operations as soon as the backup data is restored? I see a continuous stream of lsm_compaction in the logs. This has been streaming for last 2 days.

Any help is appreciated.

Thank you.