Am I thinking correctly about quantiation?

Description

I am a beginner, I lack any real experience with AI, so deep technical knowledge on the topic escapes me. I just want to clarify some details at a high level, because there are some questions that I find difficult to answer via the documentation.

My primary goal is to use a quantised model for CPU inference (the ONNX images); i need to squeeze as much as I can from the CPU, for semantic search.

This is my proposed setup:

  1. quantised model: Snowflake Arctic embed sm (30m; 384d; Q8) – the ONNX image provided by Weaviate
  2. bring your own vectors: the vectoriser is set to none because it’s a static database, and I will vectorise off-site
  3. scalar quantisation: I think that I need this right? if the model produces int8, then it should compare to an int8 index?

My setup reflects what I think is required; these questions will help clarify my confusion:

a. do I bring my own vectors as f32?
- weaviate-go accepts only f32, so I assume that it quantises the vectors as it indexes them
- would the unquantised model yield suitable vectors? (If I understand it correctly Weaviate quantises them before indexing – but are the int8 quantised vectors from the model suitable to match against Weaviate’s int8 index vectors?)
b. do I need to use SQ if the model that I use produces int8 vectors?
- if not, how does that work? how does it compare vectors (are they int8, float32, or what?) I assumed that Q8 model produces int8, and it needs to compare against int8 (SQ index)
b. if the vectoriser is “none”: do i vectorise my own queries (via HTTP request to the container)?
- I will set the vectoriser to none, because it’s a static database, and I intend to vectorise everything off-site

Generally speaking my approach (and understanding) to batch loading the database is:

  1. vectorise off-site using the unquantised snowflake, arctic embed model – produce f32
  2. send those to Weaviate via my gRPC server (as f32)
  3. Weaviate creates the collection and strores the uncompressed vectors, but indexes as int8 (SQ)

Now, a query comes in:

  1. I send it to the snowflake artctic embed model for vectorisation, it returns an int8 vector
  2. It’s then sent to weaviate, a near vector search commences
  3. Weaviate compares the int8 query to the SQ index (int8)
  4. some other magic happens (possibly re-scoring, or not: perhaps that’s only BQ pr PQ)
  5. Weaviate returns the result

Is my understanding way off? I just want a general feel for what’s occurring, I don’t want to waste a lot of CPU cycles doing something stupid. It doesn’t need to be perfect.

Server Setup Information

  • Weaviate Server Version: 1.28.4
  • Deployment Method: Docker compose
  • Multi Node? No, single, for the time being (perhaps a year or two down the line this will change)
  • Client Language and Version: Go 1.23
  • Multitenancy?: Yes, but set to a single value (only during testing do I set this to a randm UUID per test)

Any additional Information

  • I understand and have considered binary quantisation: a consideration for the future, if i need it

Hi @xbc5 !! Welcome to our community :hugs:

One thing to consider here is that Weaviate will quantize the vectors for you. So you must bring your vectors as is.

Even if your model produces int8 vectors, it’s still recommended to use SQ in Weaviate. Here’s why:

  1. Weaviate’s SQ implementation is designed to work with float32 input vectors and optimize them for storage and comparison within Weaviate’s index.

  2. SQ in Weaviate analyzes your data and distributes dimension values into 256 buckets, which may differ from the quantization scheme used by your model.

Your understanding is generally correct, but with a few adjustments:

  1. Vectorize off-site using the unquantized Snowflake Arctic embed model, producing float32 vectors.

  2. Send these float32 vectors to Weaviate.

  3. If SQ is enabled, Weaviate will store the uncompressed vectors but index them using SQ (8-bit integers).

For querying:

  1. Vectorize the query off-site, producing a float32 vector (not int8).

  2. Send this float32 query vector to Weaviate.

  3. Weaviate will compare the query vector against the SQ-compressed index.

  4. Weaviate may perform additional steps like over-fetching and re-scoring to improve recall.

One thing to consider is using, if possible, dynamic index, allowing you to define a threshold to move away from a flat (disk) to hsnw (memory) index.

Let me know if this helps!

Thanks!

That does help a lot; thanks for the quick response.

Vectorize the query off-site, producing a float32 vector (not int8).

SQ in Weaviate analyzes your data and distributes dimension values into 256 buckets, which may differ from the quantization scheme used by your model.

I misunderstood. It turns out that the model that I use[1] produces f32 vectors (despite being a quantised model).

Are these suitable for both query and initial vectorisation? Essentially I am asking: If I use it for both, will the SQ index work? Or should I use the original (unquantised) model[2] for initial vectorisation?

Thanks for putting up with my stupid questions.

Model Details


Snowflake arctic-embed sm ONNX, provided by Weaviate[1:1].

This model produces floats, but its clearly a quantised model.

Its ort_config.json:

Full config
{
  "one_external_file": true,
  "opset": null,
  "optimization": {},
  "quantization": {
    "activations_dtype": "QUInt8",
    "activations_symmetric": false,
    "format": "QOperator",
    "is_static": false,
    "mode": "IntegerOps",
    "nodes_to_exclude": [],
    "nodes_to_quantize": [],
    "operators_to_quantize": [
      "Conv",
      "MatMul",
      "Attention",
      "LSTM",
      "Gather",
      "Transpose",
      "EmbedLayerNormalization"
    ],
    "per_channel": false,
    "qdq_add_pair_to_weight": false,
    "qdq_dedicated_pair": false,
    "qdq_op_type_per_channel_support_to_axis": {
      "MatMul": 1
    },
    "reduce_range": false,
    "weights_dtype": "QUInt8",
    "weights_symmetric": true
  },
  "use_external_data_format": false
}
{
  ...
  "quantization": {
    "activations_dtype": "QUInt8",
    "mode": "IntegerOps",
    "weights_dtype": "QUInt8",
    ...
  },
}

t2v-transformers-1 | INFO: Running on CPU
t2v-transformers-1 | INFO: Running ONNX vectorizer with quantized model for amd64 (AVX2)


  1. Snowflake arctic-emded sm ONNX (quantised for CPU inference): semitechnologies/transformers-inference:snowflake-snowflake-arctic-embed-s-onnx-1.10.1 ↩︎ ↩︎

  2. Snowflake/snowflake-arctic-embed-s · Hugging Face (unquantised; f32) ↩︎

Hi!! Not stupid at all! Much the opposite :slight_smile:

ANd please, feel free to always reach out to us and ask any questions. We are here to help!

If you have a vectorizer configured, Weaviate will vectorize your content. Or, if you bring your own vector, Weaviate will quantize the vector for you.

As you noted, the model you have chosen will vectorize using f32, so it should work as expected. Now you can either generate your own vectors and provide it while creating the object, or let Weavaite vectorize it for you.

Let me know if this clarifies!

Thanks!

By the way, every week we host some nice events that you will certainly benefit from:

Pro Tip: if you can’t join, make sure to subscribe and you’ll get the recording delivered to your email :slight_smile:

Thanks.

I was concerned that the ONNX model would generate a small loss in accuracy because it does inference internally via int8 (outputting f32). I thought that further quantising the vector during indexing would exacerbate that loss in accuracy.

By the way, every week we host some nice events that you will certainly benefit from

Looks good. I’ll give it a try. The one on Mar 6th looks interesting.

1 Like