Description
I am a beginner, I lack any real experience with AI, so deep technical knowledge on the topic escapes me. I just want to clarify some details at a high level, because there are some questions that I find difficult to answer via the documentation.
My primary goal is to use a quantised model for CPU inference (the ONNX images); i need to squeeze as much as I can from the CPU, for semantic search.
This is my proposed setup:
- quantised model: Snowflake Arctic embed sm (30m; 384d; Q8) – the ONNX image provided by Weaviate
- bring your own vectors: the vectoriser is set to none because it’s a static database, and I will vectorise off-site
- scalar quantisation: I think that I need this right? if the model produces int8, then it should compare to an int8 index?
My setup reflects what I think is required; these questions will help clarify my confusion:
a. do I bring my own vectors as f32?
- weaviate-go accepts only f32, so I assume that it quantises the vectors as it indexes them
- would the unquantised model yield suitable vectors? (If I understand it correctly Weaviate quantises them before indexing – but are the int8 quantised vectors from the model suitable to match against Weaviate’s int8 index vectors?)
b. do I need to use SQ if the model that I use produces int8 vectors?
- if not, how does that work? how does it compare vectors (are they int8, float32, or what?) I assumed that Q8 model produces int8, and it needs to compare against int8 (SQ index)
b. if the vectoriser is “none”: do i vectorise my own queries (via HTTP request to the container)?
- I will set the vectoriser to none, because it’s a static database, and I intend to vectorise everything off-site
Generally speaking my approach (and understanding) to batch loading the database is:
- vectorise off-site using the unquantised snowflake, arctic embed model – produce f32
- send those to Weaviate via my gRPC server (as f32)
- Weaviate creates the collection and strores the uncompressed vectors, but indexes as int8 (SQ)
Now, a query comes in:
- I send it to the snowflake artctic embed model for vectorisation, it returns an int8 vector
- It’s then sent to weaviate, a near vector search commences
- Weaviate compares the int8 query to the SQ index (int8)
- some other magic happens (possibly re-scoring, or not: perhaps that’s only BQ pr PQ)
- Weaviate returns the result
Is my understanding way off? I just want a general feel for what’s occurring, I don’t want to waste a lot of CPU cycles doing something stupid. It doesn’t need to be perfect.
Server Setup Information
- Weaviate Server Version: 1.28.4
- Deployment Method: Docker compose
- Multi Node? No, single, for the time being (perhaps a year or two down the line this will change)
- Client Language and Version: Go 1.23
- Multitenancy?: Yes, but set to a single value (only during testing do I set this to a randm UUID per test)
Any additional Information
- I understand and have considered binary quantisation: a consideration for the future, if i need it