Search by numbers (RAG)

Description

Hello, everybody. Currently I’m trying to find a way to search records by numbers.
For example:
Records are represented as strings (
e.g.:" Report for Transaction ID 777:
Amount: 6.50 USD
Merchant: Demo merchant
Status: COMPLETE

").
It works good when I’m trying to find a record by merchant name or status. But when I tried to find a record by transaction ID, it found many records with wrong IDs (like 776 or 787).
My query has different representations such as:

  1. Provide me info about transaction with ID 777
  2. Transaction ID 777
  3. 777

But I still can’t retrieve suitable data.

Server Setup Information

  • Weaviate Server Version: 1.26.0
  • Deployment Method: docker
  • Client Language and Version: v4
  • qna-transformers: installed via docker

Any additional Information

I’m creating a flow in langflow, so I tried to use different components:

  1. The base one (regular similarity search)
  2. Custom component with raw GraphQL query

hi @Nazarii_Zabolotnyi !!

The results will depend mostly on how you index the data, and what/how are you searching for it.

So for example, by default, a text field will have a word tokenization (more on tokenization). This means that your properties will be broken down in word and some tokens will be considered from there.

So in the middle of the text, you may have data for transaction ID 1234. You will end up with roughly 4 tokens (some are ignored, etc). Let’s say data transaction ID (id may also be removed) and 1234.

There are more content from indexed properties that will also get into the inverted index.

Now whenever you search using bm25 (keyword) search, this is our search index.

Also, your content can be embedded/vectorized. The content used to generated this vector will be a concatenation of some (or all) you properties. You define this with the skip_index for each property.

The “problem” of using tools like langlow, langchain, llama-index, is that it will create the collection, index everything and make it work like magic.

But there may be some automatic properties that you do now want to index. So what you can do is to inspect the created collection, export the schema, and fine tune it.

Now, with all data indexed, you need to experiment the best results, searching with bm25 alone, near_text alone, and then mixing them up with hybrid search.

We recently had a really nice webinar on advanced rag techniques that you will certainly enjoy:

For example, while chunking your data, you can create better metadata that will expose the closest object better on this kind of queries.

Here we have some other events that we are doing:

Let me know if this helps!

Thanks!