Description
I want to build a RAG application with a lot of essentially tabular data (taken from a postgres). While I understand that this is an empirical question, I wonder if there is any best practice on this:
How should data be formatted for best results in a RAG chatbot application?
Specifically, I wonder whether information of different data types (e.g. datetime, int, float and str) should be concatenated into a long string, perhaps with an additional ID which is not to be vectorised; or if these different columns should be left as is.
Let’s say the core element of my data is the str – this is what we want to vectorise. The string might contain a comment on a social media post or a forum. BUT, we also want this post to be considered by the RAG based on the date of its posting (originally a datetime), or the user_id of its author. Should we concatenate everything into a string which is like this:
On YYYY-MM-DD HH:MM:SS TZ user 123456 posted 'hello this is the post' in forum 12345678. The post received 4 likes and 6 comments and was re-posted 1 times.
and then vectorise everything, or should we only vectorise the contents of the post, and keep the other data as metadata, stored in weaviate in its original types, e.g. datetime and int?
I’ve been experimenting with this, but haven’t really had promising results either way.
Server Setup Information
- Weaviate Server Version:
- Deployment Method:
- Multi Node? Number of Running Nodes:
- Client Language and Version:
- Multitenancy?: