How to format tabular data for optimal RAG?

Description

I want to build a RAG application with a lot of essentially tabular data (taken from a postgres). While I understand that this is an empirical question, I wonder if there is any best practice on this:

How should data be formatted for best results in a RAG chatbot application?

Specifically, I wonder whether information of different data types (e.g. datetime, int, float and str) should be concatenated into a long string, perhaps with an additional ID which is not to be vectorised; or if these different columns should be left as is.

Let’s say the core element of my data is the str – this is what we want to vectorise. The string might contain a comment on a social media post or a forum. BUT, we also want this post to be considered by the RAG based on the date of its posting (originally a datetime), or the user_id of its author. Should we concatenate everything into a string which is like this:

On YYYY-MM-DD HH:MM:SS TZ user 123456 posted 'hello this is the post' in forum 12345678. The post received 4 likes and 6 comments and was re-posted 1 times.

and then vectorise everything, or should we only vectorise the contents of the post, and keep the other data as metadata, stored in weaviate in its original types, e.g. datetime and int?

I’ve been experimenting with this, but haven’t really had promising results either way.

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method:
  • Multi Node? Number of Running Nodes:
  • Client Language and Version:
  • Multitenancy?:

Any additional Information

Hi @nik !

I believe this is the kind of route to follow.

Now, instead of you concatenating everything, you can map those properties from PG to properties in Weaviate.

So whenever Weaviate ingest your data, it will use those fields and concatenate it for you.

Check here more on the vectorization part:

so for example, if you map your data, you will not only end up with a vectizable text of:

CollectionName property1_name property1_value property2_name property2_value

And now you can also filter your vector searches on those same properties. AND, if any of those properties change, the object get vectorized again :slight_smile:

Note that you can select which fields you want to get vectorized, using the skip configuration on a per property level:

Let me know if this helps :slight_smile: