How to format tabular data for optimal RAG?

nik · June 28, 2024, 10:00am

Description

I want to build a RAG application with a lot of essentially tabular data (taken from a postgres). While I understand that this is an empirical question, I wonder if there is any best practice on this:

How should data be formatted for best results in a RAG chatbot application?

Specifically, I wonder whether information of different data types (e.g. datetime, int, float and str) should be concatenated into a long string, perhaps with an additional ID which is not to be vectorised; or if these different columns should be left as is.

Let’s say the core element of my data is the str – this is what we want to vectorise. The string might contain a comment on a social media post or a forum. BUT, we also want this post to be considered by the RAG based on the date of its posting (originally a datetime), or the user_id of its author. Should we concatenate everything into a string which is like this:

On YYYY-MM-DD HH:MM:SS TZ user 123456 posted 'hello this is the post' in forum 12345678. The post received 4 likes and 6 comments and was re-posted 1 times.

and then vectorise everything, or should we only vectorise the contents of the post, and keep the other data as metadata, stored in weaviate in its original types, e.g. datetime and int?

I’ve been experimenting with this, but haven’t really had promising results either way.

Server Setup Information

Weaviate Server Version:
Deployment Method:
Multi Node? Number of Running Nodes:
Client Language and Version:
Multitenancy?:

Any additional Information

DudaNogueira · June 28, 2024, 7:40pm

Hi @nik !

I believe this is the kind of route to follow.

Now, instead of you concatenating everything, you can map those properties from PG to properties in Weaviate.

So whenever Weaviate ingest your data, it will use those fields and concatenate it for you.

Check here more on the vectorization part:

so for example, if you map your data, you will not only end up with a vectizable text of:

CollectionName property1_name property1_value property2_name property2_value

And now you can also filter your vector searches on those same properties. AND, if any of those properties change, the object get vectorized again

Note that you can select which fields you want to get vectorized, using the skip configuration on a per property level:

Let me know if this helps

Topic		Replies	Views
Creating RAG using own data vectorized in Azure Support	3	409	September 18, 2024
Does weaviate accept raw data, hashes and dates with vectors related to it General	1	748	September 18, 2023
Uploading tables to weaviate database Support python	1	594	March 6, 2025
Searching tabular data with semantically General	3	1012	October 16, 2023
Load tabular data into weaviate General	1	612	October 16, 2023

How to format tabular data for optimal RAG?

Description

Server Setup Information

Any additional Information

Related topics