Poor retrieval quality while using CSV and XLSX files

While working with CSVs and using langchain’s csv loader & recursive character text splitter, the retrieval qualities are very poor.

Few records from the CSV:
id,first_name,last_name,date_of_birth,ethnicity,gender,status,entry_academic_period,exclusion_type,act_composite,act_math,act_english,act_reading,sat_combined,sat_math,sat_verbal,sat_reading,hs_gpa,hs_city,hs_state,hs_zip,email,entry_age,ged,english_2nd_language,first_generation

111111,John,Doe,01/2000,Hispanic,M,FT,Fall 2008,2.71,Albuquerque,New Mexico,87112,jdoe@example.com,17.9,FALSE,FALSE,TRUE

111112,Jane,Smith,05/2001,Hispanic,F,TRANSFER,Fall 2006,3.73,New York,New York,10009,jsmith@example.com,18.1,FALSE,FALSE,TRUE

If I ask what is the date of birth of John Doe, retrieval for the John Doe entry is coming out towards the end (when sorted by certainty).

We tried decreasing certainty & improving the number of retrievals but it is not helping. What would be the right way to deal with this issue?

2 Likes

I’m also facing a similar thing with CSV data. any help would be appreciated.

HI!

I believe the issue here is how you are parsing your objects.

Considering that this is an object:

111111,John,Doe,01/2000,Hispanic,M,FT,Fall 2008,2.71,Albuquerque,New Mexico,87112,jdoe@example.com,17.9,FALSE,FALSE,TRUE

There is no way that it will know that 01/2000 is a birth date. You will need to add the head of that dataset also into the content so it has a chance to understand it better.