Poor retrieval quality while using CSV and XLSX files

Bijit · April 2, 2024, 10:55am

While working with CSVs and using langchain’s csv loader & recursive character text splitter, the retrieval qualities are very poor.

Few records from the CSV:
id,first_name,last_name,date_of_birth,ethnicity,gender,status,entry_academic_period,exclusion_type,act_composite,act_math,act_english,act_reading,sat_combined,sat_math,sat_verbal,sat_reading,hs_gpa,hs_city,hs_state,hs_zip,email,entry_age,ged,english_2nd_language,first_generation

111111,John,Doe,01/2000,Hispanic,M,FT,Fall 2008,2.71,Albuquerque,New Mexico,87112,jdoe@example.com,17.9,FALSE,FALSE,TRUE

111112,Jane,Smith,05/2001,Hispanic,F,TRANSFER,Fall 2006,3.73,New York,New York,10009,jsmith@example.com,18.1,FALSE,FALSE,TRUE

…

If I ask what is the date of birth of John Doe, retrieval for the John Doe entry is coming out towards the end (when sorted by certainty).

We tried decreasing certainty & improving the number of retrievals but it is not helping. What would be the right way to deal with this issue?

Rohan_Purohit · April 3, 2024, 5:09am

I’m also facing a similar thing with CSV data. any help would be appreciated.

DudaNogueira · April 17, 2024, 6:48pm

HI!

I believe the issue here is how you are parsing your objects.

Considering that this is an object:

111111,John,Doe,01/2000,Hispanic,M,FT,Fall 2008,2.71,Albuquerque,New Mexico,87112,jdoe@example.com,17.9,FALSE,FALSE,TRUE

There is no way that it will know that 01/2000 is a birth date. You will need to add the head of that dataset also into the content so it has a chance to understand it better.

Topic		Replies	Views
Inaccurate search results Support bug	4	415	April 8, 2024
🔍 Seeking Solutions for Hybrid Search Challenges in Resume Parsing: General integration , developer-experience , feedback , wcs , python	3	396	March 7, 2024
Filter records from RetrievalQA chain Support	4	471	May 31, 2024
GSE.CutAll not work well for some Chinese text Support bug	0	94	November 22, 2024
Help with the inaccuracy of Generative Search Support	3	97	June 17, 2024

Poor retrieval quality while using CSV and XLSX files

Related topics