Hi,
We have implemented RAG with Weaviate vector DB for 100+ client of ours with different file formats but on CSV and XLSX files which were processed with langchain’s loaders & its Recursive Character Text Splitter. We observed the row-wise interpretation & they looked fine and was in the format {‘col1’:val1, ‘col2’:val2, …}.
While querying on the CSV file attached screenshot for ex: “What is John Doe’s date of birth?”, the context retrieved was very poor.
Following is the data we are retrieving:
first query is “what is john date of birth”
neartext = {"concepts": ["what is john date of birth"]}
result = (
client.query.get(
"BMP",
["fileName", "content", "source"],
)
.with_tenant("c194e46b-4ef2-4fac-aa2e-655801ed622f")
.with_limit(10)
.with_where(
{
"path": ["fileName"],
"operator": "ContainsAny",
"valueString": ["5eadd227-c950-49a7-8331-8692d979b049"],
}
)
.with_near_text(neartext)
.do()
)
with the following result:
{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type: act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type: act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111115 first_name: Mike last_name: Davis date_of_birth: 31/2001 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type: act_composite: 22 act_math: 12 act_english: 5 act_reading: 5 sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 3.46 hs_city: Seattle hs_state: Washington hs_zip: 98106 email: mdavis@example.com entry_age: 18.2 ged: FALSE english_2nd_language: TRUE first_generation: FALSE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type: act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111114 first_name: Frank last_name: Brown date_of_birth: 13/2002 ethnicity: Race/ethnicity unknown gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: 1450 sat_math: 520 sat_verbal: 510 sat_reading: 210 hs_gpa: 3.68 hs_city: Pheonix hs_state: Arizona hs_zip: 85015 email: fbrown@example.com entry_age: 19 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: first_name: last_name: date_of_birth: ethnicity: gender: status: entry_academic_period: exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: hs_city: hs_state: hs_zip: email: entry_age: ged: english_2nd_language: first_generation:',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: first_name: last_name: date_of_birth: ethnicity: gender: status: entry_academic_period: exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: hs_city: hs_state: hs_zip: email: entry_age: ged: english_2nd_language: first_generation:',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'}]}}}
with just john as search
neartext = {“concepts”: [“john”]}
{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type: act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type: act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type: act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
{'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
'source': '/app/src/uploads/students - Sheet1.csv'},
we had retrieval limit as 3 and 5, and in both cases the result for john is below that rank
We tried multiple times either the right contexts were poorly ranked or wrong retrieval itself. We had played around with certainty and expanding number of retrievals also but that didn’t seem to address our issue. This behaviour was observed across both small & large files.
Please help us with the same, it is blocking our critical deliveries.