Inaccurate search results

Hi,

We have implemented RAG with Weaviate vector DB for 100+ client of ours with different file formats but on CSV and XLSX files which were processed with langchain’s loaders & its Recursive Character Text Splitter. We observed the row-wise interpretation & they looked fine and was in the format {‘col1’:val1, ‘col2’:val2, …}.

While querying on the CSV file attached screenshot for ex: “What is John Doe’s date of birth?”, the context retrieved was very poor.

Following is the data we are retrieving:

first query is “what is john date of birth”

neartext = {"concepts": ["what is john date of birth"]}
result = (
            client.query.get(
                "BMP",
                ["fileName", "content", "source"],
            )
            .with_tenant("c194e46b-4ef2-4fac-aa2e-655801ed622f")
            .with_limit(10)
            .with_where(
                {
                    "path": ["fileName"],
                    "operator": "ContainsAny",
                    "valueString": ["5eadd227-c950-49a7-8331-8692d979b049"],
                }
            )
            .with_near_text(neartext)
            .do()
        )

with the following result:

{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111115 first_name: Mike last_name: Davis date_of_birth: 31/2001 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 22 act_math: 12 act_english: 5 act_reading: 5 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.46 hs_city: Seattle hs_state: Washington hs_zip: 98106 email: mdavis@example.com entry_age: 18.2 ged: FALSE english_2nd_language: TRUE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111114 first_name: Frank last_name: Brown date_of_birth: 13/2002 ethnicity: Race/ethnicity unknown gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined: 1450 sat_math: 520 sat_verbal: 510 sat_reading: 210 hs_gpa: 3.68 hs_city: Pheonix hs_state: Arizona hs_zip: 85015 email: fbrown@example.com entry_age: 19 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id:  first_name:  last_name:  date_of_birth:  ethnicity:  gender:  status:  entry_academic_period:  exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city:  hs_state:  hs_zip:  email:  entry_age:  ged:  english_2nd_language:  first_generation:',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id:  first_name:  last_name:  date_of_birth:  ethnicity:  gender:  status:  entry_academic_period:  exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city:  hs_state:  hs_zip:  email:  entry_age:  ged:  english_2nd_language:  first_generation:',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'}]}}}

with just john as search

neartext = {“concepts”: [“john”]}

{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},

we had retrieval limit as 3 and 5, and in both cases the result for john is below that rank

We tried multiple times either the right contexts were poorly ranked or wrong retrieval itself. We had played around with certainty and expanding number of retrievals also but that didn’t seem to address our issue. This behaviour was observed across both small & large files.

Please help us with the same, it is blocking our critical deliveries.

Hi @AmanAda,
When I look at the output:

{
    'content': 'install: 0 user id: v3xT77clHUur8FVv game app: 1yXfAufLTi688uUPvfZm country: AT advertiser: gqFdk16BCaEiyOJZEPWl age: 2 user quality score: 3 createdat: 1675211911 spending: 0.0108862069 earning: 0 carrier: KGfaH mccmnc: 232678',
    'fileName': 'f20a9535-366b-443e-a5e2-5bbbce5c964a',
    'source': '/app/src/uploads/master_data.csv'
  },

The content doesn’t contain person name and date of birth info.
This is probably why Weaviate can’t find John Doe and their dob.

I would recommend checking the langchain import process, to make sure the required info is passed into Weaviate.

Also, do you have multiple collections?
Are you sure you are querying the right collection?

Thanks for replying @sebawita but on this data source:

processed with langchain’s loaders & its Recursive Character Text Splitter. We observed the row-wise interpretation & they looked fine and was in the format {‘col1’:val1, ‘col2’:val2, …}.

first query is “what is john date of birth”

neartext = {"concepts": ["what is john date of birth"]}
result = (
            client.query.get(
                "BMP",
                ["fileName", "content", "source"],
            )
            .with_tenant("c194e46b-4ef2-4fac-aa2e-655801ed622f")
            .with_limit(10)
            .with_where(
                {
                    "path": ["fileName"],
                    "operator": "ContainsAny",
                    "valueString": ["5eadd227-c950-49a7-8331-8692d979b049"],
                }
            )
            .with_near_text(neartext)
            .do()
        )

with the following result:

{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111115 first_name: Mike last_name: Davis date_of_birth: 31/2001 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 22 act_math: 12 act_english: 5 act_reading: 5 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.46 hs_city: Seattle hs_state: Washington hs_zip: 98106 email: mdavis@example.com entry_age: 18.2 ged: FALSE english_2nd_language: TRUE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111114 first_name: Frank last_name: Brown date_of_birth: 13/2002 ethnicity: Race/ethnicity unknown gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined: 1450 sat_math: 520 sat_verbal: 510 sat_reading: 210 hs_gpa: 3.68 hs_city: Pheonix hs_state: Arizona hs_zip: 85015 email: fbrown@example.com entry_age: 19 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id:  first_name:  last_name:  date_of_birth:  ethnicity:  gender:  status:  entry_academic_period:  exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city:  hs_state:  hs_zip:  email:  entry_age:  ged:  english_2nd_language:  first_generation:',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id:  first_name:  last_name:  date_of_birth:  ethnicity:  gender:  status:  entry_academic_period:  exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city:  hs_state:  hs_zip:  email:  entry_age:  ged:  english_2nd_language:  first_generation:',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'}]}}}

with just john as search

neartext = {“concepts”: [“john”]}

{'data': {'Get': {'BMP': [{'content': 'id: 111116 first_name: Jennifer last_name: Wilson date_of_birth: 01/2002 ethnicity: Asian gender: M status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 4.24 hs_city: Denver hs_state: Colorado hs_zip: 80012 email: jwilson@example.com entry_age: 18.5 ged: TRUE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111113 first_name: Sarah last_name: Thomas date_of_birth: 21/2002 ethnicity: Hispanic gender: M status: FTFT entry_academic_period: Fall 2006 exclusion_type:  act_composite: 14 act_math: 6 act_english: 5 act_reading: 3 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.64 hs_city: Pheonix hs_state: Arizona hs_zip: 85006 email: sthomas@example.com entry_age: 17.6 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111112 first_name: Jane last_name: Smith date_of_birth: 05/2001 ethnicity: Hispanic gender: F status: TRANSFER entry_academic_period: Fall 2006 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 3.73 hs_city: New York hs_state: New York hs_zip: 10009 email: jsmith@example.com entry_age: 18.1 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111117 first_name: Jessica last_name: Garcia date_of_birth: 01/2000 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 25 act_math: 10 act_english: 5 act_reading: 10 sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa:  hs_city: Austin hs_state: Texas hs_zip: 78703 email: jgarcia@example.com entry_age: 18.8 ged: FALSE english_2nd_language: FALSE first_generation: FALSE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111119 first_name: Bob last_name: Lopez date_of_birth: 04/1998 ethnicity: White gender: F status: FTFT entry_academic_period: Fall 2007 exclusion_type:  act_composite: 15 act_math: 5 act_english: 5 act_reading: 5 sat_combined: 720 sat_math: 110 sat_verbal: 400 sat_reading: 220 hs_gpa: 3.24 hs_city: Denver hs_state: Colorado hs_zip: 80122 email: blopez@example.com entry_age: 18.5 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},
    {'content': 'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type:  act_composite:  act_math:  act_english:  act_reading:  sat_combined:  sat_math:  sat_verbal:  sat_reading:  hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE',
     'fileName': '5eadd227-c950-49a7-8331-8692d979b049',
     'source': '/app/src/uploads/students - Sheet1.csv'},

we had retrieval limit as 3 and 5, and in both cases the result for john is below that rank

Hi @AmanAda,

I guess the content contains a lot of information, which affect the vectors generated by the ML Model that you use.

What I mean
When you take a piece of text like:
'id: 111111 first_name: John last_name: Doe date_of_birth: 01/2000 ethnicity: Hispanic gender: M status: FT entry_academic_period: Fall 2008 exclusion_type: act_composite: act_math: act_english: act_reading: sat_combined: sat_math: sat_verbal: sat_reading: hs_gpa: 2.71 hs_city: Albuquerque hs_state: New Mexico hs_zip: 87112 email: jdoe@example.com entry_age: 17.9 ged: FALSE english_2nd_language: FALSE first_generation: TRUE'
The ML Model will generate a single vector embedding for the whole thing. Which is an average of many different parts of information. So, a vector for a long piece of text that contains the word “John” can be quite different from a short text like “John’s date of birth” or “John” alone, as these two would contain less information to be encoded in a vector embedding.

Options

Option A – use Hybrid Search

You could use a combination of vector and keyword search with Hybrid.

Note, I would recommend using the new Python client v3, as it is a lot easier to use and write queries.

bmp = client.collections.get("BMP")
response = bmp.query.hybrid(
    query="John Doe birthday",
    limit=5
)

for o in response.objects:
    print(o.properties)

You could even specify, which properties to look into for keyword search part.
This would be particularly useful, if you could import first_name, last_name, etc as individual properties.
Here is a query example:

bmp = client.collections.get("BMP")
response = bmp.query.hybrid(
    query="John Doe birthday",
    query_properties=["content"],
    limit=5
)

for o in response.objects:
    print(o.properties)

Option B – reduce the content for vectorization

Alternatively, you could be more selective with what goes into "content", and as a result make the vectorized content more precise. Then any other property could be still stored in Weaviate as metadata (like act_composite, or act_english)

Option C – use multi-vector approach

I am not sure, if this is possible directly with Langchain.
Recently we introduced Named Vectors, which allows you to define multiple vectors per object.
So you could create one vector on first_name + last_name, then another vector on content, then a 3rd one on something else. Which would allow you to search on vectors built from specific properties, as opposed to everything.

Note: if you configure Weaviate to create three vectors per object, then this will involve three vectorizations per object, so you will pay 3x per object.

Here is an example of a collections with two Named Vectors:

from weaviate.classes.config import Configure

client.collections.create(
    "BMP",
    vectorizer_config=[
        # Set a named vector
        Configure.NamedVectors.text2vec_cohere(  # i.e. use Cohere for this vector
        name="name", source_properties=["first_name", "last_name"]       # Set the source property(ies)
    ),
    # Set another named vector
    Configure.NamedVectors.text2vec_openai(  # i.e. use OpenAI for this vector, you can pick any vectorizer ;)
        name="content", source_properties=["content"]         # Set the source property(ies)
    )
])

Then you can query a specific vector space:

bmp = client.collections.get("BMP")
response = bmp.query.near_text(
    query="John Doe birthday",
    target_vector="name"
    limit=5
)