🔍 Seeking Solutions for Hybrid Search Challenges in Resume Parsing:

:page_facing_up: I’ve embarked on a project where I’ve collected about 20 resumes in PDF format. To efficiently extract details from each resume, I’ve employed an unstructured approach.

:jigsaw: Each document is segmented into chunks, with each chunk containing up to 100 characters. These chunks are then inputted into Weaviate, a knowledge graph, with three key properties: “resume_id” to identify the source resume, “content” to store the text chunks, and “paragraph_number” to maintain the sequence of information within each document.

:memo: The content within these chunks varies, typically encompassing details like personal information (e.g., name, email, phone), career objectives, educational background, project experiences, skills, and location.

:mag: When performing a hybrid search, aiming to find developers proficient in specific technologies like Python and React. Despite receiving high scores for paragraphs discussing skills, the challenge lies in the omission of paragraphs containing crucial user details needed to identify developers proficient in both Python and React. Even within the top 5 search results, paragraphs with user details are missing.

:thinking: How can I ensure the retrieval of relevant chunks, especially when context is fragmented across different chunks?

:thinking: Queries focusing on specific locations, such as “Give me a list of all developers residing in XYZ city,” pose challenges in locating the relevant chunks containing candidate details. While I can identify the paragraphs mentioning the location name, this alone is insufficient as I also require the candidates’ names. How can I address this issue and ensure retrieval of the necessary information?

:hammer_and_wrench: Seeking advice and solutions on optimizing search strategies to ensure retrieval of relevant chunks that encompass both user details and skill-specific information. Any suggestions on refining search parameters or utilizing advanced techniques for hybrid search would be greatly appreciated!

:mag: Despite tweaking parameters like alpha values, ranked and relative fusion, and employing various search methods (Hybrid, BM25, Near text), I often miss out on chunks containing crucial user details, like names or locations.

Hi @Freddy ! Welcome to our community :hugs:

I think the query “Give me a list of all developers that lives in XYZ” is too specific for a similarity search.

That query will give you all chunks that has similarity with that query. As those chunks are tied to the candidate, you could post process and get a list of candidates that has the closest chunks of their resume that are similar to that candidate.

One thing you could do, is trying generating a ref2vec-centroid of those chunks under one candidate:

That way you could query directly the candidate and get a centroid for all the chunks.

Let me know if that helps.

Ps: I am moving this topic to the General categories as it is more suited there :slight_smile:

1 Like

Thank you, @DudaNogueira, for your prompt response. While I am receiving chunks that are similar to my query, the document is segmented into different sections. Consequently, I am receiving chunks related to location but missing those containing the developer’s name.

Do I need to perform any post-processing, or can I retrieve those chunks in the top results using only Weaviate? As My query involves “searching for names” and identifying individuals “living in specific locations” or “with particular set of skills”

The name should be available considering that resume_id is from a specific candidate, right?

so consider a query for “Candidates that lives in Rio de Janeiro”

now, you have 5 objects:

resume_id=1, some chunk related to rio de janeiro
resume_id=1, some chunk related to copacabana
resume_id=2, some chunk vaguely related to Brazil
resume_id=3, some chunk nothing related to Brazil

now you now that
resume 1, resume 2, and resume 3, on that order, are the ones that are closest to your query.

One avenue you could explore is having the candidates as collection with a cross reference to his resume chunks.

That way it would help store metadata from the chunks and only store the relations between those objects.

1 Like