I’ve embarked on a project where I’ve collected about 20 resumes in PDF format. To efficiently extract details from each resume, I’ve employed an unstructured approach.
Each document is segmented into chunks, with each chunk containing up to 100 characters. These chunks are then inputted into Weaviate, a knowledge graph, with three key properties: “resume_id” to identify the source resume, “content” to store the text chunks, and “paragraph_number” to maintain the sequence of information within each document.
The content within these chunks varies, typically encompassing details like personal information (e.g., name, email, phone), career objectives, educational background, project experiences, skills, and location.
When performing a hybrid search, aiming to find developers proficient in specific technologies like Python and React. Despite receiving high scores for paragraphs discussing skills, the challenge lies in the omission of paragraphs containing crucial user details needed to identify developers proficient in both Python and React. Even within the top 5 search results, paragraphs with user details are missing.
How can I ensure the retrieval of relevant chunks, especially when context is fragmented across different chunks?
Queries focusing on specific locations, such as “Give me a list of all developers residing in XYZ city,” pose challenges in locating the relevant chunks containing candidate details. While I can identify the paragraphs mentioning the location name, this alone is insufficient as I also require the candidates’ names. How can I address this issue and ensure retrieval of the necessary information?
Seeking advice and solutions on optimizing search strategies to ensure retrieval of relevant chunks that encompass both user details and skill-specific information. Any suggestions on refining search parameters or utilizing advanced techniques for hybrid search would be greatly appreciated!
Despite tweaking parameters like alpha values, ranked and relative fusion, and employing various search methods (Hybrid, BM25, Near text), I often miss out on chunks containing crucial user details, like names or locations.