Issues with searching titles

I have a object cluster with a schema where the title is vectorized:

[
  "name" => "title",
  "dataType" => ["text"],
  "description" => "The title of the document",
  "moduleConfig" => [
    "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]
  ]
],

However, if I initiate nearText or even hybrid queries where I ask to only return the document with a specific title, that document either doesn’t appear in the list of documents returned from Weaviate, or appear lower on list than expected. I know I could simply use the title field in a filtered search, but I am building a Q and A application where I won’t know when the question will be of this type. I need some strategies for addressing this, as that will look awful strange to have a document with a specific title that users can NOT search for using the title. Attached are examples.



1 Like

Hey :wave: ,

Others can comment on this as well but the first modification that I think of is: instead of querying with “show only content with title equal to ‘Drupal AI SolrAI - CSS’” or “Retrieve documents with title ‘Drupal AI SolrAI - CSS’” try querying with just the title eg. “Drupal AI SolrAI - CSS” this should yield a vector that is closer to the intended document and thus give better results.

Let me know if this helps improve results, if not we can try other strategies.

Thanks. Using just the title yields even worse results. Now, I am thinking that it fails because most of the titles that do come up contain many of the same words, and it’s not really a “semantic” idea as opposed to key words signifying the content.

But, outside of filtering on the title itself, is there a way to do this? I mean, I kind of get it, but I also see the problems trying to explain to an end user why, if they put in the exact title to the exact document they are looking for, every other document but that one comes up in the search results. How is that better than keyword search?

This is the result after the document is actually found by Weaviate, but OpenAI still says it isn’t, even though the “title” is contained in the context document:

But, the good news is that I was able to get Weaviate to return the correct document by NOT using an AI generated concept or standalone question. Just sending the posted query directly to Weaviate:

So, with respect to Weaviate, the best response was show only ‘Drupal AI SolrAI - CSS’

Now, how to make that work in an environment where any question can be asked in any format. And, in the case of looking for a particular document, NOBODY is going to think “show only”.

Just to be clear, the semantic search DOES work when querying for the information that the doc Drupal AI SolrAI - CSS discusses.

This is what is on the page itself:

Right now, the strategy is: If you want to search based on the content of the document, use semantic search. If you want to search by titles or specific document names, keyword search.

have you tried hybrid search and tweaking the value of alpha?

1 Like

Yes, I did try hybrid search. The document I was seeking appeared higher on the list, depending upon the query, but not at the top. I do not know what this means: tweaking the value of alpha?

So, I modified my system message to tell users to try keyword searching if their current search fails.

This means weighting the scorer closer to keyword or vector search results. It is a range from 0 to 1 with 0 meaning only keyword and 1 only vector search.

The following weights your search closer towards keywords:

{
  Get {
    JeopardyQuestion(
      limit: 3
      hybrid: {
        query: "food"
        alpha: 0.25
      }
    ) {
      question
      answer
    }
  }
}

See https://weaviate.io/developers/weaviate/search/hybrid#weight-keyword-vs-vector-results for more info.

1 Like

Ah! Did not know that. Thank you!

Thank you @ hsm207 and @ felixthekraut

Turns out the solution is Hybrid. Note what now comes up using hybrid with gpt-3.5-turbo:

In each context document is the title of the document, so I don’t know why gpt-3.5-turbo doesn’t see it. However, using gpt-4:

Bottom line is that I have a better strategy to offer the end user to get closer to the answer. Thanks guys!

2 Likes

FYI, this is how I resolved the issue with gpt-3.5-turbo not recognizing the title: