Issues with searching titles

SomebodySysop · June 13, 2023, 2:49am

I have a object cluster with a schema where the title is vectorized:

[
  "name" => "title",
  "dataType" => ["text"],
  "description" => "The title of the document",
  "moduleConfig" => [
    "text2vec-openai" => [ "skip" => false, "vectorizePropertyName" => true ]
  ]
],

However, if I initiate nearText or even hybrid queries where I ask to only return the document with a specific title, that document either doesn’t appear in the list of documents returned from Weaviate, or appear lower on list than expected. I know I could simply use the title field in a filtered search, but I am building a Q and A application where I won’t know when the question will be of this type. I need some strategies for addressing this, as that will look awful strange to have a document with a specific title that users can NOT search for using the title. Attached are examples.

zainhas · June 13, 2023, 4:04am

Hey ,

Others can comment on this as well but the first modification that I think of is: instead of querying with “show only content with title equal to ‘Drupal AI SolrAI - CSS’” or “Retrieve documents with title ‘Drupal AI SolrAI - CSS’” try querying with just the title eg. “Drupal AI SolrAI - CSS” this should yield a vector that is closer to the intended document and thus give better results.

Let me know if this helps improve results, if not we can try other strategies.

SomebodySysop · June 13, 2023, 6:42am

Thanks. Using just the title yields even worse results. Now, I am thinking that it fails because most of the titles that do come up contain many of the same words, and it’s not really a “semantic” idea as opposed to key words signifying the content.

But, outside of filtering on the title itself, is there a way to do this? I mean, I kind of get it, but I also see the problems trying to explain to an end user why, if they put in the exact title to the exact document they are looking for, every other document but that one comes up in the search results. How is that better than keyword search?

This is the result after the document is actually found by Weaviate, but OpenAI still says it isn’t, even though the “title” is contained in the context document:

But, the good news is that I was able to get Weaviate to return the correct document by NOT using an AI generated concept or standalone question. Just sending the posted query directly to Weaviate:

So, with respect to Weaviate, the best response was show only ‘Drupal AI SolrAI - CSS’

Now, how to make that work in an environment where any question can be asked in any format. And, in the case of looking for a particular document, NOBODY is going to think “show only”.

SomebodySysop · June 13, 2023, 6:55am

Just to be clear, the semantic search DOES work when querying for the information that the doc Drupal AI SolrAI - CSS discusses.

This is what is on the page itself:

Right now, the strategy is: If you want to search based on the content of the document, use semantic search. If you want to search by titles or specific document names, keyword search.

hsm207 · June 13, 2023, 7:21am

have you tried hybrid search and tweaking the value of alpha?

SomebodySysop · June 13, 2023, 9:26am

Yes, I did try hybrid search. The document I was seeking appeared higher on the list, depending upon the query, but not at the top. I do not know what this means: tweaking the value of alpha?

So, I modified my system message to tell users to try keyword searching if their current search fails.

felixthekraut · June 13, 2023, 11:48am

This means weighting the scorer closer to keyword or vector search results. It is a range from 0 to 1 with 0 meaning only keyword and 1 only vector search.

The following weights your search closer towards keywords:

{
  Get {
    JeopardyQuestion(
      limit: 3
      hybrid: {
        query: "food"
        alpha: 0.25
      }
    ) {
      question
      answer
    }
  }
}

See https://weaviate.io/developers/weaviate/search/hybrid#weight-keyword-vs-vector-results for more info.

SomebodySysop · June 13, 2023, 7:52pm

Ah! Did not know that. Thank you!

SomebodySysop · June 14, 2023, 1:05am

Thank you @ hsm207 and @ felixthekraut

Turns out the solution is Hybrid. Note what now comes up using hybrid with gpt-3.5-turbo:

In each context document is the title of the document, so I don’t know why gpt-3.5-turbo doesn’t see it. However, using gpt-4:

Bottom line is that I have a better strategy to offer the end user to get closer to the answer. Thanks guys!

SomebodySysop · June 14, 2023, 4:24am

FYI, this is how I resolved the issue with gpt-3.5-turbo not recognizing the title:

Topic		Replies	Views
Near text search over one specific field Support	2	744	June 1, 2023
.near_text vector search score is very low Support python	3	285	September 18, 2024
No such prop with name 'title' found in class 'Test' in the schema" when querying Weaviate with near_vector Support developer-experience , documentation , technical	4	255	November 10, 2024
Hybrid Search near_text distance filtering Support python	2	263	September 17, 2024
Why does an hybrid search with alpha=0 match an objects that has none of the keywords? General	3	247	May 16, 2024

Issues with searching titles

Related topics