Semantic search is not as good as I expected

uma_shankar · July 7, 2024, 10:38am

I just started with verba and I am using ADAEmbedder

Query: “find me renovated apartment for less than 1 million”

The search does not care about 1 million at all. Also it does not consider “refurbished apartments” (as refurbished is a synonym of renovated).

How can I make this simple search work perfect ? Should I use some other Embedder or should I make a custom embedder for myself ?

I do not know about these embedders at all, as I am very new to LLMs

DudaNogueira · July 8, 2024, 12:06pm

hi @uma_shankar !!

the results will depend on the vectorizer used, the content you have indexed (how it was chunked, some metadata, etc) and the query.

I am not sure that using a “natural language” filtering (less than 1 million) will indeed get you the results you expect.

You will, however, get the closest object from your query based on the vector distance.

Let me know if this helps!

Thanks!

uma_shankar · July 8, 2024, 12:14pm

Thanks I understand.

I do not know which Vectorizer is best for my case “property recommendation”.

Should I do any thing special regarding “metadata” during ingestion ?
Or I just have to add “metadata” during the query ?

DudaNogueira · July 8, 2024, 6:26pm

hi!

The vectorizer will depend on some experimentation, really. if it must be multilingual, the size of dimensions, etc.

Regarding metadata, consider for example 20 chunks. 10 for each document. You will want to store at least the main title of that document with each chunk.

Otherwise, for each chunk, you don’t add the meaning of those words per chunk.

With that said, and considering metadata, you want to make sure that each chunk has not only it’s content, but some metadata that can add context to what that chunk is related to.

Let me know if this helps.

If you want to learn more on chunking, check this docs:

Thanks!

Ghattas_Salloum · July 10, 2024, 10:28am

Hi,
i already have experience with that same situation, after lots of testing i find out that vector search is not the best for property recommendations at all, what you will need to do is create set of features like price and size and locations and so one then use gpt-4 to extract the features from the query as json then use that json to search properties in sql database that will give way better results that using vector database
I have already done that and thr difference is huge,
the vector database is better of unstructured data like huge amounts text which i use weaviate for and it works really well

Hope that helps

DudaNogueira · July 11, 2024, 8:26pm

thanks for sharing, @Ghattas_Salloum !!

Ghattas_Salloum · July 12, 2024, 8:21am

Sure any time

have a great one

uma_shankar · July 15, 2024, 8:41am

I did some more work on this. It looks like @Ghattas_Salloum is right. I used metadata(gpt extracted) search for filtering price, nr_bedrooms, etc… I then relied on dense-search to match other criterias like amenities, patio, etc…

But the vector-search is not doing its job at all. For instance, if I search for a simple query “close to school”. The apartment which is close to school is getting less score than the other without any mention about school.

This is the only criteria I want to match. But it is not working. I wonder why.

Ghattas_Salloum · July 15, 2024, 9:17am

Hi,
I think i can help with that too but with a different acpect,

When i add a property info i always add the lat and long coordinates for that property ,

So when i do the search i would use these coordinates with google maps api to check the schools near by or the bus station or restaurants etc…

in that way i don’t really have to depend on search to find that but google maps api can handle that really well.

other advice in your situation i would look at chucking methods used that might have huge impact on the results, i would parse all the data as csv, and each chunk will be a full csv row.

Hope that hepls

uma_shankar · July 15, 2024, 9:25am

Should the csv be like

“type: apartment, bedrooms: 2, bathrooms: 2” ?

Ghattas_Salloum · July 15, 2024, 9:38am

yes that would be correct, and you can add any more search criteria in csv,

one more important note in this case you shoud not depend on the search score at all just do the search and pass the results to gpt-4 along with chat history and last query and leave task of chose best match to gpt-4 believe me you will get best results ever.

uma_shankar · July 16, 2024, 6:02pm

The CSV solution works perfectly! You’re amazing.

I’m feeling optimistic again about vector search.

Ghattas_Salloum · July 17, 2024, 5:06pm

Nice to hear that worked out for you

Any other help let me know

uma_shankar · July 18, 2024, 10:42am

@Ghattas_Salloum I am using GPT34o for metadata extraction.

I used openAI embedder (text-embedding-3-small - 1536 dimensions), with good results.

I am thinking of using huggingface sentence transformers for embeddings, to reduce cost.

Will opensource model work as well for my property search ?

I do not know which model to choose either.

Ghattas_Salloum · July 18, 2024, 7:09pm

Well unfortunately i haven’t tested that much with open source so can’t advise on that one

i am sorry.

Gene_Mc · July 19, 2024, 5:04pm

For what it’s worth I am getting acceptable results with multi-qa-MiniLM-L6-cos-v1.
My use case is primarily technical documentation embedding.

If you are using Docker, Weaviate has a pre-built image for this one. As well as others: GitHub - weaviate/t2v-transformers-models: This is the repo for the container that holds the models for the text2vec-transformers module

Topic		Replies	Views
LLM generated weaviate query General	1	142	July 8, 2024
How to Improve the accuracy of vector search in weaviate General	2	2067	March 6, 2025
Simple keyword search not working Support	4	1139	September 14, 2023
🔍 Seeking Assistance with Weaviate Vector Search Support developer-experience , python	3	349	May 6, 2024
BYOV Hybrid search metadata Support	3	235	April 5, 2024

Semantic search is not as good as I expected

Related topics