Semantic search is not as good as I expected

I just started with verba and I am using ADAEmbedder

Query: “find me renovated apartment for less than 1 million”

The search does not care about 1 million at all. Also it does not consider “refurbished apartments” (as refurbished is a synonym of renovated).

How can I make this simple search work perfect ? Should I use some other Embedder or should I make a custom embedder for myself ?

I do not know about these embedders at all, as I am very new to LLMs

hi @uma_shankar !!

the results will depend on the vectorizer used, the content you have indexed (how it was chunked, some metadata, etc) and the query.

I am not sure that using a “natural language” filtering (less than 1 million) will indeed get you the results you expect. :thinking:

You will, however, get the closest object from your query based on the vector distance.

Let me know if this helps!

Thanks!

Thanks I understand.

I do not know which Vectorizer is best for my case “property recommendation”.

Should I do any thing special regarding “metadata” during ingestion ?
Or I just have to add “metadata” during the query ?

hi!

The vectorizer will depend on some experimentation, really. if it must be multilingual, the size of dimensions, etc.

Regarding metadata, consider for example 20 chunks. 10 for each document. You will want to store at least the main title of that document with each chunk.

Otherwise, for each chunk, you don’t add the meaning of those words per chunk.

With that said, and considering metadata, you want to make sure that each chunk has not only it’s content, but some metadata that can add context to what that chunk is related to.

Let me know if this helps.

If you want to learn more on chunking, check this docs:

Thanks!

Hi,
i already have experience with that same situation, after lots of testing i find out that vector search is not the best for property recommendations at all, what you will need to do is create set of features like price and size and locations and so one then use gpt-4 to extract the features from the query as json then use that json to search properties in sql database that will give way better results that using vector database
I have already done that and thr difference is huge,
the vector database is better of unstructured data like huge amounts text which i use weaviate for and it works really well

Hope that helps

2 Likes

thanks for sharing, @Ghattas_Salloum !!

Sure any time

have a great one

1 Like

I did some more work on this. It looks like @Ghattas_Salloum is right. I used metadata(gpt extracted) search for filtering price, nr_bedrooms, etc… I then relied on dense-search to match other criterias like amenities, patio, etc…

But the vector-search is not doing its job at all. For instance, if I search for a simple query “close to school”. The apartment which is close to school is getting less score than the other without any mention about school.

This is the only criteria I want to match. But it is not working. I wonder why.

Hi,
I think i can help with that too but with a different acpect,

When i add a property info i always add the lat and long coordinates for that property ,

So when i do the search i would use these coordinates with google maps api to check the schools near by or the bus station or restaurants etc…

in that way i don’t really have to depend on search to find that but google maps api can handle that really well.

other advice in your situation i would look at chucking methods used that might have huge impact on the results, i would parse all the data as csv, and each chunk will be a full csv row.

Hope that hepls

Should the csv be like

“type: apartment, bedrooms: 2, bathrooms: 2” ?

yes that would be correct, and you can add any more search criteria in csv,

one more important note in this case you shoud not depend on the search score at all just do the search and pass the results to gpt-4 along with chat history and last query and leave task of chose best match to gpt-4 believe me you will get best results ever.

1 Like

The CSV solution works perfectly! You’re amazing. :grin:

I’m feeling optimistic again about vector search.

1 Like

Nice to hear that worked out for you

Any other help let me know :grinning::grinning:

@Ghattas_Salloum I am using GPT34o for metadata extraction.

I used openAI embedder (text-embedding-3-small - 1536 dimensions), with good results.

I am thinking of using huggingface sentence transformers for embeddings, to reduce cost.

Will opensource model work as well for my property search ?

I do not know which model to choose either.

Well unfortunately i haven’t tested that much with open source so can’t advise on that one

i am sorry.

For what it’s worth I am getting acceptable results with multi-qa-MiniLM-L6-cos-v1.
My use case is primarily technical documentation embedding.

If you are using Docker, Weaviate has a pre-built image for this one. As well as others: GitHub - weaviate/t2v-transformers-models: This is the repo for the container that holds the models for the text2vec-transformers module

1 Like