If I have data with different companies with the same data schema. Should I use different collections for each company or I just use one collection and filter with company name or id.
The total number of companys could be from dozens to a few hundreds
This is a classic Multi Tenancy scenario:
Check here all awesome CTO on a great talk about this subject:
Let me know if this helps
Thank you for point out this video for me.
But my situation is kinda different. To be able to isolate data by collection or tenant is a good way for my application but I also need the ability to search through all the collection or tenant.
I guess if I go with muliti collection or multi tenancy way I must perform the same query over all collection or tenant to be able to do a global search.This sounds like a expensive operation.
Right. Sorry, I was not aware of that requirement.
On that case, you could have a class and specify a field that will be used to filter out the data.
I am testing the all in one collection way.
I find the filter could impact the performance dramatically.
I use a INT field company_id and set it to filterable( I think this should build a invert index right?) and I set it into 1,2,3 evenly so if I filter company_id = 1 should filter to only 1/3 of my dataset.
I have a text field which uses openai to generate 1536d vectors.
I test with 10k 20k and 40k records
the vector only query time cost looks ok for 0.0028 0.0038 0.0041 sec
but the filter(for only use 1/3 of dataset) with vector query time cost looks bad for 0.0039 0.0061 0.0106 sec. It increase linearly.
I found out it could be flat_search_cutoff setting cause the problem…
Not sure I got it. Can you clarify?