How to design a schema with reference

Description

I need to create a schema to be able to do vector search with auth restriction.
My current approch is
File

  • …file meta info
  • auth_code # to decide if the user has the permission to read the file

TextChunk

  • text
  • has_file # reference to file

And search is on TextChunk collection,the filter is like
Filter.by_ref(“has_file”).by_property(“auth_code”).contains_any([auth_code_list])

There could be 100k-1M files and the text chunk should be 20 to 50 times of the files.And some user could have permission on almost everyfile(but still need to be filter by auth_code).

Is this a good schema to use with this kinda situation?
Should I move the auth_code to TextChunk level? the reason I now put it in file is if to change the auth_code of a file I only need to update once on File.

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method:1.24.1
  • Multi Node? Number of Running Nodes:
  • Client Language and Version: 4.5.1

Any additional Information

1 Like

Hi @shadowlin !

I think some tests would be better to determine the impact of storing this on File x on the chunk.

While storing the auth_code on the File model is easier to manage changes, storing in the Chunk will not require the cross reference, so may be faster :thinking:

So what is the limitation of reference?

Filter.by_ref(“has_file”).by_property(“auth_code”).contains_any([auth_code_list])

if above filter will have like 10m or even more eligible text chunk. could it greatly impact the performance?

Is there any guide of how to use reference properly?

I don’t know the exact details on why and how it would impact, but considering that you will have a lots of auth_codes, the more you have, the more it will need to match your query for to start selecting the files you want to hit.

Not aware of some best practices on cross reference. I believe this would be an interesting scenario to compare:

  1. Cross reference of Auth codes
  2. Store the cross references on a File property, and filter directly on it.

Unfortunately I doesn’t have an answer for that :grimacing: