Impact of massive cross reference count on performance?

Description

Hey! As briefly discussed in Slack, moving the question to the forum:
I’m curious if anyone has experience with adding many cross references to an entity? As a look-alike example; imagine an Instagram post that would link all the people who liked it as a cross reference post->user. A single post could then likely get thousands of cross references. I’d like to use the cross references to cut down search spaces then when querying Weaviate later on. In this example, that could be "give me all posts where post.liked_by_user_ref points to a user with user.name==X"Is there any performance bottleneck that I should watch out for? Has something like this been done before? And out of curiosity: how are cross-reference lookups implemented in Weaviate?

Eg I read that looking up a cross reference takes the same time as reading the referenced entity; is it safe to assume then that given N is the number entities linked in a coherent chain of cross-references, my query following the cross-references takes N times as long as a single entity lookup?

So lets say I would have posts->users->employers, and would filter posts down by the employers, that’s 3 entities in this chain of cross-references, which would mean 3x the time of a single entity lookup?

hi @giorgiogross !! Welcome to our forums :hugs:

I have asked internally. As soon as I get a reply I will keep you posted here.

Thanks!

Thanks would be amazing to know more! :slight_smile:

Short related follow up question as well: I’d also use this to filter by different properties of each cross-referenced entity. Tying that back into the Instagram example: smth likeshow me all posts that were liked by users where user.group_member_of="tech_meetup", which are employed by employers where employer.focus="sustainability"

Not sure if that would also affect performance (and pls correct me if that query wouldn’t even be possible with Weaviate, but from what I saw in the docs I’m sure it is). My data model allows me to deterministically cut down my search space quite a lot, and I’d just run a vector search against the much smaller sub-set of the post-entities.

hi @giorgiogross !

There is definitely a cost in using a lot of cross references like you described in your use case.

The best scenario will be having a normalized collection (everything you can in on collection).

So the more denormalization you have in your schema (a collection, cross referencing another collection that references a third collection), more costly it will be to filter the first collection based on a property on the third collection.

One thing to note here is that, the third denormalized collection (A->B->C) should not add that much of compute cost when considering when comparing (A->B) to only having A.

Whenever you have a cross reference, your data is usually stored in multiple shards that are spread around your cluster (assuming a multi node deployment), so it can live in different nodes, so there is some network overhead here, just for starter.

So on top of those network requests, you will have the compute cost of reading all the referenced objects, and merging them, as they are potentially in different shards. This adds up when you have cross references.

I believe some tests need to be done, because you not only plan on having cross references, but also having a massive amount of those.

Let me know if this helps :slight_smile:

Thanks @DudaNogueira yeah that helps in understanding the complications. Hmm I think that normalising my data is going to produce a big overhead for keeping data in sync and will lead to a large duplication. Eg 1000 users referenced from 1 post vs 1000 copies of the post all with different username props… One other option I could see is using arrays, then that would be 1 post with something like a liked_by_users array. But that’s simply moving the complexity around, now I might be dealing with huge arrays :laughing:

My takeaway rn is: I have to do some benchmark. I think I should be able to get at least some approximation while still in an invite-only phase for my product. Right now I still feel like leaning towards normalised data as it comes with a big reduction of storage-needs.

1 Like

Awesome! Please, let us know about your findings.

Those use cases and informations are invaluable for us.

Thanks!