How to sync external storage with Weaviate?

Description

How to sync external data sources (e.g. GoogleDrive, Dropbox…) with Weaviate?

For example, let’s say I have a webapp that validates client’s document. I could create a form for uploading documents and when the user uses it, I’d process that data in the backend and ingest the original data and the validation report into Weaviate.

Instead of a user manually selecting every file and uploading it, I want them to grant me access to their storage, e.g. GoogleDrive. Now every time they upload a file to GoogleDrive://some/path/to/documents/ that file gets processed and ingested into Weaviate.
Also, when a file gets deleted from GoogleDrive, it gets deleted from Weaviate.

So, Google Drive and Weaviate are kept in sync, with some minimal latency.

Are there libraries/services that can help with this?

A naive solution

A naive solution would be to implement this myself:

  • For each external data source, create a crawler
  • Every 5 minutes, go through external storage and check if there’re any new files, any deleted files and any modified files
    • Process and ingest new files to Weaviate
    • Remove deleted files from Weaviate
    • Replace modified files with new versions

This seems like a lot of code to write.

Is there an existing and easier way to achieve this?

Additional info

  • I think elastic search supports this with connectors: Connectors references | Enterprise Search documentation [8.13] | Elastic

  • As a next step, how can I make sure that if user X does have access to document A in external storage, but user B doesn’t, the access rights are the same in Weaviate? So if user B searches for that document, it can’t find it. I guess storing some metadata in every asset in Weaviate, e.g. usersHaveAccess: ["UserA",...] or usersNoAccess:["UserB"].

hi @Luka_Secerovic !! Welcome to our community :hugs:

I have seen anything like that. I imagine that for each external storage you would need to write some integration.

I believe Flowise is a nice open source tool that can fill this gap here, as it allows different integrations. There is also n8n that has some nice integrations tools, that could be used. Those are low code options, but you can also code it directly.

At least it can get you some idea on how this tool could be built.

Now sure if this helps, but that’s some 2 cents from my side :slight_smile:

let me know if this helps.

Thanks!