Description
How to sync external data sources (e.g. GoogleDrive, Dropbox…) with Weaviate?
For example, let’s say I have a webapp that validates client’s document. I could create a form for uploading documents and when the user uses it, I’d process that data in the backend and ingest the original data and the validation report into Weaviate.
Instead of a user manually selecting every file and uploading it, I want them to grant me access to their storage, e.g. GoogleDrive. Now every time they upload a file to GoogleDrive://some/path/to/documents/
that file gets processed and ingested into Weaviate.
Also, when a file gets deleted from GoogleDrive, it gets deleted from Weaviate.
So, Google Drive and Weaviate are kept in sync, with some minimal latency.
Are there libraries/services that can help with this?
A naive solution
A naive solution would be to implement this myself:
- For each external data source, create a crawler
- Every 5 minutes, go through external storage and check if there’re any new files, any deleted files and any modified files
- Process and ingest new files to Weaviate
- Remove deleted files from Weaviate
- Replace modified files with new versions
This seems like a lot of code to write.
Is there an existing and easier way to achieve this?
Additional info
-
I think elastic search supports this with connectors: Connectors references | Enterprise Search documentation [8.13] | Elastic
-
As a next step, how can I make sure that if user X does have access to document A in external storage, but user B doesn’t, the access rights are the same in Weaviate? So if user B searches for that document, it can’t find it. I guess storing some metadata in every asset in Weaviate, e.g.
usersHaveAccess: ["UserA",...]
orusersNoAccess:["UserB"]
.