I am not a data engineer, so I am not familiar with the schema structure, as I don’t have prior experience in big query, pyspark, etc. (I only know SQL as a data scientist.) So, I don’t really know what schema structure do I write for my multi-PDF querying app.
For example, here are 3 schemas I used for my app:
class_obj = {
"class": "doc-query",
"vectorizer": "text2vec-openai",
}
class_obj = {
"class": "doc-query",
"vectorizer": "text2vec-openai",
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": True
}
}
}
class_obj = {
"class": "doc-query",
"description": "Documents for chatbot",
"vectorizer": "text2vec-openai",
"moduleConfig": {
"text2vec-openai": {"model": "ada", "type": "text"},
},
"properties": [
{
"dataType": ["text"],
"description": "The content of the paragraph",
"moduleConfig": {
"text2vec-openai": {
"skip": False,
"vectorizePropertyName": False,
}
},
"name": "content",
},
],
}
And in my experiments with all three, I see pretty much the same quality of retrieval responses - all 3 give similar level of correctness in their responses, and all 3 return pretty much the same sources too (when using RetrievalQAWithSourcesChain
for example). So, what’s the difference between these different schema structures then? Or does it not matter what schema we use, and its just a placeholder of sorts?
P.S. There’s not much explanation about schemas in the Weaviate documentation too, only a handful of examples have been provided without going into the nitty gritties and the subtleties of how, when and where different schema structures can and should be used. To that end, once I understand the schema structure, I want to create a comprehensive comparative analysis of different schema structures - how to know which properties should be present in the schema based on the problem at hand, compare different schemas in the same problem to quantify the differences in results, etc. I’d love if anyone is up to collab.