I have some original data which is inherently a list of strings (names of places mentioned in an article. Example [‘Paris’,‘Rome’,‘London’,‘New York’])
Given the available query operators would it be handier to just join them in a single string (" Paris Rome London New York") or there’s some advantage keeping them as a list of separate strings?
Thanks for any suggestions
Is there an example of the Property declaration for such objects and a query?
When it comes to searching, I don’t think it really makes a difference. Considering that the default tokenization is word, you will end up with an object with tokens like paris, rome, london, new, york.
I did a little lab to check this:
import weaviate
client = weaviate.connect_to_local()
client.collections.delete("MyCollection")
collection = client.collections.create("MyCollection")
content = [
{"cities": ['Paris','Rome','London','New York'], "cities_string": "Paris,Rome,London,New York"},
{"cities": ['Rio','London','New Hampshire'], "cities_string": "Rio,London,New Hampshire"},
]
with collection.batch.dynamic() as batch:
for data_row in content:
batch.add_object(
properties=data_row
)
if client.batch.failed_objects:
print(client.batch.failed_objects)
for i in collection.query.fetch_objects(include_vector=True).objects:
print(i.properties, i.vector)
I got the same results with for example:
from weaviate import classes as wvc
collection.query.fetch_objects(
filters=wvc.query.Filter.by_property("cities_string").contains_any(["Rome", "Rio"])
).objects
and
from weaviate import classes as wvc
collection.query.fetch_objects(
filters=wvc.query.Filter.by_property("cities").contains_any(["Rome", "Rio"])
).objects
I have developed a small collection of 10.000 entries similar to yours (thank you ChatGPT).
With this larger collection I have run several timing runs and unless this is a random result I see slightly faster results when retrieving from the string as opposed to retrieving from the list.
Fetch Rome Rio from string took 0.12191414833068848 seconds to complete.
Fetch Rio Rome from string took 0.12638115882873535 seconds to complete.
Fetch Rome Rio from list took 0.1296849250793457 seconds to complete.
Fetch Rio Rome from list took 0.12851190567016602 seconds to complete.
The above is just one run. Can you think of any reason or it’s just random? Obrigado