How to manage list of strings

Data types | Weaviate - Vector Database lists text as a dataype.

I have some original data which is inherently a list of strings (names of places mentioned in an article. Example [‘Paris’,‘Rome’,‘London’,‘New York’])

Given the available query operators would it be handier to just join them in a single string (" Paris Rome London New York") or there’s some advantage keeping them as a list of separate strings?

Thanks for any suggestions
Is there an example of the Property declaration for such objects and a query?

Hi!

When it comes to searching, I don’t think it really makes a difference. Considering that the default tokenization is word, you will end up with an object with tokens like paris, rome, london, new, york.

I did a little lab to check this:

import weaviate
client = weaviate.connect_to_local()
client.collections.delete("MyCollection")
collection = client.collections.create("MyCollection")
content = [
    {"cities": ['Paris','Rome','London','New York'], "cities_string": "Paris,Rome,London,New York"},
    {"cities": ['Rio','London','New Hampshire'], "cities_string": "Rio,London,New Hampshire"},
]
with collection.batch.dynamic() as batch:
    for data_row in content:
        batch.add_object(
            properties=data_row
        )
if client.batch.failed_objects:
    print(client.batch.failed_objects)

for i in collection.query.fetch_objects(include_vector=True).objects:
    print(i.properties, i.vector)

I got the same results with for example:

from weaviate import classes as wvc
collection.query.fetch_objects(
    filters=wvc.query.Filter.by_property("cities_string").contains_any(["Rome", "Rio"])
).objects

and

from weaviate import classes as wvc
collection.query.fetch_objects(
    filters=wvc.query.Filter.by_property("cities").contains_any(["Rome", "Rio"])
).objects

Let me know if this helps :slight_smile:

1 Like

Dear @DudaNogueira thanks a lot.

I have developed a small collection of 10.000 entries similar to yours (thank you ChatGPT).

With this larger collection I have run several timing runs and unless this is a random result I see slightly faster results when retrieving from the string as opposed to retrieving from the list.

Fetch Rome Rio from string took 0.12191414833068848 seconds to complete.
Fetch Rio Rome from string took 0.12638115882873535 seconds to complete.
Fetch Rome Rio from list took 0.1296849250793457 seconds to complete.
Fetch Rio Rome from list took 0.12851190567016602 seconds to complete.

The above is just one run. Can you think of any reason or it’s just random? Obrigado

hi there @rjalex

List will indeed by very slightly slower than string, because we have to run some extra code to process the list.

The difference is also small and should impact much.

By the way, we are planning a large rewrite of the keyword search this year. Stay tuned :wink: