Weaviate Use Case with other language

Can I use Weaviate with Japanese Language?

Hi!

It will depend if your LLM support it.

Here is a snippet to demonstrate that, using OPEN AI.

Ps: This is a recipe that I am writing for GitHub - weaviate/recipes: This repository shares quick notebooks on how to use various features in Weaviate. it will be published soon :wink:

generateTask = "Quelle est la nourriture traditionnelle de ce pays ? Answer in Spanish"
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"

result = (
  client.query
  .get("WikipediaLangChain", "text")
  .with_generate(grouped_task = generateTask)
  .with_where({
      "operator": "Equal",
      "path": ["source"],
      "valueText": source_file
  })
  .with_near_text({
    "concepts": ["tradicional Food"]
  })
  .with_limit(5).do()
)

The generated text, for instance is something like:

{
     "_additional": {
      "generate": {
       "error": null,
       "groupedResult": "La comida tradicional de Brasil incluye farofa (harina de mandioca), papas fritas, yuca frita, pl\u00e1tanos fritos, carne frita y queso frito, que se consumen con frecuencia ..."
      }
...

So the text is indexed and vectorized in English, we did a question in French, and requested the answer in Spanish!

Isn’t that cool? :sunglasses:

Let me know if this helps :slight_smile:

I have already stored some japanese data in Weaviate DB. How do I query so as get some matches related to question I asked.

Above query in the response does not clear to me. I want to ask question in japanese and is expected that Weaviate will give top 5 similarities for that question.

I am also getting error message as
“message”: “explorer: get class: vectorize params: vectorize params: vectorize params: vectorize keywords: vectorizing corpus ‘[\u6d6a\u901f\u5927\u5b66\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u306e\u30c7\u30a3\u30b9\u30af\u4f7f\u7528\u91cf\u304c\u5236\u9650\u3092\u8d85\u3048\u3066\u3044\u307e\u3059]’: all words in corpus were either stopwords or not present in the contextionary, cannot build vector”

when I try to query weaviate. I am using contexionary module. is there any vectorizer module specific to Japanese language?

When I register Japanese to weaviate, the characters are garbled as shown below.

\u79c1\u306e\u540d\u524d\u306f\u9234\u6728(Suzuki)\u3067\u3059\u3002\u8da3\u5473\u306f\u91ce\u7403\u3067\u3059\u3002

Is it possible to register Japanese character strings in weaviate?
Or does weaviate not support Japanese registration?

Below is the program that was executed.

test_docs = [
    # My name is Suzuki. My hobby is baseball.
    "私の名前は鈴木(Suzuki)です。趣味は野球です。",
    # My name is Sato. My hobby is soccer.
    "私の名前は佐藤(Sato)です。趣味はサッカーです。",
    # My name is Tanaka. My hobby is tennis.
    "私の名前は田中(Tanaka)です。趣味はテニスです。"
]
with client.batch.configure(batch_size=5) as batch:
    for i, doc in enumerate(test_docs):  # Batch import data
        
        print(f"importing question: {i+1}")

        doc_encoded_text = doc.encode('utf-8').decode('utf-8')
        print(doc_encoded_text)

        properties = {
            "content": doc_encoded_text
        }
        
        batch.add_data_object(
            data_object=properties,
            class_name=weaviate_class_name
        )

# print() result
importing question: 1
私の名前は鈴木(Suzuki)です。趣味は野球です。
importing question: 2
私の名前は佐藤(Sato)です。趣味はサッカーです。
importing question: 3
私の名前は田中(Tanaka)です。趣味はテニスです。
count = client.query.aggregate(weaviate_class_name).with_meta_count().do()
json_print(count)

# result. Count and confirm that 3 Japanese sentences are registered.
{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 3
          }
        }
      ]
    }
  }
}
# 
result = (client.query
          .get("Question", ["content"])
          .with_limit(3)
          .do())

json_print(result)

# result. Confirmed strings of 3 Japanese sentences. Japanese character string is not registered in "content"
{
  "data": {
    "Get": {
      "Question": [
        {
          "content": "\u79c1\u306e\u540d\u524d\u306f\u9234\u6728(Suzuki)\u3067\u3059\u3002\u8da3\u5473\u306f\u91ce\u7403\u3067\u3059\u3002"
        },
        {
          "content": "\u79c1\u306e\u540d\u524d\u306f\u7530\u4e2d(Tanaka)\u3067\u3059\u3002\u8da3\u5473\u306f\u30c6\u30cb\u30b9\u3067\u3059\u3002"
        },
        {
          "content": "\u79c1\u306e\u540d\u524d\u306f\u4f50\u85e4(Sato)\u3067\u3059\u3002\u8da3\u5473\u306f\u30b5\u30c3\u30ab\u30fc\u3067\u3059\u3002"
        }
      ]
    }
  }
}

Ha, those are Unicode characters, but I understand that this is suboptimal.

@iamleonie Maybe you can help with some blog content for Unicode (Japanese, etc) use cases?

@Abhishek_Joshi
According to the text2vec-contextionary docs, Japanese is not supported. There are many vectorizer modules available that support Japanese, e.g.,text2vec-openai or text2vec-cohere.

@Casper_Morow
The result of your last query returns the data in escaped Unicode format. To display the Japanese text in Japanese characters, you can print your response with the following code.

print(json.dumps(result, indent=4, ensure_ascii=False))

By default, json.dumps() has ensure_ascii=True, which causes the response to be printed in escaped Unicode. By setting ensure_ascii=False, your response should be printed in Japanese characters.

For anyone interested, we now have a blog post on this topic: Using Weaviate with Non-English Languages | Weaviate - Vector Database

1 Like