Learn Japanese ๐Ÿ‡ฏ๐Ÿ‡ต with Python

Takanori Suzuki

PyCon US 2024 / 2024 May 17

PyCon JP 2024 CfP is Open

  • 2024.pycon.jp

  • Proposal Deadline: May 31 (English is welcome!!)

  • Date: Sep 27-29

  • Place: Tokyo, Japan

Questions ๐Ÿ™‹

Have you learned Japanese? ๐Ÿ™‹โ€โ™‚๏ธ

Are you interested in Japanese? ๐Ÿ™‹โ€โ™€๏ธ

Japanese is difficult ๐Ÿค”

  • 3 Types of Characters(Hiragana, Katakana, Kanji)

  • No Spaces between Words

  • Multiple Readings of Kanji

3 Types of Characters

Emoji

๐Ÿ

๐Ÿบ

Hiragara

ใธใณ

ใณใƒผใ‚‹

Katakana

ใƒ˜ใƒ“

ใƒ“ใƒผใƒซ

Kanji

่›‡

้บฆ้…’

No Spaces between Words

  • ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก

No Spaces between Words

  • ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก

  • ใ™ใ‚‚ใ‚‚/ใ‚‚/ใ‚‚ใ‚‚/ใ‚‚/ใ‚‚ใ‚‚/ใฎ/ใ†ใก

  • Plums and peaches are part of peaches

Multiple Readings of Kanji

  • ๆ—ฅ: day, sun

    • Japanese-style reading: ใซใก(nichi)ใ€ใฒ(hi)

    • Chinese-style reading: ใ˜ใค(jitsu)ใ€ใ‹(ka)

Multiple Readings of Kanji

  • ๆ—ฅ: day, sun

    • Japanese-style reading: ใซใก(nichi)ใ€ใฒ(hi)

    • Chinese-style reading: ใ˜ใค(jitsu)ใ€ใ‹(ka)

  • ๆ—ฅๆ›œๆ—ฅ (nichi you bi): Sunday

  • ๅ‰ๆ—ฅ (zen jitsu): previous day

๐Ÿ˜จ

Multiple Readings of Kanji

  • Same combination but different readings

  • ไธ€ๆ—ฅ: first day, one day

    • ไธ€ๆ—ฅ ็›ฎ: Day 1

    • ไธ€ๆœˆ ไธ€ๆ—ฅ: Jan 1st

Multiple Readings of Kanji

  • Same combination but different readings

  • ไธ€ๆ—ฅ: first day, one day

    • ไธ€ๆ—ฅ ็›ฎ (ichi nichi me): Day 1

    • ไธ€ๆœˆ ไธ€ๆ—ฅ (ichi gatsu tsuitachi): Jan 1st

๐Ÿ˜ฑ ๐Ÿ˜ฑ

Multiple Readings of Kanji

  • Special readings of Kanji idioms

  • ไปŠ ๆ—ฅ: today

  • ๆ˜จ ๆ—ฅ: yesterday

  • ๆ˜Ž ๆ—ฅ: tomorrow

Multiple Readings of Kanji

  • Special readings of Kanji idioms

  • ไปŠๆ—ฅ (kyou): today

  • ๆ˜จๆ—ฅ (kinou): yesterday

  • ๆ˜Žๆ—ฅ (asu): tomorrow

๐Ÿคฏ ๐Ÿคฏ ๐Ÿคฏ

Learn Japanese with Python

No Spaces between Words

  • ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก

  • ใ™ใ‚‚ใ‚‚/ใ‚‚/ใ‚‚ใ‚‚/ใ‚‚/ใ‚‚ใ‚‚/ใฎ/ใ†ใก

Japanese morphological analyzer

$ pip install sudachipy sudachidict_core

Word Segmentation

from sudachipy import Dictionary

tokenizer = Dictionary().create()

text = "ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก"
words = [token.surface() for token in tokenizer.tokenize(text)]
print(words)
# -> ['ใ™ใ‚‚ใ‚‚', 'ใ‚‚', 'ใ‚‚ใ‚‚', 'ใ‚‚', 'ใ‚‚ใ‚‚', 'ใฎ', 'ใ†ใก']

Multiple Readings of Kanji

  • ไปŠ ๆ—ฅ ใฏไธ€ๆœˆไธ€ ๆ—ฅ ใง ๆ—ฅ ๆ›œ ๆ—ฅ

  • Today is January 1st, Sunday

Morphological Analysis

from sudachipy import Dictionary

tokenizer = Dictionary().create()

text = "ไปŠๆ—ฅ"  # today
tokens = tokenizer.tokenize(text)

print(tokens[0].surface())  # -> ไปŠๆ—ฅ
print(tokens[0].reading_form())  # -> ใ‚ญใƒงใ‚ฆ(kyou)
print(tokens[0].part_of_speech()[0])  # -> ๅ่ฉž(noun)

Get Readings

from sudachipy import Dictionary

tokenizer = Dictionary().create()

text = "ไปŠๆ—ฅใฏไธ€ๆœˆไธ€ๆ—ฅใงๆ—ฅๆ›œๆ—ฅ"
readings = []
for token in tokenizer.tokenize(text):
    readings.append(token.reading_form())
print(readings)
# -> ['ใ‚ญใƒงใ‚ฆ', 'ใƒ', 'ใ‚คใƒ', 'ใ‚ฌใƒ„', 'ใƒ„ใ‚คใ‚ฟใƒ', 'ใƒ‡', 'ใƒ‹ใƒใƒจใ‚ฆใƒ“']

Canโ€™t read Katakana?

Convert Japanese Character

for token in tokenizer.tokenize(text):
    readings.append(token.reading_form())

print(readings)
# -> ['ใ‚ญใƒงใ‚ฆ', 'ใƒ', 'ใ‚คใƒ', 'ใ‚ฌใƒ„', 'ใƒ„ใ‚คใ‚ฟใƒ', 'ใƒ‡', 'ใƒ‹ใƒใƒจใ‚ฆใƒ“']
print([jaconv.kata2hira(r) for r in readings])
# -> ['ใใ‚‡ใ†', 'ใฏ', 'ใ„ใก', 'ใŒใค', 'ใคใ„ใŸใก', 'ใง', 'ใซใกใ‚ˆใ†ใณ']
print([jaconv.kata2alphabet(r) for r in readings])
# -> ['kyou', 'ha', 'ichi', 'gatsu', 'tsuitachi', 'de', 'nichiyoubi']

Want to hear audio? ๐Ÿ—ฃ๏ธ

Text to Speech

from contextlib import closing
from pathlib import Path
import boto3

polly = boto3.client("polly")
# I want to drink good beer today and tomorrow.
text = "ไปŠๆ—ฅใ‚‚ๆ˜Žๆ—ฅใ‚‚ใŠใ„ใ—ใ„ใƒ“ใƒผใƒซใ‚’้ฃฒใฟใŸใ„"

result = polly.synthesize_speech(
    Text=text, OutputFormat="mp3", VoiceId="Mizuki")

with closing(result["AudioStream"]) as stream:
    Path("japanese.mp3").write_bytes(stream.read())

Sample app

../_images/sample-app.png

Thank you ๐Ÿ™

slides.takanory.net code

@takanory takanory takanory takanory

takanory profile kuro-chan and kuri-chan