How to learn Japanese w/ Python

Takanori Suzuki

PyCon APAC 2024 / 2024 Oct

Agenda ✅

Background and Motivation / Goal
Japanese is Difficult
Python supports Japanese leaning

Background and Motivation 🏞️

Background and Motivation

Developing School Textbook Web at work
- Japanese NLP to make it Easier to Learn
Python libs could help people Learn Japanese

Background and Motivation(cont.)

FSI language difficulty
- Japanese is “super-hard languages” for English speakers to learn
- Mandarin, Cantonese, Korean and Arabic

Background and Motivation(cont.)

Call to Action

Goal

What is difficult about Japanese
How to use Japanese NLP libs and APIs
Python could support learning Japanese

Photos 📷 Share 🐦 👍

#pyconapac / @takanory

`slides.takanory.net` 💻

Who am I? 👤

Takanori Suzuki / 鈴木たかのり ( @takanory)
PyCon JP Association: Chair
BeProud Inc.: Director / Python Climber
Python Boot Camp, Python mini Hack-a-thon, Python Bouldering Club
Love: Ferrets, LEGO, 🍺 / Hobby: 🎺, 🧗‍♀️

takanory profile kuro-chan and kuri-chan

PyCon JP Association

www.pycon.jp

Nonprofit organization for Python users in Japan, to promote Python and supports its development. Further it is our goal to hold an annual PyCon JP.

PyCon JP Association

PyCon JP 2025

2025.pycon.jp
Date: 2025 Sep 26(Fri)-27(Sat)
Place: Hiroshima, Japan
There are English talks

Hiroshima? ⛩️

Fukushima - Hiroshima - Kyoto - Tokyo - Hokkaido
Direct flights to Hiroshima - HIJ, Japan
- Seoul, Taipei, Shanghai, Hong Kong, Dalian, Hanoi

Questions 🙋‍♂️

Have you learned Japanese? 🙋‍♀️

Are you interested in Japanese? 🙋‍♂️

Would you like to visit Japan? 🙋‍♀️ 🙋‍♂️

PyCon JP 2025

2025.pycon.jp
Date: 2025 Sep 26(Fri)-27(Sat)
Place: Hiroshima, Japan
There are English talks

Japanese is Difficult 😫

3 Types of Characters
No Spaces between Words
Multiple Readings of Kanji

3 Types of Characters

English	Snake	Beer
Pronounciation	hebi	biːru
Hiragana	へび	びーる
Katakana	ヘビ	ビール
Kanji	蛇	麦酒

No Spaces between Words

すもももももももものうち su mo mo mo mo mo mo mo mo no u chi

No Spaces between Words

すもももももももものうち

↓

すもも/も/もも/も/もも/の/うち

“Plums and peaches are part of peaches”

Multiple Readings of Kanji

日: day, sun

Multiple Readings of Kanji

2 styles of readings
Japanese-style reading(訓読みkun yomi)
Chinese-style reading(音読みon yomi)

Multiple Readings of Kanji

日: day, sun
Japanese-style reading: にち(nichi)、ひ(hi)
Chinese-style reading: じつ(jitsu)、か(ka)

Multiple Readings of Kanji

Japanese-style reading: にち(nichi)、ひ(hi)
Chinese-style reading: じつ(jitsu)、か(ka)
How to read?
- 日曜日 (Sunday)
- 前日 (Previous day)

Multiple Readings of Kanji

日nichi曜yo日bi (Sunday)
- Japanese-style reading: にち(nichi)、ひ(hi)
前zen日jitsu (Previous day)
- Chinese-style reading: じつ(jitsu)、か(ka)

Japanese is Difficult!! 😱

Python supports Japanese leaning

`<ruby>` HTML Tag 💎

What is Ruby ?

ルビruby characters are small annotation
Usually placed above the text
ref: Ruby character - Wikipedia
(Not a Programming Language)

`<ruby>` HTML Tag 💎

<ruby> represents small annotations
<rt> specifies the ruby text component

PyConPython Conference APACAsia Pacific 2024

<ruby>PyCon<rt>Python Conference</rt></ruby>
<ruby>APAC<rt>Asia Pacific</rt></ruby>
2024

ref: <ruby>: The Ruby Annotation element

Indicate pronunciation with `<ruby>`

Alphabet annotation: Pronounciation

パイコンpa i ko n えいぱっくe i pa kku (PyCon APAc)

<ruby>パイコン<rt>pa i ko n</rt></ruby>
<ruby>えいぱっく<rt>e i pa kku</rt></ruby>

Indicate pronunciation with `<ruby>`

Hiragana annotation: Readings
ふりがなfu ri ga na

インドネシアいんどねしあ共和国きょうわこく (Republic of Indonesia)

<ruby>インドネシア<rt>いんどねしあ</rt></ruby>
<ruby>共和国<rt>きょうわこく</rt></ruby>

Understand `<ruby>` Tag 💡

Hiragana and Katakana (あ / ア)

hebi / へび / ヘビ

Hiragana and Katakana

Hiragana and Katakana are phonogram
1 character represent a phoneme(speech sound)
- Like a Japanese alphabet
Hiragana: あかさたな a ka sa ta na…
Katakana: アカサタナ a ka sa ta na…

Hiragana and Katakana

Basically use Hiragana
- いんどねしあi n do ne shi a (Indonesia)
Katakana is used for foreign words
- パイコンpa i ko n (PyCon)

Romanization of Japanese (Romaji)

Alphabet to represent Japanese
Romaji is often used on Information Sign

Ikebukuro station

Learn Hiragana/Katakana using Romaji

jaconv

jaconv: interconverter for Hiragana, Katakana, alphabet and etc.

$ python3.12 -m venv env
$ . env/bin/activate
(env) pip install jaconv

>>> import jaconv
>>> jaconv.kana2alphabet("いんどねしあ")  # Hiragana -> alphabet
'indoneshia'
>>> jaconv.kata2alphabet("パイコン")  # Katakana -> alphabet
'paikon'

Add Romaji annotation

kana2roman.py

import sys
import jaconv

def kana2romaji(kana: str) -> str:
    """Convert Hiragana and Katakana to Romaji"""
    hiragana = jaconv.kata2hira(kana)  # Katakana -> Hiragana
    return jaconv.kana2alphabet(hiragana)  # Hiragana -> alphabet

def kana_with_romaji_ruby(kana: str) -> str:
    """Add romaji ruby to Kana text"""
    romaji = kana2romaji(kana)
    return f"<ruby>{kana}<rt>{romaji}</rt></ruby>"

if __name__ == "__main__":
    print(kana_with_romaji_ruby(sys.argv[1]))

Add Romaji annotation

(env) $ python kana2roman.py パイコンえいぱっく
<ruby>パイコンえいぱっく<rt>paikoneipakku</rt></ruby>

パイコンえいぱっくpaikoneipakku

Can read Hiragana and Katakana 🎉

No Spaces between Words

すもももももももものうち su mo mo mo mo mo mo mo mo no u chi

No Spaces between Words

Japanese has no spaces between words
Use Dictionary to Recognise words
Japanese Morphological Analyzer library required

Japanese Morphological Analyzer

see: taishi-i/awesome-japanese-nlp-resources

Japanese Morphological Analyzer

SudachiPy: pypi.org/project/SudachiPy
SudachiDcit: pypi.org/project/SudachiDict-core

(env) $ pip install sudachipy sudachidict_core

SudachiPy

Made with Rust, Very Fast
Three Types of Dictionaries
- Small: small vocabulary
- Core: basic vocabulary (default)
- Full: miscellaneous proper nouns

Word Segmentation

Split the words using Dictionary

>>> from sudachipy import Dictionary
>>> tokenizer = Dictionary().create()
>>> text = "すもももももももものうち"
>>> for token in tokenizer.tokenize(text):
...     print(token)
... 
すもも
も
もも
も
もも
の
うち

Cannot read Hiragana?

Word Segmentation with Romaji

word_segmentation.py

import sys
from sudachipy import Dictionary
from kana2roman import kana_with_romaji_ruby

tokenizer = Dictionary().create()

def word_segmentation(text: str) -> str:
    result = []
    for token in tokenizer.tokenize(text):
        result.append(kana_with_romaji_ruby(str(token)))
    return " / ".join(result)

if __name__ == "__main__":
    print(word_segmentation(sys.argv[1]))

Word Segmentation with Romaji

(env) $ python word_segmentation.py すもももももももものうち
<ruby>すもも<rt>sumomo</rt></ruby> / <ruby>も<rt>mo</rt></ruby> / <ruby>もも<rt>momo</rt></ruby> / <ruby>も<rt>mo</rt></ruby> / <ruby>もも<rt>momo</rt></ruby> / <ruby>の<rt>no</rt></ruby> / <ruby>うち<rt>uchi</rt></ruby>

すももsumomo / もmo / ももmomo / もmo / ももmomo / のno / うちuchi

Can split into Words 🎊

Multiple Readings of Kanji

日曜日nichi you bi、前日zen jitsu

Multiple Readings of Kanji

日: day, sun
Japanese-style reading(訓読みkun yomi): にちni chi, ひhi
Chinese-style reading(音読みon yomi): じつji tsu, かka

Multiple Readings of Kanji

日曜日 (Sunday): にちni chi ようyo u びbi
前日 (Previous day): ぜんze n じつji tsu

😨

Multiple Readings of Kanji idioms

Same combination but different readings
一日: first day, one day
- 一日目(Day 1)
- 一月一日(Jan 1st)

Multiple Readings of Kanji idioms

Same combination but different readings
一日: first day, one day
- 一日目(Day 1): いちにちi chi ni chi めme
- 一月一日(Jan 1st): いちがつi chi ga tsu ついたちtsu i ta chi

😱 😱

Special readings of Kanji idioms

今日 (today)
昨日 (yesterday)
明日 (tomorrow)

Special readings of Kanji idioms

今日 (today): きょうkyo u
昨日 (yesterday): きのうki no u
明日 (tomorrow): あしたa shi ta

🤯 🤯 🤯

Get Reading of Kanji

今日は一月一日で日曜日
Today is January 1st, Sunday

Get Reading of Kanji

Use SudachiPy and SudachiDict
reading_form(): Reading in Katakana

>>> from sudachipy import Dictionary
>>> tokenizer = Dictionary().create()
>>> text = "今日は一月一日で日曜日"
>>> for token in tokenizer.tokenize(text):
>>>     print(token, token.reading_form())
... 
今日 キョウ
は ハ
一 イチ
月 ガツ
一日 ツイタチ
で デ
日曜日 ニチヨウビ

Get Reading of Kanji

Cannot read Katakana? Use jaconv!

>>> import jaconv
>>> for token in tokenizer.tokenize(text):
...     reading = token.reading_form()
...     hiragana = jaconv.kata2hira(reading)
...     romaji = jaconv.kata2alphabet(reading)
...     print(f"{token}, {reading}, {hiragana}, {romaji}")
... 
今日, キョウ, きょう, kyou
は, ハ, は, ha
一, イチ, いち, ichi
月, ガツ, がつ, gatsu
一日, ツイタチ, ついたち, tsuitachi
で, デ, で, de
日曜日, ニチヨウビ, にちようび, nichiyoubi

Add Reading to Kanji

kanji_reading.py

import sys
import jaconv
from sudachipy import Dictionary

tokenizer = Dictionary().create()

def add_reading(text: str) -> str:
    """Add Hiranaga ruby to text"""
    result = ""
    for token in tokenizer.tokenize(text):
        ruby = jaconv.kata2hira(token.reading_form())
        result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
    return result

if __name__ == "__main__":
    print(add_reading(sys.argv[1]))

Add Reading to Kanji

今日きょうはは一いち月がつ一日ついたちでで日曜日にちようび

(env) $ python kanji_reading.py 今日は一月一日で日曜日
<ruby>今日<rt>きょう</rt></ruby>
<ruby>は<rt>は</rt></ruby>
<ruby>一<rt>いち</rt></ruby>
<ruby>月<rt>がつ</rt></ruby>
<ruby>一日<rt>ついたち</rt></ruby>
<ruby>で<rt>で</rt></ruby>
<ruby>日曜日<rt>にちようび</rt></ruby>

Add Reading to Kanji

kanji_reading_romaji.py

import sys
import jaconv
from sudachipy import Dictionary

tokenizer = Dictionary().create()

def add_reading(text: str) -> str:
    """Add Romaji ruby to text"""
    result = ""
    for token in tokenizer.tokenize(text):
        # ruby = jaconv.kata2hira(token.reading_form())
        ruby = jaconv.kata2alphabet(token.reading_form())
        result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
    return result

if __name__ == "__main__":
    print(add_reading(sys.argv[1]))

Add Reading to Kanji

今日kyou はha 一ichi 月gatsu 一日tsuitachi でde 日曜日nichiyoubi

(env) $ python kanji_reading_romaji.py 今日は一月一日で日曜日
<ruby>今日<rt>kyou</rt></ruby>
<ruby>は<rt>ha</rt></ruby>
<ruby>一<rt>ichi</rt></ruby>
<ruby>月<rt>gatsu</rt></ruby>
<ruby>一日<rt>tsuitachi</rt></ruby>
<ruby>で<rt>de</rt></ruby>
<ruby>日曜日<rt>nichiyoubi</rt></ruby>

Can read Kanji 🥳

Can read but Cannnot Pronouce 🗣️

Readings and Pronounciations are slightly different

Readings: ou / ei
Pronounciaciton: oo / ee
東京とうきょうtou kyou / 英語えいごei go

Text to Speech

Amazon Polly - AWS
- 5 million chars free per month for 12 months
Polly - Boto3 documentation

(env) $ pip install boto3

(env) $ export AWS_ACCESS_KEY_ID=AKIAYI...
(env) $ export AWS_SECRET_ACCESS_KEY=ZoWbpmi...
(env) $ export AWS_DEFAULT_REGION=ap-northeast-1

Text to Speech

text_to_speech.py

import sys
import boto3

polly = boto3.client("polly")

def text_to_speech(text: str) -> None:
    result = polly.synthesize_speech(
        Text=text, OutputFormat="mp3", VoiceId="Mizuki")
    with open("japanese.mp3", "wb") as f:
        f.write(result["AudioStream"].read())

if __name__ == "__main__":
    text_to_speech(sys.argv[1])

Text to Speech

(env) $ python text_to_speech.py 東京、英語

Can pronounce Japanese 🥳🥳

Sample App

learn-jp-with-python/learn_jp_apac.py

Summary

Japanese is Difficult
- 3 Charcters, No spaces, Kanji readings
Python supports Japanese learning
- jaconv: Interconverter
- SudachiPy: Morphological analyzer
- Amazon Polly: Text to Speech

🇯🇵 ❤️

Learn Japanese with Python

Thank you / Terima kasih 🙏

slides.takanory.net code

@takanory takanory takanory takanory

takanory profile kuro-chan and kuri-chan

How to learn Japanese w/ Python

Agenda ✅

Background and Motivation 🏞️

Background and Motivation

Background and Motivation(cont.)

Background and Motivation(cont.)

Goal

Photos 📷 Share 🐦 👍

slides.takanory.net 💻

Who am I? 👤

PyCon JP Association

PyCon JP 2025

Hiroshima? ⛩️

Questions 🙋‍♂️

Have you learned Japanese? 🙋‍♀️

Are you interested in Japanese? 🙋‍♂️

Would you like to visit Japan? 🙋‍♀️ 🙋‍♂️

PyCon JP 2025

Japanese is Difficult 😫

3 Types of Characters

No Spaces between Words

No Spaces between Words

Multiple Readings of Kanji

Multiple Readings of Kanji

Multiple Readings of Kanji

Multiple Readings of Kanji

Multiple Readings of Kanji

Japanese is Difficult!! 😱

Python supports Japanese leaning

<ruby> HTML Tag 💎

What is Ruby ?

<ruby> HTML Tag 💎

Indicate pronunciation with <ruby>

Indicate pronunciation with <ruby>

Understand <ruby> Tag 💡

Hiragana and Katakana (あ / ア)

Hiragana and Katakana

Hiragana and Katakana

Romanization of Japanese (Romaji)

jaconv

Add Romaji annotation

Add Romaji annotation

Can read Hiragana and Katakana 🎉

No Spaces between Words

No Spaces between Words

Japanese Morphological Analyzer

Japanese Morphological Analyzer

SudachiPy

Word Segmentation

Word Segmentation with Romaji

Word Segmentation with Romaji

Can split into Words 🎊

Multiple Readings of Kanji

Multiple Readings of Kanji

Multiple Readings of Kanji

😨

Multiple Readings of Kanji idioms

Multiple Readings of Kanji idioms

😱 😱

Special readings of Kanji idioms

Special readings of Kanji idioms

🤯 🤯 🤯

Get Reading of Kanji

Get Reading of Kanji

Get Reading of Kanji

Add Reading to Kanji

Add Reading to Kanji

Add Reading to Kanji

Add Reading to Kanji

Can read Kanji 🥳

Can read but Cannnot Pronouce 🗣️

Readings and Pronounciations are slightly different

Text to Speech

Text to Speech

Text to Speech

Can pronounce Japanese 🥳🥳

Sample App

Summary

🇯🇵 ❤️

Thank you / Terima kasih 🙏

`slides.takanory.net` 💻

`<ruby>` HTML Tag 💎

`<ruby>` HTML Tag 💎

Indicate pronunciation with `<ruby>`

Indicate pronunciation with `<ruby>`

Understand `<ruby>` Tag 💡