How to learn Japanese w/ Python
Takanori Suzuki
PyCon US 2025 / 2025 May 16
Agenda ✅
Background and Motivation / Goal
Japanese is Difficult
Python supports Japanese leaning
PyCon US 2024
Lightning Talk on the same idea
I will talk in more detail
Background and Motivation 🏞️
Background and Motivation
Developing School Textbook Web at work
Japanese NLP to make it Easier to Learn
Python libs could help people Learn Japanese
Background and Motivation(cont.)
-
Japanese is “super-hard languages” for English speakers to learn
Catevory V* (More than 88 weeks)
Language difficulty rankings (for native English speakers)
byu/Homesanto inMapPorn
Goal
What is difficult about Japanese
How to use Japanese NLP libs and APIs
How Python can support Japanese learning
Photos 📷 Share 🐦 👍
#pyconus
/ @takanory
slides.takanory.net
💻

Who am I? 👤
Takanori Suzuki / 鈴木 たかのり ( @takanory)
PyCon JP Association: Chair
BeProud Inc.: Director / Python Climber
Python Boot Camp, Python mini Hack-a-thon, Python Bouldering Club
Love: Ferrets, LEGO, 🍺 / Hobby: 🎺, 🧗♀️
PyCon JP Association
Nonprofit organization for Python users in Japan, to promote Python and supports its development. Further it is our goal to hold an annual PyCon JP.
PyCon JP 2025
Date: 2025 Sep 26(Fri)-27(Sat)
Place: Hiroshima, Japan
There are English talks

Hiroshima? ⛩️
Fukuoka - Hiroshima - Kyoto - Tokyo - Hokkaido
Direct flights to Hiroshima - HIJ, Japan
Seoul, Taipei, Shanghai, Hong Kong, Dalian, Hanoi

Questions 
Have you learned Japanese? 
Are you interested in Japanese? 
Would you like to visit Japan? 
PyCon JP 2025
2025 Sep 26(Fri)-27(Sat)
Hiroshima, Japan

Japanese is Difficult 
3 Types of Characters
No Spaces between Words
Multiple Readings of Kanji
3 Types of Characters
English |
Peach(🍑) |
Snake(🐍) |
---|---|---|
Pronounciation |
momo |
hebi |
Hiragana |
もも |
へび |
Katakana |
モモ |
ヘビ |
Kanji |
桃 |
蛇 |
No Spaces between Words
すもももももももものうち
No Spaces between Words
すもももももももものうち
↓
すもも/も/もも/も/もも/の/うち
“Plums and peaches are part of peaches”
Multiple Readings of Kanji
人: person, people
Multiple Readings of Kanji
2 styles of readings
Japanese-style reading(訓読み)
Chinese-style reading(音読み)
Multiple Readings of Kanji
人: person, people
Japanese-style reading: ひと(hito)、びと(bito)
Chinese-style reading: じん(jin)、にん(nin)
Multiple Readings of Kanji
Japanese-style reading: ひと(hito)、びと(bito)
Chinese-style reading: じん(jin)、にん(nin)
Can you read?
小人 (Small person)
日本人 (Japanese)
Multiple Readings of Kanji
小人 (Samll person)
Japanese-style reading: ひと(hito)、びと(bito)
日本人 (Japanese)
Chinese-style reading: じん(jin)、にん(nin)
Japanese is Difficult!! 
3 Types of Characters
No Spaces between Words
Multiple Readings of Kanji
Python supports Japanese leaning
<ruby>
HTML Tag 💎
What is Ruby ?
ルビ characters are small annotation [1]
Usually placed above the text
(Not a Programming Language)
<ruby>
HTML Tag 💎
<ruby>
represents small annotations [2]<rt>
specifies the ruby text component
PyCon US 2025
<ruby>PyCon<rt>Python Conference</rt></ruby>
<ruby>US<rt>United States</rt></ruby>
2025
Indicate pronunciation with <ruby>
Alphabet annotation: Pronunciation
パイコン あめりか (PyCon America)
<ruby>パイコン<rt>pa i ko n</rt></ruby>
<ruby>あめりか<rt>a me ri ka</rt></ruby>
Indicate pronunciation with <ruby>
Hiragana annotation: Readings
ふりがな
アメリカ 合衆国 (The United States of America)
<ruby>アメリカ<rt>あめりか</rt></ruby>
<ruby>合衆国<rt>がっしゅうこく</rt></ruby>
Figured out <ruby>
Tag 
Hiragana and Katakana (あ / ア)
Snake(🐍) / hebi / へび / ヘビ
Hiragana and Katakana
Hiragana and Katakana are phonogram
1 character represent a phoneme(speech sound)
Like a Japanese alphabet
Hiragana: あかさたな…
Katakana: アカサタナ…
Hiragana and Katakana
Basically use Hiragana
あめりか (America)
Katakana is used for foreign words
パイコン (PyCon)
Romanization of Japanese (Romaji)
Alphabet to represent Japanese
Romaji is often used on Information Sign
Learn Hiragana/Katakana using Romaji
jaconv
jaconv: interconverter for Hiragana, Katakana, alphabet and etc.
$ python3.12 -m venv env
$ . env/bin/activate
(env) pip install jaconv
>>> import jaconv
>>> jaconv.kana2alphabet("あめりか") # Hiragana -> alphabet
'amerika'
>>> jaconv.kata2alphabet("パイコン") # Katakana -> alphabet
'paikon'
Add Romaji annotation
kana2roman.py
import sys
import jaconv
def kana2romaji(kana: str) -> str:
"""Convert Hiragana and Katakana to Romaji"""
hiragana = jaconv.kata2hira(kana) # Katakana -> Hiragana
return jaconv.kana2alphabet(hiragana) # Hiragana -> alphabet
def kana_with_romaji_ruby(kana: str) -> str:
"""Add romaji ruby to Kana text"""
romaji = kana2romaji(kana)
return f"<ruby>{kana}<rt>{romaji}</rt></ruby>"
if __name__ == "__main__":
print(kana_with_romaji_ruby(sys.argv[1]))
Add Romaji annotation
(env) $ python kana2roman.py "パイコン あめりか"
<ruby>パイコン あめりか<rt>paikon amerika</rt></ruby>
パイコン あめりか
Can read Hiragana and Katakana 
No Spaces between Words
すもももももももものうち
No Spaces between Words
Japanese has no spaces between words
Use Dictionary to Recognise words
Japanese Morphological Analyzer library required
Japanese Morphological Analyzer

Japanese Morphological Analyzer
SudachiPy: pypi.org/project/SudachiPy
SudachiDcit: pypi.org/project/SudachiDict-core
(env) $ pip install sudachipy sudachidict_core
SudachiPy
Made with Rust, Very Fast
Three Types of Dictionaries
Small: small vocabulary
Core: basic vocabulary (default)
Full: miscellaneous proper nouns
Word Segmentation
Split the words using Dictionary
>>> from sudachipy import Dictionary
>>> tokenizer = Dictionary().create()
>>> text = "すもももももももものうち"
>>> for token in tokenizer.tokenize(text):
... print(token)
...
すもも
も
もも
も
もも
の
うち
Word Segmentation
word_segmentation.py
import sys
from sudachipy import Dictionary
tokenizer = Dictionary().create()
def word_segmentation(text: str) -> str:
result = []
for token in tokenizer.tokenize(text):
word = str(token)
result.append(word)
return " / ".join(result)
if __name__ == "__main__":
print(word_segmentation(sys.argv[1]))
Word Segmentation
(env) $ python word_segmentation.py すもももももももものうち
すもも / も / もも / も / もも / の / うち
すもも / も / もも / も / もも / の / うち
Cannot read Hiragana?
Word Segmentation with Romaji
word_segmentation_with_ruby.py
import sys
from sudachipy import Dictionary
from kana2roman import kana_with_romaji_ruby
tokenizer = Dictionary().create()
def word_segmentation(text: str) -> str:
result = []
for token in tokenizer.tokenize(text):
word = kana_with_romaji_ruby(str(token))
result.append(word)
return " / ".join(result)
if __name__ == "__main__":
print(word_segmentation(sys.argv[1]))
Word Segmentation with Romaji
(env) $ python word_segmentation_with_ruby.py すもももももももものうち
<ruby>すもも<rt>sumomo</rt></ruby> / <ruby>も<rt>mo</rt></ruby> / <ruby>もも<rt>momo</rt></ruby> / <ruby>も<rt>mo</rt></ruby> / <ruby>もも<rt>momo</rt></ruby> / <ruby>の<rt>no</rt></ruby> / <ruby>うち<rt>uchi</rt></ruby>
すもも / も / もも / も / もも / の / うち
Can split into Words 
Multiple Readings of Kanji
小人 (Small person)
日本人 (Japanese)
Multiple Readings of Kanji
人: person, people
🇯🇵 Japanese-style reading(訓読み):
ひと、びと
🇨🇳 Chinese-style reading(音読み):
じん、にん
Multiple Readings of Kanji
小人 (Small person): 🇯🇵 こ びと
日本人 (Japanese): 🇨🇳 に ほん じん
Multiple Readings of Kanji idioms
Same combination but different readings
一人: One person
一人 (One person)
一人前 (One serving)
Multiple Readings of Kanji idioms
Same combination but different readings
一人: One person
一人 (One person): ひとり 🇯🇵
一人前 (One serving): いちにん まえ 🇨🇳
Special readings of Kanji idioms
一 人 (One person): ひとり 🇯🇵
二 人 (Two people)
三 人 (Three people)
Special readings of Kanji idioms
一 人 (One person): ひとり 🇯🇵
二 人 (Two people): ふたり 🇯🇵
三 人 (Three people): さんにん 🇨🇳
Special readings of Kanji idioms
Other special readings
大人: おとな (Adult)
玄人: くろうと (Professional)
防人: さきもり (soldiers garrisoned at strategic posts in Kyushu in ancient times)
Get Reading of Kanji
一人の日本人の大人が一人前のラーメンを食べる
One Japanese adult eats one serving of ramen
Get Reading of Kanji
Use SudachiPy and SudachiDict again
reading_form()
: Reading in Katakana
>>> from sudachipy import Dictionary
>>> tokenizer = Dictionary().create() # Make tokenizer
>>> text = "一人の日本人の大人が一人前のラーメンを食べる"
>>> for token in tokenizer.tokenize(text): # Word segmentation
... (str(token), token.reading_form()) # Get reading
...
('一人', 'ヒトリ')
('の', 'ノ')
('日本人', 'ニホンジン')
('の', 'ノ')
('大人', 'オトナ')
...
Get Reading of Kanji
Looks good
Cannot read Katakana?
('一人', 'ヒトリ')
('の', 'ノ')
('日本人', 'ニホンジン')
('の', 'ノ')
('大人', 'オトナ')
...
Get Reading of Kanji
Cannot read Katakana? Use jaconv!
>>> from jaconv import kata2hira, kata2alphabet
>>> for token in tokenizer.tokenize(text):
... reading = token.reading_form()
... hiragana = kata2hira(reading) # to Hiragana
... romaji = kata2alphabet(reading) # to Alphabet(romaji)
... (str(token), reading, hiragana, romaji)
...
('一人', 'ヒトリ', 'ひとり', 'hitori')
('の', 'ノ', 'の', 'no')
('日本人', 'ニホンジン', 'にほんじん', 'nihonjin')
('の', 'ノ', 'の', 'no')
('大人', 'オトナ', 'おとな', 'otona')
...
Can get Reading to Kanji 
Add Reading to Kanji
kanji_reading.py
import sys
from jaconv import kata2hira
from sudachipy import Dictionary
tokenizer = Dictionary().create() # create tokenizer
def add_reading(text: str) -> str:
"""Add Hiranaga ruby to text"""
result = ""
for token in tokenizer.tokenize(text):
ruby = kata2hira(token.reading_form()) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
return result
if __name__ == "__main__":
print(add_reading(sys.argv[1]))
Add Reading to Kanji
一人 の 日本人 の 大人 が 一人前 の ラーメン を 食べる
(env) $ python kanji_reading.py 一人の日本人の大人が一人前のラーメンを食べる
<ruby>一人<rt>ひとり</rt></ruby>
<ruby>の<rt>の</rt></ruby>
<ruby>日本人<rt>にほんじん</rt></ruby>
<ruby>の<rt>の</rt></ruby>
<ruby>大人<rt>おとな</rt></ruby>
...
Add Reading to Kanji
kanji_reading_romaji.py
import sys
from jaconv import kata2alphabet
from sudachipy import Dictionary
tokenizer = Dictionary().create() # create tokenizer
def add_reading(text: str) -> str:
"""Add Hiranaga ruby to text"""
result = ""
for token in tokenizer.tokenize(text):
# ruby = kata2hira(token.reading_form()) # to Hiragana
ruby = kata2alphabet(token.reading_form()) # to Alphabet(romaji)
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
return result
if __name__ == "__main__":
print(add_reading(sys.argv[1]))
Add Reading to Kanji
一人 の 日本人 の 大人 が 一人前 の ラーメン を 食べる
(env) $ python kanji_reading_romaji.py 一人の日本人の大人が一人前のラーメンを食べる
<ruby>一人<rt>hitori</rt></ruby>
<ruby>の<rt>no</rt></ruby>
<ruby>日本人<rt>nihonjin</rt></ruby>
<ruby>の<rt>no</rt></ruby>
<ruby>大人<rt>otona</rt></ruby>
Can read Kanji 
Kanji level support 
Kanji level support 

What is the Japanese-Language Proficiency Test? Index | JLPT Japanese-Language Proficiency Test
Readings corresponding to Kanji levels 
Kanji list for each level
jiten has JLPT Kanji lists
Make JLPT Kanji level dict
make_jlpt_kanji_dict.py
import json
from urllib.request import urlopen
BASE_URL = "https://raw.githubusercontent.com/obfusk/jiten/refs/heads/master/jiten/res/jlpt/"
kanji = {}
for level in range(1, 6): # level 1 to 5
with urlopen(f"{BASE_URL}N{level}-kanji") as f:
data = f.read().decode("utf-8")
kanji[level] = data
with open("JLPT_kanji.json", "w", encoding="utf-8") as f:
json.dump(kanji, f, ensure_ascii=False, indent=2)
Make JLPT Kanji level dict
Kanji dict is ready!!
% python make_jlpt_kanji_dict.py
{
"1": "丁丑且丘丙丞丹乃之乏乙也亀井亘亜亥亦亨享亭亮仁仙仮仰企伊伍伎伏伐伯伴伶伽但佐佑佳併侃侍侑価侮侯侵促俊俗保修俳俵俸倉倖倣倫倭倹偏健偲偵偽傍傑傘催債傷僕僚僧儀儒償允充克免典兼冒冗冠冴冶准凌凜凝凡凪凱凶凸凹刀刃刈刑削剖剛剣剤剰創功劣励劾勁勅勘勧勲勺匁匠匡匿升卑卓博卯即却卸厄厘厳又及叔叙叡句只叶司吉后吏吐吟呂呈呉哀哉哲唄唆唇唯唱啄啓善喚喝喪喬嗣嘆嘉嘱器噴嚇囚圏圭坑坪垂垣執培基堀堅堕堤堪塀塁塊塑塚塾墓墜墨墳墾壁壇壊壌士壮壱奇奈奉奎奏契奔奨奪奮奴如妃妄妊妙妥妨姫姻姿威娠娯婆婿媒媛嫁嫌嫡嬉嬢孔孟孤宏宗宙宜宣宥宮宰宴宵寂寅密寛寡寧審寮寸射尉尋尚尭就尺尼尽尾尿屈展属履屯岐岬岳峠峡峰峻崇崎崚崩嵐嵩嵯嶺巌巡巣巧己巳巴巽帆帝帥帳幕幣幹幻幽庄序庶康庸廃廉廊廷弁弊弐弓弔弘弥弦弧張弾彗彦彩彪彫彬彰影往征径徐従循微徳徴徹忌忍志応忠怜怠怪恒恕恨恩恭恵悌悔悟悠悦悼惇惑惜惟惣惨惰愁愉愚慈態慎慕慢慧慨慮慰慶憂憤憧憩憲憶憾懇懐懲懸我戒戯房扇扉扱扶批抄把抑抗択披抵抹抽拍拐拒拓拘拙拠拡括拳拷挑挙振挿据捷捺授掌排控推措掲描提揚握揮援揺搬搭携搾摂摘摩撃撤撮撲擁操擦擬攻故敏救敢敦整敵敷斉斎斐斗斜斤斥於施旋旗既旦旨旬旭旺昂昆昌昭是昴晃晋晏晟晨晶智暁暇暉暑暖暢暦暫曙曹朋朔朕朗朱朴朽杉李杏杜条松析枠枢架柄柊某染柚柳柾栓栗栞株核栽桂桃案桐桑桜桟梅梓梢梧梨棄棋棚棟棺椋椎検椰椿楊楓楠楼概榛槙槻槽標模樹樺橘檀欄欣欺欽款歓殉殊殖殴殻毅毬氏汁汐江汰汽沖沙没沢沼沿泌泡泣泰洞津洪洲洵洸派浄浜浦浩浪浸涯淑淡淳添渇渉渋渓渚渥渦湧源溝滅滉滋滑滝滞漂漆漏漠漫漬漱漸潔潜潟潤潮澄澪激濁濫瀬災炉炊炎為烈焦煩煮熊熙熟燎燦燿爵爽爾牧牲犠狂狩独狭猛猟猪献猶猿獄獣獲玄率玖玲珠班琉琢琳琴瑚瑛瑞瑠瑳瑶璃環甚甫甲畔畝異疎疫疾症痘痢痴癒癖皇皐皓盆益盛盟監盤盲盾眉看眸眺眼睡督睦瞬瞭瞳矛矢矯砕砲硝硫碁碑碧碩磁磯礁礎祉祐祥票禄禅禍禎秀秘租秦秩称稀稔稚稜稲稼稿穀穂穏穣穫穴窃窒窮窯竜竣端笙笛第笹筋策箇節範篤簿粋粗粘粛糖糧系糾紀紋納紗紘級紛素紡索紫紬累紳紺絃結絞絢統絹継綜維綱網綸綺綾緊緋締緩緯縁縄縛縦縫縮繁繊織繕繭繰罰罷羅羊義翁翔翠翻翼耀耐耗耶聖聡聴肇肖肝肢肥肪肺胆胎胞胡胤胴脅脈脚脩脱脹腐腸膜膨臨臭至致興舌舎舗舜舶艇艦艶芋芙芝芳芹芽苑苗茂茄茅茉茎茜荘莉莞菊菌菖菫華萌萩葬葵蒔蒼蓄蓉蓮蔦蕉蕗薦薪薫藍藤藩藻蘭虎虐虚虜虞虹蚊蚕蛇蛍蛮蝶融衆街衛衝衡衰衷衿袈裁裂裕裟裸製褐褒襟襲覆覇視覧訂討託訟訳訴診証詐詔評詠詢詩該詳誇誉誓誕誘誠誼諄請諒諭諮諾謀謁謄謙謝謡謹譜譲護豆豚豪貞貢貫貴賀賃賄賊賓賜賠賦購赦赳赴趣距跳践踏躍軌軸較載輔輝輩轄辰辱迅迪迫迭透逐逓逝逮逸遂遇遍遣遥遭遮遵遷遺遼避還邑那邦邪邸郁郎郡郭郷酉酌酔酢酪酬酵酷酸醜醸采釈釣鈴鉛鉢銃銑銘銭鋳鋼錘錠錦錬錯鍛鎌鎖鎮鏡鐘鑑閑閣閥閲闘阻阿附陛陣陥陪陰陳陵陶隆隊随隔障隠隣隷隼雄雅雌雛離雰雷需霊霜霞霧露靖鞠韻響項須頌頑頒頻顕顧颯飢飼飽飾養餓馨駄駆駒駿騎騒騰驚髄鬼魁魂魅魔鮎鮮鯉鯛鯨鳩鳳鴻鵬鶏鶴鷹鹿麗麟麻麿黎黙黛鼓\n",
"2": "並丸久乱乳乾了介仏令仲伸伺低依個倍停傾像億兆児党兵冊再凍刊刷券刺則副劇効勇募勢包匹区卒協占印卵厚双叫召史各含周咲喫営団囲固圧坂均型埋城域塔塗塩境央奥姓委季孫宇宝寺封専将尊導届層岩岸島州巨巻布希帯帽幅干幼庁床底府庫延弱律復快恋患悩憎戸承技担拝拾挟捜捨掃掘採接換損改敬旧昇星普暴曇替札机材村板林枚枝枯柔柱査栄根械棒森植極橋欧武歴殿毒比毛氷永汗汚池沈河沸油況泉泊波泥浅浴涙液涼混清減温測湖湯湾湿準溶滴漁濃濯灯灰炭焼照燃燥爆片版玉珍瓶甘畜略畳療皮皿省県短砂硬磨祈祝祭禁秒移税章童競竹符筆筒算管築簡籍粉粒糸紅純細紹絡綿総緑線編練績缶署群羽翌耕肌肩肯胃胸脂脳腕腰膚臓臣舟航般芸荒荷菓菜著蒸蔵薄虫血衣袋被装裏補複角触訓設詞詰誌課諸講谷豊象貝貨販貯貿賞賢贈超跡踊軍軒軟軽輪輸辛農辺述逆造郊郵量針鈍鉄鉱銅鋭録門防陸隅階隻雇雲零震革順預領額香駐骨麦黄鼻齢\n",
"3": "与両乗予争互亡交他付件任伝似位余例供便係信倒候値偉側偶備働優光全共具内冷処列初判利到制刻割加助努労務勝勤化単危原参反収取受号合向君否吸吹告呼命和商喜回因困園在報増声変夢太夫失好妻娘婚婦存宅守完官定実客害容宿寄富寒寝察対局居差市師席常平幸幾座庭式引当形役彼徒得御必忘忙念怒怖性恐恥息悲情想愛感慣成戦戻所才打払投折抜抱押招指捕掛探支放政敗散数断易昔昨晩景晴暗暮曲更最望期未末束杯果格構様権横機欠次欲歯歳残段殺民求決治法泳洗活流浮消深済渡港満演点然煙熱犯状猫王現球産由申留番疑疲痛登皆盗直相眠石破確示礼祖神福科程種積突窓笑等箱米精約組経給絵絶続緒罪置美老耳職育背能腹舞船良若苦草落葉薬術表要規覚観解記訪許認誤説調談論識警議負財貧責費資賛越路辞込迎返迷追退逃途速連進遅遊過達違遠適選部都配酒閉関降限除険陽際雑難雪静非面靴頂頭頼顔願類飛首馬髪鳴\n",
"4": "不世主事京仕代以会住体作使借元兄公写冬切別力勉動医去口古台同味品員問図地堂場売夏夕多夜妹姉始字安室家少屋工帰広店度建弟強待心思急悪意手持教文料新方旅族早明映春昼曜有服朝業楽歌止正歩死注洋海漢牛物特犬理用田町画界病発目真着知研社私秋究空立答紙終習考者肉自色花英茶親言計試買貸質赤走起足転近送通週運道重野銀開院集青音題風飯飲館駅験魚鳥黒\n",
"5": "一七万三上下中九二五人今休何先入八六円出前北十千午半南友右名四国土外大天女子学小山川左年後日時書月木本来東校母毎気水火父生男白百聞行西見話語読車金長間雨電食高\n"
}
Get reading with Kanji level
-a
: Alphabet annotation(default: Hiragana)-l
: Kanji level option
% python kanji_reading_with_level.py -h
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
Get reading with Kanji level
% python kanji_reading_with_level.py 日本語を勉強する
日本語 を 勉強 する (default)
% python kanji_reading_with_level.py -a 日本語を勉強する
日本語 を 勉強 する (Alphabet(romaji))
% python kanji_reading_with_level.py -l 5 日本語を勉強する
日本語 を 勉強 する (N5 level)
Parse arguments
Process
-a
and-l
with argparseCall
add_reading()
function
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Get Kanji set with level
Get Kanji set with
get_kanji_set(level)
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Get Kanji set with level
Load
"JLPT_kanji.json"
Create a Kanji set is easier than level
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Is ruby required?
Ruby / not Ruby with
is_ruby_required()
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Is ruby required?
Get all Kanjis ->
kanji_in_surface
Kanjis are within the level or above
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Add Ruby text
is_ruby_required() == True
: add Rubyalphabet
: Alphabet or Hiragana(default)
"""
usage: kanji_reading_with_level.py [-h] [-a] [-l {1,2,3,4,5}] text
Add Furigana to Japanese text
positional arguments:
text text to add furigana annotation
options:
-h, --help show this help message and exit
-a Alphabet(Romaji) annotation(default: Hiragana)
-l {1,2,3,4,5} set kanji level
"""
import argparse
import json
import re
from jaconv import kata2alphabet, kata2hira
from sudachipy import Dictionary
KANJI = r"[\u3005-\u3007\u4E00-\u9FFF]" # Kanji pattern
tokenizer = Dictionary().create() # create tokenizer
def get_kanji_set(level: str | None) -> set[str]:
"""Returns a set of Kanji below the specified JLPT level"""
if level is None:
return set()
with open("JLPT_kanji.json", encoding="utf-8") as f:
kanji_level_dict = json.load(f)
kanji_set = set()
for l, kanji_list in kanji_level_dict.items():
if l >= level:
kanji_set.update(set(kanji_list))
return kanji_set
def is_ruby_required(surface: str, kanji_set: set[str]) -> bool:
"""Returns whether ruby is required"""
if not kanji_set: # no kanji set -> no level
return True
kanji_in_surface = set(re.findall(KANJI, surface))
if not kanji_in_surface: # word without kanji
return False
if kanji_in_surface <= kanji_set: # Kanji within the level
return False
return True
def add_reading(text: str, level: str | None, alphabet: bool):
"""Add Furigana ruby to text"""
kanji_set = get_kanji_set(level)
result = ""
for token in tokenizer.tokenize(text):
reading = token.reading_form()
if is_ruby_required(str(token), kanji_set):
if alphabet:
ruby = kata2alphabet(reading) # to Alphabet
else:
ruby = kata2hira(reading) # to Hiragana
result += f"<ruby>{token}<rt>{ruby}</rt></ruby>\n"
else:
result += f"{token}\n"
return result
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Add Furigana to Japanese text")
parser.add_argument(
"-a",
action="store_true",
help="Alphabet(Romaji) annotation(default: Hiragana)",
)
parser.add_argument("-l", choices="12345", help="set kanji level")
parser.add_argument("text", help="text to add furigana annotation")
args = parser.parse_args()
result = add_reading(args.text, args.l, args.a)
print(result)
Can handle Kanji level!! 
No Level: 日本語 を 勉強 する
N5: 日本語 を 勉強 する
N4: 日本語 を 勉強 する
Sample App 
% git clone https://github.com/takanory/learn-jp-with-python.git
% cd learn-jp-with-python/
% python3.12 -m venv env
% . env/bin/activate
(env) % pip install -r requirements.txt
(env) % streamlit run learn_jp_pyconus.py

Summary 
Japanese is Difficult
3 Characters, No spaces, Kanji readings
Python supports Japanese learning
jaconv: Interconverter
SudachiPy: Morphological analyzer
Kanji level support
🇯🇵 ❤️
Learn Japanese with Python
Thank you 
slides.takanory.net sample code
takanory takanory takanory takanory