Check whether this works or not. I found this website that seems to list all the characters in Unicode that might be used in Japanese text.
The corresponding regex (for single character) would be:
/[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/
-------------_____________-------------_____________-------------_____________
Punctuation Hiragana Katakana Full-width CJK CJK Ext. A
Roman/ (Common & (Rare)
Half-width Uncommon)
Katakana
The ranges are (as quoted from the site):
3000 - 303f
: Japanese-style punctuation3040 - 309f
: Hiragana30a0 - 30ff
: Katakanaff00 - ff9f
: Full-width Roman characters and half-width Katakana4e00 - 9faf
: CJK unified ideographs – Common and uncommon Kanji3400 - 4dbf
: CJK unified ideographs Extension A – Rare Kanji
I have changed the ranges a bit:
- I have changed from
ff00 - ffef
toff00 - ff9f
for Full-width Roman characters and half-width Katakana. The code points fromffa0 - ffdc
contains Hangul half-width characters, which is not what you want. You may want to re-add the code points fromffe0 - ffef
, but they are mostly half-width punctuations or full-width currency symbols.
You can check the site and take off any range you don’t want, or are sure that it will not appear in your input.