cjk – Tarik Billa

Php – regular expression to check if the string has chinese chars

January 1, 2024 by Tarik

What is the encoding of Chinese characters on Wikipedia?

December 24, 2023 by Tarik

>>> c=”\xe7\x9a\x84″.decode(‘utf8′) >>> c u’\u7684’ >>> print c 的 though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.

Flutter fetched Japanese character from server decoded wrong

May 25, 2023 by Tarik

If you look in postman, you will probably see that the Content-Type http header sent by the server is missing the encoding tag. This causes the Dart http client to decode the body as Latin-1 instead of utf-8. There’s a simple workaround: http.Response response = await http.get(‘SOME URL’,headers: {‘Content-Type’: ‘application/json’}); List<dynamic> responseJson = json.decode(utf8.decode(response.bodyBytes));

Detect Windows font size (100%, 125%, and 150%)

May 23, 2023 by Tarik

The correct way of handling variable DPI settings is not to detect them and adjust your controls’ sizes manually in a switch statement (for starters, there are far more possibilities than those you show in your sample if statement). Instead, you should set the AutoScaleMode property of your form to AutoScaleMode.Dpi and let the framework … Read more

Convert or extract TTC font to TTF – how to? [closed]

April 18, 2023 by Tarik

Assuming that Windows doesn’t really know how to deal with TTC files (which I honestly find strange), you can “split” the combined fonts in an easy way if you use fontforge. The steps are: Download the file. Unzip it (e.g., unzip “STHeiti Medium.ttc.zip”). Load Fontforge. Open it with Fontforge (e.g., File > Open). Fontforge will … Read more

Java regex for support Unicode?

January 18, 2023 by Tarik

What you are looking for are Unicode properties. e.g. \p{L} is any kind of letter from any language So a regex to match such a Chinese word could be something like \p{L}+ There are many such properties, for more details see regular-expressions.info Another option is to use the modifier Pattern.UNICODE_CHARACTER_CLASS In Java 7 there is … Read more

Language codes for simplified Chinese and traditional Chinese?

January 18, 2023 by Tarik

@dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese: There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would … Read more

What’s the complete range for Chinese characters in Unicode?

December 26, 2022 by Tarik

The definitive list can be found at Unicode Character Code Charts; search the page for “CJK”. The “East Asian Script” document does mention: Blocks Containing Han Ideographs Han ideographic characters are found in five main blocks of the Unicode Standard, as shown in Table 18-1 Table 18-1. Blocks Containing Han Ideographs Block Range Comment CJK … Read more

What are the most common non-BMP Unicode characters in actual use? [closed]

December 20, 2022 by Tarik

Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter’s public stream. It occurs more frequently than the tilde!

How does Chrome decide what to highlight when you double-click Japanese text?

October 12, 2022 by Tarik

So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese. function tokenizeJA(text) { var it = Intl.v8BreakIterator([‘ja-JP’], {type:’word’}) it.adoptText(text) var words = [] var cur = 0, prev = 0 while (cur < text.length) { prev = cur cur = it.next() words.push(text.substring(prev, cur)) } return words } console.log(tokenizeJA(‘どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。’)) // [“どこ”, … Read more