unicode – Tarik Billa

UTF-8 Continuation bytes

April 3, 2024 by Tarik

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Remove diacritics using Go

March 11, 2024 by Tarik

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more

How to change the color of a unicode character

January 4, 2024 by Tarik

No. The color is inherent to the character — there’s a LARGE BLUE CIRCLE as well (U+1F535 – 🔵), but no other colors are currently defined by the Unicode standard.

Character code of unknown character-character, e.g. square or question mark romb

January 4, 2024 by Tarik

Unicode has two symbols for unknown characters: □ (WHITE SQUARE, U+25A1) – Replaces a missing or unsupported Unicode character. � (REPLACEMENT CHARACTER, U+FFFD) – Replaces an invalid or unrecognizable character. Indicates a Unicode error. Sources Quora – What symbol is the square box shown for non-representable Unicode characters? FileFormat.Info – Unicode Character ‘WHITE SQUARE’ (U+25A1) … Read more

What is the difference between EM Dash #151; and #8212;?

January 2, 2024 by Tarik

 is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character. The confusion arises because character 151 is … Read more

Unicode Character for Funnel to Signify Filtering

December 27, 2023 by Tarik

Some of the most similar chars I’ve found so far: ∀, ∨, ∇, ▼, Y, Ⴤ, V, ᗊ, ⑂, ツ

Is there a “glyph not found” character?

November 29, 2023 by Tarik

From the Unicode Spec: http://unicode.org/charts/PDF/U25A0.pdf U+25A1 □ WHITE SQUARE may be used to represent a missing ideograph → U+20DE $⃞ combining enclosing square

What is the difference between Unicode code points and Unicode scalars?

November 27, 2023 by Tarik

First let’s look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding: D9 Unicode codespace: A range of integers from 0 to 10FFFF16. D10 Code point: Any value in the Unicode codespace. • A code point is also known as a code position. … D10a Code point type: Any of the seven fundamental … Read more

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

September 22, 2023 by Tarik

Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries. (I speculate there was also a reluctance to have a limit that was “only what we can currently imagine being useful” when … Read more

Unicode mirror character?

September 20, 2023 by Tarik

We’ve talked about attacks using the RLO (U+202E RIGHT TO LEFT OVERRIDE) character in the past, which shifts the ‘visual’ display of a string from the position it’s placed inside that string. So for example: document[U+202E]fdp.exe visually looks like documentexe.pdf I talked about these and other attacks of this sort here http://www.casaba.com/products/UCAPI/. In fact we’re … Read more