unicode – Page 7 – Tarik Billa

Really Good, Bad UTF-8 example test data [closed]

January 7, 2023 by Tarik

Check out Markus Kuhn’s UTF-8 decoder stress test

What Unicode characters represent “time”?

January 6, 2023 by Tarik

The following code points exist related to clocks, watches, and other devices to indicate time: ⌚ U+0231A WATCH ⌛ U+0231B HOURGLASS ⏰ U+023F0 ALARM CLOCK ⏱ U+023F1 STOPWATCH ⏲ U+023F2 TIMER CLOCK ⏳ U+023F3 HOURGLASS WITH FLOWING SAND ⧖ U+029D6 WHITE HOURGLASS ⧗ U+029D7 BLACK HOURGLASS 📅 U+1F4C5 CALENDAR 📆 U+1F4C6 TEAR-OFF CALENDAR 🕐 U+1F550 … Read more

Using awk to remove the Byte-order mark

December 31, 2022 by Tarik

Using GNU sed (on Linux or Cygwin): # Removing BOM from all text files in current directory: sed -i ‘1 s/^\xef\xbb\xbf//’ *.txt On FreeBSD: sed -i .bak ‘1 s/^\xef\xbb\xbf//’ *.txt Advantage of using GNU or FreeBSD sed: the -i parameter means “in place”, and will update files without the need for redirections or weird tricks. … Read more

What’s the complete range for Chinese characters in Unicode?

December 26, 2022 by Tarik

The definitive list can be found at Unicode Character Code Charts; search the page for “CJK”. The “East Asian Script” document does mention: Blocks Containing Han Ideographs Han ideographic characters are found in five main blocks of the Unicode Standard, as shown in Table 18-1 Table 18-1. Blocks Containing Han Ideographs Block Range Comment CJK … Read more

What is the proper way to URL encode Unicode characters?

December 24, 2022 by Tarik

I would always encode in UTF-8. From the Wikipedia page on percent encoding: The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, … Read more

What are the most common non-BMP Unicode characters in actual use? [closed]

December 20, 2022 by Tarik

Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter’s public stream. It occurs more frequently than the tilde!

How does UTF-8 “variable-width encoding” work?

December 19, 2022 by Tarik

Each byte starts with a few bits that tell you whether it’s a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this: 0xxx xxxx A single-byte US-ASCII code (from the first 127 characters) The multi-byte code-points each start with a few bits that essentially say “hey, you … Read more

complete, monospaced Unicode font? [closed]

November 28, 2022 by Tarik

Unicode is big. Really big. You just won’t believe how vastly hugely mind-bogglingly big it is. I mean, you might think it’s a long way down the codepage to ü, but that’s just peanuts to Unicode. I really doubt there’s any font in the world (monospaced or not) that has “complete” Unicode. The best you … Read more

Print Directory & File Structure with icons for representation in Markdown [closed]

November 25, 2022 by Tarik

I followed an example in another repository and wrapped the directory structure within a pair of triple backticks (“`): “` project │ README.md │ file001.txt │ └───folder1 │ │ file011.txt │ │ file012.txt │ │ │ └───subfolder1 │ │ file111.txt │ │ file112.txt │ │ … │ └───folder2 │ file021.txt │ file022.txt “`

What would be the Unicode character for big bullet in the middle of the character?

November 11, 2022 by Tarik

http://www.unicode.org is the place to look for symbol names. ● BLACK CIRCLE 25CF ⚫ MEDIUM BLACK CIRCLE 26AB ⬤ BLACK LARGE CIRCLE 2B24 or even: 🌑 NEW MOON SYMBOL 1F311 Good luck finding a font that supports them all. Only one shows up in Windows 7 with Chrome.