Really Good, Bad UTF-8 example test data [closed]
Check out Markus Kuhn’s UTF-8 decoder stress test
Check out Markus Kuhn’s UTF-8 decoder stress test
The following code points exist related to clocks, watches, and other devices to indicate time: ⌚ U+0231A WATCH ⌛ U+0231B HOURGLASS ⏰ U+023F0 ALARM CLOCK ⏱ U+023F1 STOPWATCH ⏲ U+023F2 TIMER CLOCK ⏳ U+023F3 HOURGLASS WITH FLOWING SAND ⧖ U+029D6 WHITE HOURGLASS ⧗ U+029D7 BLACK HOURGLASS 📅 U+1F4C5 CALENDAR 📆 U+1F4C6 TEAR-OFF CALENDAR 🕐 U+1F550 … Read more
Using GNU sed (on Linux or Cygwin): # Removing BOM from all text files in current directory: sed -i ‘1 s/^\xef\xbb\xbf//’ *.txt On FreeBSD: sed -i .bak ‘1 s/^\xef\xbb\xbf//’ *.txt Advantage of using GNU or FreeBSD sed: the -i parameter means “in place”, and will update files without the need for redirections or weird tricks. … Read more
The definitive list can be found at Unicode Character Code Charts; search the page for “CJK”. The “East Asian Script” document does mention: Blocks Containing Han Ideographs Han ideographic characters are found in five main blocks of the Unicode Standard, as shown in Table 18-1 Table 18-1. Blocks Containing Han Ideographs Block Range Comment CJK … Read more
I would always encode in UTF-8. From the Wikipedia page on percent encoding: The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, … Read more
Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter’s public stream. It occurs more frequently than the tilde!
Each byte starts with a few bits that tell you whether it’s a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this: 0xxx xxxx A single-byte US-ASCII code (from the first 127 characters) The multi-byte code-points each start with a few bits that essentially say “hey, you … Read more
Unicode is big. Really big. You just won’t believe how vastly hugely mind-bogglingly big it is. I mean, you might think it’s a long way down the codepage to ü, but that’s just peanuts to Unicode. I really doubt there’s any font in the world (monospaced or not) that has “complete” Unicode. The best you … Read more
I followed an example in another repository and wrapped the directory structure within a pair of triple backticks (“`): “` project │ README.md │ file001.txt │ └───folder1 │ │ file011.txt │ │ file012.txt │ │ │ └───subfolder1 │ │ file111.txt │ │ file112.txt │ │ … │ └───folder2 │ file021.txt │ file022.txt “`
http://www.unicode.org is the place to look for symbol names. ● BLACK CIRCLE 25CF ⚫ MEDIUM BLACK CIRCLE 26AB ⬤ BLACK LARGE CIRCLE 2B24 or even: 🌑 NEW MOON SYMBOL 1F311 Good luck finding a font that supports them all. Only one shows up in Windows 7 with Chrome.