Remove diacritics using Go

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more

Regex and unicode

Use a subrange of [\u0000-\uFFFF] for what you want. You can also use the re.UNICODE compile flag. The docs say that if UNICODE is set, \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

In what JS engines, specifically, are toLowerCase & toUpperCase locale-sensitive?

Note: Please, note that I couldn’t test it! As per ECMAScript specification: String.prototype.toLowerCase ( ) […] For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping. The result … Read more

how std::u8string will be different from std::string?

Since the difference between u8string and string is that one is templated on char8_t and the other on char, the real question is what is the difference between using char8_t-based strings vs. char-based strings. It really comes down to this: type-based encoding. Any char-based string (char*, char[], string, etc) may be encoded in UTF-8. But … Read more

Fast way to filter illegal xml unicode chars in python?

Recently we (Trac XmlRpcPlugin maintainers) have been notified of the fact that the regular expression above strips surrogate pairs on Python narrow builds (see th:comment:13:ticket:11050) . An alternative approach consists in using the following regex (see th:changeset:13729) . _illegal_unichrs = [(0x00, 0x08), (0x0B, 0x0C), (0x0E, 0x1F), (0x7F, 0x84), (0x86, 0x9F), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF)] if … Read more

Identifier normalization: Why is the micro sign converted into the Greek letter mu?

There are two different characters involved here. One is the MICRO SIGN, which is the one on the keyboard, and the other is GREEK SMALL LETTER MU. To understand what’s going on, we should take a look at how Python defines identifiers in the language reference: identifier ::= xid_start xid_continue* id_start ::= <all characters in … Read more

Truncating unicode so it fits a maximum size when encoded for wire transfer

def unicode_truncate(s, length, encoding=’utf-8′): encoded = s.encode(encoding)[:length] return encoded.decode(encoding, ‘ignore’) Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would’ve crashed if the split Unicode code point wasn’t ignored: >>> unicode_truncate(u’абвгд’, 5) u’\u0430\u0431′