PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
iconv(“utf-8″,”ascii//TRANSLIT”,$input); Extended example
iconv(“utf-8″,”ascii//TRANSLIT”,$input); Extended example
Yes! It’s a shame that I didn’t know about such a simple thing – but this is because I’m not a mac-maniac, I live on several OSes at once. When I’ve found out that quote + symbol gives me an accented character I’ve realized what’s happening. This was very easy: Launch System Preferences, open the … Read more
There’s no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII. An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of ‘%20’. This is how IDN works too: punycoded strings are encoded and … Read more
I found a simpler approach, which works for me: \usepackage{listings} \lstset{ literate={ö}{{\”o}}1 {ä}{{\”a}}1 {ü}{{\”u}}1 }
I have done this recently in Java: public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(“[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+”); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(“”); return str; } This will do as you specified: stripDiacritics(“Björn”) = Bjorn but it will fail on for example Białystok, because the ł character is not diacritic. If … Read more
Finally, I’ve solved it by using the Normalizer class. import java.text.Normalizer; public static String stripAccents(String s) { s = Normalizer.normalize(s, Normalizer.Form.NFD); s = s.replaceAll(“[\\p{InCombiningDiacriticalMarks}]”, “”); return s; }
Reposting my post from How do I remove diacritics (accents) from a string in .NET? This method works fine in java (purely for the purpose of removing diacritical marks aka accents). It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off … Read more
A correctly formatted UTF8 file can have a Byte Order Mark as its first three octets. These are the hex values 0xEF, 0xBB, 0xBF. These octets serve to mark the file as UTF8 (since they are not relevant as “byte order” information).1 If this BOM does not exist, the consumer/reader is left to infer the … Read more
Use java.text.Normalizer to handle this for you. string = Normalizer.normalize(string, Normalizer.Form.NFD); // or Normalizer.Form.NFKD for a more “compatible” deconstruction This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren’t. string = string.replaceAll(“[^\\p{ASCII}]”, “”); If your … Read more
I’ve not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others) static string RemoveDiacritics(string text) … Read more