diacritics – Page 2 – Tarik Billa

PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

May 2, 2023 by Tarik

iconv(“utf-8″,”ascii//TRANSLIT”,$input); Extended example

MacOSX: how to disable accented characters input

April 3, 2023 by Tarik

Yes! It’s a shame that I didn’t know about such a simple thing – but this is because I’m not a mac-maniac, I live on several OSes at once. When I’ve found out that quote + symbol gives me an accented character I’ve realized what’s happening. This was very easy: Launch System Preferences, open the … Read more

Should I use accented characters in URLs?

March 3, 2023 by Tarik

There’s no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII. An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of ‘%20’. This is how IDN works too: punycoded strings are encoded and … Read more

Listings in Latex with UTF-8 (or at least german umlauts)

February 25, 2023 by Tarik

I found a simpler approach, which works for me: \usepackage{listings} \lstset{ literate={ö}{{\”o}}1 {ä}{{\”a}}1 {ü}{{\”u}}1 }

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

January 24, 2023 by Tarik

I have done this recently in Java: public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(“[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+”); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(“”); return str; } This will do as you specified: stripDiacritics(“Björn”) = Bjorn but it will fail on for example Białystok, because the ł character is not diacritic. If … Read more

Easy way to remove accents from a Unicode string? [duplicate]

January 23, 2023 by Tarik

Finally, I’ve solved it by using the Normalizer class. import java.text.Normalizer; public static String stripAccents(String s) { s = Normalizer.normalize(s, Normalizer.Form.NFD); s = s.replaceAll(“[\\p{InCombiningDiacriticalMarks}]”, “”); return s; }

Converting Symbols, Accent Letters to English Alphabet

November 30, 2022 by Tarik

Reposting my post from How do I remove diacritics (accents) from a string in .NET? This method works fine in java (purely for the purpose of removing diacritical marks aka accents). It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off … Read more

Microsoft Excel mangles Diacritics in .csv files?

October 18, 2022 by Tarik

A correctly formatted UTF8 file can have a Byte Order Mark as its first three octets. These are the hex values 0xEF, 0xBB, 0xBF. These octets serve to mark the file as UTF8 (since they are not relevant as “byte order” information).1 If this BOM does not exist, the consumer/reader is left to infer the … Read more

Is there a way to get rid of accents and convert a whole string to regular letters?

September 29, 2022 by Tarik

Use java.text.Normalizer to handle this for you. string = Normalizer.normalize(string, Normalizer.Form.NFD); // or Normalizer.Form.NFKD for a more “compatible” deconstruction This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren’t. string = string.replaceAll(“[^\\p{ASCII}]”, “”); If your … Read more

How do I remove diacritics (accents) from a string in .NET?

September 13, 2022 by Tarik

I’ve not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others) static string RemoveDiacritics(string text) … Read more