Is there a set of “Lorem ipsums” files for testing character encoding issues?

The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:

Zażółć gęślą jaźń

which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:

  • in Spanish, “El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja.” (all 27 letters and diacritics).

  • in Russian, “Съешь же ещё этих мягких французских булок, да выпей чаю” (all 33 Russian Cyrillic alphabet letters).

List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:

public interface NationalCharacters {
  String spanish();
  String russian();
  //...
}

library?

Leave a Comment