How to protect against diacritics such as Zalgo text

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a ‘Stream-Safe’ format in UAX-15 that sets a limit of 30 combiners… Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don’t intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

Leave a Comment

tech