How to iterate over Unicode grapheme clusters in Rust?

You want to use the unicode-segmentation crate:

use unicode_segmentation::UnicodeSegmentation; // 1.5.0

fn main() {
    for g in "नमस्ते्".graphemes(true) {
        println!("- {}", g);
    }
}

(Playground, note: the playground editor can’t properly handle the string, so the cursor position is wrong in this one line)

This prints:

- न
- म
- स्
- ते्

The true as argument means that we want to iterate over the extended grapheme clusters. See graphemes documentation for more information.


Segmentation into Unicode grapheme clusters was supported by the standard library at some point, but unfortunately it was deprecated and then removed due to the size of the required Unicode tables. Instead, the de-facto solution is to use the crate. But yes, I think it’s really unfortunate that the “default standard library segmentation” uses codepoints which semantically do not make a lot of sense (i.e. counting them or splitting them up generally doesn’t make sense).

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)