You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).
That’s because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don’t want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
- the left
szUTF-8 bytes of a string. - the
szUTF-8 bytes of a string, starting atpos. - the rest of the UTF-8 bytes of a string, starting at
pos.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.