utf8_bin vs. utf_unicode_ci

Question

It depends on what you need.

The utf8_bin collation compares strings based purely on their Unicode code point values. If all of the code points have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs. decomposed) or characters that are canonically equivalent but don’t have the same code point value. In some cases, using utf8_bin will result in strings not matching when you expect them to. Theoretically, utf8_bin is the fastest because no Unicode normalization is applied to the strings, but it may not be what you want.

utf8_general_ci applies Unicode normalization using language-specific rules and compares strings case-insensitively. utf8_general_cs does the same, but compares strings case-sensitively.

Leave a Comment Cancel reply