Is there a possibility when calling .ToUpper() that the new string requires more memory?

Question

MemoryExtensions.ToUpper returns -1 if the destination is too small.

The source code for ToUpper has this gem:

// Assuming that changing case does not affect length
if (destination.Length < source.Length)
    return -1;

There is no other point where -1 is returned, the function finishes with return source.Length;

So they’ve assumed it can’t happen. Whether they’re right is another question: if you find a counter-example I suggest you file a bug report on GitHub.

The docs for TextInfo (used later on in the code) say:

The returned string might differ in length from the input string. For more information on casing, refer to the Unicode Technical Report #21 “Case Mappings,” published by the Unicode Consortium (https://www.unicode.org/). The current implementation preserves the length of the string. However, this behavior is not guaranteed and could change in future implementations.

To clarify further, we are talking about code points in UTF-16. We have not considered UTF-8 or UTF-32, as char is strictly UTF-16.

Unicode defines Case Mapping as follows:

Simple (Single-Character) Case Mapping

The general case mapping in ICU is non-language based and a 1 to 1 generic character map.

A character is considered to have a lowercase, uppercase, or title case equivalent if there is a respective “simple” case mapping specified for the character in the Unicode Character Database (UnicodeData.txt). If a character has no mapping equivalent, the result is the character itself.

The APIs provided for the general case mapping, located in uchar.h file, handles only single characters of type UChar32 and returns only single characters. To convert a string to a non-language based specific case, use the APIs in either the unistr.h or ustring.h files with a NULL argument locale.

Full (Language-Specific) Case Mapping

There are different case mappings for different locales. For instance, unlike English, the character Latin small letter ‘i’ in Turkish has an equivalent Latin capital letter ‘I’ with dot above ( \u0130 ‘İ’).

Similar to the simple case mapping API, a character is considered to have a lowercase, uppercase or title case equivalent if there is a respective mapping specified for the character in the Unicode Character database (UnicodeData.txt). In the case where a character has no mapping equivalent, the result is the character itself.

To convert a string to a language based specific case, use the APIs in ustring.h and unistr.h with an intended argument locale.

ICU implements full Unicode string case mappings.

In general:

case mapping can change the number of code points and/or code units of a string,

is language-sensitive (results may differ depending on language), and

is context-sensitive (a character in the input string may map differently depending on surrounding characters).

TL;DR;

In theory, the number of code points could change (this is separate from the number of bytes). But .NET does not currently implement this. That could change without notice, but that’s unlikely until there is a way to calculate the number of code points, due to interdependencies on Span.

TL;DR;

Leave a Comment Cancel reply