Read Unicode UTF-8 file into wstring

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary. In order to use facet you usually create locale object that encapsulates culture-specific information as a set … Read more

Case insensitive std::string.find()

You could use std::search with a custom predicate. #include <locale> #include <iostream> #include <algorithm> using namespace std; // templated version of my_equal so it could work with both char and wchar_t template<typename charT> struct my_equal { my_equal( const std::locale& loc ) : loc_(loc) {} bool operator()(charT ch1, charT ch2) { return std::toupper(ch1, loc_) == std::toupper(ch2, … Read more

What’s “wrong” with C++ wchar_t and wstrings? What are some alternatives to wide characters?

What is wchar_t? wchar_t is defined such that any locale’s char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint: Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).                                                                                — C++ … Read more

How to convert wstring into string?

As Cubbi pointed out in one of the comments, std::wstring_convert (C++11) provides a neat simple solution (you need to #include <locale> and <codecvt>): std::wstring string_to_convert; //setup converter using convert_type = std::codecvt_utf8<wchar_t>; std::wstring_convert<convert_type, wchar_t> converter; //use converter (.to_bytes: wstr->str, .from_bytes: str->wstr) std::string converted_str = converter.to_bytes( string_to_convert ); I was using a combination of wcstombs and tedious … Read more

std::wstring VS std::string

string? wstring? std::string is a basic_string templated on a char, and std::wstring on a wchar_t. char vs. wchar_t char is supposed to hold a character, usually an 8-bit character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4 bytes, while on Windows, it’s 2 bytes. … Read more