codepages – Tarik Billa

What’s the difference between an “encoding,” a “character set,” and a “code page”?

July 24, 2023 by Tarik

A ‘character set’ is just what it says: a properly-specified list of distinct characters. An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters. UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*). The confusion … Read more

How do I correct the character encoding of a file?

March 27, 2023 by Tarik

Follow these steps with Notepad++ 1- Copy the original text 2- In Notepad++, open new file, change Encoding -> pick an encoding you think the original text follows. Try as well the encoding “ANSI” as sometimes Unicode files are read as ANSI by certain programs 3- Paste 4- Then to convert to Unicode by going … Read more

Is codepage 65001 and utf-8 the same thing?

March 20, 2023 by Tarik

Yes. UTF-8 is CP65001 in Windows (which is just a way of specifying UTF-8 in the legacy codepage stuff). As far as I read ASP can handle UTF-8 when specified that way.

Text was truncated or one or more characters had no match in the target code page When importing from Excel file

March 7, 2023 by Tarik

I assume you’re trying to import this using an Excel Source in the SSIS dialog? If so, the problem is probably that SSIS samples some number of rows at the beginning of your spreadsheet when it creates the Excel source. If on the [ShortDescription] column it doesn’t notice anything too large, it will default to … Read more

How do you properly use WideCharToMultiByte

February 9, 2023 by Tarik

Here’s a couple of functions (based on Brian Bondy’s example) that use WideCharToMultiByte and MultiByteToWideChar to convert between std::wstring and std::string using utf8 to not lose any data. // Convert a wide Unicode string to an UTF8 string std::string utf8_encode(const std::wstring &wstr) { if( wstr.empty() ) return std::string(); int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), … Read more

What is ANSI format?

October 5, 2022 by Tarik

ANSI encoding is a slightly generic term used to refer to the standard code page on a system, usually Windows. It is more properly referred to as Windows-1252 on Western/U.S. systems. (It can represent certain other Windows code pages on other systems.) This is essentially an extension of the ASCII character set in that it … Read more