Windows/CharacterEncoding

Character encoding on windows

It is important that we can emit Unicode output.

The windows API is complicated in the way it represents strings, as it contains different functions working on different string representations. Such functions are usually identified by a suffix in their name: A for ASCII, W for wide character.

We work under the (more or less valid) assumption that windows supports UTF-16, and not UCS-2.

The string types are:

  • LPWSTR of type wchar *;
  • LPSTR of type char *;
  • LPTSTR which is the same as either LPSTR or LPWSTR, depending on compile-time configuration.

String literals may be:

  • "..." of type const char *;
  • L"..." of type const wchar *;
  • T"..." which can be of either type, depending on compile-time configuration.

Internal memory

The p≡p Engine works internally using UTF-8 strings of type char *, on every platform including windows – the exception being platform_windows.cpp, which takes care of translating as needed.

Console

The p≡p Engine always prints to the console in UTF-8.

It is the user’s responsibility to configure the console, which by default uses some old 8-bit msdos code page — 850 for Europe, or some other code page elsewhere. The default console configuration is inadequate for Unicode and unusable for us.

Other output

In every other contexts on this platform, including system logs, the output is in wide chars. This is achieved using -W functions.

Where to find documentation

Volker says that msdn is the only reliable source of information about the windows API. Many other people trying to document their experience work under the assumptions of some specific setup, and are not to be trusted.

On msdn ignore everything about the .net platform: we are interested in the “unmanaged” alternative, and may even use deprecated functions.