SCI_GETCODEPAGE is NOT always either 0 or 65001

Coises

It has been stated that SCI_GETCODEPAGE will always return either 0 (CP_ACP, for ANSI) or 65001 (CP_UTF8, for Unicode). I have repeated that statement myself. It is not true.

In looking into an error I found in my code — one that has not yet been triggered in practice, but lies in wait for someone to hit a case where it’s applied to a very large file containing non-ASCII characters — I re-examined some assumptions and in looking into them, I found out that SCI_GETCODEPAGE isn’t quite as simple as I (and apparently others) thought.

As far as I can determine:

When the file encoding as shown near the bottom right of the status bar is ANSI and the system default code page is a single-byte character set, SCI_GETCODEPAGE will return 0 (CP_ACP). The actual encoding within Scintilla will be the system default code page.
When the file encoding is any variant of Unicode, SCI_GETCODEPAGE will return 65001 (CP_UTF8) and the actual encoding within Scintilla will be UTF-8.
When the file encoding is ANSI and the system default code page is one of the supported CJK encodings — 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) — SCI_GETCODEPAGE will return the numeric identifier of the system default code page and the actual encoding within Scintilla will be the system default code page.
When the file encoding is anything other than ANSI or some variant of Unicode, SCI_GETCODEPAGE will be 65001 and the actual encoding within Scintilla will be UTF-8.¹

I do not know what happens if the system default code page is a multibyte encoding other than the four explicitly supported ones. The documentation for SCI_SETCODEPAGE says that Scintilla also supports code page 1361 (Korean Johab). It appears that Notepad++ does not support this encoding as an ANSI encoding… but I could be missing something. EUC-KR is listed on the Character sets menu, but that is a different multibyte encoding (51949) which is apparently not supported by Scintilla.

When I did a test by changing my system default character set to Japanese, I started a new file, set it to ANSI, and pasted in some Japanese text. SCI_GETCODEPAGE was 932. I saved it that way. When I opened it again, the encoding was set to Shift-JIS — not ANSI — and SCI_GETCODEPAGE was 65001. (Note that the file was saved as Shift-JIS; it’s the internal encoding for editing that changed. Checking the position counts moving from one character to the next also verified that before I saved, Scintilla was using Shift-JIS, but when I opened it again, Scintilla was using UTF-8. Saving again still kept the file as Shift-JIS, as expected.)

Bottom line:

SCI_GETCODEPAGE can return 0, 932, 936, 949, 950 or 65001.

When it returns 0, character strings to and from Scintilla are in a single-byte encoding which is also the system default code page.
When it returns 65001, character strings to and from Scintilla are in UTF-8.
When it returns 932, 936, 949 or 950, character strings to and from Scintilla use the indicated multi-byte encoding.
Since CP_ACP, which is 0, represents the system default code page, and CP_UTF8, which is 65001, represents UTF-8, you can safely use the value returned by SCI_GETCODEPAGE in Windows API calls that take a code page identifier. However, you cannot safely assume that non-zero means UTF-8; nor can you assume that not UTF-8 means one byte = one character.

¹ This is true even when the encoding is the same as the system default code page. For example, on a typical American or Western European system, the system default code page is Windows-1252. If you open a new file and (if necessary) set the encoding to ANSI, the status bar will show ANSI, the encoding within Scintilla will be Windows-1252, and SCI_GETCODEPAGE will return 0. If you select Encoding | Character sets | Western European | Windows-1252, the status bar will show Windows-1252, the encoding within Scintilla will be UTF-8, and SCI_GETCODEPAGE will return 65001.

Vitalii Dovgan

@Coises
Yes.
This is why CNppExec::convertSciText uses the actual Scintilla’s encoding nSciCodePage to convert Scintialla’s text to a desired encoding:
https://github.com/d0vgan/nppexec/blob/develop/NppExec/src/NppExec.cpp#L2516

Coises

I wrote in SCI_GETCODEPAGE is NOT always either 0 or 65001:

When I did a test by changing my system default character set to Japanese, I started a new file, set it to ANSI, and pasted in some Japanese text. SCI_GETCODEPAGE was 932. I saved it that way. When I opened it again, the encoding was set to Shift-JIS — not ANSI — and SCI_GETCODEPAGE was 65001.

For future reference:

This only happens if Settings | Preferences | MISC | Autodetect character encoding is checked. When it is not checked, the file opens, as expected, as ANSI (SCI_GETCODEPAGE returns 932).