Community
    • Login

    SCI_GETCODEPAGE is NOT always either 0 or 65001

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    2 Posts 2 Posters 213 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises
      last edited by

      It has been stated that SCI_GETCODEPAGE will always return either 0 (CP_ACP, for ANSI) or 65001 (CP_UTF8, for Unicode). I have repeated that statement myself. It is not true.

      In looking into an error I found in my code — one that has not yet been triggered in practice, but lies in wait for someone to hit a case where it’s applied to a very large file containing non-ASCII characters — I re-examined some assumptions and in looking into them, I found out that SCI_GETCODEPAGE isn’t quite as simple as I (and apparently others) thought.

      As far as I can determine:

      • When the file encoding as shown near the bottom right of the status bar is ANSI and the system default code page is a single-byte character set, SCI_GETCODEPAGE will return 0 (CP_ACP). The actual encoding within Scintilla will be the system default code page.

      • When the file encoding is any variant of Unicode, SCI_GETCODEPAGE will return 65001 (CP_UTF8) and the actual encoding within Scintilla will be UTF-8.

      • When the file encoding is ANSI and the system default code page is one of the supported CJK encodings — 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) — SCI_GETCODEPAGE will return the numeric identifier of the system default code page and the actual encoding within Scintilla will be the system default code page.

      • When the file encoding is anything other than ANSI or some variant of Unicode, SCI_GETCODEPAGE will be 65001 and the actual encoding within Scintilla will be UTF-8.¹

      I do not know what happens if the system default code page is a multibyte encoding other than the four explicitly supported ones. The documentation for SCI_SETCODEPAGE says that Scintilla also supports code page 1361 (Korean Johab). It appears that Notepad++ does not support this encoding as an ANSI encoding… but I could be missing something. EUC-KR is listed on the Character sets menu, but that is a different multibyte encoding (51949) which is apparently not supported by Scintilla.

      When I did a test by changing my system default character set to Japanese, I started a new file, set it to ANSI, and pasted in some Japanese text. SCI_GETCODEPAGE was 932. I saved it that way. When I opened it again, the encoding was set to Shift-JIS — not ANSI — and SCI_GETCODEPAGE was 65001. (Note that the file was saved as Shift-JIS; it’s the internal encoding for editing that changed. Checking the position counts moving from one character to the next also verified that before I saved, Scintilla was using Shift-JIS, but when I opened it again, Scintilla was using UTF-8. Saving again still kept the file as Shift-JIS, as expected.)


      Bottom line:

      SCI_GETCODEPAGE can return 0, 932, 936, 949, 950 or 65001.

      • When it returns 0, character strings to and from Scintilla are in a single-byte encoding which is also the system default code page.

      • When it returns 65001, character strings to and from Scintilla are in UTF-8.

      • When it returns 932, 936, 949 or 950, character strings to and from Scintilla use the indicated multi-byte encoding.

      • Since CP_ACP, which is 0, represents the system default code page, and CP_UTF8, which is 65001, represents UTF-8, you can safely use the value returned by SCI_GETCODEPAGE in Windows API calls that take a code page identifier. However, you cannot safely assume that non-zero means UTF-8; nor can you assume that not UTF-8 means one byte = one character.


      ¹ This is true even when the encoding is the same as the system default code page. For example, on a typical American or Western European system, the system default code page is Windows-1252. If you open a new file and (if necessary) set the encoding to ANSI, the status bar will show ANSI, the encoding within Scintilla will be Windows-1252, and SCI_GETCODEPAGE will return 0. If you select Encoding | Character sets | Western European | Windows-1252, the status bar will show Windows-1252, the encoding within Scintilla will be UTF-8, and SCI_GETCODEPAGE will return 65001.

      Vitalii DovganV 1 Reply Last reply Reply Quote 2
      • Vitalii DovganV
        Vitalii Dovgan @Coises
        last edited by

        @Coises
        Yes.
        This is why CNppExec::convertSciText uses the actual Scintilla’s encoding nSciCodePage to convert Scintialla’s text to a desired encoding:
        https://github.com/d0vgan/nppexec/blob/develop/NppExec/src/NppExec.cpp#L2516

        1 Reply Last reply Reply Quote 1
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors