Code page weird function; CP/Windows-1250

Uzivatel919

I see function of code page that I don’t understand. Try this:

Create new file.
Write down some glyphs.
Save file as arbitrary file with any extension.
Change code page to Windows-1250 (Central European).
Delete text and paste ◘○•🔶 ěščřžýáíéúůťďňÓ.
Save, close and open file.
You’ll get •0•?? ěščřžýáíéúůťďňÓ.
Specify code page again.
Still •0•?? ěščřžýáíéúůťďňÓ.

Why glyphs are shown well before save if after re-open they are not?

guy038

Hello, @uzivatel919 and All

Before trying to explain this N++ behavior, we need additional information :

At step 1, when you create your new file ( File > New ) or with the Ctrl+ N shortcut, what’s the current encoding of this new # file, before doing anything else ? I guess it should be Windows-1250. Am I right about it ?
At step 2, before saving file, did you type in pure ASCII glyphs, only ( with code in range [\x{0020}-\x{007f}] ) or did you add some accentuated characters ( as, for instance , č ř or ť )
At step 8 just to confirm : you meant “Specify Windows-1250 code-page, again”, didn’t you ?

Best Regards

guy038

Uzivatel919

Use Crtl + N as well as File/New. For me encoding is set to UTF-8 by default.
Use arbitrary glyphs. It is just not allowed to save empty file.
Yes, exactly choose Windows-1250 CP again.

guy038

Hi, @uzivatel919 and All

First, note that the tests, below, have been performed, whatever the option Autodetect character encoding was checked or not, in the dialog Settings > Preferences... > MISC

Your method can be simplified to this first scenario, below :

Create a new file ( Crtl + N )

Note that the present encoding is UTF-8

Select the option Encoding > Character Sets > Central European > Windows-1250
Paste the text ◘○•🔶 ěščřžýáíéúůťďňÓ.

=> This text is encoded with the Windows-1250 encoding using 1 byte to describe each character Note that some of the graphical characters, which do not belong to the Windows-1250 encoding, are replaced, of course, with a question mark ? !

https://en.wikipedia.org/wiki/Windows-1250

Save the file, with, for instance, the name Test.txt
Close Test.txt ( Ctrl + W )
Re-open Test.txt ( Ctrl + Shift + T )

=> The letters of the text are correct but, as expected, some graphical characters are replaced with a ?. Moreover, Notepad++ detects the ANSI encoding , which is, indeed, quite equivalent to the Window-1250 encoding, used by your system, for all NON-Unicode files

Select, again, the option Encoding > Character Sets > Central European > Windows-1250

=> As the encoding process just re-interprets all the 1-byte encoded characters, nothing has changed, as Windows-1250 encoding ≡≡ ANSI encoding. Note, that, as I’m French, on my system, for instance, there is the equivalence Windows-1252 encoding ≡≡ ANSI encoding

Now, let’s imagine the second scenario, below :

Create a new file ( Crtl + N )

Note that the present encoding is UTF-8

Paste the text ◘○•🔶 ěščřžýáíéúůťďňÓ.

=> This time, as we haven’t change current encoding yet, this text is, then, encoded with the UTF-8 encoding, using between 1 to 4 bytes to describe all the characters

Select the option Encoding > Character Sets > Central European > Windows-1250
Click on the Yes button of the small dialog Save Current Modification
Choose, again, the name Test.txt and save the file

=> So, the encoding is changed to Windows-1250. But the encoding operation does NOT change the present contents of the file. Notepad++ just re-interprets all bytes of the file as it was a range of 1-byte encoded characters, of the Windows-1250 encoding => So, it’s obvious that all text seems rather incomprehensible !

Thus, internally, the Test.txt file is still a suite of characters, each described according to the UTF-8 encoding

Close Test.txt ( Ctrl + W )
Re-open Test.txt ( Ctrl + Shift + T )

=> The text and most of the graphical chars are correct, according to your current font, and the UTF-8 encoding is automatically chosen ;-))

Remarks :

As this text is an UTF-8 encoded, you may “test” any other character set, using Encoding > Character Sets > .... menu option
You’ll notice that during this test phase, the file contents are NOT modified at all and the icon of the file remains blue !
At the end, after that test phase, just select the option Encoding > Encode in UTF-8 to get the original text back ;-))

Remember :

During an encoding operation, the present contents of the current file are, just, re-interpreted as they were encoded with this new encoding and are never modified
During a conversion operation, the present contents of the current file are modified, so that the new contents of the file correspond to the new encoding of the same characters

In other words :

The option “Encode in ...” OR “Character sets/ ...” just read the present file contents, according to the new chosen encoding, giving, generally, a new representation of the current file contents
The option “Convert to ...” does modify the present file contents in order to be read, in an identical way, with the new chosen encoding.

To end with :

As you see, encoding and conversion concepts are not easy to assimilate. So, I advice everyone to always use the UTF-8 encoding or, better, the UTF-8-BOM encoding, which is able to encode, absolutely, all the Unicode characters !
Of course, to fully exploit the UTF-8 files, your system must contain some fonts which cover most of Unicode characters and/or symbols !
For the record, as of today, 92.6% of Web pages are encoded in UTF-8-BOM;-)) Refer to the link, below :

https://w3techs.com/technologies/history_overview/character_encoding/ms/y

Best Regards,

guy038

Uzivatel919

Yes, yes. I know that things around. I was just surprised by given error since I was used to perfect Notepad++ code page functions. I did not realized at the moment what CP-1250 really includes.

Btw, thanks.