BUG: N++ does not keep in UTF8 unsaved open files

dz15mlru

BUG,
I’m using N++ with a lot of unsaved open files and I have in settings the option that all new documents to be in UTF-8, but in recent times I’ve discovered that a number of my usaved open documents with content in Cyrillic are not kept in UTF8 by N++ and are converted in “Cyrillic -> Macintosh” and some cyrillic text is malformed, uninteligible and lost. After converting back to UTF8 the text remains to be malformed.
https://i.imgur.com/jt05fe5.png

Coises

@dz15mlru said in BUG: N++ does not keep in UTF8 unsaved open files:

I’m using N++ with a lot of unsaved open files and I have in settings the option that all new documents to be in UTF-8, but in recent times I’ve discovered that a number of my usaved open documents with content in Cyrillic are not kept in UTF8 by N++ and are converted in “Cyrillic -> Macintosh” and some cyrillic text is malformed, uninteligible and lost.

First, would you open the ? menu, select Debug Into… and paste the information here? That helps make sure we know some details that can be important when analyzing bugs.

Second, at Settings | Preferences… | MISC. is the box labeled Autodetect character encoding checked? If it is, try unchecking it. That option sometimes does more harm than good.

in recent times

If there is any way you can remember or otherwise figure out what change(s) might have happened around the time this changed, it will help with working out what is happening.

After converting back to UTF8 the text remains to be malformed.

Once the text in the edit window is garbled, using one of the Encoding | Convert to options will never help. Those just convert what you’re already seeing to a different encoding.

I never use persistent unsaved files, so hopefully someone else will come along with experience about how to manage an unsaved file carried over from a previous session (I assume that’s the condition you’re describing) that opens in the wrong encoding. If it were a saved file that you were opening anew, the right thing to do would be to select the correct encoding from the top of the Encoding menu (not the Convert to options at the bottom) before making any changes. I don’t know if that works with persistent unsaved files, though.

PeterJones

@dz15mlru said in BUG: N++ does not keep in UTF8 unsaved open files:

BUG,
I’m using N++ with a lot of unsaved open files and I have in settings the option that all new documents to be in UTF-8, but in recent times I’ve discovered that a number of my usaved open documents with content in Cyrillic are not kept in UTF8 by N++ and are converted in “Cyrillic -> Macintosh”

If you have a new UTF-8 file, the session file stores its encoding as “-1”, which I believe means it will use its auto-detect the next time around. And the auto-detection that Notepad++ uses is imperfect (as any encoding autodetection will be; this is explaied in the new “Encoding” section in the User Manual – but I just discovered that the manual stopped publishing updates a few day ago, so until it publishes, you can read the encoding description in the repo instead)

As @coises suggested while I was writing this up, try turning off the auto-detection, and it should prevent that in the future.

And the “Convert to…” won’t work to fix what you are seeing on your existing files… but maybe Encoding > UTF-8 will cause it to re-interpret the bytes correctly (assuming the bytes haven’t been re-written at this point to something else).

PeterJones

@Coises said in BUG: N++ does not keep in UTF8 unsaved open files:

If there is any way you can remember or otherwise figure out what change(s) might have happened around the time this changed, it will help with working out what is happening

Assuming autodetection is on (and that’s the best assumption, given the data), it depends on what other characters are also in the file, so if you get a combination of bytes that look to the algorithm like “Cyrillic -> Macintosh” instead of “UTF-8”, then it will pick that. So “in recent times” may have been that additional text was added to those files which make the algorithm think it looks like “Cyrillic -> Macintosh” should look.

Coises

@PeterJones said in BUG: N++ does not keep in UTF8 unsaved open files:

Assuming autodetection is on (and that’s the best assumption, given the data), it depends on what other characters are also in the file, so if you get a combination of bytes that look to the algorithm like “Cyrillic -> Macintosh” instead of “UTF-8”, then it will pick that. So “in recent times” may have been that additional text was added to those files which make the algorithm think it looks like “Cyrillic -> Macintosh” should look.

The thing is… it is very unusual for a UTF-8 file of any size that contains non-ASCII characters to “look like” anything but UTF-8. (Unless Cyrillic/Macintosh is some strange exception.) I suspect something is going on here that we haven’t heard about yet.

One possibility might be if new files are set to open as UTF-8 but “Apply to opened ANSI files” is not checked, then the user exits when a file has only ASCII characters; on re-opening, perhaps (as I said, I don’t use Remember current session) Notepad++ opens it as ANSI, the user doesn’t notice and adds non-ASCII characters. Now it really would be in something other than UTF-8 — but why it would be mis-identified as the wrong Cyrillic code page, I don’t know.

dz15mlru

@Coises said in BUG: N++ does not keep in UTF8 unsaved open files:

at Settings | Preferences… | MISC. is the box labeled Autodetect character encoding checked? If it is, try unchecking it. That option sometimes does more harm than good.

Yes, is cheked. I’ll try to disable it, but not sure when I’ll see the changes, the result.

@Coises said in BUG: N++ does not keep in UTF8 unsaved open files:

If there is any way you can remember or otherwise figure out what change(s) might have happened around the time this changed, it will help with working out what is happening.

Well, few weeks ago I’ve encountered a few BSODs, caused by a faulty RAM unit or slot. Fixed by removing one. I have had N++ open at that time in at least one incident. I was happy after that that N++ did not lost my huge session of unsaved files, and apparently everything was ok - at least in the most recent open files, but they mainly were in standard Latin alphabet content. But in a few days I discovered the issue of files with malformed text in wrong encoding. However, I can’t say for sure if the BSOD caused this issue, or maybe it already existed for some short time before this. I have a lot of unsaved files and I don’t open all of them daily to be sure when changes occur.

@PeterJones said in BUG: N++ does not keep in UTF8 unsaved open files:

Assuming autodetection is on (and that’s the best assumption, given the data), it depends on what other characters are also in the file, so if you get a combination of bytes that look to the algorithm like “Cyrillic -> Macintosh” instead of “UTF-8”, then it will pick that

Yes, is ON. And most frequently I have mixed content in documents, both Cyrillic + Latin. But I expected that UTF-8 should preserve intact all the file content…
I’ll disable autodetection.

@Coises said in BUG: N++ does not keep in UTF8 unsaved open files:

One possibility might be if new files are set to open as UTF-8 but “Apply to opened ANSI files” is not checked

Indeed is so. In “Settings - > New Document - > Encoding - > UTF-8”, should I tick the option “Apply to opened ANSI files”?
Not sure if I had it checked over the years, or if the setting was changed recently.
About the file content, I’m pretty sure that the file with noticed problem was unchaged for very long time and it already contained both Cyrillic + Latin, and it was in UTF-8 over the time.
I always keep all my files in UTF-8, and never change the encoding to another. In this case something happened and a number of files were changed from UTF-8 to Cyrillic - > Macintosh, without a valid reason. Perhaps it was due to those BSODs

dz15mlru

Thanks you guys for your support.

So, I’ll try to 1) disable the “autodetection of character encoding”, and 2) to check the option “Apply to opened ANSI files"

Also, just now it arrived one fresh update from N++.
I will perfom this update as well. And I’ll monitor after this if the problem appears again.

dz15mlru

So, I’ve checked the session.xml to see how many files and in which encoding are.
https://i.imgur.com/H10X5Sh.png

What I’ve discovered here are inconsistent results.
While I expected all files to be in UTF-8 due to my settings, here I found:

Some (most) UTF-8 files with encoding “-1” in session.xml
Some UTF-8 files with encoding “10007” in session.xml
Some “Cyrillic -> Macintosh” files with encoding “10007” in session.xml
Some “Cyrillic -> Windows 1251” files with encoding “1251” in session.xml

Somehow, N++ for itself has decided and selected alle these different encodings. What I need is to have all my files always in UTF-8, and now I’m thinking maybe to mass-convert all the file to UTF-8 somehow…

AZJIO AZJIO

@dz15mlru

Disable automatic encoding recognition. For Windows-1251 encoded Russian, auto-recognition will always open as Macintosh. If you start editing files, you will have two encodings, or rather garbage from two encodings, which will be difficult to fix manually, since you will have to re-read all the texts (this is a module for spoiling files). When you disable automatic encoding assignment, you will only have ANSI, UTF-8, UTF-16. WindowsXP-7-8-10-11 it will always open the ANSI file correctly, in 1251 encoding, as this is the default encoding. The remaining UTF-8 and others will also open automatically correctly. You will get rid of the problem forever. The automatic text encoding recognition module is needed if you open files in Arabic in ANSI, but in reality you will never do this, since a Russian-speaking person has only Russian-language files on their computer. People who want to make the file available to all people on earth save the file in UTF-8 encoding and it will always open correctly for you. You don’t need automatic file recognition, as it’s only for local files that you’ll never get from someone else’s computer abroad.

Отключи автоматической распознавание кодировки. Для русского языка в кодировке Windows-1251 автораспознавание всегда будет открываться как Macintosh. Если начать редактировать файлы, то у вас будет две кодировки, точнее мусор из двух кодировок, который будет трудно исправить вручную, так как вам придётся перечитать все тексты (это модуль для порчи файлов). Когда вы отключите автораспозначание кодировки, то у вас будет только ANSI, UTF-8, UTF-16. WindowsXP-7-8-10-11 всегда откроет файл ANSI правильно, в кодировке 1251, так как это кодировка по умолчанию. Остальные UTF-8 и прочие откроются также автоматически правильно. Вы навсегда избавитесь от проблемы. Модуль автоматического распознавания кодировки текста нужен если вы открываете файлы на арабском языке в ANSI, но в реальности вы никогда этого не сделаете, так как у русскоязычного человека на компьютере есть только русскоязычные файлы. Люди, которые хотят сделать файл доступным для всех людей на земле сохраняют файл в кодировке UTF-8 и он всегда откроется правильно у вас. Вам не нужно автоматическое распознавание файлов, так как оно только для локальных файлов, которые у вас никогда не появятся с чужого заграничного компьютера.