Community
    • Login

    C# plugin and reading text content, characters missing when too quickly after a user edit

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    12 Posts 4 Posters 250 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • EkopalypseE
      Ekopalypse @Mark Olson
      last edited by Ekopalypse

      @Mark-Olson

              public unsafe void SetText(string text)
              {
                  if (text.Length == 0)
                  {
                      ClearAll();
                      return;
                  }
                  fixed (byte* textPtr = Encoding.UTF8.GetBytes(text))
                  {
                      Win32.SendMessage(scintilla, SciMsg.SCI_SETTEXT, (IntPtr) Unused, (IntPtr) textPtr);
                  }
              }
      

      Are you sure that textPtr is a null terminated “string” pointer? I assume it is not.

      EkopalypseE 1 Reply Last reply Reply Quote 1
      • EkopalypseE
        Ekopalypse @Ekopalypse
        last edited by Ekopalypse

        This does not seem correct either

                private unsafe string GetNullStrippedStringFromMessageThatReturnsLength(SciMsg msg, int length = 0)
                {
                    if (length < 1)
                        length = Win32.SendMessage(scintilla, msg, (IntPtr)Unused, (IntPtr)Unused).ToInt32();
                    byte[] textBuffer = new byte[length];
                    fixed (byte* textPtr = textBuffer)
                    {
                        Win32.SendMessage(scintilla, msg, (IntPtr)length, (IntPtr)textPtr);
                        int lastNullCharPos = length - 1;
                        // this only bypasses NULL chars because no char
                        // other than NULL can have any 0-valued bytes in UTF-8.
                        // See https://en.wikipedia.org/wiki/UTF-8#Encoding
                        for (; lastNullCharPos >= 0 && textBuffer[lastNullCharPos] == '\x00'; lastNullCharPos--) { }
                        return Encoding.UTF8.GetString(textBuffer, 0, lastNullCharPos + 1);
                    }
                }
        

        If length is 0 or less, the required null-terminating byte is missing, right? Means, the stack/heap might get corrupted.

        EkopalypseE 1 Reply Last reply Reply Quote 2
        • EkopalypseE
          Ekopalypse @Ekopalypse
          last edited by

          Hmmm … as far as I can tell, I see a few problems in the original scintilla wrapper code.
          It often casts to int where intptr actually belongs. This may be due to internal C# issues, but is that, in 2025, really still the case?

          Mark OlsonM 1 Reply Last reply Reply Quote 1
          • Mark OlsonM
            Mark Olson @Ekopalypse
            last edited by

            @Ekopalypse said in C# plugin and reading text content, characters missing when too quickly after a user edit:

            It often casts to int where intptr actually belongs.

            Agreed that this is done a lot, but never in cases where the IntPtr returned is a a location in memory. It is more convenient for the developer to use an int.

            Obviously some methods like SCI_GETLENGTH could potentially return a number greater than int.MaxValue, but that is exactly why I implemented Npp.TryGetLengthAsInt and Npp.TryGetText: to make it easy for developers to bail out and stop executing a plugin command on files that are too big. It is literally impossible for a C# string to have length greater than int.MaxValue, after all.

            @Ekopalypse said in C# plugin and reading text content, characters missing when too quickly after a user edit:

            Are you sure that textPtr is a null terminated “string” pointer? I assume it is not.

            Yeah, that’s an issue that I need to fix. Thank you for pointing this out! I’ll find all the places where a null terminated string is required and adjust the code where appropriate.

            Mark OlsonM 1 Reply Last reply Reply Quote 4
            • Mark OlsonM
              Mark Olson @Mark Olson
              last edited by Mark Olson

              After carefully consulting the Scintilla documentation, I believe that this NppCSharpPluginPack commit ought to fix the issue @Ekopalypse mentioned where in some cases I was supposed to be sending a null-terminated string to Notepad++ but wasn’t.

              As far as I can tell, in every other situation where I send a string to Notepad++ I am also sending the length and thus the string can contain NUL characters.

              rdipardoR 1 Reply Last reply Reply Quote 2
              • Bas de ReuverB
                Bas de Reuver
                last edited by Bas de Reuver

                @Ekopalypse @Mark-Olson Thanks for looking into this and your insights. I’ve been using the ScintillaGateway source files as provided and tbh I don’t fully understand what it all does. Looking at the NppCSharpPluginPack commit and my ScintillaGateway.cs I think there have been some fixed along the way of developing the CSV plugin and I’ve got an outdated or different source files.

                I suspect there’s not some easy fix for this issue and it’s better to move the whole CSV Lint project from using the VS2019 NotepadPlusPlusPluginPack.Net to using the newer NppCSharpPluginPack. That will take some reworking but then we’re at least all working with the same baseline source files.

                1 Reply Last reply Reply Quote 1
                • rdipardoR
                  rdipardo @Mark Olson
                  last edited by rdipardo

                  @Mark-Olson said in C# plugin and reading text content, characters missing when too quickly after a user edit:

                  After carefully consulting the Scintilla documentation, I believe that this NppCSharpPluginPack commit ought to fix the issue @Ekopalypse mentioned where in some cases I was supposed to be sending a null-terminated string to Notepad++ but wasn’t.

                  If I’m following correctly, the return value of this method is the “null-terminated string” of the commit message (where + 1 accounts for the NULL, I’m guessing):

                  private byte[] GetNullTerminatedUTF8Bytes(string text)
                  {
                      int length = Encoding.UTF8.GetByteCount(text);
                      byte[] bytes = new byte[length + 1];
                      int lengthWritten = Encoding.UTF8.GetBytes(text, 0, text.Length, bytes, 0);
                      // ...
                      return bytes;
                  }
                  

                  Looks good. But what if the given document is not UTF8, but ANSI or UTF-16?

                  Notepad++ calls “ANSI” whatever the system’s default encoding happens to be, which is the same code page mapped to System.Text.Encoding.Default, and can be looked up with reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v ACP.

                  Here’s the string "café\0", with the NULL character already included:

                  Microsoft (R) Visual C# Interactive Compiler version 4.14.0-3.25326.11 ()
                  Copyright (C) Microsoft Corporation. All rights reserved.
                  
                  Type "#help" for more information.
                  > using static System.Text.Encoding;
                  > UTF8.GetByteCount("café\0")
                  6
                  > Default.GetByteCount("café\0")
                  5
                  

                  Now, the type of string returned by most N++ API methods is wchar_t*, which is UTF-16 on Windows. The .NET equivalent of UTF-16 is System.Text.Encoding.Unicode; here’s the length in bytes of a single NULL character in UTF-16:

                  > Unicode.GetByteCount("\0")
                  2
                  > // in fact...
                  > System.Runtime.InteropServices.Marshal.SystemDefaultCharSize == Unicode.GetByteCount("\0")
                  true
                  

                  Which means the string "café\0" has 2 possible byte lengths depending on the document’s encoding: 5 for ANSI and 6 for UTF8. If we were getting text from an N++ API and inserting that into the document, the length would be something else again:

                  > Unicode.GetByteCount("café\0")
                  10
                  

                  In other words, a NULL character is not guaranteed to always be the same length as a NULL byte.

                  A more comprehensive fix would involve querying the current document’s encoding (e.g., using SCI_GETCODEPAGE), then creating an instance of System.Text.Encoding using, for example, System.Text.Encoding.GetEncoding(Int32), with a fallback in case of exceptions, e.g.:

                  -    int length = Encoding.UTF8.GetByteCount(text);
                  -    byte[] bytes = new byte[length + 1];
                  +    // assign the .NET equivalent of the document's encoding to "cp"
                  +    int length = cp.GetByteCount(text);
                  +    byte[] bytes = new byte[length + cp.GetByteCount("\0")];
                  

                  For a fully worked-out example, see this implementation of encoding-aware Scintilla wrappers for .NET. The code is multi-targeted, so it should be possible to copy-paste any part of it into a .NET Framework project.

                  Mark OlsonM 1 Reply Last reply Reply Quote 2
                  • Bas de ReuverB
                    Bas de Reuver
                    last edited by

                    Just to make clear what the issue is, see screenshots below, so

                    A) Open the file and notice the text “test123@my.”
                    B) make one edit, like change “test123@my.” to “test12@my.”
                    C) quickly open Reformat dialog and press OK, like within 2 seconds

                    Now one extra character after the edited part is missing, so “test12@my.” becomes “test12my.”

                    csv_lint_reformat_too_quickly.png

                    It always hapens at the point of the edit, and it only happens when you do it very quickly, like within about 2 seconds. In other words, when you make the edit, open the Reformat dialog, wait 5 seconds, then press OK, then it doesn’t happen.

                    Can something like that really be explained by a string encoding issue? I mean, why does it not happen when you wait a few more seconds?

                    Mark OlsonM 1 Reply Last reply Reply Quote 1
                    • Mark OlsonM
                      Mark Olson @Bas de Reuver
                      last edited by

                      @Bas-de-Reuver

                      Can something like that really be explained by a string encoding issue?

                      I agree with you that this is pretty clearly a race condition, and not an encoding issue.

                      If you’re having a problem with ScintillaStreams while the document is being edited, it seems like the simplest option would be to do something like I do in JsonTools with automatic linting after edit where I track the last time the document was edited and wait to do some operations until the user hasn’t edited for a while. But the only way I can see to do a non-blocking wait from when the user presses a button until some condition is met is to spawn a new thread that keeps sleeping and checking the condition until that condition is met.

                      I guess my feeling is that ScintillaGateway.GetText is a superior option to ScintillaStreams for reformatting, because if the user is using their plugin to reformat their file, the only sane thing for them to do is wait until the reformatting is done, and a blocking function forces them to do this. I can see why you prefer ScintillaStreams for just getting a CSV schema or any other read-only operation, since not blocking the editor is worth being off by a character or two.

                      1 Reply Last reply Reply Quote 1
                      • Mark OlsonM
                        Mark Olson @rdipardo
                        last edited by

                        @rdipardo

                        But what if the given document is not UTF8, but ANSI or UTF-16?

                        I’ve finally gotten around to implementing a fix for this issue in JsonTools. I did this first in JsonTools because it has a lot of features that read the text of the file. It appears to work fine for ANSI files (you can test it on this example file I added).

                        As I noted in the commit message, there are some features that are broken because my JSON parser’s code for determining node position assumes that the document is UTF8-encoded.

                        …or UTF-16?

                        JsonTools has always worked correctly on files with any encoding except ANSI, so at least on my machine, I have just confirmed that Notepad++ internally represents documents as UTF-8 if the file itself is in UTF-16 or some exotic encoding like OEM 865.

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors