Community
    • Login

    Wanted function: Remove Duplicated Lines

    Scheduled Pinned Locked Moved General Discussion
    9 Posts 7 Posters 7.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dmitry BondD
      Dmitry Bond
      last edited by

      Hi.

      When preparing master data it is often required to remove duplicated lines.
      Currently I have to use cmdline for this operation. So, paste into NPP, sort by lines, remove empty lines, save into txt file, apply cmdline to text file to remove duplcates, reload back to NPP, continue editing.
      Would be nice to have such function in NPP, somwhere at menu -> Edit -> Line Operations -> Remove Duplicated Lines (maybe with “ignore case” option).

      Thank you.

      1 Reply Last reply Reply Quote 1
      • chcgC
        chcg
        last edited by

        see https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad
        or for just words https://notepad-plus-plus.org/community/topic/15247/replacing-duped-words-across-a-block-block-of-text-respecting

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @dmitry-Bond and All,

          First of all, we must agree about the statement Remove Duplicates Lines !

          So, assuming the initial example text :

              bbbbbbb
              hhhhhhhhhhh
              fffffffffffffff
              bbbbbbb
              aaaaa
              bbbbbbb
              jj
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              AAAaa
              ccccccccccccccccccccccccccccccccccccccccccccccc
              AAAAA
              ddd
              iiiiiiiiiiiiiiiii
              aaaaa
              hhhhhhhhhhh
              gggggggggggggggggggggggggggggggggggg
              AAAAA
              bbbbbbb
          
          • IF the search is insensitive, 3 lines are duplicated :

            • The line aaaaa ( 5 items )

            • The line bbbbbbb ( 4 items )

            • The line hhhhhhhhhhh ( 2 items )

          • IF the search is sensitive, 4 lines are duplicated :

            • The line aaaaa ( 2 items )

            • The line AAAAA ( 2 items )

            • The line bbbbbbb ( 4 items )

            • The line hhhhhhhhhhh ( 2 items )

          I suppose that you probably want to get a final text with, at least, a single item of all these duplicated lines :-)) If you really want to delete ALL the lines, which are duplicated, just tell me !

          This task can be easily done with a search/replacement, using Regular expressions ! So :


          • Open your file in N++

          • Open the Replace dialog ( Ctrl + H )

          • IF your file is already SORTED :

            • Type in (?-is)^(.+\R)\1+ ( search sensitive to case ) OR (?i-s)^(.+\R)\1+ ( search insensitive to case )

            • Type in \1 in the Replace with: zone

          • IF your file is an UNSORTED file :

            • Type in (?s-i)^(.+?\R)(?=(?:.+\R)?\1) ( search sensitive to case ) OR (?si)^(.+?\R)(?=(?:.+\R)?\1) ( search insensitive to case )

            • Leave the Replace with: zone EMPTY

          • Tick the Wrap around option and the Regular expression search mode

          • Click on the Replace button, several times or once only, on the Replace All button

          Et voilà !

          Remark: When processing on a non sorted file, all duplicated lines, but the last, are deleted !


          So, from the initial example text ( see above ), let’s perform a pre-sort operation ( Edit > Line operations > Sort lines lexicographically Ascending ), we get the sorted text, below :

              AAAAA
              AAAAA
              AAAaa
              aaaaa
              aaaaa
              bbbbbbb
              bbbbbbb
              bbbbbbb
              bbbbbbb
              ccccccccccccccccccccccccccccccccccccccccccccccc
              ddd
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              fffffffffffffff
              gggggggggggggggggggggggggggggggggggg
              hhhhhhhhhhh
              hhhhhhhhhhh
              iiiiiiiiiiiiiiiii
              jj
          

          Then, the search = (?-is)^(.+\R)\1+ and replacement = \1 would change text into :

              AAAAA
              AAAaa
              aaaaa
              bbbbbbb
              ccccccccccccccccccccccccccccccccccccccccccccccc
              ddd
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              fffffffffffffff
              gggggggggggggggggggggggggggggggggggg
              hhhhhhhhhhh
              iiiiiiiiiiiiiiiii
              jj
          

          Whereas the search = (?i-s)^(.+\R)\1+ and replacement = \1 would change text into :

              AAAAA
              bbbbbbb
              ccccccccccccccccccccccccccccccccccccccccccccccc
              ddd
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              fffffffffffffff
              gggggggggggggggggggggggggggggggggggg
              hhhhhhhhhhh
              iiiiiiiiiiiiiiiii
              jj
          

          Now, If we use the initial text, without any sort operation :

          The search = (?s-i)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

              fffffffffffffff
              jj
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              AAAaa
              ccccccccccccccccccccccccccccccccccccccccccccccc
              ddd
              iiiiiiiiiiiiiiiii
              aaaaa
              hhhhhhhhhhh
              gggggggggggggggggggggggggggggggggggg
              AAAAA
              bbbbbbb
          

          Whereas the search = (?si)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

              fffffffffffffff
              jj
              eeeeeeeeeeeeeeeeeeeeeeeeeee
              ccccccccccccccccccccccccccccccccccccccccccccccc
              ddd
              iiiiiiiiiiiiiiiii
              hhhhhhhhhhh
              gggggggggggggggggggggggggggggggggggg
              AAAAA
              bbbbbbb
          

          Remarks :

          • In a previously sorted file, the regexes keep the first duplicate found

          • In an unsorted file, the regexes keep the last duplicate found

          • If you want to delete the pure blank lines, as well :

            • Adds |^\R at the end of any search regex

            • Change the non-empty replacement \1 with the syntax ?1\1

          I could give you some explanations about these regexes, or other topics, next time, if you want to :-))

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 0
          • Juan Miguel MartínezJ
            Juan Miguel Martínez
            last edited by

            I want to add my gain of salt about this: the regex solution works, but it’s absolutely slow for anything but a hundred lines. I added one of the regex solutions (not sure if this) as a macro, and I regularly use it on a ~6k lines text, and takes enough to make it annoying, even in a 4ghz CPU. Right now I had to do it on a 120k lines text and it’s going on its tenth minute. TextFX did stuff like this almost instantly (or taking a reasonable amount of time for large texts anyway), but there’s no 64bit build of it.

            Then I have read that devs aren’t too keen to add functions when regex can do the work, but in this case ‘can do the work’ is very open to interpretation in the context of a text editor that is capable of working with very large files. The regex solution, or better called, workaround, time of operation doesn’t increase linearly with number of lines, it seems exponential.

            Right now my options are:

            • Reinstall 32bit Notepad++, or
            • Keep using the really slow regex “solution”.

            Which IMHO is a really bad pair of choices when this function, which shouldn’t be that bad to code, could be in a simple plugin, or as added functionality in the Edit-> Operations with lines section.

            So, will someone attend to our pleas, please?

            Oh, my regex hasn’t ended yet…

            Scott SumnerS 1 Reply Last reply Reply Quote 0
            • Jim DaileyJ
              Jim Dailey
              last edited by

              @Juan-Miguel-Martínez You have other options:

              • Learn a scripting language to perform tasks like this.
              • Use a different editor that is more to your liking.
              • Find the TextFX source code and build a 64-bit version.

              IMHO, people far too often expect an editor to things that are done much easier by tools that have been around for many years. For example, this simple AWK program will print the unique lines in a file (no sorting needed):

              { if (L[$0]++ == 0) print }
              

              And, I expect it would be very efficient, even on your 120K line input file.

              Now, consider a couple of variations. Suppose you wanted only the duplicated lines, this would do the trick:

              { if (L[$0]++ == 1) print }
              

              And, finally, if you wanted to see all of the duplicated lines (including duplicates of the duplicates), this would work:

              { if (L[$0]++ >= 1) print }
              

              I’m sure the Python and PERL experts can demonstrate similar capabilities.

              Please consider learning a bit of AWK, Python, PERL, or some other such scripting language if you need to manipulate files in various ways that seem a bit difficult or contorted using the editor’s search and replace capabilities.

              1 Reply Last reply Reply Quote 2
              • PeterJonesP
                PeterJones
                last edited by PeterJones

                perl -ne "print unless $seen{$_}++" dups.txt

                (I recommend easy-to-install strawberry perl for windows perl usage)

                1 Reply Last reply Reply Quote 1
                • Juan Miguel MartínezJ
                  Juan Miguel Martínez
                  last edited by

                  You forgot to mention I can also write my own text editor with the functions I want cough
                  I’m not demanding anything, I’m requesting something that seems useful, other people seem to want it as well, and I don’t think it’s hard to implement (specially if the lines are already sorted), and certainly is not out of context, seeing other line operations Notepad++ already offers. No need to go all smartass about it.

                  1 Reply Last reply Reply Quote 0
                  • PeterJonesP
                    PeterJones
                    last edited by PeterJones

                    If you are using the 32bit Notepad++ and have PythonScript installed (or you grab and run the installer), you can run this PythonScript to delete adjacent duplicate rows

                    # remove duplicate lines (assumes lines already sorted, so only compares to previous line)
                    
                    console.clear()
                    console.show()
                    
                    prev = "should not match previous"
                    lineNumber = 0
                    
                    while lineNumber < editor.getLineCount():
                        editor.gotoLine(lineNumber)
                        contents = editor.getLine(lineNumber)
                    
                        console.write( "#" + str(lineNumber) + "/" + str(editor.getLineCount()) )
                        console.write( "#[" + str(len(contents)) + "]\t" + contents)
                    
                        if contents == prev:
                            console.write( "\tdeleting\n" )
                            editor.deleteLine(lineNumber)
                        else:
                            console.write( "\tno match\n" )
                            lineNumber = lineNumber + 1
                    
                        prev = contents
                    

                    running it on the following

                    # line 1
                    # this matches
                    # this matches
                    # ends up as line 3
                    # this matches
                    # ends up as line 5
                    

                    will result in

                    # line 1
                    # this matches
                    # ends up as line 3
                    # this matches
                    # ends up as line 5
                    

                    So even though the 5th input line matches the 2nd/3rd input lines, it won’t be deleted. This matches your requirement/specification of “it’s already sorted”

                    oh, sorry. I just noticed again while re-reading that it’s because you’ve switched to NPP 64bit that you asked for this to begin with. Unfortunately, PythonScript isn’t there (yet) for 64bit. I know Claudia is working on converting the PythonScript plugin to 64bit, but she’s not there yet.

                    Someone (not me) might be able to convert my PythonScript solution into something that works with LuaScript, which runs under either 32bit or 64bit NPP. (I don’t know Lua at all, sorry.)

                    An alternative solution: since you know the functionality you need exists in the TextFX plugin in 32bit: grab a portable installation of the 32bit, and load that version of NPP in situations where you want to use a TextFX, but use 64bit by default in most situations.

                    1 Reply Last reply Reply Quote 1
                    • Scott SumnerS
                      Scott Sumner @Juan Miguel Martínez
                      last edited by Scott Sumner

                      I took a look at TextFX and I don’t see a “remove duplicate lines” functionality. There is an ability to keep only unique lines upon doing a sort, but this is not the same thing…not everyone wants a sort along with duplicate-line removal. So recompiling TextFX for 64-bit isn’t going to do anything for providing what the OP is asking for. Maybe I’m wrong and I just missed seeing it in TextFX?

                      @Juan-Miguel-Martínez said:

                      I can…write my own text editor with the functions I want cough

                      I must admit that caused me to laugh out loud–guess it was the cough part on the end. :^)

                      …No need to go all smartass about it.

                      This comment caused me to stop laughing. This forum is all about alternative solutions to problems, and Jim/Guy/Peter were simply trying to illustrate some of those. They weren’t going “all smartass” as I interpret what they said. There is usually value in all contributions here, and I for one will be checking out Strawberry Perl (used Perl before, but not the fruit-flavored variety).

                      That all being said, if you already know alternatives exist and are just stating that you’d like to see a duplicate lines removal feature built right into Notepad++, then point taken, request noted. As to whether or not you will ever see that happen, I have no idea.

                      1 Reply Last reply Reply Quote 3
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors