Community
    • Login

    Remove duplicate lines from unsorted, keeping first

    Scheduled Pinned Locked Moved General Discussion
    10 Posts 5 Posters 9.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn
      last edited by

      I’ve seen some techniques here for using regular expressions to remove duplicate lines from an unsorted file, but these all seem to show how to keep the LAST occurrence of the duplicated line. I need to do this but keep the FIRST occurrence. Does anyone know how that can be done?

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @Alan Kilborn
        last edited by

        @Alan-Kilborn

        So the Regex Cookbook (buy it!) gives a couple of regular-expression replacement scenarios for this:

        Find-what zone: (?s)^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
        OR
        Find-what zone: (?-s)^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
        For either choice the Replace-with zone is set to \1\2

        Some changes may be made to these expressions for more typical use in Notepad++:

        Find-what zone: (?-s)^(.*)$(?s)(.*?)(?:\R\1$)+
        Replace-with zone: \1\2
        Wrap around checkbox: ticked
        Action: Press Replace All button REPEATEDLY until status bar indicates: “Replace All: 0 occurrences were replaced.”

        But…there are some interesting things to note in the Cookbook regexes:

        • [^\r\n] may be used in place of “(?-s) with a later occurring .” to mean “any character but not including (or across) line-ending characters”
        • [\s\S] may be used in place of “(?s) with a later occurring .” to mean “any character at all (including line-ending characters)”

        I like these as they put all the functionality in one place in the regex.

        1 Reply Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn
          last edited by

          Excellent. :)

          Is it important that the last line have a line-ending on it if it is otherwise a duplicate of exactly one line that comes before it? I guess I will find out.

          Thanks.

          Scott SumnerS 1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @alan-kilborn, @scott-sumner and All,

            An other formulation of this regex, could be :

            SEARCH (?-s)^(.+\R)(?s).*?\K\1+

            REPLACE EMPTY

            OPTIONS Wrap around and Regular expression set

            ACTION : Click, repeatedly, on the ALT + A shortcut ( Replace All )

            ( Easy to memorize : \K as Kilborn ! )


            So, for instance, from the initial text, below, with a line break after the last item 123 :

            123
            456
            123
            789
            789
            000
            789
            456
            abc
            123
            123
            456
            456
            456
            789
            999
            123
            
            

            We get :

            123
            456
            789
            000
            abc
            999
            

            Scott, the regex, which represents a single standard character, is (?-s).. It can, also, be replaced by the exact negative class [^\r\n\f\x85\x{2028}\x{2029}]. So, the regex [^\r\n] is just an easy approximation :-)

            And the regex (?s)., which represents any character, at all, is (?s).. It can, also, be replaced by any of the regexes [\s\S] , [\d\D] , [\l\L] , [\u\U] , [\w\W] , [\h\H] or [\v\V]


            Scott, you said, too :

            I like these as they put all the functionality in one place in the regex

            But, if you placed the (?s) syntax, inside round parentheses, along with a regex expression, (?s) acts, ONLY, inside the part, within parentheses :-))

            For instance, let’s consider the text :

            blablah
            A simple
            abc
            123
            456
            789
            xyz
            Test
            blablah
            

            And imagine the regex (?-s).+\R((?s)abc.+xyz\R).+

            • In the first and, above all, the last part of the regex, the dot . means a standard character

            • In the middle part, surrounded by parentheses, the dot means any character !

            Best Regards,

            guy038

            Scott SumnerS 1 Reply Last reply Reply Quote 3
            • Scott SumnerS
              Scott Sumner @guy038
              last edited by Scott Sumner

              @guy038 said:

              (?-s)^(.+\R)(?s).*?\K\1+

              One of the things that I’m faced with is the need to move regular expressions between Notepad++ and Python code. As Python doesn’t support \K I lean toward the regexes for this that don’t contain it. Note that Python doesn’t support the \R syntax either, but I used it earlier. But as \R is just a simple abbreviation and \K can have bigger logic implications, I’m allowed a little license with the \R. :-D Of course, the longer “Cookbook” regexes have wider applicability than even N++ and Python.

              exact negative class [^\r\n\f\x85\x{2028}\x{2029}]

              Well, I don’t care about any of those beyond the \n…but maybe somebody does! :-D

              …ONLY, inside the part, within parentheses

              Now THAT I didn’t know. Nice. But I still tend to like [\s\S] and [^\r\n] (and their close relatives that you pointed out).

              1 Reply Last reply Reply Quote 0
              • AZJIO AZJIOA
                AZJIO AZJIO
                last edited by

                http://rgho.st/6GD58rS8H
                See the sections AutoIt and Scripting.Dictionary.
                If satisfied, I will do it in English

                1 Reply Last reply Reply Quote 0
                • AZJIO AZJIOA
                  AZJIO AZJIO
                  last edited by

                  English
                  http://rgho.st/8C8jFTw4b

                  1 Reply Last reply Reply Quote 0
                  • Scott SumnerS
                    Scott Sumner @Alan Kilborn
                    last edited by

                    @Alan-Kilborn said:

                    Is it important that the last line have a line-ending on it if it is otherwise a duplicate…?

                    Mostly No…but maybe ?

                    Say we start with this data:

                    value1[CR][LF]
                    value2[CR][LF]
                    value2[CR][LF]
                    value4[CR][LF]
                    value3[CR][LF]
                    value3[CR][LF]
                    value2[CR][LF]
                    value4
                    

                    Notice that the last value4 does NOT have a line-ending on it.

                    Then, using the technique above, and after multiple Replace All actions, we are left with this:

                    value1[CR][LF]
                    value2[CR][LF]
                    value4[CR][LF]
                    value3
                    

                    Thus, the final value4 was detected as a duplicate and removed, even though the line itself wasn’t an exact duplicate in the original data.

                    Note, however, that the new last line (containing value3) now does not have a line-ending…when all the value3 lines in the original file did.

                    1 Reply Last reply Reply Quote 0
                    • Sepehr ES
                      Sepehr E
                      last edited by

                      @guy038,
                      Why does sometimes the (?-s)^(.+\R)(?s).*?\K\1+ makes Notepad++ think the whole text is duplicated and needs to be deleted?

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hello, @sepehr-e,

                        I must admit that sometimes regexes, involving great amount of text, may, wrongly, get an unique match, which represents all the file contents :-(( This case may also happen, in case of regexes with recursive patterns, inside !

                        I can’t clearly explain this behaviour. May be, it’s related to a matched range of characters, that exceeds a limit. It could also depends on the RAM amount or because of specific N+++ features, like the periodic backup !


                        Practically, you could use the regex, below, which implies an other condition : the \1+ block of lines, which is to be deleted, must, follow, most of the time, some End of line characters !

                        (?-s)^(.+\R)(?s).*?\R?\K\1+

                        Despite you didn’t say anything about your working file, but it could help ?!

                        Note that the syntax \R must be optional ( => the form \R? ) in case of a block of consecutives identical lines, as for instance :

                        456
                        789
                        123
                        123
                        123
                        123
                        000
                        

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors