Community
    • Login

    Delete number strings in the middle of lines of data

    Scheduled Pinned Locked Moved General Discussion
    replace
    20 Posts 7 Posters 3.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @guy038
      last edited by

      @guy038 said in Delete number strings in the middle of lines of data:

      (?(DEFINE)…)

      It’s a nice construct. It is documented here for those that don’t know:

      https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

      as

      (?(DEFINE)never-exectuted-pattern) Defines a block of code that is never executed and matches no characters: this is usually used to define one or more named sub-expressions which are referred to from elsewhere in the pattern.
      

      I don’t think it has a mention in the official Notepad++ docs, though.

      It doesn’t mean a lot if you simply read it, but a lot of value is added with a concrete example such as that provided by @guy038

      One thing that I don’t like about it is that it consumes a capture group number. Wouldn’t it be better to work with named and not numbered groups? Indeed the docs say “…define one or more named sub-expressions…” so this would be equivalent for “my” regex (regex A) above:

      (?x-si)    (    ?(DEFINE)    (?<ALAN>    (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w])    )    )        (^(?P>ALAN)\h|(?P>ALAN)$)
      

      But alas, even though I’ve used a group named ALAN above, it is equivalent to group #1, thus a possible equivalency use case could look like this:

      (?x-si)    (    ?(DEFINE)    (?<ALAN>    (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w])    )    )        (^(?1)\h|(?1)$)
      

      Note that the difference is, even though I’ve named the group ALAN at “define” time, I refer to it as 1 when actually used.

      So why is this a downside? Well, because it couples the left side (definition) with the right side (use). Maybe I have a library of definitions, that I want to largely ignore (except their names), and I’m wanting to write a regex I’m going to use to match some data–maybe in the regex I want to backrefer to my own capture group #1. Well, because of the coupling, group #1 would already be in use.

      Ok, so maybe it is a slight downside that wouldn’t come up often, but, I just happened to encounter that scenario recently… :-)

      Did this turn into a Boost regex forum accidentally, or what?!? So sorry…

      1 Reply Last reply Reply Quote 3
      • guy038G
        guy038
        last edited by guy038

        Hi, @alan-kilborn and All,

        Yes, Alan, I’m agree with you that named groups should not be numbered by the regex engine and, thus, the user should only use them, as backreferences, with their names, in search and/or replacement !

        However, the .NET regex engine, has an intelligent way to have the best of both worlds ! Indeed, the .NET regex engine scans all unnamed groups, first, numbering them from value 1, then re-scans the regex, continuing to number all the named groups, from after the greatest number used in unnamed groups ;-))

        In the old version, below, of the Regular-Expressions manual, of Jan Goyvaerts ( creator of the Regular-expressions.info site ),

        https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

        it is said, pages 36-37

        Names and Numbers for Capturing Groups :

        Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this regex and the replacement \1\2\3\4, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.

        Things are quite a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. Probably not what you expected.

        The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups (?<x>b) and (?<y>d) get their numbers, continuing from the unnamed groups, in this case: three.

        To make things simple, when using .NET’s regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively.

        But, with the Boost regex engine of Notepad++, we have to make do with the usual numbering of the groups, which just does one regex scan and numbers any group, named or not, one after the other !

        Best Regards,

        guy038

        cracksoftC 1 Reply Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn
          last edited by

          @guy038

          Maybe getting really off-topic now, but with the “DEFINE” stuff it got me thinking about a similar “problem” I have. I say “problem” because it is nothing I can’t workaround, but I’m wondering if there is a better solution.

          Consider:

          search: (?-i)(Xxx)|(XXX)|(Yyy)
          replace: (?1Zzz)(?2ZZZ)(?3Www)

          This would convert this text: The quick Xxx Yyy jumped over the lazy XXX into The quick Zzz Www jumped over the lazy ZZZ

          So please don’t consider the wrong problem. What I have is a simplified example of something more complicated, and the above is just for illustration.

          What I’d like to do is to NOT have to specify the capitalized version of ZZZ in the replace, but rather use the Zzz text without respecifying it (important!) in combination with a \U option.

          So in pseudo-regex, because I know this won’t work, without even trying it:

          replace: (?1Zzz)(?2\U${1}\E)(?3Www)

          So I was just wondering if you had any thoughts on this. TIA. :-)

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @alan-kilborn,

            Your replacement cannot work because, when the search regex matches the string XXX, due to the different alternatives, the group 2 is the only group defined, anyway :-((

            In addition, seemingly, you’re not interested by the group 1, itself, but only with the replacement string of this group , so that you would like something like (?2\UREPLACEMENT of (\1)\E) !!


            Let’s imagine the text sample, below, which is used in all subsequent tests :

            Xxx
            XXX
            XXX---Xxx
            

            Then with the regex S/R :

            SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | (\2---\1)
            Groups :            1         2        3
            
            REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n
            

            We get :

            Group 1 >Xxx<
            Group 2 ><
            Group 3 ><
            
            
            Group 1 ><
            Group 2 >XXX<
            Group 3 ><
            
            XXX---Xxx
            

            As explained above, the search regex does match the Xxx and XXX strings but fails to find the XXX---xxx because when trying the 3rd alternative, the groups \1 and \2 are not defined


            OK, let’s try another syntax, using sub-routine calls (?#) :

            SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2)---(?1))
            Groups :            1         2        3
            
            REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n
            

            Text turns into :

            Group 1 >Xxx<
            Group 2 ><
            Group 3 ><
            
            
            Group 1 ><
            Group 2 >XXX<
            Group 3 ><
            
            
            Group 1 ><
            Group 2 ><
            Group 3 >XXX---Xxx<
            

            This time, the result is better as, when matching the string XXX---xxx, with the alternative ((?2)---(?1)), it makes reference to groups 1 and 2, outside the alternative matched, due to the (DEFINE) syntax !

            However, we don’t get the groups 1 and 2, individually


            Let’s use, again, an other syntax, where any sub-routine call (?#) is embedded in parentheses, itself, so ((?#))

            SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1))
            Groups :            1         2        3        4 
            
            REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nGroup 4 >\4<\r\n
            

            Just note that the 3rd alternative is not embedded, itself, between parentheses. After execution, we’re left with :

            Group 1 >Xxx<
            Group 2 ><
            Group 3 ><
            Group 4 ><
            
            
            Group 1 ><
            Group 2 >XXX<
            Group 3 ><
            Group 4 ><
            
            
            Group 1 ><
            Group 2 ><
            Group 3 >XXX<
            Group 4 >Xxx<
            

            Ah!.. ,now, when the regex engine tries the 3rd alternative, it does match the string XXX-Xxx and, in replacement, we note that groups 3 and 4 ( which are identical to groups 2 and 1, respectively, not part of the present match ), are both defined :-))

            So, using a more natural example, below :

            SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1))
            Groups :            1         2        3        4 
            
            REPLACE   (?1ABC)(?2DEF)(?3Group 1 = \4 and Group 2 = \3)
            

            The sample text :

            Xxx
            XXX
            XXX---Xxx
            

            is changed into :

            ABC
            DEF
            Group 1 = Xxx and Group 2 = XXX
            

            However, there’s still a problem, as, in your example, you would like to refer to the replacement part of a group, which does not participate to the overall match, anyway ! More complicated…

            We must find a way :

            • To match and capture the string XXX

            • To capture the string ZZZ, in the same alternative, although the string ZZZ would not be part of the overall match

            Still searching !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 2
            • cracksoftC
              cracksoft @guy038
              last edited by cracksoft

              @guy038 I was working on a paper wen i notice i had to replace averything after a (space)
              exemple 2020-04-10 21,25,25

              I found the pdf pud i’m just Dum
              how to remove every regular expresion: 21,25,25
              so everything after the year-month-day?
              And sorry If I did broke few rules éditor Notepad++

              cracksoftC 1 Reply Last reply Reply Quote 0
              • cracksoftC
                cracksoft @cracksoft
                last edited by

                @cracksoft said in Delete number strings in the middle of lines of data:

                @guy038 I was working on a *papier wen i notice i had to replace averything after a (space)
                exemple 2020-04-10 21,25,25

                I found the pdf pud i’m just Dum
                how to remove every regular expresion: 21,25,25
                so everything after the year-month-day?
                And sorry If I did broke few rules éditor Notepad++
                *edit

                cracksoftC 1 Reply Last reply Reply Quote 0
                • cracksoftC
                  cracksoft @cracksoft
                  last edited by

                  @cracksoft **edit I may be on the right track I just found front the pdf you provide in this post space = \s if i’m not wrong?

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @craksoft, and All,

                    If I fully understood your needs, you would like to delete the part after a date, which, I suppose, is the hour part ?

                    If so :

                    • SEARCH (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}

                    • REPLACE Leave EMPTY

                    • Select the Regular expression search mode

                    Best Regards,

                    guy038

                    P.S. :

                    For regex documentation, follow this link :

                    https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

                    cracksoftC 1 Reply Last reply Reply Quote 1
                    • cracksoftC
                      cracksoft @guy038
                      last edited by

                      @guy038 said in Delete number strings in the middle of lines of data:

                      (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}

                      So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
                      Still thank it work you made my escape of selecting and deleting few hours of work ^^

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @craksoft, and All,

                        You said :

                        So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?

                        I don’t know what you means, exactly !?

                        The regex expression (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} deletes blanks characters and the next 8 characters, when preceded by a date, with the YYYY-MM-DD format. No more, no less :-)

                        BR

                        guy038

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors