Erase content from duplicate lines, but keeping the first unchanged
-
Hello. So, what I want to do is to turn this:
31 31 31 31 32 32 32 33 33 33 33 34 35 35 35into this:
31 32 33 34 35So I want to eliminate the content from all duplicate lines (while keeping them) after the first line. Only the first line keeps its value: the content of all the others is erased.
Thanks in advance!
-
@Luís-Gonçalves
So it turns out that it’s actually much easier to remove all but the last occurrence of each number you found. Hopefully that is sufficient for your needs.- Select the
Marktab on the find/replace form. - Enter the regex
(^\d+$)(?=.+?^\1$)into theFind what:tab.- How this regex works:
- Find a line containing only digits (
(^\d+$)). - This line can only be matched if there was at least one line containing exactly the same digits earlier in the document (
(?=.+?^\1$))
- Make sure that
Bookmark line,Purge for each search,Regular expression, and. matches newlineare all checked. - Hit the
Mark allbutton. Now every line that has an identical line before it will be marked. - Copy a single tab or space character to the clipboard.
- Select
Search->Bookmark->Paste to (Replace) bookmarked linesfrom the main menu. - If you want to clear all space from the empty lines, just use
Edit->Blank Operations->Trim Trailing Spaceor use the find/replace form to replace^[ \t]+$with nothing.
I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute (takes time proportional to the
N^2 log(N), whereNis the max number of repeats) and requires you to hit the replace button several times.
To do that:- Use
^(\d+)$(.+?)^\1$in theFindbox and\1\2in theReplace within theReplacetab of the find/replace form. Make sureregular expressionsis checked. - As noted above, you will have to keep hitting the
replacebutton until the little indicator at the bottom says 0 things were replaced.
- Select the
-
@Mark-Olson
If you want to find and replace identical lines (not just numbers like in this toy example), just replace^\d+$with^regex-that-matches-an-entire-line$wherever you saw me write^\d+.
For example:^[abc]{3,5}$would match a line containing any combination of the letters a, b, and c with total length 3 to 5.^[^\r\n]*$would match any line (even an empty line)
-
@Mark-Olson said:
I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute
I’m glad you provided this, even if it is slower, because if you had just provided the first part of your solution, you didn’t solve the problem, as it didn’t give the OP what they wanted.
I presume they have good reason for wanting the replace output the way they specified!
-
@Mark-Olson’s second method could get tedious if there are 50 duplicate lines in a row instead of just 3-5 in a row.
I’d do it in a multistep
- FIND WHAT:
(?-s)(^\d+$)(\R\1)*
REPLACE WITH:☺$0
SEARCH MODE = Regular Expression
REPLACE ALL- ie, look for a line (in this case, all digits) that has 0 or more copies immediately following, and prefix with a smiley
- FIND WHAT:
(?-s)(^\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL- any line that didn’t get transformed, but matches the “all digits” requirement, must’ve been a duplicate, so it should be cleared
- FIND WHAT:
^☺(?=\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL- any line that did get transformed should have the smiley removed
(Like Mark’s attempts, mine assumes the lines you want to transform are just one or more digits each, with no spaces or non-digit characters either before or after.)
- FIND WHAT:
-
@PeterJones
This approach is much better than mine in the case where all the duplicate lines are consecutive (that is, there are no numbers other than11between the first occurrence of11and the last occurrence of11).
While my approach is far worse for this specific use case, it does not have this limitation. -
You are right. When I looked at the OP data, it only had consecutive duplicates. If it has to handle duplicates with other lines in between, then mine is not sufficient. The OP doesn’t state whether or not all the duplicates are consecutive, so we’re both working from a reasonable but different assumption/interpretation of the example data.
-
@Mark-Olson your solution worked perfectly, and it did exactly what I wanted. Thank you very much! =)
Thanks to all the other people who gave their help as well.
You’re the best! -
Hello, @luís-gonçalves, @mark-olson, @alan-kilborn, @peterjones and All,
Here is a quick way to mark all consecutive equal lines but the first !
-
First, add a final line-break at the end of your number’s list ! ( IMPORTANT )
-
MARK
(?x) ^ ( \d+ \R ) \K ( \1 )+Bookmark line,Purge for each searchandRegular expressionchecked
Then, you can follow the @mark-olson’s instructions ! So :
-
Put a single
spacechar in the clipboard withCtrl + C -
Run the
Search > Bookmark > Paste to (Replace) Boomarked Linesoption -
Finally, run the simple S/R :
-
SEARCH
^\x20$ -
REPLACE
Leave EMPTY
-
Or use the
Edit > Blank Operations > Trim Trailing SpaceoptionBest Regards
guy038
-