Erase content from duplicate lines, but keeping the first unchanged

Luís Gonçalves

Hello. So, what I want to do is to turn this:

into this:

So I want to eliminate the content from all duplicate lines (while keeping them) after the first line. Only the first line keeps its value: the content of all the others is erased.

Thanks in advance!

Mark Olson

@Luís-Gonçalves
So it turns out that it’s actually much easier to remove all but the last occurrence of each number you found. Hopefully that is sufficient for your needs.

Select the Mark tab on the find/replace form.
Enter the regex (^\d+$)(?=.+?^\1$) into the Find what: tab.
- How this regex works:
- Find a line containing only digits ((^\d+$)).
- This line can only be matched if there was at least one line containing exactly the same digits earlier in the document ((?=.+?^\1$))
Make sure that Bookmark line, Purge for each search, Regular expression, and . matches newline are all checked.
Hit the Mark all button. Now every line that has an identical line before it will be marked.
Copy a single tab or space character to the clipboard.
Select Search->Bookmark->Paste to (Replace) bookmarked lines from the main menu.
If you want to clear all space from the empty lines, just use Edit->Blank Operations->Trim Trailing Space or use the find/replace form to replace ^[ \t]+$ with nothing.

I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute (takes time proportional to the N^2 log(N), where N is the max number of repeats) and requires you to hit the replace button several times.
To do that:

Use ^(\d+)$(.+?)^\1$ in the Find box and \1\2 in the Replace with in the Replace tab of the find/replace form. Make sure regular expressions is checked.
As noted above, you will have to keep hitting the replace button until the little indicator at the bottom says 0 things were replaced.

Mark Olson

@Mark-Olson
If you want to find and replace identical lines (not just numbers like in this toy example), just replace ^\d+$ with ^regex-that-matches-an-entire-line$ wherever you saw me write ^\d+.
For example:

^[abc]{3,5}$ would match a line containing any combination of the letters a, b, and c with total length 3 to 5.
^[^\r\n]*$ would match any line (even an empty line)

Alan Kilborn

@Mark-Olson said:

I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute

I’m glad you provided this, even if it is slower, because if you had just provided the first part of your solution, you didn’t solve the problem, as it didn’t give the OP what they wanted.

I presume they have good reason for wanting the replace output the way they specified!

PeterJones

@Mark-Olson’s second method could get tedious if there are 50 duplicate lines in a row instead of just 3-5 in a row.

I’d do it in a multistep

FIND WHAT: (?-s)(^\d+$)(\R\1)*
REPLACE WITH: ☺$0
SEARCH MODE = Regular Expression
REPLACE ALL
- ie, look for a line (in this case, all digits) that has 0 or more copies immediately following, and prefix with a smiley
FIND WHAT: (?-s)(^\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL
- any line that didn’t get transformed, but matches the “all digits” requirement, must’ve been a duplicate, so it should be cleared
FIND WHAT: ^☺(?=\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL
- any line that did get transformed should have the smiley removed

(Like Mark’s attempts, mine assumes the lines you want to transform are just one or more digits each, with no spaces or non-digit characters either before or after.)

Mark Olson

@PeterJones
This approach is much better than mine in the case where all the duplicate lines are consecutive (that is, there are no numbers other than 11 between the first occurrence of 11 and the last occurrence of 11).
While my approach is far worse for this specific use case, it does not have this limitation.

PeterJones

@Mark-Olson ,

You are right. When I looked at the OP data, it only had consecutive duplicates. If it has to handle duplicates with other lines in between, then mine is not sufficient. The OP doesn’t state whether or not all the duplicates are consecutive, so we’re both working from a reasonable but different assumption/interpretation of the example data.

Luís Gonçalves

@Mark-Olson your solution worked perfectly, and it did exactly what I wanted. Thank you very much! =)

Thanks to all the other people who gave their help as well.
You’re the best!

guy038

Hello, @luís-gonçalves, @mark-olson, @alan-kilborn, @peterjones and All,

Here is a quick way to mark all consecutive equal lines but the first !

First, add a final line-break at the end of your number’s list ! ( IMPORTANT )
MARK (?x) ^ ( \d+ \R ) \K ( \1 )+
- Bookmark line, Purge for each search and Regular expression checked

Then, you can follow the @mark-olson’s instructions ! So :

Put a single space char in the clipboard with Ctrl + C
Run the Search > Bookmark > Paste to (Replace) Boomarked Lines option
Finally, run the simple S/R :
- SEARCH ^\x20$
- REPLACE Leave EMPTY

Or use the Edit > Blank Operations > Trim Trailing Space option

Best Regards

guy038