About single and duplicate lines...
-
Hello, All,
Reading that post made me realize that searching for single or duplicate lines is a very common task. Some time ago, for my personal workflow, I had written a method to solve the main cases ! So, in this post, I’m going to show you, from an original file, how to keep :
-
All single lines, ONLY
-
All duplicate lines, ONLY
-
All single lines and the first copy of all duplicate lines
-
All single lines and the last copy of all duplicate lines
-
The first copy of all duplicate lines, ONLY
-
The last copy of all duplicate lines, ONLY
I’ll use a file, named
Test_File.txt, that both contains single lines and duplicate lines that appear in2, 3, 4ormoretimes. It contains48color palettes, found from various sites and added one after another, giving a total of78,117records whose39,532are single lines and38,585are duplicate lines. On the other hand, if we countonecopy of all the duplicates, this file contains11,290different duplicate lines.To test my solutions, simply download this UTF-8 file (
5,937,560bytes ) from myGoogle Driveaccount :https://drive.google.com/file/d/1aYOpKon4KYw_NXSdj4Tm4Ti_FrygC2ky/view?usp=sharing
Remarks :
Note the definition of single lines : these are lines that differ in characters and/or case from all the other lines of the current file. For example, in this small file of
14lines, below :ABC xyz 123 789 HIJ HIJ 123 AbC 123 HIJ abc HIJ 456 xyz-
The 5 lines
ABC,AbC,abc,789and456are considered to be single lines, as different in *chars and/or case from all the other lines. -
The 3
123lines are considered to be a duplicate line with3copies ( Multiple occurrences ) -
The 2
xyzlines are considered to be a duplicate line with2copies ( Multiple occurrences ) -
Les 4
HIJlines are considered to be a duplicate line with4copies ( Multiple occurrences )
IMPORTANT :
I’ve done some of the work for you, by adding a final column that numbers all lines in this file. Thus, is will be easy to restore the original order of the remaining records, after that each processing is complete. So, in case you need this initial order :
-
Put the caret right before the number
00001, at the end of the first line -
Run the
Edit > Begin/End Select in Column Modeoption ( or use theAlt + Shift + Bshortcut ) -
Move to the last line of the file
-
Put the caret right before the number
78117 -
Run again the
Edit > Begin/End Select in Column Modeoption ( or use theAlt + Shift + Bshortcut )
=> A ZERO-LINE column mode selection should appear throughout all the lines
- Then, run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption
=> The original order of the remaining records should be back !
In each procedure, below,
1or2S/R are used. To process them :-
First, cancel any existing selection to ensure that any line-end character will be taken in account during the S/R phase
-
Open the Replace dialog (
Ctrl + H) -
Uncheck all box options
-
Check the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
(1) To keep all the SINGLE lines ONLY (
39,532records ) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )+ -
REPLACE
Leave EMPTY -
Perform the IMPORTANT section, above
(2) To keep all the DUPLICATE lines ONLY (
38,585 records = 78,117 - 39,532) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )+ (*SKIP) (*F) | ^ .+ \R -
REPLACE
Leave EMPTY -
Perform the IMPORTANT section, above
(3) To keep all the SINGLE lines and the FIRST copy of ALL the DUPLICATE lines, found AFTER the sort (
50,822records ) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( ( .+ ) .{7} \R ) (?: \2 .{7} \R )+ -
REPLACE
\1 -
Perform the IMPORTANT section, above
(4) To keep all the SINGLE lines and the LAST copy of all the DUPLICATE lines, found AFTER the sort (
50,822records ) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )* ( \1 .{7} \R ) -
REPLACE
\2 -
Perform the IMPORTANT section, above
(5) To keep the FIRST copy of all the DUPLICATE lines ONLY, found AFTER the sort (
11,290 = 50,822 - 39,532) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )+ (*SKIP) (*F) | ^ .+ \R -
REPLACE
Leave EMPTY
Then :
-
SEARCH
(?x-is) ^ ( ( .+ ) .{7} \R ) (?: \2 .{7} \R )+ -
REPLACE
\1 -
Perform the IMPORTANT section, above
(6) To keep the LAST copy of all the DUPLICATE lines ONLY, found AFTER the sort (
11,290 = 50,822 - 39,532) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab and select all text (
Ctrl + A) -
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Click anywhere, in the new tab, to cancel the entire selection
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )+ (*SKIP) (*F) | ^ .+ \R -
REPLACE
Leave EMPTY
Then :
-
SEARCH
(?x-is) ^ ( .+ ) .{7} \R (?: \1 .{7} \R )* ( \1 .{7} \R ) -
REPLACE
\2 -
Perform the IMPORTANT section, above
At the very end of any of these choices, you may delete the extra numeration :
-
SEARCH
(?x-s) .{7} $ -
REPLACE
Leave EMPTY -
Then run the
Edit > Blank Operations > Trim Trailing Space
Best Regards,
guy038
P.S. :
Note that there is also a native way to get all the single lines and the first copy of all the duplicate lines, found with the present order (
50,822records ) :-
Paste the
Text_File.txtcontents in a new tab -
Switch to that new tab
-
Delete the numeration, at end of each line :
-
SEARCH
(?x-s) .{7} $ -
REPLACE
Leave EMPTY
-
-
Then, use the
Edit > Line Opérations > Remove Duplicate linesoption
-