@Sarah-Duong said in Remove duplicate lines in separate files:
This is really difficult for me.
Well, you have presented quite a significant problem, mainly due to the size. The actual process (as I’ve outlined previously) is not difficult, but given the number and size of the files the solution will take some effort by you to complete.
So do you want a solution in the Windows environment or were you considering the Linux solution in which case this thread (collection of posts) can close?
You say the average filesize is 200000 lines, if we suggest an average of 18 (you said 5 to 30 characters per line) characters per line this makes an average filesize of 3.6MB. I haven’t personally worked on a file of this size, however I’m sure NPP is capable of filesizes much larger. It can depend upon the environment such as whether NPP used is 32bit or 64bit, and whether you have additional plugins loaded.
I still think it will only be possible to use NPP is the files are broken down into groups which will mean sorting each file first and breaking them apart by the first 1 or 2 characters. Then processing each group separately.
Please advise whether you still want to consider this approach, we (on the forum) can help you, but be aware it will be a lengthy job.
Terry