Community
    • Login

    Is it possible...?

    Scheduled Pinned Locked Moved General Discussion
    14 Posts 6 Posters 887 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @dipsi7772, @peterjones, @terry-r and All,

      First, Peter said :

      @guy038 is going to have to jump in to do it in one fell swoop. Barring that (he may be busy, or away, or not interested in this one)

      Indeed, I was away for a long week-end !

      Secondly, @dipsi7772 said :

      This is a original line of the file:

      /><br />password<br />target</div>

      So I assume that the different targets are characters between the nearest > and < symbols, after the word password, with this exact case, followed with few characters. In this case, here is a destructive method, which need to be run only on a copy of your bunch of .html files

      To sump up :

      • Copy the directory, containing all your .html files, to an other location

      • Start Notepad++

      • Open the Search in Files dialog ( Ctrl + Shift + F )

      • SEARCH (?s-i).*?password.+?>(.+?)(?=<)|.+

      • REPLACE \1\r\n ( or \1\n if your files use Unix EOL syntax )

      • FILTERS *.html

      • DIRECTORY The absolute location of the COPY of all your .html files ( Do NOT use your original files )

      • Tick the Wrap around button

      • Select the Regular expression search mode

      • Click on the Replace in Files button

      • Click on the Yes button of the small dialog Are you sure?

      => Each copy, of an original .html file, should have been drastically decreased and simply contains a list of all passwords, one per line, contained in the original file ;-))

      Notes :

      • As usual, the (?s-i) in-line modifiers mean that :

        • The dot . character represents any single character, even EOL chars ( (?s) )

        • The search is processed in a sensitive to case way ( (?-i) )

      • Then, The part .*?password.+?> looks, from beginning of each file, for the shortest range of any character, till the string paasword, with this exact case, followed with some characters till the nearest > symbol

      • Now, the (.+?)(?=<) part stores, as group 1, all the subsequent characters till the condition contained in the look-around structure is true, i.e. till the nearest < symbol if found

      • In replacement, the “value” of each password \1 is simply rewritten, followed with new-line characters ( \r\n or \n )

      • At the end, if the word password , with this exact case, cannot be found, the second alternative .+, after the alternation symbol |, selects all the remaining characters of current scanned file and deletes them, because group 1 is not defined

      Remarks :

      • Notice that all quantifiers of the first alternative of the search regex are lazy quantifiers, i. e. it grasps as little chars as possible, though satisfying the subsequent parts of the overall regex

      • When an .html file do not contain any word password, all its contents are just replaced with a single line-break !

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 3
      • dipsi7772D
        dipsi7772
        last edited by

        Dear @Terry-R dear @guy038, dear @PeterJones
        thank you so much for your support. All of you seem to be very smart people if I read through your thoughts and realize how difficult it is get the target out of the big text.

        As I dont want to make you guys headache I have checked now several different files of the big bunch to find out a “rule” or a repeated “sign” which is situated around the target.

        I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.

        So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:
        The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.
        Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.

        So from my unprofessional point of view:

        its just possible to extract just the line which contains the strings “pass” “password” and “</div>”.
        As I have almost 3000 html files to search, just extract the lines would also help me a lot.

        Should I try your proposals now or have you any remarks?
        I googled but still not sure about the meaning of Unix EOL

        Search field should look like this?
        96f34881-14da-4237-8ef5-1a24708f2526-image.png

        Not sure about the search options you mentioned.
        Thanks for all guys!!!

        andrecool-68A 1 Reply Last reply Reply Quote 0
        • andrecool-68A
          andrecool-68 @dipsi7772
          last edited by

          @dipsi7772 You have the simple search tab open, and you need the following tabs of the search window, replace or find and replace in files

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,

            @dipsi7772, many thanks for trying to find out some general rules in order to isolate your different target strings more easily !

            You said :

            1 I found that there is always a </div> directly next to the target. But unfortunately there are also other lines with </div>, but without the target.

            2 So at the end, the only fix mark is the word “pass” or the word “password” before the target. Unfortunately I have to inform about a further issue:

            3 The word “password” is not everytime directly before the target … sometimes there is also letters numbers signs or free spaces between them.

            4 Regarding your question: The target can contain every, sign,letter, special sign …everything you can imagine.


            If so, I think that the new regex S/R , below, should meet these 4 criteria !

            SEARCH (?s).*?(?-si:pass(word)?.+?>(.+?)(?=</div>))|.+

            REPLACE \2\r\n ( or \2\n if your files use Unix EOL syntax )

            • The look-around (?=</div>), as well as the (?-si:pass(word)?... syntax, should satisfy your first criterion

            • The part (?-si:pass(word)?...), which matches the word pass or password, with this exact case, satisfies your second criterion

            • Then, the part .+?>, which matches the shortest range of standard chars, till a > symbol, satisfies your third criterion

            • Finally, the part (.+?), which stores, as group1, any range of standard characters, due to the in-line modifiers (?-si:....), till the nearest string </div> excluded ( located right after all the target characters ), satisfies your fourth criterion


            As @andrecool-68 said, use the Find in Files dialog ( Ctrl + Shift + F )

            Cheers,

            guy038

            P.S. :

            So, except for the SEARCH and REPLACE updated zones, just follow the instructions given in my first post !

            1 Reply Last reply Reply Quote 4
            • dipsi7772D
              dipsi7772
              last edited by

              Im soooo happy =) this works … Thank you all!! This made my day :)
              You are great !! Thanks so much=)

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by

                Hi, @dipsi7772, @peterjones, @terry-r, @andrecool-68 and All,

                @dipsi7772, you said :

                I googled but still not sure about the meaning of Unix EOL

                I missed that sentence. Refer to the link, below, for general information about new-line definition :

                https://en.wikipedia.org/wiki/Newline#Representation


                To be rigorous, I’ve made a slight error, in my last proposed search regex ! In order to extract target from lines of that form :

                /><br />password>target</div>  OR   /><br />pass>target</div>   ( when NO character exists between the string pass(word) and >target< )

                I should have used the following search regex, with a *, instead of a +, at the indicated place !

                SEARCH (?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+
                                                ▲
                                                │
                

                Best Regards,

                guy038

                Alan KilbornA 1 Reply Last reply Reply Quote 4
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038 said in Is it possible...?:

                  https://en.wikipedia.org/wiki/Newline#Representation

                  OT to the main thread here, but it is interesting from that article that the Mac line-endings (carriage-return only) that Notepad++ still supports are for an OS whose last release was before the first release of N++ (if I have my dates straight).

                  1 Reply Last reply Reply Quote 3
                  • dipsi7772D
                    dipsi7772
                    last edited by

                    Dear guy038,

                    thanks for clarification. Your proposal works good. Anyway its still a job to manuallly copy the eexpressions out of the sentences :) but its way better than without the code.
                    Strange thin is, on some files. just <!doctype html> this is the result.
                    I would say al files are the same format , so Its mystious.
                    Anyway whis you a nice xMas all community members and thanks again=)

                    1 Reply Last reply Reply Quote 0
                    • dipsi7772D
                      dipsi7772
                      last edited by

                      Is it maybe possible to implement a rule that avoids duplicate results?

                      Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.

                      Best Regards Friends

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @dipsi7772 and All,

                        You said :

                        Strange thin is, on some files. just <!doctype html> this is the result.
                        I would say al files are the same format , so Its mystious.

                        It’s not strange and it’s not related to file format at all ! The reason, is that , for files with big size, it may happen that the regex does not work properly and deletes all characters but the first line of your HTML files. Indeed, as the regex is :

                        (?s).*?(?-si:pass(word)?.*?>(.+?)(?=</div>))|.+

                        The beginning (?s).*?(?-si:pass(word)? means that the regex engine selects all characters, even displayed on several lines, from current position of the caret till the first word pass or password. In some files, this range of characters can be significative and this fact could explain the non-expected results !

                        If your HTML files are not important nor confidentiel, simply e-mail me one of these files, which produces errors. I’ll try to find out an other regex which works correctly, in all cases ;-))


                        Next, you said :

                        Is it maybe possible to implement a rule that avoids duplicate results?

                        My question is : In the copied HTML files, that contains the passwords ( 1 per line ), which is the maximum length of these files ?

                        Depending of this length, a regex solution may be possible… However, if you don’t mind changing the initial order of these passwords, just use, for each copied HTML file, the two menu options, below :

                        • Edit > Line Operations > Sort Line Lexicographically Ascending

                        • Edit > Line Operations > Remove Consecutive Duplicate Lines


                        Finally, you said :

                        Another thing is, even I choose “Automatischer Zeienumbruch” each line is writte ine ONE line, not in a second one which woud avoid the vertical scrolling.

                        I’m sorry because I cannot guess what you’re speaking of :-(( Depending on your file ending characters, discussed previously, and using the appropriate Replace regex :

                        \2\r\n for Windows files

                        OR

                        \2\n for Unix files

                        The View > Word Wrap option should work correctly !?

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 3
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors