Community
    • Login

    Unicode BLANK characters and the regexes \h , \v and \s

    Scheduled Pinned Locked Moved General Discussion
    4 Posts 2 Posters 3.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi, All,

      As I’m about to reply to someone, regarding some regex explanations, involving the \s syntax, I created, below, the definitive list of blank characters, in Unicode 10.0, and the way to get these blank characters, with the \h, \v and \s regexes, in the Boost regex engine, presently used in Notepad++


      The regex \h matches any of these 19 HORIZONTAL Blank characters, below :

      - TABULATION                   ( \x{0009} -\t )
      
      - SPACE                        ( \x{0020} )
      
      - NO BREAK SPACE               ( \x{00A0} )
      
      - OGHAM SPACE MARK             ( \x{1680} )
      
      - MONGOLIAN VOYEL SAPARATOR    ( \x{180E} )
      
      - EN QUAD                      ( \x{2000} )
      
      - EM QUAD                      ( \x{2001} )
      
      - EN SPACE                     ( \x{2002} )
      
      - EM SPACE                     ( \x{2003} )
      
      - THREE-PER-EM SPACE           ( \x{2004} )
      
      - FOUR-PER-EM SPACE            ( \x{2005} )
      
      - SIX-PER-EM SPACE             ( \x{2006} )
      
      - FIGURE SPACE                 ( \x{2007} )
      
      - PUNCTUATION SPACE            ( \x{2008} )
      
      - THIN SPACE                   ( \x{2009} )
      
      - HAIR SPACE                   ( \x{200A} )
      
      - NARROW NO-BREAK SPACE        ( \x{202F} )
      
      - MEDIUM MATHEMATICAL SPACE    ( \x{205F} )
      
      - IDEOGRAPHIC SPACE            ( \x{3000} )
      

      The regex \v matches any of these 07 VERTICAL Blank characters :

      - NEW LINE                     ( \x{000A} - \n )
      
      - VERTICAL TABULATION          ( \x{000B} )
      
      - FORM FEED                    ( \x{000C} - \f )
      
      - CARRIAGE RETRUN              ( \x{000D} - \r )
      
      - NEXT LINE                    ( \x{0085} )
      
      - LINE SEPARATOR               ( \x{2028} )
      
      - PARAGRAPH SEPARATOR          ( \x{2029} )
      

      Finally, the regex \s matches any of the 26 SPACE Blank characters, listed above

      REMARK : The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !


      In practise, the regex \s matches, principally, a single blank character from the list, below :

      - TABULATION                   ( \x{0009} -\t )
      
      - SPACE                        ( \x{0020} )
      
      - NEW LINE                     ( \x{000A} - \n )
      
      - CARRIAGE RETRUN              ( \x{000D} - \r )
      

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 5
      • MAPJe71M
        MAPJe71
        last edited by

        The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !

        Please explain.

        1 Reply Last reply Reply Quote 2
        • guy038G
          guy038
          last edited by guy038

          Hello, @mapje71, and All,

          Ah yes, I apologize because I should have written additional information, in my last post !

          I, simply, noticed that the regex [\v] doesn’t match the same characters than the simple \v regex does :-((

          • The \v regex, as said in my first post, matches any of the 7 vertical blank characters

          • The [\v] regex just matches the vertical tabulation control character ( VT ), ONLY, of Unicode code \x{000B} ( or \x{0B} or \x0B )


          I don’t know if it’s a bug of the Boost regex engine, used by N++ or if it’s a normal regex restriction when used in a character class ! I should investigate on the http://www.regular-expressions.info/ site ;-))

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 4
          • guy038G
            guy038
            last edited by

            Hi, @mapje71, and All,

            In the web page, below :

            http://www.regular-expressions.info/refcharclass.html

            It is said that the regex [\v] adds the “vertical tab” control character (ASCII 0x0B) to the character class, without adding any other vertical whitespace, which is confirmed by the given example !

            So, seemingly, it’s a current restriction of the \v regex, in a character class !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 4
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors