Unicode BLANK characters and the regexes \h , \v and \s
-
Hi, All,
As I’m about to reply to someone, regarding some regex explanations, involving the
\ssyntax, I created, below, the definitive list of blank characters, in Unicode 10.0, and the way to get these blank characters, with the\h,\vand\sregexes, in the Boost regex engine, presently used in Notepad++
The regex
\hmatches any of these 19 HORIZONTAL Blank characters, below :- TABULATION ( \x{0009} -\t ) - SPACE ( \x{0020} ) - NO BREAK SPACE ( \x{00A0} ) - OGHAM SPACE MARK ( \x{1680} ) - MONGOLIAN VOYEL SAPARATOR ( \x{180E} ) - EN QUAD ( \x{2000} ) - EM QUAD ( \x{2001} ) - EN SPACE ( \x{2002} ) - EM SPACE ( \x{2003} ) - THREE-PER-EM SPACE ( \x{2004} ) - FOUR-PER-EM SPACE ( \x{2005} ) - SIX-PER-EM SPACE ( \x{2006} ) - FIGURE SPACE ( \x{2007} ) - PUNCTUATION SPACE ( \x{2008} ) - THIN SPACE ( \x{2009} ) - HAIR SPACE ( \x{200A} ) - NARROW NO-BREAK SPACE ( \x{202F} ) - MEDIUM MATHEMATICAL SPACE ( \x{205F} ) - IDEOGRAPHIC SPACE ( \x{3000} )
The regex
\vmatches any of these 07 VERTICAL Blank characters :- NEW LINE ( \x{000A} - \n ) - VERTICAL TABULATION ( \x{000B} ) - FORM FEED ( \x{000C} - \f ) - CARRIAGE RETRUN ( \x{000D} - \r ) - NEXT LINE ( \x{0085} ) - LINE SEPARATOR ( \x{2028} ) - PARAGRAPH SEPARATOR ( \x{2029} )
Finally, the regex
\smatches any of the 26 SPACE Blank characters, listed aboveREMARK : The regex
\sis equivalent to the regex(\h|\v), but is different from the regex[\h\v]!
In practise, the regex
\smatches, principally, a single blank character from the list, below :- TABULATION ( \x{0009} -\t ) - SPACE ( \x{0020} ) - NEW LINE ( \x{000A} - \n ) - CARRIAGE RETRUN ( \x{000D} - \r )Best Regards,
guy038
-
The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !
Please explain.
-
Hello, @mapje71, and All,
Ah yes, I apologize because I should have written additional information, in my last post !
I, simply, noticed that the regex
[\v]doesn’t match the same characters than the simple\vregex does :-((-
The
\vregex, as said in my first post, matches any of the 7 vertical blank characters -
The
[\v]regex just matches the vertical tabulation control character ( VT ), ONLY, of Unicode code\x{000B}( or\x{0B}or\x0B)
I don’t know if it’s a bug of the Boost regex engine, used by N++ or if it’s a normal regex restriction when used in a character class ! I should investigate on the http://www.regular-expressions.info/ site ;-))
Cheers,
guy038
-
-
Hi, @mapje71, and All,
In the web page, below :
http://www.regular-expressions.info/refcharclass.html
It is said that the regex
[\v]adds the “vertical tab” control character (ASCII 0x0B) to the character class, without adding any other vertical whitespace, which is confirmed by the given example !So, seemingly, it’s a current restriction of the
\vregex, in a character class !Best Regards,
guy038