First, even if the resulting regex will not keep the same meaning, let’s reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.
Now let’s consider the following code:
$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line
$x =~ /\s+0/; # The fast line
Digging a bit with use re debugcolor; we get the following output:
Guessing start of match in sv for REx "\s*0" against " _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against " _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes)
0 < _0>| 1:STAR(3)
POSIXD[\s] can match 4 times out of 2147483647...
failed...
1 < _0>| 1:STAR(3)
POSIXD[\s] can match 3 times out of 2147483647...
failed...
2 < _0>| 1:STAR(3)
POSIXD[\s] can match 2 times out of 2147483647...
failed...
3 < _0>| 1:STAR(3)
POSIXD[\s] can match 1 times out of 2147483647...
failed...
5 < _0>| 1:STAR(3)
POSIXD[\s] can match 0 times out of 2147483647...
5 < _0>| 3: EXACT <0>(5)
6 < _0>| 5: END(0)
Match successful!
-----------------------
Guessing start of match in sv for REx "\s+0" against " _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against " _0"
Matching stclass POSIXD[\s] against " _" (5 bytes)
0 < _0>| 1:PLUS(3)
POSIXD[\s] can match 4 times out of 2147483647...
failed...
Contradicts stclass... [regexec_flags]
Match failed
Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it’ll look for 0 : Found floating substr "0" at offset 5...
Then it’ll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:
Matching REx "\s*0" against " _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against " _0" (6 bytes)
Matching REx "\s+0" against " _0"
Matching stclass POSIXD[\s] against " _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char
After that it’ll look for the first position meeting the stclass requirement, here at position 0.
\s*0:- starts at 0, find 4 spaces then fail;
- starts at 1, find 3 spaces then fail;
- starts at 2, find 2 spaces then fail;
- starts at 3, find 1 spaces then fail;
- starts at 4, find 0 spaces then doesn’t fail;
- Find an exact
0
\s+0:- starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.
If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning… But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.
In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn’t fail with Perl because it have been optimized to find floating substr "0%n"
You can see another example on Jeff Atwood’s blog.
Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers
With such shorter example the behavior is “findable”, but if you start to play with complicated patterns, it’s far from easy to spot, for example: Regular expression hangs program (100% CPU usage)