I have reduced your problem to this:
my $text="M Y H A P P Y T E X T";
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i
(case insensitive) modifier and presence of certain character combinations such as "ss"
or "st"
that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i
matches for instance on both AUGUST
(6 characters) and auguļ¬
(5 characters, the last one being U+FB06)).
However if we remove /i
(case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa
modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text="M Y H A P P Y T E X T";
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like “k” with “\N{KELVIN SIGN}”), specify the “a” twice, for example
/aai
or/aia
. (The first occurrence of “a” restricts the\d
, etc., and the second occurrence adds the “/i” restrictions.) But, note that code points outside the ASCII range will use Unicode rules for/i
matching, so the modifier doesn’t really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here