Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

Here’s some fun valid XML for you: <!DOCTYPE x [ <!ENTITY y “a]>b”> ]> <x> <a b=”&y;>” /> <![CDATA[[a>b <a>b <a]]> <?x <a> <!– <b> ?> c –> d </x> And this little bundle of joy is valid HTML: <!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd” [ <!ENTITY % e “href=”https://stackoverflow.com/questions/701166/hello””> <!ENTITY e “<a … Read more

How do I grep for all non-ASCII characters?

You can use the command: grep –color=”auto” -P -n “[\x80-\xFF]” file.xml This will give you the line number, and will highlight non-ascii chars in red. In some systems, depending on your settings, the above will not work, so you can grep by the inverse grep –color=”auto” -P -n “[^\x00-\x7F]” file.xml Note also, that the important … Read more

Regular Expression to find a string included between two characters while EXCLUDING the delimiters

Easy done: (?<=\[)(.*?)(?=\]) Technically that’s using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of: is preceded by a [ that is not captured (lookbehind); a non-greedy captured group. It’s non-greedy to stop at the first ]; and is followed by a ] that is not captured (lookahead). Alternatively you can … Read more

How can I find all matches to a regular expression in Python?

Use re.findall or re.finditer instead. re.findall(pattern, string) returns a list of matching strings. re.finditer(pattern, string) returns an iterator over MatchObject objects. Example: re.findall( r’all (.*?) are’, ‘all cats are smarter than dogs, all dogs are dumber than cats’) # Output: [‘cats’, ‘dogs’] [x.group() for x in re.finditer( r’all (.*?) are’, ‘all cats are smarter than … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)