regex – Page 221 – Tarik Billa

Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

September 20, 2022 by Tarik

Here’s some fun valid XML for you: <!DOCTYPE x [ <!ENTITY y “a]>b”> ]> <x> <a b=”&y;>” /> <![CDATA[[a>b <a>b <a]]> <?x <a> <!– <b> ?> c –> d </x> And this little bundle of joy is valid HTML: <!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd” [ <!ENTITY % e “href=”https://stackoverflow.com/questions/701166/hello””> <!ENTITY e “<a … Read more

A regular expression to exclude a word/string

September 20, 2022 by Tarik

Here’s yet another way (using a negative look-ahead): ^/(?!ignoreme|ignoreme2|ignoremeN)([a-z0-9]+)$ Note: There’s only one capturing expression: ([a-z0-9]+).

How do I grep for all non-ASCII characters?

September 20, 2022 by Tarik

You can use the command: grep –color=”auto” -P -n “[\x80-\xFF]” file.xml This will give you the line number, and will highlight non-ascii chars in red. In some systems, depending on your settings, the above will not work, so you can grep by the inverse grep –color=”auto” -P -n “[^\x00-\x7F]” file.xml Note also, that the important … Read more

Regular Expression to find a string included between two characters while EXCLUDING the delimiters

September 19, 2022 by Tarik

Easy done: (?<=\[)(.*?)(?=\]) Technically that’s using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of: is preceded by a [ that is not captured (lookbehind); a non-greedy captured group. It’s non-greedy to stop at the first ]; and is followed by a ] that is not captured (lookahead). Alternatively you can … Read more

What regex will match every character except comma ‘,’ or semi-colon ‘;’?

September 18, 2022 by Tarik

[^,;]+ You haven’t specified the regex implementation you are using. Most of them have a Split method that takes delimiters and split by them. You might want to use that one with a “normal” (without ^) character class: [,;]+

Removing empty lines in Notepad++

September 18, 2022 by Tarik

There is now a built-in way to do this as of version 6.5.2 Edit -> Line Operations -> Remove Empty Lines or Remove Empty Lines (Containing Blank characters)

How to match “any character” in regular expression?

September 18, 2022 by Tarik

Yes, you can. That should work. . = any char except newline \. = the actual dot character .? = .{0,1} = match any char except newline zero or one times .* = .{0,} = match any char except newline zero or more times .+ = .{1,} = match any char except newline one or … Read more

Regex match one of two words

September 17, 2022 by Tarik

This will do: /^(apple|banana)$/ to exclude from captured strings (e.g. $1,$2): (?:apple|banana) Or, if you use a standalone pattern: apple|banana

Split Java String by New Line

September 17, 2022 by Tarik

This should cover you: String lines[] = string.split(“\\r?\\n”); There’s only really two newlines (UNIX and Windows) that you need to worry about.

How can I find all matches to a regular expression in Python?

September 17, 2022 by Tarik

Use re.findall or re.finditer instead. re.findall(pattern, string) returns a list of matching strings. re.finditer(pattern, string) returns an iterator over MatchObject objects. Example: re.findall( r’all (.*?) are’, ‘all cats are smarter than dogs, all dogs are dumber than cats’) # Output: [‘cats’, ‘dogs’] [x.group() for x in re.finditer( r’all (.*?) are’, ‘all cats are smarter than … Read more