matching unicode characters in python regular expressions

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix: >>> re.match(r’^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$’, u’/by_tag/påske/øyfjell.jpg’, re.UNICODE).groupdict() {‘tag’: u’p\xe5ske’, ‘filename’: u’\xf8yfjell.jpg’} This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.

Why does wprintf transliterate Russian text in Unicode into Latin on Linux?

Because conversion of wide characters is done according to the currently set locale. By default a C program always starts with a “C” locale which only supports ASCII characters. You have to switch to any Russian or UTF-8 locale first: setlocale(LC_ALL, “ru_RU.utf8”); // Russian Unicode setlocale(LC_ALL, “en_US.utf8”); // English US Unicode Or to a current … Read more

Removing unicode \u2026 like characters in a string in python2.7 [duplicate]

Python 2.x >>> s ‘This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!’ >>> print(s.decode(‘unicode_escape’).encode(‘ascii’,’ignore’)) This is some text that has to be cleaned! it’s annoying! Python 3.x >>> s=”This is some \u03c0 text that has to be cleaned\u2026! it\u0027s annoying!” >>> s.encode(‘ascii’, ‘ignore’) b”This is some text that has to be … Read more

How to fetch a non-ascii url with urlopen?

Strictly speaking URIs can’t contain non-ASCII characters; what you have there is an IRI. To convert an IRI to a plain ASCII URI: non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm; non-ASCII characters in the path, and most of the other parts of the address … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)