python-unicode – Tarik Billa

String.maketrans for English and Persian numbers

April 6, 2024 by Tarik

See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages. In Python 2: >>> from unidecode import unidecode >>> a = unidecode(u”۰۱۲۳۴۵۶۷۸۹”) >>> a ‘0123456789’ >>> unidecode(a) ‘0123456789’ In Python 3: >>> from unidecode import unidecode >>> a = unidecode(“۰۱۲۳۴۵۶۷۸۹”) >>> a ‘0123456789’ >>> … Read more

Unicode Encode Error when writing pandas df to csv

December 9, 2023 by Tarik

You have unicode values in your DataFrame. Files store bytes, which means all unicode have to be encoded into bytes before they can be stored in a file. You have to specify an encoding, such as utf-8. For example, df.to_csv(‘path’, header=True, index=False, encoding=’utf-8′) If you don’t specify an encoding, then the encoding used by df.to_csv … Read more

TypeError: ufunc ‘subtract’ did not contain a loop with signature matching types dtype(‘

November 29, 2023 by Tarik

I got the same error, but in my case I am subtracting dict.key from dict.value. I have fixed this by subtracting dict.value for corresponding key from other dict.value. cosine_sim = cosine_similarity(e_b-e_a, w-e_c) here I got error because e_b, e_a and e_c are embedding vector for word a,b,c respectively. I didn’t know that ‘w’ is string, … Read more

Pipreqs: UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x98 in position 1206: character maps to

August 29, 2023 by Tarik

You can pass an encoding argument to pipreqs to set the encoding to use to open files. Python3 files are usually encoded as utf-8, so execute pipreqs –encoding=utf8 C:\Users\root\Desktop\resumes

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 7: ordinal not in range(128) [duplicate]

August 22, 2023 by Tarik

You need to encode Unicode explicitly before writing to a file, otherwise Python does it for you with the default ASCII codec. Pick an encoding and stick with it: f.write(printinfo.encode(‘utf8’) + ‘\n’) or use io.open() to create a file object that’ll encode for you as you write to the file: import io f = io.open(filename, … Read more

UnicodeDecodeError: (‘utf-8’ codec) while reading a csv file [duplicate]

August 11, 2023 by Tarik

Known encoding If you know the encoding of the file you want to read in, you can use pd.read_csv(‘filename.txt’, encoding=’encoding’) These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings Unknown encoding If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work. import … Read more

Correctly reading text from Windows-1252(cp1252) file in python

July 27, 2023 by Tarik

CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x: >>> print(repr(b’J\xe2nis’.decode(‘cp1252′))) u’J\xe2nis’ >>> print(b’J\xe2nis’.decode(‘cp1252’)) Jânis

Removing unicode \u2026 like characters in a string in python2.7 [duplicate]

July 24, 2023 by Tarik

Python 2.x >>> s ‘This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!’ >>> print(s.decode(‘unicode_escape’).encode(‘ascii’,’ignore’)) This is some text that has to be cleaned! it’s annoying! Python 3.x >>> s=”This is some \u03c0 text that has to be cleaned\u2026! it\u0027s annoying!” >>> s.encode(‘ascii’, ‘ignore’) b”This is some text that has to be … Read more

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte

July 19, 2023 by Tarik

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

June 25, 2023 by Tarik

“你好”.encode(‘utf-8′) encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don’t have the u). So python has to convert the string to a unicode object first. So it does the equivalent of “你好”.decode().encode(‘utf-8′) But the decode fails because the string isn’t valid ascii. … Read more