UnicodeDecodeError when reading CSV file in Pandas with Python

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv(‘file’, encoding = “ISO-8859-1”), or alternatively encoding = “utf-8” for reading, and generally utf-8 for to_csv. You can also use one of several alias options like ‘latin’ or ‘cp1252’ (Windows) instead of ‘ISO-8859-1’ (see python docs, also for numerous other … Read more

The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe

The R Language Definition is handy for answering these types of questions: http://cran.r-project.org/doc/manuals/R-lang.html#Indexing R has three basic indexing operators, with syntax displayed by the following examples x[i] x[i, j] x[[i]] x[[i, j]] x$a x$”a” For vectors and matrices the [[ forms are rarely used, although they have some slight semantic differences from the [ form … Read more

Converting a Pandas GroupBy output from Series to DataFrame

g1 here is a DataFrame. It has a hierarchical index, though: In [19]: type(g1) Out[19]: pandas.core.frame.DataFrame In [20]: g1.index Out[20]: MultiIndex([(‘Alice’, ‘Seattle’), (‘Bob’, ‘Seattle’), (‘Mallory’, ‘Portland’), (‘Mallory’, ‘Seattle’)], dtype=object) Perhaps you want something like this? In [21]: g1.add_suffix(‘_Count’).reset_index() Out[21]: Name City City_Count Name_Count 0 Alice Seattle 1 1 1 Bob Seattle 2 2 2 Mallory … Read more

How to check if any value is NaN in a Pandas DataFrame

jwilner’s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster: df.isnull().values.any() import numpy as np import pandas as pd import perfplot def setup(n): df = pd.DataFrame(np.random.randn(n)) df[df > 0.9] = np.nan return df def … Read more

How to apply a function to two columns of Pandas dataframe

There is a clean, one-line way of doing this in Pandas: df[‘col_3’] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns. Example with data (based on original question): import pandas as pd … Read more

Difference between map, applymap and apply methods in Pandas

Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book): Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this: In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list(‘bde’), index=[‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’]) In [117]: frame Out[117]: b d e … Read more

Constructing pandas DataFrame from values in variables gives “ValueError: If using all scalar values, you must pass an index”

The error message says that if you’re passing scalar values, you have to pass an index. So you can either not use scalar values for the columns — e.g. use a list: >>> df = pd.DataFrame({‘A’: [a], ‘B’: [b]}) >>> df A B 0 2 3 or use scalar values and pass an index: >>> … Read more

Convert pandas dataframe to NumPy array

Use df.to_numpy() It’s better than df.values, here’s why.* It’s time to deprecate your usage of values and as_matrix(). pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects: to_numpy(), which is defined on Index, Series, and DataFrame objects, and array, which is defined on Index and Series objects only. If you visit … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)