How can a pandas merge preserve order?

Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so… Zeroth, I’m assuming you don’t want to just end up sorted on loan, but to preserve whatever original order was in x, which may or may not have anything to do with the order of the loan … Read more

Pandas dataframe to Spark dataframe “Can not merge type error”

Long story short don’t depend on schema inference. It is expensive and tricky in general. In particular some columns (for example event_dt_num) in your data have missing values which pushes Pandas to represent them as mixed types (string for not missing, NaN for missing values). If you’re in doubt it is better to read all … Read more

Pandas populate new dataframe column based on matching columns in another dataframe

Consider the following dataframes df and df2 df = pd.DataFrame(dict( AUTHOR_NAME=list(‘AAABBCCCCDEEFGG’), title= list(‘zyxwvutsrqponml’) )) df2 = pd.DataFrame(dict( AUTHOR_NAME=list(‘AABCCEGG’), title =list(‘zwvtrpml’), CATEGORY =list(‘11223344′) )) option 1 merge df.merge(df2, how=’left’) option 2 join cols = [‘AUTHOR_NAME’, ‘title’] df.join(df2.set_index(cols), on=cols) both options yield

How to get rid of multilevel index after using pivot table pandas?

You need remove only index name, use rename_axis (new in pandas 0.18.0): print (reshaped_df) sale_product_id 1 8 52 312 315 sale_user_id 1 1 1 1 5 1 print (reshaped_df.index.name) sale_user_id print (reshaped_df.rename_axis(None)) sale_product_id 1 8 52 312 315 1 1 1 1 5 1 Another solution working in pandas below 0.18.0: reshaped_df.index.name = None print … Read more

Why do I get a KeyError when using pandas apply?

As answered by EdChum in the comments. The issue is that apply works column wise by default (see the docs). Therefore, the column names cannot be accessed. To specify that it should be applied to each row instead, axis=1 must be passed: test.apply(lambda x: find_max(x,test,’document_id’,’confidence_level’,’category_id’), axis=1)

How can I subclass a Pandas DataFrame?

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series. The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py As in HYRY’s answer, it seems there are two things you’re trying to accomplish: … Read more

Pandas groupby with categories with redundant nan

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added : import pandas as pd group_cols = [‘Group1’, ‘Group2’, ‘Group3’] df = pd.DataFrame([[‘A’, ‘B’, ‘C’, 54.34], … Read more

Create a pandas DataFrame from generator?

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don’t use .from_records(); just use the constructor, for example: import pandas as pd someGenerator = ( (x, chr(x)) for x in range(48,127) ) someDf = pd.DataFrame(someGenerator) Produces: type(someDf) #pandas.core.frame.DataFrame someDf.dtypes #0 int64 #1 object #dtype: object someDf.tail(10) … Read more