Pandas dataframe to Spark dataframe “Can not merge type error”

Long story short don’t depend on schema inference. It is expensive and tricky in general. In particular some columns (for example event_dt_num) in your data have missing values which pushes Pandas to represent them as mixed types (string for not missing, NaN for missing values). If you’re in doubt it is better to read all … Read more

Cleanest, most efficient syntax to perform DataFrame self-join in Spark

There are at least two different ways you can approach this either by aliasing: df.as(“df1”).join(df.as(“df2″), $”df1.foo” === $”df2.foo”) or using name-based equality joins: // Note that it will result in ambiguous column names // so using aliases here could be a good idea as well. // df.as(“df1”).join(df.as(“df2”), Seq(“foo”)) df.join(df, Seq(“foo”)) In general column renaming, while … Read more

How to get rid of multilevel index after using pivot table pandas?

You need remove only index name, use rename_axis (new in pandas 0.18.0): print (reshaped_df) sale_product_id 1 8 52 312 315 sale_user_id 1 1 1 1 5 1 print (reshaped_df.index.name) sale_user_id print (reshaped_df.rename_axis(None)) sale_product_id 1 8 52 312 315 1 1 1 1 5 1 Another solution working in pandas below 0.18.0: reshaped_df.index.name = None print … Read more

How can I subclass a Pandas DataFrame?

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series. The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py As in HYRY’s answer, it seems there are two things you’re trying to accomplish: … Read more

How to properly add hours to a pandas.tseries.index.DatetimeIndex?

You can use pd.DateOffset: test[1].index + pd.DateOffset(hours=16) pd.DateOffset accepts the same keyword arguments as dateutil.relativedelta. The problem you encountered was due to this bug which has been fixed in Pandas version 0.14.1: In [242]: pd.to_timedelta(16, unit=”h”) Out[242]: numpy.timedelta64(16,’ns’) If you upgrade, your original code should work.

What is the fastest way to upload a big csv file in notebook to work with python pandas?

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV – 492 MB). Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]): read_s write_s size_ratio_to_CSV storage CSV 17.900 69.00 1.000 CSV.gzip 18.900 186.00 0.047 Pickle 0.173 1.77 … Read more