Defining a UDF that accepts an Array of objects in a Spark DataFrame?

What you’re looking for is Seq[o.a.s.sql.Row]: import org.apache.spark.sql.Row val my_size = udf { subjects: Seq[Row] => subjects.size } Explanation: Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side. According to the official specification, the local (external) type for StructType is … Read more

How To Solve KeyError: u”None of [Index([..], dtype=’object’)] are in the [columns]”

The problem is that there are spaces in your column names; here is what I get when I save your data and load the dataframe as you have done: df.columns # result: Index([‘LABEL’, ‘ F1’, ‘ F2’, ‘ F3’, ‘ F4’, ‘ F5’, ‘ X’, ‘ Y’, ‘ Z’, ‘ C1’, ‘ C2’], dtype=”object”) so, … Read more

Attaching a calculated column to an existing dataframe raises TypeError: incompatible index of inserted column with frame index

The problem is, as the Error message says, that the index of the calculated column you want to insert is incompatible with the index of df. The index of df is a simple index: In [8]: df.index Out[8]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=”int64″) while the index of the calculated column … Read more

Rename a single pandas DataFrame column without knowing column name

Should work: drugInfo.rename(columns = {list(drugInfo)[1]: ‘col_1_new_name’}, inplace = True) Example: In [18]: df = pd.DataFrame({‘a’:randn(5), ‘b’:randn(5), ‘c’:randn(5)}) df Out[18]: a b c 0 -1.429509 -0.652116 0.515545 1 0.563148 -0.536554 -1.316155 2 1.310768 -3.041681 -0.704776 3 -1.403204 1.083727 -0.117787 4 -0.040952 0.108155 -0.092292 In [19]: df.rename(columns={list(df)[1]:’col1_new_name’}, inplace=True) df Out[19]: a col1_new_name c 0 -1.429509 -0.652116 0.515545 … Read more

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

Aware that this might be a reply that some will find highly controversial, I’m still posting my opinion here… Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it’s the user’s responsibility to fix it, not the module’s responsibility to propose code refactoring steps. Rationale for this harsh … Read more

Create multiindex from existing dataframe

You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns. df.groupby([‘user_id’, ‘account_num’, ‘dates’]).sales.sum().to_frame() You should also be able to simply do this: df.set_index([‘user_id’, ‘account_num’, ‘dates’]) Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num … Read more

How to surface plot/3d plot from dataframe

.plot_surface() takes 2D arrays as inputs, not 1D DataFrame columns. This has been explained quite well here, along with the below code that illustrates how one could arrive at the required format using DataFrame input. Reproduced below with minor modifications like additional comments. Alternatively, however, there is .plot_trisurf() which uses 1D inputs. I’ve added an … Read more