DataFrame / Dataset groupBy behaviour/optimization

Yes, it is “smart enough“. groupBy performed on a DataFrame is not the same operation as groupBy performed on a plain RDD. In a scenario you’ve described there is no need to move raw data at all. Let’s create a small example to illustrate that: val df = sc.parallelize(Seq( (“a”, “foo”, 1), (“a”, “foo”, 3), … Read more

Adding a new column in pandas dataframe from another dataframe with differing indices

Assuming the size of your dataframes are the same, you can assign the RESULT_df[‘RESULT’].values to your original dataframe. This way, you don’t have to worry about indexing issues. # pre 0.24 feature_file_df[‘RESULT’] = RESULT_df[‘RESULT’].values # >= 0.24 feature_file_df[‘RESULT’] = RESULT_df[‘RESULT’].to_numpy() Minimal Code Sample df A B 0 -1.202564 2.786483 1 0.180380 0.259736 2 -0.295206 1.175316 … Read more

Filling in date gaps in MultiIndex Pandas Dataframe

You can make a new multi index based on the Cartesian product of the levels of the existing multi index. Then, re-index your data frame using the new index. new_index = pd.MultiIndex.from_product(df.index.levels) new_df = df.reindex(new_index) # Optional: convert missing values to zero, and convert the data back # to integers. See explanation below. new_df = … Read more

pandas reset_index after groupby.value_counts()

You need parameter name in reset_index, because Series name is same as name of one of levels of MultiIndex: df_grouped.reset_index(name=”count”) Another solution is rename Series name: print (df_grouped.rename(‘count’).reset_index()) A Amt count 0 1 30 4 1 1 20 3 2 1 40 2 3 2 40 3 4 2 10 2 More common solution instead … Read more

Spark: How to translate count(distinct(value)) in Dataframe API’s

What you need is the DataFrame aggregation function countDistinct: import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Log(page: String, visitor: String) val logs = data.map(p => Log(p._1,p._2)) .toDF() val result = logs.select(“page”,”visitor”) .groupBy(‘page) .agg(‘page, countDistinct(‘visitor)) result.foreach(println)

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

Let’s assume your csv looks something like: c1,c2 0.000000,0.968012 1.000000,2.712641 2.000000,11.958873 3.000000,10.889784 … I generated the data as such: import numpy as np from sklearn import datasets, linear_model import matplotlib.pyplot as plt length = 10 x = np.arange(length, dtype=float).reshape((length, 1)) y = x + (np.random.rand(length)*10).reshape((length, 1)) This data is saved to test.csv (just so you … Read more

pandas python how to count the number of records or rows in a dataframe

To get the number of rows in a dataframe use: df.shape[0] (and df.shape[1] to get the number of columns). As an alternative you can use len(df) or len(df.index) (and len(df.columns) for the columns) shape is more versatile and more convenient than len(), especially for interactive work (just needs to be added at the end), but … Read more

404 Not Found

Not Found

The requested URL was not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.