dataframe – Page 22 – Tarik Billa

DataFrame / Dataset groupBy behaviour/optimization

September 1, 2023 by Tarik

Yes, it is “smart enough“. groupBy performed on a DataFrame is not the same operation as groupBy performed on a plain RDD. In a scenario you’ve described there is no need to move raw data at all. Let’s create a small example to illustrate that: val df = sc.parallelize(Seq( (“a”, “foo”, 1), (“a”, “foo”, 3), … Read more

Adding a new column in pandas dataframe from another dataframe with differing indices

August 31, 2023 by Tarik

Assuming the size of your dataframes are the same, you can assign the RESULT_df[‘RESULT’].values to your original dataframe. This way, you don’t have to worry about indexing issues. # pre 0.24 feature_file_df[‘RESULT’] = RESULT_df[‘RESULT’].values # >= 0.24 feature_file_df[‘RESULT’] = RESULT_df[‘RESULT’].to_numpy() Minimal Code Sample df A B 0 -1.202564 2.786483 1 0.180380 0.259736 2 -0.295206 1.175316 … Read more

R self reference

August 31, 2023 by Tarik

Filling in date gaps in MultiIndex Pandas Dataframe

August 30, 2023 by Tarik

You can make a new multi index based on the Cartesian product of the levels of the existing multi index. Then, re-index your data frame using the new index. new_index = pd.MultiIndex.from_product(df.index.levels) new_df = df.reindex(new_index) # Optional: convert missing values to zero, and convert the data back # to integers. See explanation below. new_df = … Read more

pandas reset_index after groupby.value_counts()

August 30, 2023 by Tarik

You need parameter name in reset_index, because Series name is same as name of one of levels of MultiIndex: df_grouped.reset_index(name=”count”) Another solution is rename Series name: print (df_grouped.rename(‘count’).reset_index()) A Amt count 0 1 30 4 1 1 20 3 2 1 40 2 3 2 40 3 4 2 10 2 More common solution instead … Read more

Spark: How to translate count(distinct(value)) in Dataframe API’s

August 29, 2023 by Tarik

What you need is the DataFrame aggregation function countDistinct: import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Log(page: String, visitor: String) val logs = data.map(p => Log(p._1,p._2)) .toDF() val result = logs.select(“page”,”visitor”) .groupBy(‘page) .agg(‘page, countDistinct(‘visitor)) result.foreach(println)

Spark DataFrame – Select n random rows

August 29, 2023 by Tarik

In Python, You can shuffle the rows and then take the top ones: import org.apache.spark.sql.functions.rand dataset.orderBy(rand()).limit(n)

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

August 28, 2023 by Tarik

Let’s assume your csv looks something like: c1,c2 0.000000,0.968012 1.000000,2.712641 2.000000,11.958873 3.000000,10.889784 … I generated the data as such: import numpy as np from sklearn import datasets, linear_model import matplotlib.pyplot as plt length = 10 x = np.arange(length, dtype=float).reshape((length, 1)) y = x + (np.random.rand(length)*10).reshape((length, 1)) This data is saved to test.csv (just so you … Read more

pandas python how to count the number of records or rows in a dataframe

August 28, 2023 by Tarik

To get the number of rows in a dataframe use: df.shape[0] (and df.shape[1] to get the number of columns). As an alternative you can use len(df) or len(df.index) (and len(df.columns) for the columns) shape is more versatile and more convenient than len(), especially for interactive work (just needs to be added at the end), but … Read more

Not Found