dataframe – Page 24 – Tarik Billa

Spark add new column to dataframe with value from previous row

August 24, 2023 by Tarik

You can use lag window function as follows from pyspark.sql.functions import lag, col from pyspark.sql.window import Window df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF([“id”, “num”]) w = Window().partitionBy().orderBy(col(“id”)) df.select(“*”, lag(“num”).over(w).alias(“new_col”)).na.drop().show() ## +—+—+——-+ ## | id|num|new_col| ## +—+—+——-| ## | 2|3.0| 5.0| ## | 3|7.0| 3.0| ## | 4|9.0| 7.0| ## +—+—+——-+ but … Read more

Spark DataFrames: registerTempTable vs not

August 24, 2023 by Tarik

The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name … Read more

How do pandas Rolling objects work?

August 24, 2023 by Tarik

I suggest you have a look at the source code in order to get into the nitty gritty of what rolling does. In particular I suggest you have a look at the rolling functions in generic.py and window.py. From there you can have a look at the Window class which is used if you specify … Read more

Pandas: create named columns in DataFrame from dict

August 24, 2023 by Tarik

You can iterate through the items: In [11]: pd.DataFrame(list(my_dict.items()), columns=[‘business_id’,’business_code’]) Out[11]: business_id business_code 0 id2 val2 1 id3 val3 2 id1 val1

Fastest way to add rows for missing time steps?

August 24, 2023 by Tarik

Dropping time from datetime

August 24, 2023 by Tarik

The quickest way is to use DatetimeIndex’s normalize (you first need to make the column a DatetimeIndex): In [11]: df = pd.DataFrame({“t”: pd.date_range(‘2014-01-01′, periods=5, freq=’H’)}) In [12]: df Out[12]: t 0 2014-01-01 00:00:00 1 2014-01-01 01:00:00 2 2014-01-01 02:00:00 3 2014-01-01 03:00:00 4 2014-01-01 04:00:00 In [13]: pd.DatetimeIndex(df.t).normalize() Out[13]: <class ‘pandas.tseries.index.DatetimeIndex’> [2014-01-01, …, 2014-01-01] Length: … Read more

Pandas Dataframe or similar in C#.NET

August 23, 2023 by Tarik

Deedle is a .Net library that handles DataFrames. http://bluemountaincapital.github.io/Deedle/index.html

Julia DataFrame: remove column by name

August 22, 2023 by Tarik

You can use select!: julia> df = DataFrame(A = 1:4, B = [“M”, “F”, “F”, “M”], C = 2:5) 4×3 DataFrame |——-|—|—–|—| | Row # | A | B | C | | 1 | 1 | “M” | 2 | | 2 | 2 | “F” | 3 | | 3 | 3 | … Read more

Pandas DataFrame stack multiple column values into single column

August 22, 2023 by Tarik

You can melt your dataframe: >>> keys = [c for c in df if c.startswith(‘key.’)] >>> pd.melt(df, id_vars=”topic”, value_vars=keys, value_name=”key”) topic variable key 0 8 key.0 abc 1 9 key.0 xab 2 8 key.1 def 3 9 key.1 xcd 4 8 key.2 ghi 5 9 key.2 xef It also gives you the source of the … Read more

Not Found