dataframe – Page 16 – Tarik Billa

Dropping a nested column from Spark DataFrame

September 29, 2023 by Tarik

It is just a programming exercise but you can try something like this: import org.apache.spark.sql.{DataFrame, Column} import org.apache.spark.sql.types.{StructType, StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String], source: String): Column = … Read more

Loop over rows of dataframe applying function with if-statement

September 29, 2023 by Tarik

Sort Pandas dataframe and print highest n values

September 29, 2023 by Tarik

I think you can use nlargest (New in pandas version 0.17.0): print df 0 Bytes Client Ip 0 1 1 1000 192.168.10.2 1 0 0 2000 192.168.10.12 2 2 2 500 192.168.10.4 3 3 3 159 192.168.10.56 print df.nlargest(3, ‘Client’) 0 Bytes Client Ip 1 0 0 2000 192.168.10.12 0 1 1 1000 192.168.10.2 2 … Read more

Spark – Group by HAVING with dataframe syntax?

September 29, 2023 by Tarik

Yes, it doesn’t exist. You express the same logic with agg followed by where: df.groupBy(someExpr).agg(somAgg).where(somePredicate)

How can I use the row.names attribute to order the rows of my dataframe in R?

September 28, 2023 by Tarik

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

September 25, 2023 by Tarik

You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise: newdf = df.notnull().astype(‘int’) If you really want to write into your original DataFrame, this will work: df.loc[~df.isnull()] = 1 # … Read more

How to get maximum length of each column in the data frame using pandas python

September 24, 2023 by Tarik

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions. You can use pd.DataFrame.select_dtypes to select object columns. import pandas as pd import numpy as np df = pd.DataFrame({‘A’: [‘abc’, ‘de’, ‘abcd’], ‘B’: [‘a’, ‘abcde’, ‘abc’], ‘C’: [1, 2.5, 1.5]}) measurer = np.vectorize(len) Max length for all columns res1 = measurer(df.values.astype(str)).max(axis=0) … Read more

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

September 23, 2023 by Tarik

There’s actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects. A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, … Read more

pandas: merged (inner join) data frame has more rows than the original ones

September 23, 2023 by Tarik

Because you have duplicates of the merge column in both data sets, you’ll get k * m rows with that merge column value, where k is the number of rows with that value in data set 1 and m is the number of rows with that value in data set 2. try drop_duplicates dfa = … Read more

How can I set index while converting dictionary to dataframe?

September 22, 2023 by Tarik

Use set_index: df = pd.DataFrame(dictionary, columns=[‘Date’, ‘Open’, ‘Close’]) df = df.set_index(‘Date’) print (df) Open Close Date 2016/11/22 07:00:00 47.47 47.48 2016/11/22 06:59:00 47.46 47.45 2016/11/22 06:58:00 47.38 47.40 Or use inplace: df = pd.DataFrame(dictionary, columns=[‘Date’, ‘Open’, ‘Close’]) df.set_index(‘Date’, inplace=True) print (df) Open Close Date 2016/11/22 07:00:00 47.47 47.48 2016/11/22 06:59:00 47.46 47.45 2016/11/22 06:58:00 47.38 … Read more