Dropping a nested column from Spark DataFrame

It is just a programming exercise but you can try something like this: import org.apache.spark.sql.{DataFrame, Column} import org.apache.spark.sql.types.{StructType, StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try case class DFWithDropFrom(df: DataFrame) { def getSourceField(source: String): Try[StructField] = { Try(df.schema.fields.filter(_.name == source).head) } def getType(sourceField: StructField): Try[StructType] = { Try(sourceField.dataType.asInstanceOf[StructType]) } def genOutputCol(names: Array[String], source: String): Column = … Read more

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise: newdf = df.notnull().astype(‘int’) If you really want to write into your original DataFrame, this will work: df.loc[~df.isnull()] = 1 # … Read more

How to get maximum length of each column in the data frame using pandas python

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions. You can use pd.DataFrame.select_dtypes to select object columns. import pandas as pd import numpy as np df = pd.DataFrame({‘A’: [‘abc’, ‘de’, ‘abcd’], ‘B’: [‘a’, ‘abcde’, ‘abc’], ‘C’: [1, 2.5, 1.5]}) measurer = np.vectorize(len) Max length for all columns res1 = measurer(df.values.astype(str)).max(axis=0) … Read more

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

There’s actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects. A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, … Read more

How can I set index while converting dictionary to dataframe?

Use set_index: df = pd.DataFrame(dictionary, columns=[‘Date’, ‘Open’, ‘Close’]) df = df.set_index(‘Date’) print (df) Open Close Date 2016/11/22 07:00:00 47.47 47.48 2016/11/22 06:59:00 47.46 47.45 2016/11/22 06:58:00 47.38 47.40 Or use inplace: df = pd.DataFrame(dictionary, columns=[‘Date’, ‘Open’, ‘Close’]) df.set_index(‘Date’, inplace=True) print (df) Open Close Date 2016/11/22 07:00:00 47.47 47.48 2016/11/22 06:59:00 47.46 47.45 2016/11/22 06:58:00 47.38 … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)