Create Spark DataFrame. Can not infer schema for type

Question

SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/~~dict~~* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:

myFloatRdd.map(lambda x: (x, )).toDF()

or even better:

from pyspark.sql import Row

row = Row("val") # Or some other column name
myFloatRdd.map(row).toDF()

To create a DataFrame from a list of scalars you’ll have to use SparkSession.createDataFrame directly and provide a schema***:

from pyspark.sql.types import FloatType

df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())

df.show()

## +-----+
## |value|
## +-----+
## |  1.0|
## |  2.0|
## |  3.0|
## +-----+

but for a simple range it would be better to use SparkSession.range:

from pyspark.sql.functions import col

spark.range(1, 4).select(col("id").cast("double"))

* No longer supported.

** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.

*** Supported only in Spark 2.0 or later.

Leave a Comment Cancel reply