PySpark dataframe convert unusual string format to Timestamp

Question

Spark >= 2.2

from pyspark.sql.functions import to_timestamp

(sc
    .parallelize([Row(dt="2016_08_21 11_31_08")])
    .toDF()
    .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
    .show(1, False))

## +-------------------+-------------------+
## |dt                 |parsed             |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+

Spark < 2.2

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp

(sc
    .parallelize([Row(dt="2016_08_21 11_31_08")])
    .toDF()
    .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
    # For Spark <= 1.5
    # See issues.apache.org/jira/browse/SPARK-11724 
    .cast("double")
    .cast("timestamp"))
    .show(1, False))

## +-------------------+---------------------+
## |dt                 |parsed               |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+

In both cases the format string should be compatible with Java SimpleDateFormat.

Leave a Comment Cancel reply