spark-csv – Tarik Billa

Provide schema while reading csv file as a dataframe in Scala Spark

February 4, 2023 by Tarik

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file. val pagecount = sqlContext.read.format(“csv”) .option(“delimiter”,” “).option(“quote”,””) .option(“header”, “true”) .option(“inferSchema”, “true”) .load(“dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000”) If you want to manually specify the schema, you can do it as below: import org.apache.spark.sql.types._ val customSchema = StructType(Array( … Read more

Write single CSV file using spark-csv

November 13, 2022 by Tarik

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle): df .repartition(1) .write.format(“com.databricks.spark.csv”) .option(“header”, “true”) .save(“mydata.csv”) or coalesce: df .coalesce(1) .write.format(“com.databricks.spark.csv”) .option(“header”, “true”) .save(“mydata.csv”) data frame before … Read more

How to show full column content in a Spark Dataframe?

September 30, 2022 by Tarik

results.show(20, false) will not truncate. Check the source 20 is the default number of rows displayed when show() is called without any arguments.