Compute size of Spark dataframe – SizeEstimator gives unexpected results

Question

Unfortunately, I was not able to get reliable estimates from SizeEstimator, but I could find another strategy – if the dataframe is cached, we can extract its size from queryExecution as follows:

df.cache.foreach(_ => ())
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(
    catalyst_plan).optimizedPlan.stats.sizeInBytes

For the example dataframe, this gives exactly 4.8GB (which also corresponds to the file size when writing to an uncompressed Parquet table).

This has the disadvantage that the dataframe needs to be cached, but it is not a problem in my case.

EDIT: Replaced df.cache.foreach(_=>_) by df.cache.foreach(_ => ()), thanks to @DavidBenedeki for pointing it out in the comments.

Leave a Comment Cancel reply