Spark union column order

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:

Return a new DataFrame containing union of rows in this and another frame.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.

Leave a Comment