PySpark and broadcast join example

Question

Spark 1.3 doesn’t support broadcast joins using DataFrame. In Spark >= 1.5.0 you can use broadcast function to apply broadcast joins:

from pyspark.sql.functions import broadcast

data1.join(broadcast(data2), data1.id == data2.id)

For older versions the only option is to convert to RDD and apply the same logic as in other languages. Roughly something like this:

from pyspark.sql import Row
from pyspark.sql.types import StructType

# Create a dictionary where keys are join keys
# and values are lists of rows
data2_bd = sc.broadcast(
    data2.map(lambda r: (r.id, r)).groupByKey().collectAsMap())


# Define a new row with fields from both DFs
output_row = Row(*data1.columns + data2.columns)

# And an output schema
output_schema = StructType(data1.schema.fields + data2.schema.fields)

# Given row x, extract a list of corresponding rows from broadcast
# and output a list of merged rows
def gen_rows(x):
    return [output_row(*x + y) for y in data2_bd.value.get(x.id, [])]

# flatMap and create a new data frame
joined = data1.rdd.flatMap(lambda row: gen_rows(row)).toDF(output_schema)

Leave a Comment Cancel reply