pyspark join multiple conditions

Quoting from spark docs:

(https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join)

join(other, on=None, how=None) Joins with another DataFrame, using the
given join expression.

The following performs a full outer join between df1 and df2.

Parameters: other – Right side of the join on – a string for join
column name, a list of column names, , a join expression (Column) or a
list of Columns. If on is a string or a list of string indicating the
name of the join column(s), the column(s) must exist on both sides,
and this performs an inner equi-join. how – str, default ‘inner’. One
of inner, outer, left_outer, right_outer, semijoin.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
 [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]


>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

So you need to use the “condition as a list” option like in the last example.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)