You are on the right track, except that DataFrame.align
doesn’t combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let’s try an example:
Initialising two dataframes with some descriptive column names and toy data:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
Now, let’s view these data frames by themselves:
print(df1)
D B E A 1 1 2 3 4 2 6 7 8 9
print(df2)
A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Let’s align these two dataframes, aligning by columns (axis=1
), and performing an outer join on column labels (join='outer'
):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
A B C D E 1 4 2 NaN 1 3 2 9 7 NaN 6 8 A B C D E 2 10 20 30 40 NaN 3 60 70 80 90 NaN 4 600 700 800 900 NaN
A few things to notice here:
- The columns in
df1
have been rearranged so they align with the columns indf2
. - There is a column labelled
'C'
that has been added todf1
, and a column labelled'E'
that has been added todf2
. These columns have been filled withNaN
. This is because we performed an outer join on the column labels. - None of the values inside the DataFrames have been altered.
- Note that the row labels are not aligned;
df2
has rows3
and4
, whereasdf1
does not. This is because we requested alignment on columns (axis=1
).
What happens if we align on both rows and columns, but change the join
parameter to 'right'
?
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
A B C D 2 9.0 7.0 NaN 6.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Note that:
- Only the columns and rows that are found in the “right” dataframe (
df2
) are retained. Column'E'
is no longer present. This is because we made a right join on both the column and row labels. - Rows with labels
3
and4
have been added todf1
, filled withNan
. This is because we requested alignment on both rows and columns (axis=None
). - Row labels are now aligned as well as column labels.
- Again, note that none of the actual values within the dataframes have been altered.
Finally, let’s have a look at the code in the question, with join='inner'
and axis=1
:
a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
D B A 1 1 2 4 2 6 7 9 D B A 2 40 20 10 3 90 70 60 4 900 700 600
- Only column labels are aligned (
axis=1
). - Only column labels that are present in both
df1
anddf2
are retained (join='inner'
).
In summary, use DataFrame.align()
when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.