You are on the right track, except that DataFrame.align doesn’t combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let’s try an example:
Initialising two dataframes with some descriptive column names and toy data:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
Now, let’s view these data frames by themselves:
print(df1)
D B E A 1 1 2 3 4 2 6 7 8 9
print(df2)
A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Let’s align these two dataframes, aligning by columns (axis=1), and performing an outer join on column labels (join='outer'):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
A B C D E 1 4 2 NaN 1 3 2 9 7 NaN 6 8 A B C D E 2 10 20 30 40 NaN 3 60 70 80 90 NaN 4 600 700 800 900 NaN
A few things to notice here:
- The columns in
df1have been rearranged so they align with the columns indf2. - There is a column labelled
'C'that has been added todf1, and a column labelled'E'that has been added todf2. These columns have been filled withNaN. This is because we performed an outer join on the column labels. - None of the values inside the DataFrames have been altered.
- Note that the row labels are not aligned;
df2has rows3and4, whereasdf1does not. This is because we requested alignment on columns (axis=1).
What happens if we align on both rows and columns, but change the join parameter to 'right'?
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
A B C D 2 9.0 7.0 NaN 6.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Note that:
- Only the columns and rows that are found in the “right” dataframe (
df2) are retained. Column'E'is no longer present. This is because we made a right join on both the column and row labels. - Rows with labels
3and4have been added todf1, filled withNan. This is because we requested alignment on both rows and columns (axis=None). - Row labels are now aligned as well as column labels.
- Again, note that none of the actual values within the dataframes have been altered.
Finally, let’s have a look at the code in the question, with join='inner' and axis=1:
a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
D B A 1 1 2 4 2 6 7 9 D B A 2 40 20 10 3 90 70 60 4 900 700 600
- Only column labels are aligned (
axis=1). - Only column labels that are present in both
df1anddf2are retained (join='inner').
In summary, use DataFrame.align() when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.