Pandas - Joining DataFrames with Concat and Append

It is frequently required to join dataframes together, such as when data is loaded from multiple files or even multiple sources. pandas.concat() is used to add the rows of multiple dataframes together and produce a new dataframe with the the combined data. It can also be used to concatenate dataframes by columns as shown below.

Concatenation is the grouping together of sequences of data ans is more generally used when referring to strings. When one string is appended to another or several strings are concatenated together. pandas.concat takes a list of DataFrames or a list of Series and concatenates the data. The DataFrame class also has an append function that is a shortcut to the concat function.



Summary of the pandas.concat and dataframe.append:

Summary of pandas.concat and dataframe.append

pd.concat([df1, df2]) concatenate two dataframes along index
pd.concat([df1, df2], ignore_index = True) concatenate two dataframes along index, ignoring source index values
pd.concat([df1, df2], axis = 1) concatenate two dataframes along columns
pd.concat([df1, df2], axis = 1, join = 'inner') concatenate two dataframes along columns with inner join
pd.concat(rainbows, ignore_index = True) concatenate two dataframes along index and re-index
pd.concat([A, B, C], keys = ['A', 'B', 'C']) concatenate multiple dataframes creating a hierarchical index with keys parameter
df1.append([df2]) append one dataframe to another
df1.append([df2, df3], ignore_index = True) append multiple dataframes to another and re-index


concat

The Pandas.concat function joins a number of dataframes on one of the axis. The default is to join along the index.

 1# Create Dataframe
 2random.seed(42) # use seed to make random data reproducible
 3d1 = pd.DataFrame({
 4    'red': [random.randint(1,20) for _ in range(4)],
 5    'orange': [random.randint(1,20) for _ in range(4)],
 6    'yellow': [random.randint(1,20) for _ in range(4)],
 7    'green': [random.randint(1,20) for _ in range(4)]})
 8"""
 9   red  orange  yellow  green
100    4       8       3      1
111    1       5      19      3
122    9       4      14      7
133    8      18       2      8
14"""
15
16d2 = pd.DataFrame({
17    'red': [random.randint(1,20) for _ in range(4)],
18    'orange': [random.randint(1,20) for _ in range(4)],
19    'yellow': [random.randint(1,20) for _ in range(4)],
20    'green': [random.randint(1,20) for _ in range(4)]})
21"""
22   red  orange  yellow  green
230   17       7      15      6
241   20      18      19     14
252    1      14       9     11
263   18       8       1      9
27"""

Concatenate the two dataframes together to join along the index.

 1print(pd.concat([d1, d2]))
 2
 3"""
 4   red  orange  yellow  green
 50    4       8       3      1
 61    1       5      19      3
 72    9       4      14      7
 83    8      18       2      8
 90   17       7      15      6
101   20      18      19     14
112    1      14       9     11
123   18       8       1      9
13"""

Notice that the original indices from the origina dataframes are preserved by default, which results in duplicate index values in the resultant dataframe.

 1d3 = pd.concat([d1, d2])
 2print(d3.index)
 3"""
 4Int64Index([0, 1, 2, 3, 0, 1, 2, 3], dtype='int64')
 5"""
 6
 7print(d3.columns)
 8"""
 9Index(['red', 'orange', 'yellow', 'green'], dtype='object')
10"""
11
12print(d3.shape)
13"""
14(8, 4)
15"""
16
17print(d3.loc[3])
18"""
19   red  orange  yellow  green
203    8      18       2      8
213   18       8       1      9
22"""

Pandas - concat() defaults to join two dataframes along the index
Pandas - concat by default joins two dataframes along the index



concat parameters

The concat function has a number of parameters, which have defaults. These are explained in the Pandas.concat documentation. The following shows that executing the same concat function with all the parameters set to their defaults returns the same results not specifying any optional parameters.

Pandas.concat Parameters:

Parameter

Note

Default

objs list of DataFrame or Series objects  
axis the axis to concatenate along, (0 = ’index’, 1 = ’columns’) 0
join how to handle indexes on other axis, (options are ‘inner’ or ‘outer’) 'outer'
ignore_index boolean value on preserving source index False
keys sequence used to create hierarchical index using the passed keys None
levels list of sequences used to create a MultiIndex None
names list of names for the levels in hierarchical index None
verify_integrity boolean value to specify whether the new concatenated axis contains duplicates False
sort boolean value to specify sorting non-concatenation axis if it is not already aligned when join is ‘outer’ False
copy boolean value to specify whether data is copied unnecessarily True
 1# These are the default parameters to concat
 2pd.concat([d1, d2],
 3          axis = 0,
 4          join = 'outer',
 5          ignore_index = False,
 6          keys = None,
 7          levels = None,
 8          names = None,
 9          verify_integrity = False,
10          sort = False,
11          copy = True)
12"""
13   red  orange  yellow  green
140    4       8       3      1
151    1       5      19      3
162    9       4      14      7
173    8      18       2      8
180   17       7      15      6
191   20      18      19     14
202    1      14       9     11
213   18       8       1      9
22"""

The ignore_index can be set to True to allow the index values from the source dataframes to be ignored and a new index generated for the resultant dataframe.

 1pd.concat([d1, d2], ignore_index = True)
 2"""
 3   red  orange  yellow  green
 40    4       8       3      1
 51    1       5      19      3
 62    9       4      14      7
 73    8      18       2      8
 84   17       7      15      6
 95   20      18      19     14
106    1      14       9     11
117   18       8       1      9
12"""

Pandas - concat join two dataframes along the index with ignore_index
Pandas - concat join two dataframes along the index with ignore_index



concat with different column names

What happens if the column names of the dataframes to be concatenated are not the same? Create two dataframes and then concatenate them together.

 1# Create Dataframe
 2random.seed(1) # use seed to make random data reproducible
 3df1 = pd.DataFrame({
 4    'red': [random.randint(1,20) for _ in range(4)],
 5    'orange': [random.randint(1,20) for _ in range(4)]})
 6"""
 7   red  orange
 80    5       4
 91   19      16
102    3      15
113    9      16
12"""
13
14
15df2 = pd.DataFrame({
16    'green': [random.randint(1,20) for _ in range(5)],
17    'blue': [random.randint(1,20) for _ in range(5)],
18    'indigo': [random.randint(1,20) for _ in range(5)]})
19"""
20   green  blue  indigo
210     13    13       9
221      7    14       8
232      4    20      19
243     16     1       4
254      1    15      11
26"""

Concatenate the two dataframes together. These are joined along the index and missing data is filled in with NaN.

 1df3 =  pd.concat([df1, df2])
 2"""
 3    red  orange  green  blue  indigo
 40   5.0     4.0    NaN   NaN     NaN
 51  19.0    16.0    NaN   NaN     NaN
 62   3.0    15.0    NaN   NaN     NaN
 73   9.0    16.0    NaN   NaN     NaN
 80   NaN     NaN   13.0  13.0     9.0
 91   NaN     NaN    7.0  14.0     8.0
102   NaN     NaN    4.0  20.0    19.0
113   NaN     NaN   16.0   1.0     4.0
124   NaN     NaN    1.0  15.0    11.0
13"""

Pandas - concat() defaults to join two dataframes along the index
concat() defaults to join two dataframes along the index



concat with axis = 1

The concat function has a number of parameters, which have defaults. the axis parameter specifies along which to join the dataframes, o for index (default) and 1 for columns. The following code is equivalent code to pd.concat([df1, df2])

 1# These are the default parameters to concat
 2pd.concat([df1, df2],
 3          axis = 0,
 4          join = 'outer',
 5          ignore_index = False,
 6          keys = None,
 7          levels = None,
 8          names = None,
 9          verify_integrity = False,
10          sort = False,
11          copy = True)
12"""
13    red  orange  green  blue  indigo
140   5.0     4.0    NaN   NaN     NaN
151  19.0    16.0    NaN   NaN     NaN
162   3.0    15.0    NaN   NaN     NaN
173   9.0    16.0    NaN   NaN     NaN
180   NaN     NaN   13.0  13.0     9.0
191   NaN     NaN    7.0  14.0     8.0
202   NaN     NaN    4.0  20.0    19.0
213   NaN     NaN   16.0   1.0     4.0
224   NaN     NaN    1.0  15.0    11.0
23"""

The following shows the result when the axis is specified as 1 so that the dataframes are concatenated along the columns.

 1pd.concat([df1, df2], axis = 1)
 2
 3"""
 4    red  orange  green  blue  indigo
 50   5.0     4.0     13    13       9
 61  19.0    16.0      7    14       8
 72   3.0    15.0      4    20      19
 83   9.0    16.0     16     1       4
 94   NaN     NaN      1    15      11
10"""

Pandas - concat() with index = 1 joins two dataframes along the columns
Pandas - concat() with index = 1 joins two dataframes along the columns



concat with inner join

concat with inner join is similar to SQL inner join. When matching along columns, then only the rows with matching indexes are kept. Note that row at index 4 is dropped from df2 as there is no matching index in df1.

1pd.concat([df1, df2], axis = 1, join = 'inner')
2"""
3   red  orange  green  blue  indigo
40    5       4     13    13       9
51   19      16      7    14       8
62    3      15      4    20      19
73    9      16     16     1       4
8"""

Pandas - concatenate two dataframes with inner join only keeps matching indexes
Pandas - concatenate two dataframes with inner join only keeps matching indexes



Concatenating multiple dataframes

More than two dataframes can be concatenated together. The default is to concatenate along the index.

 1# Create an array of dataframes
 2rainbows = []
 3for x in range(5):
 4    df = pd.DataFrame({
 5        'red': [random.randint(1,20) for _ in range(1)],
 6        'orange': [random.randint(1,20) for _ in range(1)],
 7        'yellow': [random.randint(1,20) for _ in range(1)],
 8        'green': [random.randint(1,20) for _ in range(1)],
 9        'blue': [random.randint(1,20) for _ in range(1)],
10        'indigo': [random.randint(1,20) for _ in range(1)],
11        'violet': [random.randint(1,20) for _ in range(1)]})
12    rainbows.append(df)
13
14"""
15   red  orange  yellow  green  blue  indigo  violet
160    9       4       9     17     7      20      14
17
18   red  orange  yellow  green  blue  indigo  violet
190    1       8       1     13     5       2       6
20
21   red  orange  yellow  green  blue  indigo  violet
220   15      17      14     18     8      17      15
23
24   red  orange  yellow  green  blue  indigo  violet
250    8      17       1     13    19      11      14
26
27   red  orange  yellow  green  blue  indigo  violet
280    2      10       5      7     2      10       3
29"""
30
31pd.concat(rainbows, axis = 0)
32"""
33   red  orange  yellow  green  blue  indigo  violet
340    9       4       9     17     7      20      14
350    1       8       1     13     5       2       6
360   15      17      14     18     8      17      15
370    8      17       1     13    19      11      14
380    2      10       5      7     2      10       3
39"""
40
41# The index of the individual dataframes can be ignored
42pd.concat(rainbows, axis = 0, ignore_index = True)
43"""
44   red  orange  yellow  green  blue  indigo  violet
450    9       4       9     17     7      20      14
461    1       8       1     13     5       2       6
472   15      17      14     18     8      17      15
483    8      17       1     13    19      11      14
494    2      10       5      7     2      10       3
50"""

Pandas - concatenate multiple dataframes along index
Pandas - concatenate multiple dataframes along index

Multiple dataframes can also be concatenated along the columns with axis=1.

 1# Create an array of dataframes
 2dfs = []
 3for x in ['R', 'O', 'Y', 'G', 'B', 'I', 'V']:
 4    df = pd.DataFrame({x: [random.randint(1,20) for _ in range(4)]})
 5    dfs.append(df)
 6
 7pd.concat(dfs, axis = 1)
 8"""
 9    R   O   Y   G   B   I   V
100   8  18  12   9  13  18  16
111   1   8  19  18  17   7  12
122   7  13  12  20   5  14  19
133  18  17  15   1  17   2  18
14"""

Pandas - concatenate multiple dataframes along columns
Pandas - concatenate multiple dataframes along columns



Hierarchical data

A hierarchical index can be created using the keys parameter. This can be used to allow reference to the original source of the data in the combined dataframe. Three dataframes are created with the same columns containing random data.

 1# Create Dataframe
 2A = pd.DataFrame({
 3    'red': [random.randint(1,20) for _ in range(4)],
 4    'green': [random.randint(1,20) for _ in range(4)],
 5    'blue': [random.randint(1,20) for _ in range(4)]})
 6"""
 7   red  green  blue
 80   15      5    17
 91   20      6    15
102   12      1    20
113    9     11     3
12"""
13
14B = pd.DataFrame({
15    'red': [random.randint(1,20) for _ in range(4)],
16    'green': [random.randint(1,20) for _ in range(4)],
17    'blue': [random.randint(1,20) for _ in range(4)]})
18"""
19   red  green  blue
200   11     13     6
211   18      6     6
222   20     15     8
233    2     14     2
24"""
25
26C = pd.DataFrame({
27    'red': [random.randint(1,20) for _ in range(4)],
28    'green': [random.randint(1,20) for _ in range(4)],
29    'blue': [random.randint(1,20) for _ in range(4)]})
30"""
31   red  green  blue
320    4      3     7
331    5     13     8
342   17      4    14
353   19     10     3
36"""

Concatenate the dataframes with key values.

 1pd.concat([A, B, C], keys = ['A', 'B', 'C'])
 2
 3"""
 4     red  green  blue
 5A 0   15      5    17
 6  1   20      6    15
 7  2   12      1    20
 8  3    9     11     3
 9B 0   11     13     6
10  1   18      6     6
11  2   20     15     8
12  3    2     14     2
13C 0    4      3     7
14  1    5     13     8
15  2   17      4    14
16  3   19     10     3
17"""

Pandas - concatenate multiple dataframes with keys
Pandas - concatenate multiple dataframes, creating a hierarchical with keys parameter

This hierarchical allows easy access to original data based on the key.

 1h_df = pd.concat([A, B, C], keys = ['A', 'B', 'C'])
 2
 3h_df.loc['B']
 4"""
 5   red  green  blue
 60   11     13     6
 71   18      6     6
 82   20     15     8
 93    2     14     2
10"""
11
12h_df.loc[[('A', 1),('B', 3)]]
13"""
14     red  green  blue
15A 1   20      6    15
16B 3    2     14     2
17"""


Dataframe.Append

instance method performs the same function as concat by appending a Series or Dataframe onto the end of the calling dataframe and returning a new dataframe. As noted in the documentation, successive calls to append rows to a DataFrame can be more computationally expensive than a single concat function.

 1random.seed(5)
 2P = pd.DataFrame({
 3    'red': [random.randint(1,20) for _ in range(4)],
 4    'green': [random.randint(1,20) for _ in range(4)],
 5    'blue': [random.randint(1,20) for _ in range(4)]})
 6"""
 7   red  green  blue
 80   20      1     6
 91    9     15     4
102   12      8    12
113   17      2    16
12"""
13
14Q = pd.DataFrame({
15    'red': [random.randint(1,20) for _ in range(4)],
16    'green': [random.randint(1,20) for _ in range(4)],
17    'blue': [random.randint(1,20) for _ in range(4)]})
18"""
19   red  green  blue
200    8     19    14
211   13      8     9
222   18      1     6
233    4      7    13
24"""
25
26R = pd.DataFrame({
27    'red': [random.randint(1,20) for _ in range(4)],
28    'green': [random.randint(1,20) for _ in range(4)],
29    'blue': [random.randint(1,20) for _ in range(4)]})
30"""
31   red  green  blue
320    6     20     1
331    3     15     1
342    5      5     7
353   20      5     7
36"""

Append two dataframes to a dataframe.

 1P.append([Q, R], ignore_index = True)
 2
 3"""
 4    red  green  blue
 50    20      1     6
 61     9     15     4
 72    12      8    12
 83    17      2    16
 94     8     19    14
105    13      8     9
116    18      1     6
127     4      7    13
138     6     20     1
149     3     15     1
1510    5      5     7
1611   20      5     7
17"""

DataFrame - append multiple dataframes with ignore_index
DataFrame - append multiple dataframes with ignore_index



Conclusion

Joining dataframes is easily achieved with pandas.concat function. The default is to concatenate the rows of the second dataframe after the last row of the first dataframe and return a new dataframe. The same functionality can be achieved using the dataframe.append function. concat has more options such as concatenating along the columns or specifying an inner join. A hierarchical index can be generated on the combined dataframe with the use of the keys parameter.


Posts in this Series