Pandas - Joining DataFrames with Concat and Append
It is frequently required to join dataframes together, such as when data
is loaded from multiple files or even multiple sources. pandas.concat()
is used
to add the rows of multiple dataframes together and produce a new dataframe with the
the combined data. It can also be used to concatenate dataframes by columns as
shown below.
Concatenation is the grouping together of sequences of data ans is more generally used
when referring to strings. When one string is appended to another or several strings are
concatenated together. pandas.concat takes a list of DataFrames or a list of Series and
concatenates the data. The DataFrame class also has an append
function that is a
shortcut to the concat function.
Summary of the pandas.concat and dataframe.append:
Summary of pandas.concat and dataframe.append |
|
---|---|
pd.concat([df1, df2]) | concatenate two dataframes along index |
pd.concat([df1, df2], ignore_index = True) | concatenate two dataframes along index, ignoring source index values |
pd.concat([df1, df2], axis = 1) | concatenate two dataframes along columns |
pd.concat([df1, df2], axis = 1, join = 'inner') | concatenate two dataframes along columns with inner join |
pd.concat(rainbows, ignore_index = True) | concatenate two dataframes along index and re-index |
pd.concat([A, B, C], keys = ['A', 'B', 'C']) | concatenate multiple dataframes creating a hierarchical index with keys parameter |
df1.append([df2]) | append one dataframe to another |
df1.append([df2, df3], ignore_index = True) | append multiple dataframes to another and re-index |
concat
The Pandas.concat function joins a number of dataframes on one of the axis. The default is to join along the index.
1# Create Dataframe
2random.seed(42) # use seed to make random data reproducible
3d1 = pd.DataFrame({
4 'red': [random.randint(1,20) for _ in range(4)],
5 'orange': [random.randint(1,20) for _ in range(4)],
6 'yellow': [random.randint(1,20) for _ in range(4)],
7 'green': [random.randint(1,20) for _ in range(4)]})
8"""
9 red orange yellow green
100 4 8 3 1
111 1 5 19 3
122 9 4 14 7
133 8 18 2 8
14"""
15
16d2 = pd.DataFrame({
17 'red': [random.randint(1,20) for _ in range(4)],
18 'orange': [random.randint(1,20) for _ in range(4)],
19 'yellow': [random.randint(1,20) for _ in range(4)],
20 'green': [random.randint(1,20) for _ in range(4)]})
21"""
22 red orange yellow green
230 17 7 15 6
241 20 18 19 14
252 1 14 9 11
263 18 8 1 9
27"""
Concatenate the two dataframes together to join along the index.
1print(pd.concat([d1, d2]))
2
3"""
4 red orange yellow green
50 4 8 3 1
61 1 5 19 3
72 9 4 14 7
83 8 18 2 8
90 17 7 15 6
101 20 18 19 14
112 1 14 9 11
123 18 8 1 9
13"""
Notice that the original indices from the origina dataframes are preserved by default, which results in duplicate index values in the resultant dataframe.
1d3 = pd.concat([d1, d2])
2print(d3.index)
3"""
4Int64Index([0, 1, 2, 3, 0, 1, 2, 3], dtype='int64')
5"""
6
7print(d3.columns)
8"""
9Index(['red', 'orange', 'yellow', 'green'], dtype='object')
10"""
11
12print(d3.shape)
13"""
14(8, 4)
15"""
16
17print(d3.loc[3])
18"""
19 red orange yellow green
203 8 18 2 8
213 18 8 1 9
22"""
Pandas - concat by default joins two dataframes along the index
concat parameters
The concat function has a number of parameters, which have defaults. These are explained in the Pandas.concat documentation. The following shows that executing the same concat function with all the parameters set to their defaults returns the same results not specifying any optional parameters.
Pandas.concat Parameters:
Parameter |
Note |
Default |
---|---|---|
objs | list of DataFrame or Series objects | |
axis | the axis to concatenate along, (0 = ’index’, 1 = ’columns’) | 0 |
join | how to handle indexes on other axis, (options are ‘inner’ or ‘outer’) | 'outer' |
ignore_index | boolean value on preserving source index | False |
keys | sequence used to create hierarchical index using the passed keys | None |
levels | list of sequences used to create a MultiIndex | None |
names | list of names for the levels in hierarchical index | None |
verify_integrity | boolean value to specify whether the new concatenated axis contains duplicates | False |
sort | boolean value to specify sorting non-concatenation axis if it is not already aligned when join is ‘outer’ | False |
copy | boolean value to specify whether data is copied unnecessarily | True |
1# These are the default parameters to concat
2pd.concat([d1, d2],
3 axis = 0,
4 join = 'outer',
5 ignore_index = False,
6 keys = None,
7 levels = None,
8 names = None,
9 verify_integrity = False,
10 sort = False,
11 copy = True)
12"""
13 red orange yellow green
140 4 8 3 1
151 1 5 19 3
162 9 4 14 7
173 8 18 2 8
180 17 7 15 6
191 20 18 19 14
202 1 14 9 11
213 18 8 1 9
22"""
The ignore_index
can be set to True to allow the index values from the source
dataframes to be ignored and a new index generated for the resultant dataframe.
1pd.concat([d1, d2], ignore_index = True)
2"""
3 red orange yellow green
40 4 8 3 1
51 1 5 19 3
62 9 4 14 7
73 8 18 2 8
84 17 7 15 6
95 20 18 19 14
106 1 14 9 11
117 18 8 1 9
12"""
Pandas - concat join two dataframes along the index with ignore_index
concat with different column names
What happens if the column names of the dataframes to be concatenated are not the same? Create two dataframes and then concatenate them together.
1# Create Dataframe
2random.seed(1) # use seed to make random data reproducible
3df1 = pd.DataFrame({
4 'red': [random.randint(1,20) for _ in range(4)],
5 'orange': [random.randint(1,20) for _ in range(4)]})
6"""
7 red orange
80 5 4
91 19 16
102 3 15
113 9 16
12"""
13
14
15df2 = pd.DataFrame({
16 'green': [random.randint(1,20) for _ in range(5)],
17 'blue': [random.randint(1,20) for _ in range(5)],
18 'indigo': [random.randint(1,20) for _ in range(5)]})
19"""
20 green blue indigo
210 13 13 9
221 7 14 8
232 4 20 19
243 16 1 4
254 1 15 11
26"""
Concatenate the two dataframes together. These are joined along the index and missing
data is filled in with NaN
.
1df3 = pd.concat([df1, df2])
2"""
3 red orange green blue indigo
40 5.0 4.0 NaN NaN NaN
51 19.0 16.0 NaN NaN NaN
62 3.0 15.0 NaN NaN NaN
73 9.0 16.0 NaN NaN NaN
80 NaN NaN 13.0 13.0 9.0
91 NaN NaN 7.0 14.0 8.0
102 NaN NaN 4.0 20.0 19.0
113 NaN NaN 16.0 1.0 4.0
124 NaN NaN 1.0 15.0 11.0
13"""
concat() defaults to join two dataframes along the index
concat with axis = 1
The concat function has a number of parameters, which have defaults. the axis
parameter specifies along which to join the dataframes, o for index (default) and
1 for columns. The following code is equivalent code to pd.concat([df1, df2])
1# These are the default parameters to concat
2pd.concat([df1, df2],
3 axis = 0,
4 join = 'outer',
5 ignore_index = False,
6 keys = None,
7 levels = None,
8 names = None,
9 verify_integrity = False,
10 sort = False,
11 copy = True)
12"""
13 red orange green blue indigo
140 5.0 4.0 NaN NaN NaN
151 19.0 16.0 NaN NaN NaN
162 3.0 15.0 NaN NaN NaN
173 9.0 16.0 NaN NaN NaN
180 NaN NaN 13.0 13.0 9.0
191 NaN NaN 7.0 14.0 8.0
202 NaN NaN 4.0 20.0 19.0
213 NaN NaN 16.0 1.0 4.0
224 NaN NaN 1.0 15.0 11.0
23"""
The following shows the result when the axis is specified as 1 so that the dataframes are concatenated along the columns.
1pd.concat([df1, df2], axis = 1)
2
3"""
4 red orange green blue indigo
50 5.0 4.0 13 13 9
61 19.0 16.0 7 14 8
72 3.0 15.0 4 20 19
83 9.0 16.0 16 1 4
94 NaN NaN 1 15 11
10"""
Pandas - concat() with index = 1 joins two dataframes along the columns
concat with inner join
concat with inner join is similar to SQL inner join. When matching along columns, then only the rows with matching indexes are kept. Note that row at index 4 is dropped from df2 as there is no matching index in df1.
1pd.concat([df1, df2], axis = 1, join = 'inner')
2"""
3 red orange green blue indigo
40 5 4 13 13 9
51 19 16 7 14 8
62 3 15 4 20 19
73 9 16 16 1 4
8"""
Pandas - concatenate two dataframes with inner join only keeps matching indexes
Concatenating multiple dataframes
More than two dataframes can be concatenated together. The default is to concatenate along the index.
1# Create an array of dataframes
2rainbows = []
3for x in range(5):
4 df = pd.DataFrame({
5 'red': [random.randint(1,20) for _ in range(1)],
6 'orange': [random.randint(1,20) for _ in range(1)],
7 'yellow': [random.randint(1,20) for _ in range(1)],
8 'green': [random.randint(1,20) for _ in range(1)],
9 'blue': [random.randint(1,20) for _ in range(1)],
10 'indigo': [random.randint(1,20) for _ in range(1)],
11 'violet': [random.randint(1,20) for _ in range(1)]})
12 rainbows.append(df)
13
14"""
15 red orange yellow green blue indigo violet
160 9 4 9 17 7 20 14
17
18 red orange yellow green blue indigo violet
190 1 8 1 13 5 2 6
20
21 red orange yellow green blue indigo violet
220 15 17 14 18 8 17 15
23
24 red orange yellow green blue indigo violet
250 8 17 1 13 19 11 14
26
27 red orange yellow green blue indigo violet
280 2 10 5 7 2 10 3
29"""
30
31pd.concat(rainbows, axis = 0)
32"""
33 red orange yellow green blue indigo violet
340 9 4 9 17 7 20 14
350 1 8 1 13 5 2 6
360 15 17 14 18 8 17 15
370 8 17 1 13 19 11 14
380 2 10 5 7 2 10 3
39"""
40
41# The index of the individual dataframes can be ignored
42pd.concat(rainbows, axis = 0, ignore_index = True)
43"""
44 red orange yellow green blue indigo violet
450 9 4 9 17 7 20 14
461 1 8 1 13 5 2 6
472 15 17 14 18 8 17 15
483 8 17 1 13 19 11 14
494 2 10 5 7 2 10 3
50"""
Pandas - concatenate multiple dataframes along index
Multiple dataframes can also be concatenated along the columns with axis=1
.
1# Create an array of dataframes
2dfs = []
3for x in ['R', 'O', 'Y', 'G', 'B', 'I', 'V']:
4 df = pd.DataFrame({x: [random.randint(1,20) for _ in range(4)]})
5 dfs.append(df)
6
7pd.concat(dfs, axis = 1)
8"""
9 R O Y G B I V
100 8 18 12 9 13 18 16
111 1 8 19 18 17 7 12
122 7 13 12 20 5 14 19
133 18 17 15 1 17 2 18
14"""
Pandas - concatenate multiple dataframes along columns
Hierarchical data
A hierarchical index can be created using the keys
parameter. This can be used to
allow reference to the original source of the data in the combined dataframe.
Three dataframes are created with the same columns containing random data.
1# Create Dataframe
2A = pd.DataFrame({
3 'red': [random.randint(1,20) for _ in range(4)],
4 'green': [random.randint(1,20) for _ in range(4)],
5 'blue': [random.randint(1,20) for _ in range(4)]})
6"""
7 red green blue
80 15 5 17
91 20 6 15
102 12 1 20
113 9 11 3
12"""
13
14B = pd.DataFrame({
15 'red': [random.randint(1,20) for _ in range(4)],
16 'green': [random.randint(1,20) for _ in range(4)],
17 'blue': [random.randint(1,20) for _ in range(4)]})
18"""
19 red green blue
200 11 13 6
211 18 6 6
222 20 15 8
233 2 14 2
24"""
25
26C = pd.DataFrame({
27 'red': [random.randint(1,20) for _ in range(4)],
28 'green': [random.randint(1,20) for _ in range(4)],
29 'blue': [random.randint(1,20) for _ in range(4)]})
30"""
31 red green blue
320 4 3 7
331 5 13 8
342 17 4 14
353 19 10 3
36"""
Concatenate the dataframes with key values.
1pd.concat([A, B, C], keys = ['A', 'B', 'C'])
2
3"""
4 red green blue
5A 0 15 5 17
6 1 20 6 15
7 2 12 1 20
8 3 9 11 3
9B 0 11 13 6
10 1 18 6 6
11 2 20 15 8
12 3 2 14 2
13C 0 4 3 7
14 1 5 13 8
15 2 17 4 14
16 3 19 10 3
17"""
Pandas - concatenate multiple dataframes, creating a hierarchical with keys parameter
This hierarchical allows easy access to original data based on the key.
1h_df = pd.concat([A, B, C], keys = ['A', 'B', 'C'])
2
3h_df.loc['B']
4"""
5 red green blue
60 11 13 6
71 18 6 6
82 20 15 8
93 2 14 2
10"""
11
12h_df.loc[[('A', 1),('B', 3)]]
13"""
14 red green blue
15A 1 20 6 15
16B 3 2 14 2
17"""
Dataframe.Append
instance method performs the same function as concat by appending a Series or Dataframe onto the end of the calling dataframe and returning a new dataframe. As noted in the documentation, successive calls to append rows to a DataFrame can be more computationally expensive than a single concat function.
1random.seed(5)
2P = pd.DataFrame({
3 'red': [random.randint(1,20) for _ in range(4)],
4 'green': [random.randint(1,20) for _ in range(4)],
5 'blue': [random.randint(1,20) for _ in range(4)]})
6"""
7 red green blue
80 20 1 6
91 9 15 4
102 12 8 12
113 17 2 16
12"""
13
14Q = pd.DataFrame({
15 'red': [random.randint(1,20) for _ in range(4)],
16 'green': [random.randint(1,20) for _ in range(4)],
17 'blue': [random.randint(1,20) for _ in range(4)]})
18"""
19 red green blue
200 8 19 14
211 13 8 9
222 18 1 6
233 4 7 13
24"""
25
26R = pd.DataFrame({
27 'red': [random.randint(1,20) for _ in range(4)],
28 'green': [random.randint(1,20) for _ in range(4)],
29 'blue': [random.randint(1,20) for _ in range(4)]})
30"""
31 red green blue
320 6 20 1
331 3 15 1
342 5 5 7
353 20 5 7
36"""
Append two dataframes to a dataframe.
1P.append([Q, R], ignore_index = True)
2
3"""
4 red green blue
50 20 1 6
61 9 15 4
72 12 8 12
83 17 2 16
94 8 19 14
105 13 8 9
116 18 1 6
127 4 7 13
138 6 20 1
149 3 15 1
1510 5 5 7
1611 20 5 7
17"""
DataFrame - append multiple dataframes with ignore_index
Conclusion
Joining dataframes is easily achieved with pandas.concat
function. The default is to
concatenate the rows of the second dataframe after the last row of the first dataframe
and return a new dataframe. The same functionality can be achieved using the
dataframe.append
function. concat has more options such as concatenating along
the columns or specifying an inner join. A hierarchical index can be generated on the
combined dataframe with the use of the keys parameter.