Create a Bar Chart Race with Matplotlib - part 2
A Bar Chart Race is an animation where the bars representing the data change in place on the chart to represent change over time. This is the second article to improve on the bar chart race to include changing categories over time while maintaining the color associated with each category.
How to create a basic Bar Chart Race is described in "How to create a Bar Chart race - part 1". In this article the bar chart race will built upon in the following areas:
- Handle a larger range of data where the countries change.
- Keep the color for each country constant.
- Improve the bar chart race by displaying the current value on the bar.
- Wrap up in functions to create other bar chart races.
Load the data
Details on downloading and loading the data into a dataframe is described in "Pandas - Load data from Excel file and Display Chart". The following loads the data from a local file and filters to the median values.
1# Load the excel worksheet into a dataframe
2u5mr_df = pd.read_excel(
3 "/tmp/data/Under-five-mortality-rate_2020.xlsx",
4 sheet_name = 'Country estimates (both sexes)',
5 header = 14)
6
7# Drop the last two rows
8u5mr_df.drop(u5mr_df.tail(2).index, inplace = True)
9
10# Rename the columns to Years
11u5mr_df.columns = [x[:-2] if x.endswith('.5') else x for x in u5mr_df.columns]
12
13# Rename 'Uncertainty.Bounds*' column to 'Uncertainty.Bounds'
14u5mr_df = u5mr_df.rename(columns={'Uncertainty.Bounds*': 'Uncertainty.Bounds'})
15
16# Filter to the Median values
17u5mr_med_df = u5mr_df[u5mr_df['Uncertainty.Bounds'] == 'Median']
18
19# Review the data
20u5mr_med_df.iloc[[0,1,2,3,-4,-3,-2,-1], [0,1,2,3,4,5,6,-4,-3,-2,-1]]
21"""
22 ISO.Code Country.Name Uncertainty.Bounds 1950 1951 1952 1953 2016 2017 2018 2019
231 AFG Afghanistan Median NaN NaN NaN NaN 67.572190 64.940759 62.541196 60.269399
244 ALB Albania Median NaN NaN NaN NaN 9.419110 9.418052 9.525133 9.682407
257 DZA Algeria Median NaN NaN NaN NaN 24.792098 24.319482 23.805926 23.256168
2610 AND Andorra Median NaN NaN NaN NaN 3.369056 3.218925 3.085839 2.966929
27574 VNM Viet Nam Median NaN NaN NaN NaN 21.220796 20.843125 20.405423 19.935167
28577 YEM Yemen Median NaN NaN NaN NaN 56.823614 56.966430 58.460003 58.356138
29580 ZMB Zambia Median NaN NaN NaN 234.418232 66.510929 64.337901 63.294182 61.663465
30583 ZWE Zimbabwe Median NaN NaN NaN NaN 59.538505 58.234924 55.856832 54.612967
31"""
Identify the countries of interest
First get the countries that are in the top 10 for all years in the dataset. Use a lambda expression to apply a sort to each row and combining the results of the top 10 in each row. This consists of the following 41 countries.
1Afghanistan Angola Bangladesh
2Benin Burkina Faso Cambodia
3Cameroon Central African Republic Chad
4Cote d'Ivoire Democratic Republic of the Congo Egypt
5Ethiopia Gambia Ghana
6Guinea Guinea-Bissau Haiti
7Iraq Jordan Liberia
8Libya Malawi Mali
9Mauritania Mozambique Nepal
10Niger Nigeria Oman
11Pakistan Peru Republic of Korea
12Rwanda Senegal Sierra Leone
13Somalia South Sudan Togo
14Turkey Yemen
1fields = ['Country.Name'] + [str(x) for x in range(1950, 2020)]
2sel_df = u5mr_med_df[fields].set_index('Country.Name')
3countries = list(sel_df.apply(
4 lambda x: x.sort_values(ascending = False).head(10),
5 axis = 0).index)
Create a color map for these countries.
The original bar chart animation showed changes in the same 5 countries from 2015 to 2019. The bar chart race is an animation of a sequence of bar charts in which the color of the bars is associated with the country for the value on the bar. Each bar chart is generated independently of the previous bar charts, so the color of the countries will change as new countries are added to the chart and other countries are removed. This is confusing in the final animated result.
A color-map is created assigning a color to each of the 41 countries to address this issue. There is no predefined color palette for more than 20 colors. The palette is created by joining colors from multiple palettes together.
1# Colours
2cols = plt.cm.tab20.colors + plt.cm.Dark2.colors + plt.cm.Set3.colors + plt.cm.tab20b.colors
3
4m = zip(countries, cols)
5color_dict = {x[0]:x[1] for x in m}
6color_dict
Unique colors for top countries
Get the data for these 41 countries of interest
Extract the data for the top 41 countries and transpose the data into wide format. Change the index to the last day of the year for each year and convert to datetime format.
1# Get the data for the countries of interest
2data_df = (u5mr_med_df.drop(['ISO.Code', 'Uncertainty.Bounds'], axis=1)
3 [u5mr_med_df['Country.Name'].isin(countries)]).copy()
4
5# Set index to "Country.Name" and Transpose the dataframe
6wide_df = data_df.set_index('Country.Name').T
7
8# Remove the column name
9wide_df.rename_axis(None, axis=1, inplace=True)
10
11# Set index to datetime for end of year
12wide_df.index = pd.to_datetime([f"{x}-12-31" for x in wide_df.index])
13
14wide_df.iloc[[0,1,2,3,-4,-3,-2,-1], [0,1,2,3,-3,-2,-1]]
15"""
16 Afghanistan Angola Bangladesh Benin Togo Turkey Yemen
171950-12-31 NaN NaN 345.207382 349.108392 317.200927 NaN NaN
181951-12-31 NaN NaN 335.272222 345.669180 311.993315 NaN NaN
191952-12-31 NaN NaN 325.617440 342.068496 307.047255 NaN NaN
201953-12-31 NaN NaN 316.000124 338.513583 302.123318 299.282646 NaN
212016-12-31 67.572190 84.211894 35.680942 97.416395 73.515650 12.148032 56.823614
222017-12-31 64.940759 80.622302 33.921226 95.133079 71.317235 11.396633 56.966430
232018-12-31 62.541196 77.672320 32.266390 92.773521 69.115785 10.696923 58.460003
242019-12-31 60.269399 74.686710 30.753860 90.286429 66.904696 10.046388 58.356138
25"""
Expand the data set for smooth animation
Expand the dataset with rows every two months and creating a ranking dataframe
with rankings for the countries for each year. Fill in the newly created rows
in the dataframes using interpolate
function. Increment any duplicate rankings
for a given row to avoid bars disappearing from the bar chart animation.
1# Expand the dataset
2expanded_df = wide_df.asfreq('2M')
3
4# Create a ranking dataframe
5rank_df = expanded_df.rank(axis = 1, method = 'first', ascending = False)
6
7expanded_df = expanded_df.interpolate()
8rank_df = rank_df.interpolate()
9
10# Remove duplicate ranking
11rank_df = rank_df.where(~rank_df.apply(pd.Series.duplicated, axis=1), rank_df*1.01)
Display bar chart for a single year
Displaying a bar chart for a single year showing the ten countries with the highest Under Five Mortality Rate for that year.
1fig, ax = plt.subplots(nrows = 1,
2 ncols = 1,
3 figsize = (10, 7),
4 facecolor = plt.cm.Blues(.2),
5 tight_layout = True)
6bar_num = 10
7i = 138
8sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
9ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
10 tick_label = sel_df.columns,
11 width = sel_df.iloc[i],
12 color = [color_dict[col] for col in sel_df.columns],
13 alpha = 0.8)
14cur_year = expanded_df.index[i].strftime('%Y-%m')
15ax.set_title(f'Under Five Mortality Rate - {cur_year}',
16 fontsize = 'xx-large',
17 fontweight = 'bold')
18ax.set_ylim(10.8, 0.2)
19ax.set_facecolor(plt.cm.Blues(.2))
20ax.tick_params(labelsize = 'medium')
21ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
22ax.set_axisbelow(True)
23[spine.set_visible(False) for spine in ax.spines.values()]
24plt.show()
Top ten countries with highest Under Five Mortality Rate in 1973
Display bar chart with mortality rate on the bars
It can be helpful to see the under five mortality rate on the bar chart as the data is changing. This is done by adding an annotation to each of the bars. The mortality rate is displayed inside of the right edge of the bar. This could be displayed anywhere and works quite well outside to the right of the bar.
1fig, ax = plt.subplots(nrows = 1,
2 ncols = 1,
3 figsize = (10, 7),
4 facecolor = plt.cm.Blues(.2),
5 tight_layout = True)
6bar_num = 10
7i = 138
8sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
9bars = ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
10 tick_label = sel_df.columns,
11 width = sel_df.iloc[i],
12 color = [color_dict[col] for col in sel_df.columns],
13 alpha = 0.8)
14
15cur_year = expanded_df.index[i].strftime('%Y-%m')
16ax.set_title(f'Under Five Mortality Rate - {cur_year}',
17 fontsize = 'xx-large',
18 fontweight = 'bold')
19ax.set_ylim(10.8, 0.2)
20ax.set_facecolor(plt.cm.Blues(.2))
21ax.tick_params(labelsize = 'medium')
22ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
23ax.set_axisbelow(True)
24[spine.set_visible(False) for spine in ax.spines.values()]
25
26for bar in bars:
27 width = bar.get_width()
28 ax.annotate(f'{width:.0F}',
29 xy = (width , bar.get_y() + bar.get_height() / 2),
30 xytext = (-25, 0),
31 textcoords = "offset points",
32 fontsize = 'xx-large',
33 fontweight = 'bold',
34 ha = 'right',
35 va = 'center')
Top ten countries with highest Under Five Mortality Rates displaying rate on bars
Display a random sample of bar charts for different year
The expanded dataframe now has 415 rows, the final animation is created by generating 415 bar charts and creating a sequence of these charts. It is helpful to take a random sample of the dataset to validate that the bar charts are displayed as expected. The random rows are selected using the dataframe sample function.
1sample_num = 5
2d_df = expanded_df.sample(n=sample_num, random_state=1)
3r_df = rank_df.loc[d_df.index]
4
5fig, axs = plt.subplots(nrows = 1,
6 ncols = sample_num,
7 figsize = (15, 5),
8 facecolor = plt.cm.Blues(.2),
9 tight_layout = True)
10
11for i, ax in enumerate(axs.flatten()):
12 sel_df = d_df.iloc[:, list(r_df.iloc[i] <= bar_num)]
13 bars = ax.barh(y = r_df.iloc[:, list(r_df.iloc[i] <= bar_num)].iloc[i],
14 tick_label = sel_df.columns,
15 width = sel_df.iloc[i],
16 color = [color_dict[col] for col in sel_df.columns],
17 alpha = 0.8)
18
19 cur_year = d_df.index[i].strftime('%Y-%m')
20 ax.set_title(f'{cur_year}',
21 fontsize = 'large',
22 fontweight = 'bold')
23 ax.set_ylim(10.8, 0.2)
24 ax.set_facecolor(plt.cm.Blues(.2))
25 ax.tick_params(labelsize = 'medium')
26 ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
27 ax.set_axisbelow(True)
28 [spine.set_visible(False) for spine in ax.spines.values()]
29
30 for bar in bars:
31 width = bar.get_width()
32 ax.annotate(f'{width:.0F}',
33 xy = (width , bar.get_y() + bar.get_height() / 2),
34 xytext = (-5, 0),
35 textcoords = "offset points",
36 fontsize = 'medium',
37 fontweight = 'bold',
38 ha = 'right',
39 va = 'center')
40
41plt.show()
Sample bar charts od U5MR from expanded dataset
Create bar chart race of highest U5MR over the years
The animation is created by using the FuncAnimation function in Matplotlib. As there are 415 rows in the expladed dataset, this animation can take some time to generate. Note that defining the functions is instantaneous, the time is required when saving the animation to either gif, html or mp4.
1def update(i):
2 ax.clear()
3
4 p = expanded_df.columns.map(len).max()
5 bar_num = 10
6 sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
7 bars = ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
8 tick_label = [x.rjust(p, ' ') for x in sel_df.columns],
9 width = sel_df.iloc[i],
10 color = [color_dict[col] for col in sel_df.columns],
11 alpha = 0.8)
12 plt.setp(ax.get_xticklabels(), fontsize='x-small')
13 plt.setp(ax.get_yticklabels(), fontsize='small', fontfamily = 'monospace')
14
15 cur_year = expanded_df.index[i].strftime('%Y')
16 ax.set_title(f'Under Five Mortality Rate - {cur_year}',
17 fontsize = 'larger',
18 fontweight = 'bold',
19 loc = 'center')
20 ax.set_ylim(10.8, 0.2)
21 ax.set_facecolor(plt.cm.Blues(.2))
22 ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
23 ax.set_axisbelow(True)
24 [spine.set_visible(False) for spine in ax.spines.values()]
25
26 for bar in bars:
27 width = bar.get_width()
28 ax.annotate(f'{width:.0F}',
29 xy = (width , bar.get_y() + bar.get_height() / 2),
30 xytext = (-20, 0),
31 textcoords = "offset points",
32 fontsize = 'small',
33 ha = 'right',
34 va = 'center')
35
36fig, ax = plt.subplots(figsize = (8, 5),
37 facecolor = plt.cm.Blues(.2),
38 dpi = 150,
39 tight_layout = True)
40
41u5mr_anim = anim.FuncAnimation(
42 fig = fig,
43 func = update,
44 frames = len(expanded_df),
45 interval = 300)
Generate the gif file.
1u5mr_anim.save('U5MR_bar_chart_race_all_years.gif')
Bar chart race showing top ten countries with highest under five mortality rate from 1950 to 2019"
MP4 file of bar chart race from 1950 to 2019
Wrap up bar chart race creation in functions
There are a number of steps in preparing the data and then expanding, ranking and finally generating the bar chart race. The following wraps this up in a series of functions so these can be used to create a bar chart race from a similar dataset. Three functions are created; one to prepare the data in wide format; one to expand the data; and the final function to create the animation.
- prepare_data - Function to get the appropriate data and convert to wide format.
1def prepare_data(df, highest = True):
2 # Get all the countries in the top 10 for the years
3 fields = ['Country.Name'] + [str(x) for x in range(1950, 2020)]
4 sel_df = df[fields].set_index('Country.Name')
5 countries = list(sel_df.apply(
6 lambda x: x.sort_values(ascending = not highest).head(10),
7 axis = 0).index)
8
9 # Create color map for countries
10 cols = plt.cm.tab20.colors + plt.cm.Dark2.colors + plt.cm.Set3.colors + plt.cm.tab20b.colors
11 color_dict = {x[0]:x[1] for x in zip(countries, cols)}
12
13 # Get the data for the countries of interest
14 data_df = (df.drop(['ISO.Code', 'Uncertainty.Bounds'], axis=1)
15 [df['Country.Name'].isin(countries)]).copy()
16
17 # Set index to "Country.Name" and Transpose the dataframe
18 wide_df = data_df.set_index('Country.Name').T
19
20 # Remove the column name
21 wide_df.rename_axis(None, axis=1, inplace=True)
22
23 # Convert index to datetime
24 wide_df.index = pd.to_datetime([f"{x}-12-31" for x in wide_df.index])
25
26 return wide_df, color_dict
- expand_data - Function to expand the data and create a ranking dataframe.
1def expand_data(df, highest = True):
2 e_df = df.asfreq('2M')
3
4 # Create ranking dataset
5 r_df = e_df.rank(axis = 1, method = 'first', ascending = highest)
6
7 # Interpolate
8 e_df = e_df.interpolate()
9 r_df = r_df.interpolate()
10
11 # Remove duplicate ranks from the same row
12 r_df = r_df.where(~r_df.apply(pd.Series.duplicated, axis=1), r_df*1.01)
13
14 return e_df, r_df
- create_animation - Function to create the animation with the dataframes.
1def create_animation(expanded_df, rank_df, color_dict, highest = True):
2 def update2(i):
3 ax.clear()
4
5 p = expanded_df.columns.map(len).max()
6 bar_num = 10
7 sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
8 bars = ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
9 tick_label = [x.rjust(p, ' ') for x in sel_df.columns],
10 width = sel_df.iloc[i],
11 color = [color_dict[col] for col in sel_df.columns],
12 alpha = 0.8)
13
14 plt.setp(ax.get_xticklabels(), fontsize='small')
15 plt.setp(ax.get_yticklabels(), fontsize='medium', fontfamily = 'monospace')
16
17 cur_year = expanded_df.index[i].strftime('%Y')
18 ax.set_title(f'Under Five Mortality Rate - {cur_year}',
19 fontsize = 'larger',
20 fontweight = 'bold',
21 loc = 'right')
22 ax.set_ylim(10.8, 0.2)
23 if not highest:
24 ax.set_ylim(0.2, 10.8)
25 ax.set_facecolor(plt.cm.Blues(.2))
26 ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
27 ax.set_axisbelow(True)
28 [spine.set_visible(False) for spine in ax.spines.values()]
29
30
31 h_offset = -20 if highest else 2
32 h_align = 'right' if highest else 'left'
33 for bar in bars:
34 width = bar.get_width()
35 dislpay_value = f'{width:.0F}' if highest else f'{width:.1F}'
36 ax.annotate(dislpay_value,
37 xy = (width , bar.get_y() + bar.get_height() / 2),
38 xytext = (h_offset, 0),
39 textcoords = "offset points",
40 fontsize = 'small',
41 ha = h_align,
42 va = 'center')
43
44 fig, ax = plt.subplots(figsize = (8, 3),
45 facecolor = plt.cm.Blues(.2),
46 dpi = 150,
47 tight_layout = True)
48
49 data_anim = anim.FuncAnimation(
50 fig = fig,
51 func = update2,
52 frames = len(expanded_df),
53 interval = 200)
54
55 return data_anim
Finally, call these functions to create a bar chart race for countries with
either the highest or lowest Under Five Mortality Rates over time. Setting
highest = False
creates a bar chart race for the countries with the lowest Under
Five Mortality Rates from 1950 to 2019.
1highest = False
2# 1. Prepare the data
3df, col_dict = prepare_data(u5mr_med_df, highest)
4
5# 2. Expand the data
6e_df, r_df = expand_data(df)
7
8# 3. Create animation
9data_anim = create_animation(e_df, r_df, col_dict, highest)
10
11# 4. Save animation as gif
12data_anim.save('Bar_chart_race_U5MR_lowest_countries.gif')
Bar chart race showing the countries with lowest under five mortality rate from 1950 to 2019
MP4 version available here - MP4 file for bar chart race showing the countries with lowest under five mortality rate from 1950 to 2019
Display U5MR bar chart race for selected countries
The same functions can also be used to compare changes to specific countries over the years.
1highest = False
2# 1. Prepare the data
3df, col_dict = prepare_data(selected_df, highest)
4
5# 2. Expand the data
6e_df, r_df = expand_data(df)
7
8# 3. Create animation
9data_anim = create_animation(e_df, r_df, col_dict, highest)
10
11# 4. Save animation as gif
12data_anim.save('Bar_chart_race_U5MR_selected_countries.gif')
Bar chart race showing under five mortality rate changes for selected countries
MP4 version available here - MP4 file for Bar chart race showing under five mortality rate changes for selected countries
Conclusion
The creation of a Bar Chart Race using Matplotlib was improved to handle the countries changing over time. The main challenge is to keep consistent bar colors for each country and this was achieved by creating a color map for all of the countries that appear in the bar chart race. A bar chart race is an animated sequence of bar charts, so the size and number of the charts has an impact on the final size of the gif or mp4 file. The current value for each country is displayed on the bar and this makes it easier to see changes over time.
It was noted that the position of the y-axis changes to accomodate the labels for the countries. This was resolved by right-padding the string for the country name with white space and setting the font to a constant width.