Bar chart race showing changes in Covid-19 deaths
COVID-19 is the disease caused by a new coronavirus called SARS-CoV-2. World Health Organisation (WHO) first learned of the virus on 31 December 2019. The WHO declared the coronavirus outbreak a pandemic in March 2020. This article will show how to create a bar chart race depicting the countries with the highest number of deaths from coronavirus as they change from day to day.
The data used in this article is retrieved from the Johns Hopkins University who have made the data available on GitHub. More information about COVID-19 and the coronavirus is available from Coronavirus disease (COVID-19) advice for the public.
Retrieve the data and load into dataframe
The data is available on John Hopkins GitHub page. The data for the daily deaths from corona virus is in time_series_covid19_deaths_global.csv file. This file can either be downloaded and loaded into a dataframe or it can be loaded directly from GitHub as in the code below. Load and review the data.
1# raw csv files from Github
2deaths_path = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
3deaths_df = pd.read_csv(deaths_path)
4
5deaths_df.shape
6"""
7(312, 171)
8"""
9
10deaths_df.iloc[[0,1,2,3,-4,-3,-2,-1], [0,1,2,3,4,5,6,-3,-2,-1]]
11"""
12 Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 11/23/20 11/24/20 11/25/20
130 NaN Afghanistan 33.939110 67.709953 0 0 0 1695 1712 1725
141 NaN Albania 41.153300 20.168300 0 0 0 716 735 743
152 NaN Algeria 28.033900 1.659600 0 0 0 2294 2309 2329
163 NaN Andorra 42.506300 1.521800 0 0 0 76 76 76
17267 NaN Western Sahara 24.215500 -12.885800 0 0 0 1 1 1
18268 NaN Yemen 15.552727 48.516388 0 0 0 609 609 611
19269 NaN Zambia -13.133897 27.849332 0 0 0 357 357 357
20270 NaN Zimbabwe -19.015438 29.154857 0 0 0 273 274 274
21"""
Group the data by Country/Region
The data in the Country/Region
is further broken down by Province/State
for some
countries such as China or United Kingdom. The data is grouped by Country/Region
summing up the number to provide a single total for each country.
1deaths_df[deaths_df['Country/Region'] == 'China'].head()
2
3"""
4 Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 11/16/20 11/17/20 11/18/20 11/19/20 11/20/20 11/21/20 11/22/20 11/23/20 11/24/20 11/25/20
558 Anhui China 31.8257 117.2264 0 0 0 0 0 0 ... 6 6 6 6 6 6 6 6 6 6
659 Beijing China 40.1824 116.4142 0 0 0 0 0 1 ... 9 9 9 9 9 9 9 9 9 9
760 Chongqing China 30.0572 107.8740 0 0 0 0 0 0 ... 6 6 6 6 6 6 6 6 6 6
861 Fujian China 26.0789 117.9874 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
962 Gansu China 35.7518 104.2861 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
10"""
11
12# Group by Country
13country_deaths_df = deaths_df.groupby(by = ['Country/Region'],
14 as_index = False).sum()
15country_deaths_df.shape
16"""
17(191, 312)
18"""
Map countries to specific colors
It is necessary to get a list of all of the top ten counties as they change over time in the dataset.
1df = country_deaths_df.drop(['Lat', 'Long'], axis = 1)
2countries = []
3for c in df.columns[1:]:
4 df_num = (df[['Country/Region', c]][df[c] > 0]
5 .sort_values(by = c, ascending = False)
6 .head(10))
7 countries.extend(list(df_num['Country/Region']))
8countries = list(set(countries))
9
10len(countries)
11"""
1224
13"""
It is necessary to map a set of colors to this set of 24 countries to maintain consistent colors for each country. This is necessary as each bar chart created in the bar chart race is independent of the previous bar chart and the default would reuse the same ten colors. Display the 24 countries with their assigned color.
1cols = list(plt.cm.Dark2.colors + plt.cm.Set3.colors + plt.cm.tab10.colors)
2cols.remove(plt.cm.Set3.colors[8])
3color_dict = {x[0]:x[1] for x in zip(countries, cols)}
Colors for countries with highest deaths from Covid-19
Create bar chart for top 10 countries for latest day
Extract the data for the country list on the latest date. Sort the data by the number of deaths and create a bar chart of the top ten countries with the highest number of deaths. Add labels to the end of each bar showing the number of deaths for each country and add a label for the date.
1top_df = (sel_df.iloc[:, [-1]]
2 .sort_values(by = sel_df.columns[-1], ascending = False)
3 .head(10))
4fig, ax = plt.subplots(nrows = 1,
5 ncols = 1,
6 figsize = (10, 7),
7 facecolor = plt.cm.Blues(.2),
8 tight_layout = True)
9bars = ax.barh(y = range(1, len(top_df.index) + 1),
10 tick_label = top_df.index,
11 width = top_df.iloc[: , 0],
12 color = [color_dict[col] for col in top_df.index])
13day = pd.to_datetime(top_df.columns[0]).strftime('%b %d %Y')
14ax.set_title(f'Top ten countries with highest deaths from Covid-19 - {day}',
15 fontsize = 'xx-large',
16 fontweight = 'bold')
17ax.set_ylim(10.8, 0.2)
18ax.set_facecolor(plt.cm.Blues(.2))
19ax.tick_params(labelsize = 'medium')
20ax.grid(True, axis = 'x', color=plt.cm.Blues(0.05))
21[spine.set_visible(False) for spine in ax.spines.values()]
22
23for bar in bars:
24 width = bar.get_width()
25 ax.annotate(f'{width:,.0F}',
26 xy = (width , bar.get_y() + bar.get_height() / 2),
27 xytext = (25, 0),
28 textcoords = "offset points",
29 fontsize = 'x-large',
30 fontweight = 'bold',
31 ha = 'left',
32 va = 'center')
33
34# Add large Date in bottom right on chart
35ax.annotate(pd.to_datetime(top_df.columns[0]).strftime('%b %d'),
36 xy = (1.05, 0.1),
37 xycoords='axes fraction',
38 fontsize = 40,
39 fontweight = 'bold',
40 ha = 'right',
41 va = 'bottom')
42
43plt.show()
Top ten countries with highest deaths from Covid-19 on November 23
Transpose the data to wide format and expand the data
The data is transposed and the date is set as the index so that each row represents the data for a particular date and can be shown in a bar chart.
1# Transpose the data for the countries of interest
2wide_df = sel_df.T[countries].copy()
3
4# Remove the column name
5wide_df.rename_axis(None, axis=1, inplace=True)
6
7# Set index to datetime
8wide_df.index = pd.to_datetime([f"{x}" for x in wide_df.index])
9
10wide_df.iloc[[0,1,2,3,-4,-3,-2,-1], [0,1,2,3,-3,-2,-1]]
11"""
12 United Kingdom Netherlands Iran Italy US Brazil Peru
132020-01-22 0 0 0 0 0 0 0
142020-01-23 0 0 0 0 0 0 0
152020-01-24 0 0 0 0 0 0 0
162020-01-25 0 0 0 0 0 0 0
172020-11-21 54721 8946 44327 49261 255946 168989 35549
182020-11-22 55120 8967 44802 49823 256866 169183 35549
192020-11-23 55327 9021 45255 50453 257779 169485 35595
202020-11-24 55935 9111 45738 51306 259925 170115 35641
21"""
Expand the data to plot a smooth transition so the bars on the chart do not jump around, but slide smoothly.
1expanded_df = wide_df.asfreq('4h')
2
3expanded_df.shape
4"""
5(1843, 24)
6"""
7
8expanded_df.iloc[-8:, [0,1,2,3,-3,-2,-1]]
9"""
10 United Kingdom Netherlands Iran Italy US Brazil Peru
112020-11-22 20:00:00 NaN NaN NaN NaN NaN NaN NaN
122020-11-23 00:00:00 55327.0 9021.0 45255.0 50453.0 257779.0 169485.0 35595.0
132020-11-23 04:00:00 NaN NaN NaN NaN NaN NaN NaN
142020-11-23 08:00:00 NaN NaN NaN NaN NaN NaN NaN
152020-11-23 12:00:00 NaN NaN NaN NaN NaN NaN NaN
162020-11-23 16:00:00 NaN NaN NaN NaN NaN NaN NaN
172020-11-23 20:00:00 NaN NaN NaN NaN NaN NaN NaN
182020-11-24 00:00:00 55935.0 9111.0 45738.0 51306.0 259925.0 170115.0 35641.0
19"""
Create ranking dataset to rank the countries in each row. The rank order is used to position the bars on the bar chart so that the color remains consistent for each country.
1rank_df = expanded_df.rank(axis = 1, method = 'first', ascending = False)
2
3rank_df.shape
4"""
5(1843, 24)
6"""
7
8rank_df.iloc[-8:, [0,1,2,3,-3,-2,-1]]
9"""
10 United Kingdom Netherlands Iran Italy US Brazil Peru
112020-11-22 20:00:00 NaN NaN NaN NaN NaN NaN NaN
122020-11-23 00:00:00 5.0 16.0 8.0 6.0 1.0 2.0 11.0
132020-11-23 04:00:00 NaN NaN NaN NaN NaN NaN NaN
142020-11-23 08:00:00 NaN NaN NaN NaN NaN NaN NaN
152020-11-23 12:00:00 NaN NaN NaN NaN NaN NaN NaN
162020-11-23 16:00:00 NaN NaN NaN NaN NaN NaN NaN
172020-11-23 20:00:00 NaN NaN NaN NaN NaN NaN NaN
182020-11-24 00:00:00 5.0 16.0 8.0 6.0 1.0 2.0 11.0
19"""
Interpolate the results to create a smooth transition from one day to the next. Add incremental values every 4 hours between the given values.
1expanded_df = expanded_df.interpolate()
2rank_df = rank_df.interpolate()
3
4expanded_df.iloc[-8:, [0,1,2,3,-3,-2,-1]]
5"""
6 United Kingdom Netherlands Iran Italy US Brazil Peru
72020-11-22 20:00:00 55292.500000 9012.0 45179.5 50348.000000 257626.833333 169434.666667 35587.333333
82020-11-23 00:00:00 55327.000000 9021.0 45255.0 50453.000000 257779.000000 169485.000000 35595.000000
92020-11-23 04:00:00 55428.333333 9036.0 45335.5 50595.166667 258136.666667 169590.000000 35602.666667
102020-11-23 08:00:00 55529.666667 9051.0 45416.0 50737.333333 258494.333333 169695.000000 35610.333333
112020-11-23 12:00:00 55631.000000 9066.0 45496.5 50879.500000 258852.000000 169800.000000 35618.000000
122020-11-23 16:00:00 55732.333333 9081.0 45577.0 51021.666667 259209.666667 169905.000000 35625.666667
132020-11-23 20:00:00 55833.666667 9096.0 45657.5 51163.833333 259567.333333 170010.000000 35633.333333
142020-11-24 00:00:00 55935.000000 9111.0 45738.0 51306.000000 259925.000000 170115.000000 35641.000000
15"""
Remove duplicate ranks so that one country does not hide another when they have the exact same rank.
1# Remove any duplicate ranks from the same row
2while ((rank_df.where(~rank_df.apply(pd.Series.duplicated, axis=1), -1)) == -1).any(axis = None):
3 rank_df = rank_df.where(~rank_df.apply(pd.Series.duplicated, axis=1), rank_df*1.01)
Display a sample bar chart from the expanded dataframe using the ranking to order the bars. Add annotations to each bar to display the number of deaths. Add a large display of the current day, which will make the animation more understandable.
1fig, ax = plt.subplots(nrows = 1,
2 ncols = 1,
3 figsize = (10, 7),
4 facecolor = plt.cm.Blues(.2),
5 tight_layout = True)
6p = expanded_df.columns.map(len).max()
7bar_num = 10
8i = 1117
9sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
10bars = ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
11 tick_label = [x.rjust(p, ' ') for x in sel_df.columns],
12 width = sel_df.iloc[i],
13 color = [color_dict[col] for col in sel_df.columns],
14 alpha = 0.8)
15
16plt.setp(ax.get_xticklabels(), fontsize='small')
17plt.setp(ax.get_yticklabels(), fontsize='medium', fontfamily = 'monospace')
18cur_day = expanded_df.index[i].strftime('%Y-%m')
19ax.set_title(f'Countries with highest deaths from Covid-19 on {cur_day}',
20 fontsize = 'xx-large',
21 fontweight = 'bold')
22ax.set_ylim(10.8, 0.2)
23ax.set_facecolor(plt.cm.Blues(.2))
24ax.grid(True, axis = 'x', color=plt.cm.Blues(.05))
25ax.set_axisbelow(True)
26[spine.set_visible(False) for spine in ax.spines.values()]
27
28for bar in bars:
29 width = bar.get_width()
30 ax.annotate(f'{width:,.0F}',
31 xy = (width , bar.get_y() + bar.get_height() / 2),
32 xytext = (25, 0),
33 textcoords = "offset points",
34 fontsize = 'large',
35 fontweight = 'bold',
36 ha = 'left',
37 va = 'center')
38
39plt.show()
Countries with highest deaths from Covid-19 on sample date
Display a sample of the bar charts for random dates to ensure the bar charts are displaying correctly with the correct ranking and countries maintain their color.
1sample_num = 5
2d_df = expanded_df.sample(n = sample_num, random_state = 12).sort_index()
3r_df = rank_df.loc[d_df.index]
4
5fig, axs = plt.subplots(nrows = 1,
6 ncols = sample_num,
7 figsize = (15, 5),
8 facecolor = plt.cm.Blues(.2),
9 tight_layout = True)
10
11for i, ax in enumerate(axs.flatten()):
12 sel_df = d_df.iloc[:, list(r_df.iloc[i] <= bar_num)]
13 bars = ax.barh(y = r_df.iloc[:, list(r_df.iloc[i] <= bar_num)].iloc[i],
14 tick_label = sel_df.columns,
15 width = sel_df.iloc[i],
16 color = [color_dict[col] for col in sel_df.columns],
17 alpha = 0.8)
18
19 cur_day = d_df.index[i].strftime('%Y-%m-%d')
20 ax.set_title(f'{cur_day}',
21 fontsize = 'large',
22 fontweight = 'bold')
23 ax.set_ylim(10.8, 0.2)
24 ax.set_facecolor(plt.cm.Blues(.2))
25 ax.tick_params(labelsize = 'xx-small')
26 ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
27 ax.set_axisbelow(True)
28 [spine.set_visible(False) for spine in ax.spines.values()]
29
30 for bar in bars:
31 width = bar.get_width()
32 ax.annotate(f'{width:,.0F}',
33 xy = (width , bar.get_y() + bar.get_height() / 2),
34 xytext = (5, 0),
35 textcoords = "offset points",
36 fontsize = 'small',
37 fontweight = 'bold',
38 ha = 'left',
39 va = 'center')
40plt.show()
Sample of bar charts for countries with highest deaths from Covid-19
Create the animation
The animation is created by generating a bar chart for each of the rows in the expanded dataframe using the rankings dataframe for bar positions. These images are then combined into sequence using the FuncAnimation function in Matplotlib.
1def update(i):
2 ax.clear()
3
4 p = expanded_df.columns.map(len).max()
5 bar_num = 10
6 sel_df = expanded_df.iloc[:, list(rank_df.iloc[i] <= bar_num)]
7 bars = ax.barh(y = rank_df.iloc[:, list(rank_df.iloc[i] <= bar_num)].iloc[i],
8 tick_label = [x.rjust(p, ' ') for x in sel_df.columns],
9 width = sel_df.iloc[i],
10 color = [color_dict[col] for col in sel_df.columns],
11 alpha = 0.8)
12 plt.setp(ax.get_xticklabels(), fontsize='x-small')
13 plt.setp(ax.get_yticklabels(), fontsize='small', fontfamily = 'monospace')
14
15 cur_day = expanded_df.index[i].strftime('%Y-%b-%d')
16 ax.set_title(f'Deaths from Covid-19 - {cur_day}',
17 fontsize = 'x-large',
18 fontweight = 'bold',
19 loc = 'center')
20 ax.set_ylim(10.8, 0.2)
21 ax.set_facecolor(plt.cm.Blues(.2))
22 ax.grid(True, axis = 'x', color=plt.cm.Blues(.1))
23 ax.set_axisbelow(True)
24 [spine.set_visible(False) for spine in ax.spines.values()]
25
26 for bar in bars:
27 width = bar.get_width()
28 ax.annotate(f'{width:,.0F}',
29 xy = (width , bar.get_y() + bar.get_height() / 2),
30 xytext = (10, 0),
31 textcoords = "offset points",
32 fontsize = 'small',
33 ha = 'left',
34 va = 'center')
35
36 # Add large Date in bottom right on chart
37 ax.annotate(expanded_df.index[i].strftime('%b %d'),
38 xy = (1.25, 0.1),
39 xycoords='axes fraction',
40 fontsize = 40,
41 fontweight = 'bold',
42 ha = 'right',
43 va = 'bottom')
44
45fig, ax = plt.subplots(figsize = (8, 4),
46 facecolor = plt.cm.Blues(.2),
47 dpi = 50,
48 tight_layout = True)
49
50covid_anim = anim.FuncAnimation(
51 fig = fig,
52 func = update,
53 frames = len(expanded_df),
54 interval = 100)
55
56
57covid_anim.save('COVID_bar_chart_race_2020.gif')
Countries with highest deaths from Covid-19
The animation can also be converted to HTML5 video or saved as MP4.
1html = covid_anim.to_html5_video()
2
3covid_anim.save('COVID_bar_chart_race_2020.gif')
MP4 version available here - MP4 bar chart race for Countries with highest deaths from Covid-19
Conclusion
A bar chart race can be an effective way to visualise the increase in deaths from COVID-19 in different contries. There is a bit of work in creating the animated chart, which has been laid out in this article. The concept is straight forward, simply sequence through the data and display a bar chart for each day and the Matplotlib FuncAnimation is great for creating the animation. Some of the things to keep in mind are to implement intermediary data to give a smooth transition and to maintain consistent color for each country.