# Display COVID-19 numbers relative to population

The total number of cases of COVID-19 and deaths from COVID-19 in a country will vary based on a number of factors. It does not seem right to compare absolute numbers in different countries, where the populations of those countries varies greatly. This article will show how to take the absolute numbers and express the confirmed cases and deaths per 100,000 of the population and thus allow comparison between countries.

Data sources:

Population data is obtained from the World Bank data

## Load COVID-19 data into dataframes

There are two sets of data to load the global confirmed cases and the global deaths.

• time_series_covid19_deaths_global.csv
• time_series_covid19_confirmed_global.csv

The data can be loaded directly from the GitHub page as in the code below. The data is cleaned to remove unwanted fields and group the data by country with the following function.

 1def load_clean_covid_data(csv_path):
3
4    # 1. Drop unwanted columns
5    df.drop(['Province/State', 'Lat', 'Long'], axis = 1, inplace = True)
6
7    # 2. Group by Country
8    df = df.groupby('Country/Region').sum()
9
10    # 3. Convert column headings to datetime
11    df.columns = pd.to_datetime(df.columns)
12
13    # 4. Remove the column name
14    df.rename_axis(None, axis = 0, inplace = True)
15
16    return df


Load the data for Confirmed cases and the data for Deaths from COVID-19 into dataframes.

 1
3confirmed_df.shape
4"""
5(191, 326)
6"""
7
8confirmed_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-3,-2,-1]]
9"""
10             2020-01-22  2020-01-23  2020-01-24  2020-12-10  2020-12-11  2020-12-12
11Afghanistan           0           0           0       48053       48116       48229
12Albania               0           0           0       46061       46863       47742
13Algeria               0           0           0       90579       91121       91638
14Yemen                 0           0           0        2081        2082        2083
15Zambia                0           0           0       18091       18161       18217
16Zimbabwe              0           0           0       11081       11162       11219
17"""
18
19
21deaths_df.shape
22"""
23(191, 326)
24"""
25
26deaths_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-3,-2,-1]]
27"""
28             2020-01-22  2020-01-23  2020-01-24  2020-12-10  2020-12-11  2020-12-12
29Afghanistan           0           0           0        1935        1945        1956
30Albania               0           0           0         965         977         989
31Algeria               0           0           0        2564        2575        2584
32Yemen                 0           0           0         606         606         606
33Zambia                0           0           0         364         365         366
34Zimbabwe              0           0           0         305         306         307
35"""


## Load global population data into dataframe

World population data is available from the World Bank data. This is available to download in csv, xml or Excel format. The Excel file contains three worksheets with the population data in the first sheet Data. This is loaded into a dataframe.

1pop_xl = pd.ExcelFile("/tmp/API_SP.POP.TOTL_DS2_en_excel_v2_1740169.xls", engine = "xlrd")
2pop_xl.sheet_names
3"""
5"""

 1# Load the excel worksheet into a dataframe
3    "/tmp/API_SP.POP.TOTL_DS2_en_excel_v2_1740169.xls",
4    engine = "xlrd",
5    sheet_name = 'Data',
7
8pop_df.shape
9"""
10(264, 65)
11"""
12
13pop_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-3,-2,-1]]
14"""
15     Country Name Country Code     Indicator Name        2018        2019  2020
160           Aruba          ABW  Population, total    105845.0    106314.0   NaN
171     Afghanistan          AFG  Population, total  37172386.0  38041754.0   NaN
182          Angola          AGO  Population, total  30809762.0  31825295.0   NaN
19261  South Africa          ZAF  Population, total  57779622.0  58558270.0   NaN
20262        Zambia          ZMB  Population, total  17351822.0  17861030.0   NaN
21263      Zimbabwe          ZWE  Population, total  14439018.0  14645468.0   NaN
22"""


Extract population data for 2019 into a dataframe.

 1# Get population for countries for 2019
2pop_2019_df = pop_df[['Country Name', '2019']].copy()
3
4pop_2019_df.shape
5"""
6(264, 2)
7"""
8
9pop_2019_df.iloc[[0,1,2,3,4,-5,-4-3,-2,-1], :]
10"""
11    Country Name          2019
120          Aruba  1.063140e+05
131    Afghanistan  3.804175e+07
142         Angola  3.182530e+07
153        Albania  2.854191e+06
164        Andorra  7.714200e+04
17259       Kosovo  1.794248e+06
18257        World  7.673534e+09
19262       Zambia  1.786103e+07
20263     Zimbabwe  1.464547e+07
21"""


### Load population data for a single year

After exploring the population dataset, the code above can be simplified to just load data for a single year. The population data for the year 2019 can be loaded directly from the Excel file. The Country Code cannot be used, as it is not present in the COVID-19 data.

 1# Load country and data for 2019 from the excel worksheet into a dataframe
3    "/tmp/API_SP.POP.TOTL_DS2_en_excel_v2_1740169.xls",
4    engine = "xlrd",
5    sheet_name = 'Data',
6    usecols = ['Country Name', '2019'],
8
9pop_2019_df.shape
10"""
11(264, 2)
12"""
13
14pop_2019_df.iloc[[0,1,2,3,4,-5,-4-3,-2,-1], :]
15"""
16    Country Name          2019
170          Aruba  1.063140e+05
181    Afghanistan  3.804175e+07
192         Angola  3.182530e+07
203        Albania  2.854191e+06
214        Andorra  7.714200e+04
22259       Kosovo  1.794248e+06
23257        World  7.673534e+09
24262       Zambia  1.786103e+07
25263     Zimbabwe  1.464547e+07
26"""


## Compare countries in COVID-19 datasets with population dataset

Examine the matching of country names between population data and COVID-19 data as country codes are not used. This shows that there are 25 "Country/Region" that are not present in the population data.

 1# Countries in confirmed_df dataframe that are not in pop_2019_df
2unmatched_covid_countries = sorted(
3    list(
4        confirmed_df[
5            ~confirmed_df.index
6            .isin(list(pop_2019_df['Country Name']))
7        ].index))
8
9len(unmatched_covid_countries)
10"""
1125
12"""
13
14unmatched_covid_countries
15"""
16Bahamas                               Brunei                                Burma
17Congo (Brazzaville)                   Congo (Kinshasa)                      Czechia
18Diamond Princess                      Egypt                                 Gambia
19Holy See                              Iran                                  Korea, South
20Kyrgyzstan                            Laos                                  MS Zaandam
21Russia                                Saint Kitts and Nevis                 Saint Lucia
22Saint Vincent and the Grenadines      Slovakia                              Syria
23Taiwan*                               US                                    Venezuela
24Yemen
25"""


Four of these "Country/Region" cannot be mapped to a country in the population data. Two of these are cruise ships, Taiwan* is a disputed territory and the Holy See may be included in smal states. These are removed from the list of unmatched countries.

 1unmatched_covid_countries.remove('Diamond Princess')
2unmatched_covid_countries.remove('MS Zaandam')
3unmatched_covid_countries.remove('Holy See')
4unmatched_covid_countries.remove('Taiwan*')
5
6len(unmatched_covid_countries)
7"""
821
9"""
10
11unmatched_covid_countries
12"""
13Bahamas                               Brunei                                Burma
14Congo (Brazzaville)                   Congo (Kinshasa)                      Czechia
15Egypt                                 Gambia                                Iran
16Korea, South                          Kyrgyzstan                            Laos
17Russia                                Saint Kitts and Nevis                 Saint Lucia
18Saint Vincent and the Grenadines      Slovakia                              Syria
19US                                    Venezuela                             Yemen
20"""


The remaining "Country/Region" are manually mapped to Countries in the population dataset.

COVID-19 Country Name Population Country Name
Bahamas Bahamas, The
Brunei Brunei Darussalam
Burma Myanmar
Congo (Brazzaville) Congo, Rep.
Congo (Kinshasa) Congo, Dem. Rep.
Czechia Czech Republic
Egypt Egypt, Arab Rep.
Gambia Gambia, The
Iran Iran, Islamic Rep.
Korea, South Korea, Rep.
Kyrgyzstan Kyrgyz Republic
Laos Lao PDR
Russia Russian Federation
Saint Kitts and Nevis St. Kitts and Nevis
Saint Lucia St. Lucia
Slovakia Slovak Republic
Syria Syrian Arab Republic
US United States
Venezuela Venezuela, RB
Yemen Yemen, Rep.

## Update country names in population dataset

A new column is added to the population dataset to contain the country name from the COVID-19 dataset if it is different or the original country if there is no difference. This could update the country name in place, but it is better to have the two names for verification.

 1map_pop = ['Bahamas, The', 'Brunei Darussalam', 'Myanmar', 'Congo, Rep.', 'Congo, Dem. Rep.',
2           'Czech Republic', 'Egypt, Arab Rep.', 'Gambia, The', 'Iran, Islamic Rep.',
3           'Korea, Rep.', 'Kyrgyz Republic', 'Lao PDR', 'Russian Federation', 'St. Kitts and Nevis',
4           'St. Lucia', 'St. Vincent and the Grenadines', 'Slovak Republic', 'Syrian Arab Republic',
5           'United States', 'Venezuela, RB', 'Yemen, Rep.'
6          ]
7
8country_map = {}
9for c,p in zip(unmatched_covid_countries, map_pop):
10    country_map[p] = c
11
12pop_2019_df['Covid_Country'] = pop_2019_df.apply(
13    lambda x: country_map[x['Country Name']]
14        if x['Country Name'] in country_map
15        else x['Country Name'],
16    axis = 1)
17
18pop_2019_df.iloc[[1,2,3,21,29,249], :]
19"""
20          Country Name         2019 Covid_Country
211          Afghanistan   38041754.0   Afghanistan
222               Angola   31825295.0        Angola
233              Albania    2854191.0       Albania
2421        Bahamas, The     389482.0       Bahamas
2529   Brunei Darussalam     433285.0        Brunei
26249      United States  328239523.0            US
27"""


Confirm that there are no unexpected countries in covid data that are not in population data.

 1list(confirmed_df[
2    ~confirmed_df.index
3    .isin(list(pop_2019_df['Covid_Country']))
4].index)
5"""
6['Diamond Princess', 'Holy See', 'MS Zaandam', 'Taiwan*']
7"""
8
9list(deaths_df[
10    ~deaths_df.index
11    .isin(list(pop_2019_df['Covid_Country']))
12].index)
13"""
14['Diamond Princess', 'Holy See', 'MS Zaandam', 'Taiwan*']
15"""


## Calculate cases and deaths per 100,000 of population

Merge the population numbers for 2019 into the COVID datasets. Rename the column name for 2019 to Population. This is done for the Confirmed cases and the Deaths datasets.

 1# Merge Population data
2confirmed_pop_df = confirmed_df.merge(
3    pop_2019_df,
4    left_index = True,
5    left_on = confirmed_df.index,
6    right_on = 'Covid_Country',
7)
8
9# Rename '2019' column name to 'population'
10confirmed_pop_df = confirmed_pop_df.rename(columns={'2019': 'Population'})
11
12confirmed_pop_df.shape
13"""(187, 329)
14"""
15
16confirmed_pop_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-4,-3,-2,-1]]
17"""
18(187, 329)
19     2020-01-22 00:00:00  2020-01-23 00:00:00  2020-01-24 00:00:00  2020-12-12 00:00:00 Country Name  Population Covid_Country
201                      0                    0                    0                48229  Afghanistan  38041754.0   Afghanistan
213                      0                    0                    0                47742      Albania   2854191.0       Albania
2258                     0                    0                    0                91638      Algeria  43053054.0       Algeria
23260                    0                    0                    0                 2083  Yemen, Rep.  29161922.0         Yemen
24262                    0                    0                    0                18217       Zambia  17861030.0        Zambia
25263                    0                    0                    0                11219     Zimbabwe  14645468.0      Zimbabwe
26"""


 1# Merge Population data
2deaths_pop_df = deaths_df.merge(
3    pop_2019_df,
4    left_index = True,
5    left_on = deaths_df.index,
6    right_on = 'Covid_Country',
7)
8
9# Rename '2019' column name to 'population'
10deaths_pop_df = deaths_pop_df.rename(columns={'2019': 'Population'})
11
12deaths_pop_df.shape
13"""(187, 329)
14"""
15
16deaths_pop_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-4,-3,-2,-1]]
17"""
18     2020-01-22 00:00:00  2020-01-23 00:00:00  2020-01-24 00:00:00  2020-12-12 00:00:00 Country Name  Population Covid_Country
191                      0                    0                    0                 1956  Afghanistan  38041754.0   Afghanistan
203                      0                    0                    0                  989      Albania   2854191.0       Albania
2158                     0                    0                    0                 2584      Algeria  43053054.0       Algeria
22260                    0                    0                    0                  606  Yemen, Rep.  29161922.0         Yemen
23262                    0                    0                    0                  366       Zambia  17861030.0        Zambia
24263                    0                    0                    0                  307     Zimbabwe  14645468.0      Zimbabwe
25"""


To adjust the numbers of confirmed cases and deaths to 100,000 of the population, divide the number by the population and multiply by 100,000.

 1# Confirmed cases per 100,000 of population
2confirmed_per_100k_df = confirmed_pop_df.copy()
3col_names = list(confirmed_per_100k_df.columns)
4[(col_names.remove(x)) for x in ['Country Name', 'Population', 'Covid_Country']]
5confirmed_per_100k_df[col_names] = (
6    confirmed_per_100k_df[col_names]
7    .div(confirmed_per_100k_df['Population'], axis = 0)
8    .mul(100000, axis = 0)
9)
10
11confirmed_per_100k_df.shape
12"""
13(187, 329)
14"""
15
16confirmed_per_100k_df.iloc[[0,1,2,-3,-2,-1], [0,50,100,150,-4,-3,-2,-1]]
17"""
18     2020-01-22 00:00:00  2020-03-12 00:00:00  2020-05-01 00:00:00  2020-06-20 00:00:00  2020-12-12 00:00:00 Country Name  Population  Covid_Country
191                    0.0             0.031544             6.022330            74.728416           126.779117  Afghanistan  38041754.0    Afghanistan
203                    0.0             0.805833            27.398307            66.253450          1672.698148      Albania   2854191.0        Albania
2158                   0.0             0.055745             9.648561            27.015505           212.849012      Algeria  43053054.0        Algeria
22260                  0.0             0.000000             0.024004             3.161657             7.142876  Yemen, Rep.  29161922.0          Yemen
23262                  0.0             0.000000             0.610267             8.006257           101.992998       Zambia  17861030.0         Zambia
24263                  0.0             0.000000             0.273122             3.270636            76.603902     Zimbabwe  14645468.0       Zimbabwe
25"""


1# Deaths per 100,000 of population
2deaths_per_100k_df = deaths_pop_df.copy()
3col_names = list(deaths_per_100k_df.columns)
4[(col_names.remove(x)) for x in ['Country Name', 'Population', 'Covid_Country']]
5deaths_per_100k_df[col_names] = (
6    deaths_per_100k_df[col_names]
7    .div(deaths_per_100k_df['Population'], axis = 0)
8    .mul(100000, axis = 0)
9)


## Countries with highest confirmed cases per 100,000 of population

Countries with populations of less than one million were removed from the dataset. The data is sorted by the numbers of confirmed cases per 100,000 on the latest date and the data for the top 20 countries displayed in a table and bar chart. The changes over time for the top 10 countries is displayed in a line chart.

 1confirmed_per_100k_df = confirmed_per_100k_df[confirmed_per_100k_df['Population'] > 1000000]
2
3top_20_confirmed_df = (confirmed_per_100k_df.sort_values(
4    by = confirmed_per_100k_df.columns[-4], ascending = False)
6               'Population',
7               pd.to_datetime('2020-12-12 00:00:00') ]])
8
9# Merge Confirmed data
10top_20_confirmed_df = top_20_confirmed_df.merge(
11    confirmed_df.iloc[:, [-1]],
12    right_index = True,
13    left_on = top_20_confirmed_df['Covid_Country'],
14    right_on = confirmed_df.index,
15)
16
17# Reset the index
18top_20_confirmed_df.reset_index(drop = True, inplace = True)
19
20# Drop the merged key
21top_20_confirmed_df = top_20_confirmed_df.drop(['key_0'], axis = 1)
22
23# Rename the column names
24top_20_confirmed_df = top_20_confirmed_df.rename(
25    columns = {'Covid_Country': 'Country',
26               '2020-12-12 00:00:00_x': 'Cases per 100k',
27               '2020-12-12 00:00:00_y': 'Total cases'})


Table: The top 20 countries with highest cases of COVID-19 on November 12 2020

Rank Country Population Cases per 100k Total cases
1 Bahrain 1,641,172 5,421 88,965
2 Czechia 10,669,709 5,393 575,422
3 Belgium 11,484,055 5,252 603,159
4 Georgia 3,720,382 5,027 187,006
5 Armenia 2,957,731 4,981 147,312
6 Qatar 2,832,067 4,973 140,827
7 US 328,239,523 4,893 16,062,299
8 Moldova 2,657,637 4,731 125,723
9 Slovenia 2,087,946 4,573 95,481
10 Panama 4,246,439 4,488 190,585
11 Switzerland 8,574,832 4,360 373,831
12 Croatia 4,067,500 4,241 172,523
13 Israel 9,053,300 3,930 355,786
14 Serbia 6,944,975 3,764 261,437
15 Spain 47,076,781 3,676 1,730,575
16 Austria 8,877,067 3,603 319,822
17 France 67,059,887 3,587 2,405,255
18 Netherlands 17,332,850 3,540 613,630
19 North Macedonia 2,083,459 3,505 73,025
20 Kuwait 4,207,083 3,471 146,044
 1top_10 = top_20_confirmed_df.head(10)
2bg_color = 'rgba(208, 225, 242, 1.0)'
3
4fig = go.Figure(
5    data = [go.Bar(x = list(top_10['Cases per 100k']),
6                   y = list(top_10['Country']),
7                   hovertemplate = [f"""<b>{r['Country']}</b>: Confirmed Cases<br>
8<br>Per 100k: {r['Cases per 100k']:,.0F}
9<br>Total: {r['Total cases']:,.0f}
10<br>Population: {r['Population']:,.0f}
11<extra></extra>
12""" for i,r in top_10.iterrows()],
13                   orientation = 'h')]
14)
15
16fig.update_layout(
17    # Set default font
18        font = dict(
19        family = "Droid Sans",
20        size = 16
21    ),
22    # Set figure title
23    title = dict(
24        text = f"""<b>Countries with highest confirmed cases of COVID-19 per 100,000
25<BR>{pd.to_datetime('2020-12-12 00:00:00').strftime('%b %d %Y')}</b>""",
26        xref = 'container',
27        yref = 'container',
28        x = 0.5,
29        y = 0.91,
30        xanchor = 'center',
31        yanchor = 'middle',
32        font = dict(family = 'Droid Sans', size = 24)
33    ),
34    # set x-axis
35    xaxis = dict(
36        title = dict(
37            text = 'Number of confirmed cases per 100,000 of population',
38            font = dict(family = 'Droid Sans', size = 18)
39        ),
40        showgrid = False,
41        linecolor = bg_color,
42        linewidth = 2,
43        showticklabels = True,
44    ),
45    # set y-axis
46    yaxis = dict(
47        showgrid = False,
48        linecolor = bg_color,
49        linewidth = 4,
50    ),
51    # set the plot bacground color
52    plot_bgcolor = bg_color,
53    # set the hover background color
54    hoverlabel = dict(
55        bgcolor = 'rgba(75, 152, 201, 0.2)',
56        font_size = 16
57    ),
58    paper_bgcolor = bg_color,
59)
60
61fig.update_traces(marker_color = 'rgb(75, 152, 201)')
62fig['layout']['yaxis']['autorange'] = "reversed"
63
64fig.show()


Highest confirmed cases of COVID-19 per 100,000 of population

 1top_10_confirmed_df = (confirmed_per_100k_df.sort_values(
2    by = confirmed_per_100k_df.columns[-4], ascending = False)
4
5top_10 = (top_10_confirmed_df
6          .drop(['Country Name', 'Population'], axis = 1)
7          .set_index('Covid_Country')
8          .T)
9
10bg_color = 'rgba(208, 225, 242, 1.0)'
11line_color = 'rgba(75, 152, 201, 1.0)'
12grid_color = 'rgba(75, 152, 201, 0.3)'
13
14fig = go.Figure()
15for c in top_10.columns:
17        x = top_10.index,
18        y = top_10[c],
19        name = c,
20        hovertemplate = [f"""<b>{c}</b><br>
21<br>Cases per 100k:<br>{r[c]:,.0F}
22<extra></extra>
23""" for i,r in top_10.iterrows()]
24    ))
25
26fig.update_traces(
27     hoverinfo = 'text+name',
28    mode = 'lines'
29)
30
31fig.update_layout(
32    # Set figure title
33    title = dict(
34        text = f"""<b>Highest confirmed cases of COVID-19 per 100,000</b>\
35<br>{top_10.index[-1].strftime('%b %d %Y')}""",
36        xref = 'container',
37        yref = 'container',
38        x = 0.5,
39        y = 0.9,
40        xanchor = 'center',
41        yanchor = 'middle',
42        font = dict(family = 'Droid Sans', size = 28)
43    ),
44    # set legend
45    legend = dict(
46        orientation = 'v',
47        traceorder = 'normal',
48        font_size = 12,
49        x = 0.05,
50        y = 1.0,
51        xanchor = "left",
52        yanchor = "top"
53    ),
54    # set x-axis
55    xaxis = dict(
56        title = 'Date',
57        linecolor = line_color,
58        linewidth = 2,
59        gridcolor = grid_color,
60        showticklabels = True,
61        ticks = 'outside',
62    ),
63    # set y-axis
64    yaxis = dict(
65        title = 'Cases of COVID-19 per 100.000 of population',
66        rangemode = 'tozero',
67        linecolor = line_color,
68        linewidth = 2,
69        gridcolor = grid_color,
70        showticklabels = True,
71        ticks = 'outside',
72    ),
73    # set the plot background color
74    plot_bgcolor = bg_color,
75    paper_bgcolor = bg_color,
76)
77
78fig.show()


Changes in countries with highest confirmed cases of COVID-19 per 100,000 of population

## Countries with highest deaths per 100,000 of population

Similar table and charts are created for the Deaths from COVID-19 data.

Table: The top 20 countries with highest deaths per 100,000 from COVID-19 on November 12 2020

Rank Country Population Deaths per 100k Total deaths
1 Belgium 11,484,055 154.9 17,792
2 Peru 32,510,453 112.4 36,544
3 Italy 60,297,396 106.2 64,036
4 Spain 47,076,781 101.2 47,624
5 North Macedonia 2,083,459 100.6 2,096
6 Bosnia and Herzegovina 3,301,000 99.9 3,298
7 Slovenia 2,087,946 97.8 2,041
8 United Kingdom 66,834,405 95.9 64,123
9 Moldova 2,657,637 95.8 2,547
10 US 328,239,523 90.7 297,818
11 Argentina 44,938,712 90.5 40,668
12 Mexico 127,575,529 89.1 113,704
13 Czechia 10,669,709 88.6 9,450
14 France 67,059,887 86.0 57,671
15 Brazil 211,049,527 85.8 181,123
16 Chile 18,952,038 83.6 15,846
17 Armenia 2,957,731 83.2 2,462
18 Bulgaria 6,975,761 80.7 5,626
20 Panama 4,246,439 78.4 3,331

Highest deaths from COVID-19 per 100,000 of population

Changes in countries with highest deaths from COVID-19 per 100,000 of population

## Conclusion

Calculations can be applied to a whole dataset easily with Pandas, such as converting the total numbers of COVID-19 cases to numbers relative to a standard number. The challenge here was getting the population, which is available from the World Data Bank, but the names for countries are inconsistent. Once these are mapped to the countries, the population is merged in for each country. The div and mul dataframe functions are used to adjust the numbers relative to 100,000 of the population. The countries with highest relative numbers are not necessarily the same countries with the highest absolute numbers.

Plotly is an open-source graphing library for Python that produces interactive charts.