Daily counts and rolling average of COVID-19 numbers

The daily counts of confirmed cases or deaths from COVID-19 can fluctuate greatly from day to day. This can be down to time factors such as week day vs weekend or delays in testing or gathering and collating results. A rolling average is a calculated average of different subsets of the full data set. It is used to smooth out fluctuations in the data.

Data sources:
The data used in this article is retrieved from the Johns Hopkins University who have made the data available on GitHub. More information about COVID-19 and the coronavirus is available from Coronavirus disease (COVID-19) advice for the public.



Load COVID-19 data into dataframes

There are two sets of data to load the global confirmed cases and the global deaths.

  • time_series_covid19_deaths_global.csv
  • time_series_covid19_confirmed_global.csv

The data can be loaded directly from the GitHub page as in the code below. The data is cleaned to remove unwanted fields and group the data by country with the following function.

 1def load_clean_covid_data(csv_path):
 2    df = pd.read_csv(csv_path)
 3
 4    # 1. Drop unwanted columns
 5    df.drop(['Province/State', 'Lat', 'Long'], axis = 1, inplace = True)
 6
 7    # 2. Group by Country
 8    df = df.groupby('Country/Region').sum()
 9
10    # 3. Convert column headings to datetime
11    df.columns = pd.to_datetime(df.columns)
12
13    # 4. Remove the column name
14    df.rename_axis(None, axis = 0, inplace = True)
15
16    return df

Load the data for Confirmed cases and the data for Deaths from COVID-19 into dataframes.

 1
 2confirmed_df = load_clean_data('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
 3confirmed_df.shape
 4"""
 5(191, 333)
 6"""
 7
 8confirmed_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-3,-2,-1]]
 9"""
10             2020-01-22  2020-01-23  2020-01-24  2020-12-17  2020-12-18  2020-12-19  
11Afghanistan           0           0           0       49378       49621       49681  
12Albania               0           0           0       51424       52004       52542  
13Algeria               0           0           0       93933       94371       94781  
14Yemen                 0           0           0        2087        2087        2087  
15Zambia                0           0           0       18504       18575       18620  
16Zimbabwe              0           0           0       11866       12047       12151  
17"""
18
19
20deaths_df = load_clean_data('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
21deaths_df.shape
22"""
23(191, 333)
24"""
25
26deaths_df.iloc[[0,1,2,-3,-2,-1], [0,1,2,-3,-2,-1]]
27"""
28             2020-01-22  2020-01-23  2020-01-24  2020-12-17  2020-12-18  2020-12-19  
29Afghanistan           0           0           0        2025        2030        2047  
30Albania               0           0           0        1055        1066        1074  
31Algeria               0           0           0        2640        2647        2659  
32Yemen                 0           0           0         606         606         606  
33Zambia                0           0           0         369         373         373  
34Zimbabwe              0           0           0         314         316         318  
35"""


Get the top 10 countries

Get the names of the top ten countries based on the highest deaths for the latest date.

 1top_countries = list(deaths_df
 2                 .iloc[:, -1]
 3                 .sort_values(ascending = False)
 4                 .head(10)
 5                 .index)
 6
 7top_countries
 8"""
 9['US',
10 'Brazil',
11 'India',
12 'Mexico',
13 'Italy',
14 'United Kingdom',
15 'France',
16 'Iran',
17 'Russia',
18 'Spain']
19"""


Display the deaths for a country over time

The data in time_series_covid19_deaths_global.csv is the cumulative number of deaths from COVID-19 recorded in each country every day. This is displayed in a line chart.

 1c = 'US'
 2data = deaths_df.loc[c]
 3date = deaths_df.columns[-1].strftime('%b %d %Y')
 4
 5bg_color = 'rgba(208, 225, 242, 1.0)'
 6line_color = 'rgba(75, 152, 201, 1.0)'
 7grid_color = 'rgba(75, 152, 201, 0.3)'
 8
 9fig = go.Figure()
10fig.add_trace(go.Scatter(
11    x = data.index,
12    y = data,
13    name = 'US',
14    hovertemplate = [f"""<b>{c}</b><br>
15<br>Deaths: {x:,.0F}
16<extra></extra>
17""" for x in data] ))
18
19fig.update_layout(
20    # Set figure title
21    title = dict(
22        text = f'Deaths from COVID-19 in <b>"{c}"</b> <br> {date}',
23        xref = 'container',
24        yref = 'container',
25        x = 0.5,
26        y = 0.9,
27        xanchor = 'center',
28        yanchor = 'middle',
29        font = dict(family = 'Droid Sans', size = 28)
30    ),
31    # set x-axis
32    xaxis = dict(
33        title = 'Date',
34        linecolor = line_color,
35        linewidth = 2,
36        gridcolor = grid_color,
37        showticklabels = True,
38        ticks = 'outside',
39    ),
40    # set y-axis
41    yaxis = dict(
42        title = 'Number of deaths',
43        rangemode = 'tozero',
44        linecolor = line_color,
45        linewidth = 2,
46        gridcolor = grid_color,
47        showticklabels = True,
48        ticks = 'outside',
49    ),
50    # set the plot bacground color
51    plot_bgcolor = bg_color,
52    paper_bgcolor = bg_color,
53)
54
55fig.show()

Total deaths from COVID-19 in US - December 19 2020



Display daily counts for a single country

Reduce the dataset by selecting the data for the top 10 countries. The daily count is easily got using pandas.DataFrame.diff function. This calculates the diference of data elements from another element in the dataframe. The default is to calculate the difference between the element in the previous row, but an axis parameter of 1 or columns can be used to calculate the difference between the element in the previous column.

 1top_df = deaths_df.loc[top_countries]
 2top_daily_df = top_df.diff(axis=1)
 3
 4top_daily_df.iloc[:, -10:]
 5"""
 6                2020-12-10  2020-12-11  2020-12-12  2020-12-13  2020-12-14  2020-12-15  2020-12-16  2020-12-17  2020-12-18  2020-12-19  
 7US                  2920.0      3283.0      2354.0      1389.0      1484.0      2984.0      3668.0      3345.0      2814.0      2571.0  
 8Brazil               770.0       672.0       686.0       279.0       433.0       964.0       936.0      1092.0       823.0       706.0  
 9India                413.0       443.0       391.0       336.0       354.0       387.0       355.0       338.0       347.0       341.0  
10Mexico               671.0       693.0       685.0       249.0       345.0       801.0       670.0       718.0       762.0       627.0  
11Italy                887.0       761.0       649.0       484.0       491.0       846.0       680.0       683.0       674.0       553.0  
12United Kingdom       516.0       424.0       520.0       144.0       233.0       506.0       612.0       532.0       490.0       537.0  
13France               296.0       628.0        -1.0       344.0       376.0       791.0       290.0       261.0       612.0       189.0  
14Iran                 284.0       231.0       222.0       247.0       251.0       223.0       213.0       212.0       178.0       175.0  
15Russia               549.0       601.0       553.0       481.0       442.0       564.0       584.0       574.0       602.0       574.0  
16Spain                325.0       280.0         0.0         0.0       389.0       388.0       195.0       181.0       149.0         0.0  
17"""

Table: Deaths from COVID-19 in previous 7 days in countries with highest deaths on December 19 2020

Country Dec 13 Dec 14 Dec 15 Dec 16 Dec 17 Dec 18 Dec 19
US 1,389 1,484 2,984 3,668 3,345 2,814 2,571
Brazil 279 433 964 936 1,092 823 706
India 336 354 387 355 338 347 341
Mexico 249 345 801 670 718 762 627
Italy 484 491 846 680 683 674 553
United Kingdom 144 233 506 612 532 490 537
France 344 376 791 290 261 612 189
Iran 247 251 223 213 212 178 175
Russia 481 442 564 584 574 602 574
Spain 0 389 388 195 181 149 0

The daily counts is plotted for a single country.

 1c = 'US'
 2data = top_daily_df.loc[c]
 3date = top_daily_df.columns[-1].strftime('%b %d %Y')
 4
 5bg_color = 'rgba(208, 225, 242, 1.0)'
 6line_color = 'rgba(75, 152, 201, 1.0)'
 7grid_color = 'rgba(75, 152, 201, 0.3)'
 8
 9fig = go.Figure()
10fig.add_trace(go.Scatter(
11    x = data.index,
12    y = data,
13    name = 'US',
14    hovertemplate = [f"""<b>{c}</b><br>
15<br>Deaths: {x:,.0F}
16<extra></extra>
17""" for x in data] ))
18
19fig.update_layout(
20    # Set figure title
21    title = dict(
22        text = f'Daily deaths from COVID-19 in <b>"{c}"</b> <br> {date}',
23        xref = 'container',
24        yref = 'container',
25        x = 0.5,
26        y = 0.9,
27        xanchor = 'center',
28        yanchor = 'middle',
29        font = dict(family = 'Droid Sans', size = 28)
30    ),
31    # set x-axis
32    xaxis = dict(
33        title = 'Date',
34        linecolor = line_color,
35        linewidth = 2,
36        gridcolor = grid_color,
37        showticklabels = True,
38        ticks = 'outside',
39    ),
40    # set y-axis
41    yaxis = dict(
42        title = 'Number of deaths',
43        rangemode = 'tozero',
44        linecolor = line_color,
45        linewidth = 2,
46        gridcolor = grid_color,
47        showticklabels = True,
48        ticks = 'outside',
49    ),
50    # set the plot bacground color
51    plot_bgcolor = bg_color,
52    paper_bgcolor = bg_color,
53)
54
55fig.show()

Daily deaths from COVID-19 in US - December 19 2020



Display rolling average on daily counts

The daily counts fluctuate greatly from day to day and it can be difficult to see the overall trend. A rolling average calculates an average of different subsets of the full data set. It is used to smooth out fluctuations in the data. In this case the pandas.Series.rolling function is used to create a number of rolling average to determine the best value to use.

Here is an example of calculating the rolling average of 3 days for the latest 20 days on the number of deaths from COVID-19 in the US. The chart is used to determine the best number of days to set for the rolling window. Rolling window sizes of 3, 5, 7 and 10 were compared. The different results can be switched on and off in the Plotly chart and areas of the chart can be zoomed in on. These results show that 7 is the best number to select for the rolling window size. This is likely due to daily fluctuations based on weekday that are smoothed out when taking the week on average.

 1c = 'US'
 2data = top_daily_df.loc[c].tail(20)
 3rolling = data.rolling(3).mean()
 4
 5data
 6"""
 7['1153.0', '2588.0', '2775.0', '2843.0', '2594.0', '2257.0', '1122.0',
 8'1419.0', '2520.0', '3064.0', '2920.0', '3283.0', '2354.0', '1389.0',
 9'1484.0', '2984.0', '3668.0', '3345.0', '2814.0', '2571.0']
10"""
11
12rolling
13"""
14['NAN', 'NAN', '2172.0', '2735.3', '2737.3', '2564.7', '1991.0',
15'1599.3', '1687.0', '2334.3', '2834.7', '3089.0', '2852.3', '2342.0',
16'1742.3', '1952.3', '2712.0', '3332.3', '3275.7', '2910.0']
17"""
 1c = 'US'
 2data = top_daily_df.loc[c]
 3date = top_daily_df.columns[-1].strftime('%b %d %Y')
 4r3 = data.rolling(3).mean()
 5r5 = data.rolling(5).mean()
 6r7 = data.rolling(7).mean()
 7r10 = data.rolling(10).mean()
 8
 9bg_color = 'rgba(208, 225, 242, 1.0)'
10line_color = 'rgba(75, 152, 201, 1.0)'
11grid_color = 'rgba(75, 152, 201, 0.3)'
12
13fig = go.Figure()
14for d, n in zip([data, r3, r5, r7, r10], [1, 3, 5, 7, 10]):    
15    fig.add_trace(go.Scatter(
16        x = d.index,
17        y = d,
18        name = f'rolling window = "{n}"',
19        hovertemplate = [f"""<b>{c}</b>
20    <br>Rolling window = {n}<br>
21    <br>Deaths: {x:,.0F}
22    <extra></extra>
23    """ for x in d] ))
24
25fig.update_layout(
26    # Set figure title
27    title = dict(
28        text = f'Daily deaths from COVID-19 in <b>"{c}"</b> <br> {date}',
29        xref = 'container',
30        yref = 'container',
31        x = 0.5,
32        y = 0.9,
33        xanchor = 'center',
34        yanchor = 'middle',
35        font = dict(family = 'Droid Sans', size = 28)
36    ),
37    # set x-axis
38    xaxis = dict(
39        title = 'Date',
40        linecolor = line_color,
41        linewidth = 2,
42        gridcolor = grid_color,
43        showticklabels = True,
44        ticks = 'outside',
45    ),
46    # set y-axis
47    yaxis = dict(
48        title = 'Number of deaths',
49        rangemode = 'tozero',
50        linecolor = line_color,
51        linewidth = 2,
52        gridcolor = grid_color,
53        showticklabels = True,
54        ticks = 'outside',
55    ),
56    # set the plot bacground color
57    plot_bgcolor = bg_color,
58    paper_bgcolor = bg_color,
59)
60
61fig.show()

Daily deaths from COVID-19 in US showing means for different rolling window sizes - December 19 2020



Identify the best rolling average window size

It can be easier to identify the better rolling window size by plotting the different options on separate plots. This confirms that the best rolling window setting is for 7 days. The 7-day rolling average smooths out the daily fluctuations and makes it easier to see trends in the data. This confirmed by creating a number of charts for different countries.

 1c = 'US'
 2data = top_daily_df.loc[c]
 3date = top_daily_df.columns[-1].strftime('%b %d %Y')
 4r3 = data.rolling(3).mean()
 5r5 = data.rolling(5).mean()
 6r7 = data.rolling(7).mean()
 7r10 = data.rolling(10).mean()
 8r20 = data.rolling(10).mean()
 9
10bg_color = 'rgba(208, 225, 242, 1.0)'
11line_color = 'rgba(75, 152, 201, 1.0)'
12grid_color = 'rgba(75, 152, 201, 0.3)'
13
14fig = make_subplots(
15    rows = 3,
16    cols = 2,
17    subplot_titles = (
18        "Daily Data",
19        "Rolling Window 3",
20        "Rolling Window 5",
21        "Rolling Window 7",
22        "Rolling Window 10",
23        "Rolling Window 20"))
24
25fig.add_trace(go.Scatter(x = data.index, y = data), row = 1, col = 1)
26fig.add_trace(go.Scatter(x = r3.index, y = r3), row = 1, col = 2)
27fig.add_trace(go.Scatter(x = r5.index, y = r5), row = 2, col = 1)
28fig.add_trace(go.Scatter(x = r7.index, y = r7), row = 2, col = 2)
29fig.add_trace(go.Scatter(x = r10.index, y = r10), row = 3, col = 1)
30fig.add_trace(go.Scatter(x = r20.index, y = r20), row = 3, col = 2)
31
32fig.update_layout(
33    showlegend=False
34)
35
36fig.show()

Deaths from COVID-19 in US showing different rolling averages in multiple subplots


Deaths from COVID-19 in UK showing different rolling averages in multiple subplots



Display rolling averages for top ten countries

Display the 7-day rolling average of Deaths from COVID-19 for the top ten countries with the highest deaths on December 19 2020.

 1date = top_daily_df.columns[-1].strftime('%b %d %Y')
 2
 3bg_color = 'rgba(208, 225, 242, 1.0)'
 4line_color = 'rgba(75, 152, 201, 1.0)'
 5grid_color = 'rgba(75, 152, 201, 0.3)'
 6
 7fig = go.Figure()
 8for i, row in top_daily_df.iterrows():
 9    r7 = row.rolling(7).mean()
10    fig.add_trace(go.Scatter(
11        x = r7.index,
12        y = r7,
13        name = i,
14        hovertemplate = [f"""<b>{r7.name}</b>
15    <br>Deaths: {x:,.0F}
16    <extra></extra>
17    """ for x in r7]))
18
19fig.update_layout(
20    # Set figure title
21    title = dict(
22        text = f'7-day daily average of deaths from COVID-19 in top ten countries<br> {date}',
23        xref = 'container',
24        yref = 'container',
25        x = 0.5,
26        y = 0.9,
27        xanchor = 'center',
28        yanchor = 'middle',
29        font = dict(family = 'Droid Sans', size = 24)
30    ),
31    # set x-axis
32    xaxis = dict(
33        title = 'Date',
34        linecolor = line_color,
35        linewidth = 2,
36        gridcolor = grid_color,
37        showticklabels = True,
38        ticks = 'outside',
39    ),
40    # set y-axis
41    yaxis = dict(
42        title = '7-day daily average of deaths',
43        rangemode = 'tozero',
44        linecolor = line_color,
45        linewidth = 2,
46        gridcolor = grid_color,
47        showticklabels = True,
48        ticks = 'outside',
49    ),
50    # set the plot bacground color
51    plot_bgcolor = bg_color,
52    paper_bgcolor = bg_color,
53)
54
55fig.show()

7-day average of Deaths per day from COVID-19 in countries with highest deaths


A similar plot can be created for confirmed cases for the same ten countries with highest deaths.

7-day average of Confirmed Cases per day of COVID-19 in countries with highest deaths



Display rolling averages and the daily counts

Display the 7-day rolling average of Deaths from COVID-19 as well as the actual daily counts in the same chart. The daily counts are shown as bars and the rolling average is a line. The results for the countries with the highest deaths is displayed. The plots are separated per country. There is a lot of data collected to date and sections of the charts can be chosen to zoom in on a particular time period. Another approach is to just plot the data for the most recent 30 days to give a picture of what the current trend is.

 1date = top_daily_df.columns[-1].strftime('%b %d %Y')
 2
 3bg_color = 'rgba(208, 225, 242, 1.0)'
 4line_color = 'rgba(75, 152, 201, 1.0)'
 5grid_color = 'rgba(75, 152, 201, 0.3)'
 6
 7fig = make_subplots(
 8    rows = 5,
 9    cols = 2,
10    shared_xaxes = True,
11    subplot_titles = (top_daily_df.index)
12    )
13
14j = 0
15for i, row in top_daily_df.iterrows():
16    r7 = row.rolling(7).mean()
17    fig.add_trace(go.Bar(x = row.index, y = row, name = i), row = j%5+1, col = j%2+1)
18    fig.add_trace(go.Scatter(x = r7.index, y = r7, name = i), row = j%5+1, col = j%2+1)
19
20    j += 1
21
22
23fig.update_xaxes(dict(
24        linecolor = line_color,
25        gridcolor = grid_color,
26    ))
27
28fig.update_yaxes(dict(
29        rangemode = 'tozero',
30        linecolor = line_color,
31        gridcolor = grid_color,
32    ))
33
34fig.update_layout(
35    # Set figure title
36    title = dict(
37        text = f'Daily deaths from COVID-19 in <b>"{c}"</b> <br> {date}',
38        xref = 'container',
39        yref = 'container',
40        x = 0.5,
41        y = 0.95,
42        xanchor = 'center',
43        yanchor = 'middle',
44        font = dict(family = 'Droid Sans', size = 22)
45    ),
46    showlegend=False,
47    # set the plot bacground color
48    plot_bgcolor = bg_color,
49    paper_bgcolor = bg_color,
50)
51
52fig.show()

Daily counts and 7-day average of Deaths from COVID-19 in countries with highest deaths

Daily counts and 7-day average of Deaths from COVID-19 for latest 30 days



Similar charts can be created for the daily counts and rolling average of confirmed cases.

Daily counts and 7-day average of Confirmed Cases of COVID-19 in countries with highest deaths

Daily counts and 7-day average of Confirmed Cases of COVID-19 for latest 30 days



Conclusion

Once again calculations can be applied to a whole dataset easily with Pandas. The diff function is used to easily convert the running totals to daily differences and the rolling function is used to group a subset of data together and calculate a running average. The running average helps smooth the day to day fluctuations in the data and makes it easier to spot a trend.




Plotly is an open-source graphing library for Python that produces interactive charts.