Nowcasting concepts

Introduction

So far we’ve explored the delays and biases of real-time outbreak data, started to correct for these, and considered the underlying process that drives the evolution of an outbreak (looking at the reproduction number and renewal equation). Next, we’ll focus on predicting new information about how an outbreak is evolving in the present and future.

We know that we have incomplete information in the present because of delays in the observation process (reporting delays). The aim of nowcasting is to predict what an epidemiological time series will look like after all delayed reports are in, for which we need to account for the delays and biases we’ve already considered.

Slides

Objectives

This session aims to introduce the concept of nowcasting, and see how we can perform a nowcast if we know the underlying delay distribution.

Source file

The source file of this session is located at sessions/nowcasting.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the dplyr and tidyr packages for data wrangling, ggplot2 library for plotting, and the tidybayes package for extracting results of the inference.

library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("tidybayes")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.

set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)

Simulating delayed reporting

Epidemiological data is not usually available immediately for analysis. Instead, data usually gets collated at different levels of a healthcare or health surveillance system, cleaned, checked before being aggregated and/or anonymised and ultimately shared with an analyst. We call the reporting time the time a data point (e.g. a time or day of symptom onset or a time or day of hospitalisation) has entered the data set used for some analysis. Similar to the data discussed in the preceding session, this time is often only available as a date, i.e. censored at the scale of a day.

We can simulate this reporting process. Let us assume that the symptom onsets are reported with a delay and that this delay is characterised by a lognormal distribution with meanlog 1 and sdlog 0.5: To do so, we perform a very similar simulation to what we did in the session on delay distributions, except now we don’t simulate hospitalisations but reports of symptom onsets:

data(infection_times)
df <- infection_times |>
  mutate(
    onset_time = infection_time + rgamma(n(), shape = 5, rate = 1),
    report_time = onset_time + rlnorm(n(), meanlog = 1, sdlog = 0.5)
  )

We then assume that we’re 70 days into the outbreak, i.e. we only consider observations with a reporting time less than 71 - other symptom onset may have already happened, but we have not observed them yet.

cutoff <- 71
df_co <- df |>
  filter(report_time < cutoff)

We can now convert this to a time series of symptom onsets and reports:

## create time series of infections, onsets, and reports
df_co <- df_co |>
  transmute(
    infection_day = floor(infection_time),
    onset_day = floor(onset_time),
    report_day = floor(report_time)
  )

infection_ts <- df_co |>
  count(day = infection_day, name = "infections")
onset_ts <- df_co |>
  count(day = onset_day, name = "onsets")
reports_ts <- df_co |>
  count(day = report_day, name = "reports")

all_days <- expand_grid(day = seq(0, cutoff - 1)) |>
  full_join(infection_ts, by = "day") |>
  full_join(onset_ts, by = "day") |>
  full_join(reports_ts, by = "day") |>
  replace_na(list(onsets = 0, reports = 0))

Plotting these, we get

combined <- all_days |>
  pivot_longer(c(onsets, reports, infections), names_to = "variable")
  ggplot(combined, aes(x = day, y = value)) +
  facet_grid(variable ~ .) +
  geom_col()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).

Looking at the three plots in isolation we would conclude very different things about the epidemic: symptom onsets seem to have flattened off and perhaps are going down, whereas reports are increasing rapidly.

This apparent contradiction appears because onsets are reported with a delay. By cutting off at a certain reporting date, we do not observe many of the recent symptom onsets that are yet to be be reported. We can see that if we plot the final data set alongside the cut-off one:

# Use full outbreak dataset
final <- df |>
  transmute(onset_day = floor(onset_time))
final_onset_ts <- final |>
  count(day = onset_day, name = "onsets")
final_all_days <- expand_grid(day = seq(0, max(final_onset_ts$day))) |>
  full_join(final_onset_ts, by = "day") |>
  replace_na(list(onsets = 0)) |>
  mutate(cutoff = "final")
intermediate <- combined |>
  filter(variable == "onsets") |>
  select(-variable) |>
  rename(onsets = value) |>
  mutate(cutoff = "70 days")
combined_cutoffs <- rbind(
  intermediate,
  final_all_days
)
ggplot(combined_cutoffs, aes(x = day, y = onsets, colour = cutoff)) +
  geom_line() +
  scale_colour_brewer(palette = "Dark2") +
  geom_vline(xintercept = cutoff, linetype = "dashed")

As we can see, even though it may much seem like the epidemic curve is going down, in fact in the final data set one can see that at the time symptom onsets were still increasing. The apparent decline towards the present (indicated by a dashed vertical line) was caused by the delay in reporting.

Plotting and analysing data by report date

Looking at the plots one might suggest plotting and analysing the data by date of reporting which correctly showed the data to be still increasing and which should, by definition, not be subject to future changes.

This can sometimes be a sensible way to visualise the data. However, reporting might itself be subject to biases such as breaks during the weekend, holidays etc or reporting delays may vary over time. At the same time, when it comes to capacity or intervention planning we may need to know how many people e.g. become sick on any given day and will thus present to the healthcare system rather than how many will be reported.

Tip

Estimating the “true curve” (i.e. what we expect to see once the data are complete at a future date) of the time series of epidemiologically relevant events from a potentially truncated epidemiological curve and information about the delays is what is usually called “nowcasting”.

Nowcasting with a known delay

The simplest possible nowcasting model

Here we assume that the delay distribution is known and that we can use it to nowcast the most recent data. In practice, the delay distribution is often not known and needs to be estimated from the data. We could do this using methods from the session on biases in delay distributions.

In the session on convolutions we used delay distributions convolved with the infection times to estimate the time series of symptom onsets. A simple way to nowcast is to use the same approach but using the cumulative distribution function of the delay distribution rather than the probability density function and only apply it to the most recent data as this is the only data that can be subject to change (due to delays in reporting).

We will build intuition for this as usual using simulation. First we define the proportion reported using a delay distribution up to 15 days, again using a lognormal distribution with meanlog 1 and sdlog 0.5:

proportion_reported <- plnorm(1:15, 1, 0.5)
plot(proportion_reported)

The plnorm function

The plnorm() function is related to the rlnorm() function we used earlier to simulate the individual level reporting delay, but instead it gives the cumulative distribution function rather than random samples. That is, it gives us the probability that a report is made on day 1 or earlier, day 2 or earlier, etc.

We can now use this delay distribution to nowcast the most recent data. Here we use the same simulation approach as in the renewal session (here we have created helper functions make_gen_time_pmf() and make_ip_pmf() to make it easier to re-use; we recommend to have a look at what these functions do), and apply the reporting_delay to the last 15 days of data.

gen_time_pmf <- make_gen_time_pmf()
ip_pmf <- make_ip_pmf()
onset_df <- simulate_onsets(
  make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)
reported_onset_df <- onset_df |>
  filter(day < cutoff) |>
  mutate(proportion_reported = c(rep(1, n() - 15), rev(proportion_reported)),
         reported_onsets = rpois(n(), onsets * proportion_reported)
  )
tail(reported_onset_df)
# A tibble: 6 × 5
    day onsets infections proportion_reported reported_onsets
  <dbl>  <int>      <int>               <dbl>           <int>
1    65     63         83              0.943               51
2    66     64         75              0.889               58
3    67     75         92              0.780               57
4    68     78        113              0.578               34
5    69     69        120              0.270               20
6    70     84        115              0.0228               4
Take 5 minutes

Spend a few minutes trying to understand the code above. What is the proportion_reported? What is the reported_onsets?

  • The proportion_reported is the cumulative distribution function of the delay distribution. It gives the probability that a report is made on day 1 or earlier, day 2 or earlier, etc. Note that for days more that 15 days into the past
  • The reported_onsets are the number of onsets that are reported on each day. This is calculated by multiplying the number of onsets by the proportion of onsets that are reported on each day. It has Poisson noise added to it to simulate the stochasticity in the reporting process.
reported_onset_df |>
  ggplot(aes(x = day, y = reported_onsets)) +
  geom_col()

We can now fit our first nowcasting model. Here we assume exactly the same generative process as we used for simulation and model the number of onsets as independent draws from a normal distribution.

mod <- nfidd_cmdstan_model("simple-nowcast")
mod
 1: functions {
 2:   #include "functions/condition_onsets_by_report.stan"
 3: }
 4: 
 5: data {
 6:   int n;                // number of days
 7:   array[n] int obs;     // observed symptom onsets
 8:   int report_max;       // max reporting delay
 9:   array[report_max + 1] real report_cdf;
10: }
11: 
12: parameters {
13:   array[n] real<lower = 0> onsets;
14: }
15: 
16: transformed parameters {
17:   array[n] real reported_onsets = condition_onsets_by_report(onsets, report_cdf);
18: }
19: 
20: model {
21:   onsets ~ normal(5, 20) T[0,];
22:   // Likelihood
23:   obs ~ poisson(reported_onsets);
24: }
Take 5 minutes

Familiarise yourself with the model above. What does it do?

  • On line 2 we define a new function condition_onsets_by_report.stan which takes the number of onsets and reports and the delay distribution as input and returns the nowcasted number of onsets.
  • On line 17, this function is used to calculate the nowcasted number of onsets and this is then used in the likelihood.
  • On line 21, we define the generative process for the number of onsets. Here we assume that onsets are independent with each drawn from a normal distribution.

Once again we can generate estimates from this model:

data <- list(
  n = nrow(reported_onset_df) - 1,
  obs = reported_onset_df$reported_onsets[-1],
  report_max = length(proportion_reported) - 1,
  report_cdf = proportion_reported 
)
simple_nowcast_fit <- mod$sample(data = data, parallel_chains = 4)
simple_nowcast_fit
  variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__      2334.43 2334.86 5.97 5.72 2324.05 2343.72 1.00     1343     2019
 onsets[1]    1.00    0.71 1.01 0.73    0.05    3.01 1.00     4967     1899
 onsets[2]    1.01    0.70 1.02 0.73    0.05    3.02 1.00     5071     2155
 onsets[3]    1.02    0.70 1.04 0.71    0.05    3.10 1.00     5820     2549
 onsets[4]    1.01    0.70 1.01 0.72    0.06    3.08 1.00     5649     2433
 onsets[5]    1.02    0.70 1.04 0.74    0.05    3.18 1.00     3613     1858
 onsets[6]    1.02    0.70 1.04 0.73    0.05    3.11 1.00     5147     2011
 onsets[7]    1.02    0.71 0.99 0.73    0.05    3.04 1.00     5145     2280
 onsets[8]    0.99    0.68 0.97 0.71    0.05    2.96 1.00     4879     2356
 onsets[9]    2.01    1.67 1.44 1.19    0.36    4.71 1.00     7954     2901

 # showing 10 of 139 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

We can now plot onsets alongside those nowcasted by the model:

nowcast_onsets <- simple_nowcast_fit |>
  gather_draws(onsets[day]) |>
  ungroup() |>
  filter(.draw %in% sample(.draw, 100)) |>
  mutate(day = day + 1)
ggplot(nowcast_onsets, aes(x = day)) +
  geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
  geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
  geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))

Tip

The points in this plot represent the data available when the nowcast was made (and so are truncated) whilst the bars represent the finally reported data (a perfect nowcast would exactly reproduce these).

Tip

As we found in the using delay distributions to model the data generating process of an epidemic session, this simple model struggles to recreate the true number of onsets. This is because it does not capture the generative process of the data (i.e. the transmission process and delays from infection to onset). In the next section we will see how we can use a model that does capture this generative process to improve our nowcasts.

Adding in a geometric random walk to the nowcasting model

As we saw in the session on the renewal equation, a geometric random walk is a simple way to model multiplicative growth. Adding this into our simple nowcasting model may help us to better capture the generative process of the data and so produce a better nowcast.

We first load the model

rw_mod <- nfidd_cmdstan_model("simple-nowcast-rw")
rw_mod
 1: functions {
 2:   #include "functions/geometric_random_walk.stan"
 3:   #include "functions/condition_onsets_by_report.stan"
 4: }
 5: 
 6: data {
 7:   int n;                // number of days
 8:   array[n] int obs;     // observed symptom onsets
 9:   int report_max;       // max reporting delay
10:   array[report_max + 1] real report_cdf;
11: }
12: 
13: parameters {
14:   real<lower=0> init_onsets;
15:   array[n-1] real rw_noise;
16:   real<lower=0> rw_sd;
17: }
18: 
19: transformed parameters {
20:   array[n] real onsets = geometric_random_walk(init_onsets, rw_noise, rw_sd);
21:   array[n] real reported_onsets = condition_onsets_by_report(onsets, report_cdf);
22: }
23: 
24: model {
25:   init_onsets ~ lognormal(0, 1) T[0,];
26:   rw_noise ~ std_normal();
27:   rw_sd ~ normal(0, 5) T[0,];
28:   //Likelihood
29:   obs ~ poisson(reported_onsets);
30: }

and then fit it

rw_nowcast_fit <- rw_mod$sample(data = data, parallel_chains = 4)
rw_nowcast_fit
    variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__        2177.29 2177.56 8.68 8.48 2162.83 2191.10 1.00      757     1605
 init_onsets    0.29    0.25 0.17 0.16    0.08    0.63 1.00     4728     2602
 rw_noise[1]   -1.17   -1.17 0.88 0.89   -2.65    0.26 1.00     4884     3190
 rw_noise[2]   -0.82   -0.84 0.87 0.88   -2.23    0.62 1.00     4663     2931
 rw_noise[3]   -0.55   -0.57 0.92 0.91   -2.07    0.99 1.00     6003     3190
 rw_noise[4]   -0.37   -0.38 0.94 0.94   -1.88    1.18 1.00     6046     3134
 rw_noise[5]   -0.15   -0.17 0.89 0.87   -1.62    1.35 1.00     5186     3412
 rw_noise[6]   -0.01   -0.03 0.91 0.91   -1.51    1.48 1.00     5251     3117
 rw_noise[7]    0.20    0.20 0.91 0.93   -1.29    1.70 1.00     5194     2443
 rw_noise[8]    0.36    0.37 0.90 0.89   -1.13    1.79 1.00     5443     3042

 # showing 10 of 209 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

Again we can extract the nowcasted onsets and plot them alongside the observed data:

rw_nowcast_onsets <- rw_nowcast_fit |>
  gather_draws(onsets[day]) |>
  ungroup() |>
  filter(.draw %in% sample(.draw, 100)) |> ## sample 100 iterations randomly
  mutate(day = day + 1)
ggplot(rw_nowcast_onsets, aes(x = day)) +
  geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
  geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
  geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))

Take 2 minutes

What do you think of the nowcast now? Does it look better than the previous one?

  • The nowcast better matches the ultimately observed data. The geometric random walk allows the model to capture the multiplicative growth in the data and so better capture that current indidence is related to past incidence.
  • This should be particularly true when the data is more truncated (i.e nearer to the date of the nowcast) as the geometric random walk allows the model to extrapolate incidence based on previous incidence rather than relying on the prior distribution as the simpler model did.
  • However, the model is still quite simple and so may struggle to capture more complex patterns in the data. In particular, the prior model for the geometric random walk assumes that onsets are the same as the previous day with statistical noise. This may not be a good assumption in a rapidly changing epidemic (where the reproduction number is not near 1).

What happens if we get the delay distribution wrong?

In practice, we often do not know the delay distribution and so need to estimate it using the data at hand. In the session on biases in delay distributions we saw how we could do this using individual-level records. We will now look at what happens if we get the delay distribution wrong.

We use the same data as before but now assume that the delay distribution is a gamma distribution with shape 2 and rate 3. This is a very different distribution to the lognormal distribution we used to simulate the data.

wrong_proportion_reported <- pgamma(1:15, 2, 3)
plot(wrong_proportion_reported)

We first need to update the data to use this new delay distribution:

wrong_delay_data <- data
wrong_delay_data$report_cdf <- wrong_proportion_reported

We now fit the nowcasting model with the wrong delay distribution:

gamma_nowcast_fit <- rw_mod$sample(data = wrong_delay_data, parallel_chains = 4)
gamma_nowcast_fit
    variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__        2180.37 2180.69 8.16 8.18 2166.51 2193.09 1.00      893     1566
 init_onsets    0.31    0.27 0.19 0.16    0.08    0.68 1.00     3943     2758
 rw_noise[1]   -1.20   -1.17 0.86 0.83   -2.66    0.19 1.00     4572     2930
 rw_noise[2]   -0.82   -0.81 0.90 0.88   -2.30    0.68 1.00     4263     2770
 rw_noise[3]   -0.56   -0.55 0.92 0.92   -2.08    0.90 1.00     3929     3121
 rw_noise[4]   -0.32   -0.33 0.92 0.93   -1.86    1.20 1.00     3796     2970
 rw_noise[5]   -0.15   -0.15 0.89 0.90   -1.60    1.28 1.00     4180     3263
 rw_noise[6]    0.02    0.04 0.88 0.87   -1.47    1.45 1.00     4398     2963
 rw_noise[7]    0.22    0.24 0.94 0.95   -1.34    1.76 1.00     4030     2590
 rw_noise[8]    0.41    0.42 0.94 0.95   -1.16    1.92 1.00     4857     2976

 # showing 10 of 209 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

Again we can extract the nowcast of symptom onsets and plot it alongside the observed data:

gamma_nowcast_onsets <- gamma_nowcast_fit |>
  gather_draws(onsets[day]) |>
  ungroup() |>
  filter(.draw %in% sample(.draw, 100)) |>
  mutate(day = day + 1)
ggplot(gamma_nowcast_onsets, aes(x = day)) +
  geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
  geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
  geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))

Take 2 minutes

What do you think of the nowcast now? How would you know you had the wrong delay if you didn’t have the true delay distribution?

  • The nowcast now looks very different to the observed data. This is because the model is using the wrong delay distribution to nowcast the data.
  • If you didn’t have the true delay distribution you would not know that you had the wrong delay distribution. This is why it is important to estimate the delay distribution from the data.
  • In practice, you would likely compare the nowcast to the observed data and if they did not match you would consider whether the delay distribution was the cause.

Wrap up