library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("tidybayes")
Nowcasting concepts
Introduction
So far we’ve explored the delays and biases of real-time outbreak data, started to correct for these, and considered the underlying process that drives the evolution of an outbreak (looking at the reproduction number and renewal equation). Next, we’ll focus on predicting new information about how an outbreak is evolving in the present and future.
We know that we have incomplete information in the present because of delays in the observation process (reporting delays). The aim of nowcasting is to predict what an epidemiological time series will look like after all delayed reports are in, for which we need to account for the delays and biases we’ve already considered.
Slides
Objectives
This session aims to introduce the concept of nowcasting, and see how we can perform a nowcast if we know the underlying delay distribution.
Source file
The source file of this session is located at sessions/nowcasting.qmd
.
Libraries used
In this session we will use the nfidd
package to load a data set of infection times and access stan models and helper functions, the dplyr
and tidyr
packages for data wrangling, ggplot2
library for plotting, and the tidybayes
package for extracting results of the inference.
The best way to interact with the material is via the Visual Editor of RStudio.
Initialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr
show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.
set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)
Simulating delayed reporting
Epidemiological data is not usually available immediately for analysis. Instead, data usually gets collated at different levels of a healthcare or health surveillance system, cleaned, checked before being aggregated and/or anonymised and ultimately shared with an analyst. We call the reporting time the time a data point (e.g. a time or day of symptom onset or a time or day of hospitalisation) has entered the data set used for some analysis. Similar to the data discussed in the preceding session, this time is often only available as a date, i.e. censored at the scale of a day.
We can simulate this reporting process. Let us assume that the symptom onsets are reported with a delay and that this delay is characterised by a lognormal distribution with meanlog 1 and sdlog 0.5: To do so, we perform a very similar simulation to what we did in the session on delay distributions, except now we don’t simulate hospitalisations but reports of symptom onsets:
data(infection_times)
<- infection_times |>
df mutate(
onset_time = infection_time + rgamma(n(), shape = 5, rate = 1),
report_time = onset_time + rlnorm(n(), meanlog = 1, sdlog = 0.5)
)
We then assume that we’re 70 days into the outbreak, i.e. we only consider observations with a reporting time less than 71 - other symptom onset may have already happened, but we have not observed them yet.
<- 71
cutoff <- df |>
df_co filter(report_time < cutoff)
We can now convert this to a time series of symptom onsets and reports:
## create time series of infections, onsets, and reports
<- df_co |>
df_co transmute(
infection_day = floor(infection_time),
onset_day = floor(onset_time),
report_day = floor(report_time)
)
<- df_co |>
infection_ts count(day = infection_day, name = "infections")
<- df_co |>
onset_ts count(day = onset_day, name = "onsets")
<- df_co |>
reports_ts count(day = report_day, name = "reports")
<- expand_grid(day = seq(0, cutoff - 1)) |>
all_days full_join(infection_ts, by = "day") |>
full_join(onset_ts, by = "day") |>
full_join(reports_ts, by = "day") |>
replace_na(list(onsets = 0, reports = 0))
Plotting these, we get
<- all_days |>
combined pivot_longer(c(onsets, reports, infections), names_to = "variable")
ggplot(combined, aes(x = day, y = value)) +
facet_grid(variable ~ .) +
geom_col()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).
Looking at the three plots in isolation we would conclude very different things about the epidemic: symptom onsets seem to have flattened off and perhaps are going down, whereas reports are increasing rapidly.
This apparent contradiction appears because onsets are reported with a delay. By cutting off at a certain reporting date, we do not observe many of the recent symptom onsets that are yet to be be reported. We can see that if we plot the final data set alongside the cut-off one:
# Use full outbreak dataset
<- df |>
final transmute(onset_day = floor(onset_time))
<- final |>
final_onset_ts count(day = onset_day, name = "onsets")
<- expand_grid(day = seq(0, max(final_onset_ts$day))) |>
final_all_days full_join(final_onset_ts, by = "day") |>
replace_na(list(onsets = 0)) |>
mutate(cutoff = "final")
<- combined |>
intermediate filter(variable == "onsets") |>
select(-variable) |>
rename(onsets = value) |>
mutate(cutoff = "70 days")
<- rbind(
combined_cutoffs
intermediate,
final_all_days
)ggplot(combined_cutoffs, aes(x = day, y = onsets, colour = cutoff)) +
geom_line() +
scale_colour_brewer(palette = "Dark2") +
geom_vline(xintercept = cutoff, linetype = "dashed")
As we can see, even though it may much seem like the epidemic curve is going down, in fact in the final data set one can see that at the time symptom onsets were still increasing. The apparent decline towards the present (indicated by a dashed vertical line) was caused by the delay in reporting.
Looking at the plots one might suggest plotting and analysing the data by date of reporting which correctly showed the data to be still increasing and which should, by definition, not be subject to future changes.
This can sometimes be a sensible way to visualise the data. However, reporting might itself be subject to biases such as breaks during the weekend, holidays etc or reporting delays may vary over time. At the same time, when it comes to capacity or intervention planning we may need to know how many people e.g. become sick on any given day and will thus present to the healthcare system rather than how many will be reported.
Estimating the “true curve” (i.e. what we expect to see once the data are complete at a future date) of the time series of epidemiologically relevant events from a potentially truncated epidemiological curve and information about the delays is what is usually called “nowcasting”.
Nowcasting with a known delay
The simplest possible nowcasting model
Here we assume that the delay distribution is known and that we can use it to nowcast the most recent data. In practice, the delay distribution is often not known and needs to be estimated from the data. We could do this using methods from the session on biases in delay distributions.
In the session on convolutions we used delay distributions convolved with the infection times to estimate the time series of symptom onsets. A simple way to nowcast is to use the same approach but using the cumulative distribution function of the delay distribution rather than the probability density function and only apply it to the most recent data as this is the only data that can be subject to change (due to delays in reporting).
We will build intuition for this as usual using simulation. First we define the proportion reported using a delay distribution up to 15 days, again using a lognormal distribution with meanlog 1 and sdlog 0.5:
<- plnorm(1:15, 1, 0.5)
proportion_reported plot(proportion_reported)
plnorm
function
The plnorm()
function is related to the rlnorm()
function we used earlier to simulate the individual level reporting delay, but instead it gives the cumulative distribution function rather than random samples. That is, it gives us the probability that a report is made on day 1 or earlier, day 2 or earlier, etc.
We can now use this delay distribution to nowcast the most recent data. Here we use the same simulation approach as in the renewal session (here we have created helper functions make_gen_time_pmf()
and make_ip_pmf()
to make it easier to re-use; we recommend to have a look at what these functions do), and apply the reporting_delay
to the last 15 days of data.
<- make_gen_time_pmf()
gen_time_pmf <- make_ip_pmf()
ip_pmf <- simulate_onsets(
onset_df make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)<- onset_df |>
reported_onset_df filter(day < cutoff) |>
mutate(proportion_reported = c(rep(1, n() - 15), rev(proportion_reported)),
reported_onsets = rpois(n(), onsets * proportion_reported)
)tail(reported_onset_df)
# A tibble: 6 × 5
day onsets infections proportion_reported reported_onsets
<dbl> <int> <int> <dbl> <int>
1 65 63 83 0.943 51
2 66 64 75 0.889 58
3 67 75 92 0.780 57
4 68 78 113 0.578 34
5 69 69 120 0.270 20
6 70 84 115 0.0228 4
Spend a few minutes trying to understand the code above. What is the proportion_reported
? What is the reported_onsets
?
- The
proportion_reported
is the cumulative distribution function of the delay distribution. It gives the probability that a report is made on day 1 or earlier, day 2 or earlier, etc. Note that for days more that 15 days into the past - The
reported_onsets
are the number of onsets that are reported on each day. This is calculated by multiplying the number of onsets by the proportion of onsets that are reported on each day. It has Poisson noise added to it to simulate the stochasticity in the reporting process.
|>
reported_onset_df ggplot(aes(x = day, y = reported_onsets)) +
geom_col()
We can now fit our first nowcasting model. Here we assume exactly the same generative process as we used for simulation and model the number of onsets as independent draws from a normal distribution.
<- nfidd_cmdstan_model("simple-nowcast")
mod mod
1: functions {
2: #include "functions/condition_onsets_by_report.stan"
3: }
4:
5: data {
6: int n; // number of days
7: array[n] int obs; // observed symptom onsets
8: int report_max; // max reporting delay
9: array[report_max + 1] real report_cdf;
10: }
11:
12: parameters {
13: array[n] real<lower = 0> onsets;
14: }
15:
16: transformed parameters {
17: array[n] real reported_onsets = condition_onsets_by_report(onsets, report_cdf);
18: }
19:
20: model {
21: onsets ~ normal(5, 20) T[0,];
22: // Likelihood
23: obs ~ poisson(reported_onsets);
24: }
Familiarise yourself with the model above. What does it do?
- On line 2 we define a new function
condition_onsets_by_report.stan
which takes the number of onsets and reports and the delay distribution as input and returns the nowcasted number of onsets. - On line 17, this function is used to calculate the nowcasted number of onsets and this is then used in the likelihood.
- On line 21, we define the generative process for the number of onsets. Here we assume that onsets are independent with each drawn from a normal distribution.
Once again we can generate estimates from this model:
<- list(
data n = nrow(reported_onset_df) - 1,
obs = reported_onset_df$reported_onsets[-1],
report_max = length(proportion_reported) - 1,
report_cdf = proportion_reported
)<- mod$sample(data = data, parallel_chains = 4) simple_nowcast_fit
simple_nowcast_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 2334.43 2334.86 5.97 5.72 2324.05 2343.72 1.00 1343 2019
onsets[1] 1.00 0.71 1.01 0.73 0.05 3.01 1.00 4967 1899
onsets[2] 1.01 0.70 1.02 0.73 0.05 3.02 1.00 5071 2155
onsets[3] 1.02 0.70 1.04 0.71 0.05 3.10 1.00 5820 2549
onsets[4] 1.01 0.70 1.01 0.72 0.06 3.08 1.00 5649 2433
onsets[5] 1.02 0.70 1.04 0.74 0.05 3.18 1.00 3613 1858
onsets[6] 1.02 0.70 1.04 0.73 0.05 3.11 1.00 5147 2011
onsets[7] 1.02 0.71 0.99 0.73 0.05 3.04 1.00 5145 2280
onsets[8] 0.99 0.68 0.97 0.71 0.05 2.96 1.00 4879 2356
onsets[9] 2.01 1.67 1.44 1.19 0.36 4.71 1.00 7954 2901
# showing 10 of 139 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
We can now plot onsets alongside those nowcasted by the model:
<- simple_nowcast_fit |>
nowcast_onsets gather_draws(onsets[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100)) |>
mutate(day = day + 1)
ggplot(nowcast_onsets, aes(x = day)) +
geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))
The points in this plot represent the data available when the nowcast was made (and so are truncated) whilst the bars represent the finally reported data (a perfect nowcast would exactly reproduce these).
As we found in the using delay distributions to model the data generating process of an epidemic session, this simple model struggles to recreate the true number of onsets. This is because it does not capture the generative process of the data (i.e. the transmission process and delays from infection to onset). In the next section we will see how we can use a model that does capture this generative process to improve our nowcasts.
Adding in a geometric random walk to the nowcasting model
As we saw in the session on the renewal equation, a geometric random walk is a simple way to model multiplicative growth. Adding this into our simple nowcasting model may help us to better capture the generative process of the data and so produce a better nowcast.
We first load the model
<- nfidd_cmdstan_model("simple-nowcast-rw")
rw_mod rw_mod
1: functions {
2: #include "functions/geometric_random_walk.stan"
3: #include "functions/condition_onsets_by_report.stan"
4: }
5:
6: data {
7: int n; // number of days
8: array[n] int obs; // observed symptom onsets
9: int report_max; // max reporting delay
10: array[report_max + 1] real report_cdf;
11: }
12:
13: parameters {
14: real<lower=0> init_onsets;
15: array[n-1] real rw_noise;
16: real<lower=0> rw_sd;
17: }
18:
19: transformed parameters {
20: array[n] real onsets = geometric_random_walk(init_onsets, rw_noise, rw_sd);
21: array[n] real reported_onsets = condition_onsets_by_report(onsets, report_cdf);
22: }
23:
24: model {
25: init_onsets ~ lognormal(0, 1) T[0,];
26: rw_noise ~ std_normal();
27: rw_sd ~ normal(0, 5) T[0,];
28: //Likelihood
29: obs ~ poisson(reported_onsets);
30: }
and then fit it
<- rw_mod$sample(data = data, parallel_chains = 4) rw_nowcast_fit
rw_nowcast_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 2177.29 2177.56 8.68 8.48 2162.83 2191.10 1.00 757 1605
init_onsets 0.29 0.25 0.17 0.16 0.08 0.63 1.00 4728 2602
rw_noise[1] -1.17 -1.17 0.88 0.89 -2.65 0.26 1.00 4884 3190
rw_noise[2] -0.82 -0.84 0.87 0.88 -2.23 0.62 1.00 4663 2931
rw_noise[3] -0.55 -0.57 0.92 0.91 -2.07 0.99 1.00 6003 3190
rw_noise[4] -0.37 -0.38 0.94 0.94 -1.88 1.18 1.00 6046 3134
rw_noise[5] -0.15 -0.17 0.89 0.87 -1.62 1.35 1.00 5186 3412
rw_noise[6] -0.01 -0.03 0.91 0.91 -1.51 1.48 1.00 5251 3117
rw_noise[7] 0.20 0.20 0.91 0.93 -1.29 1.70 1.00 5194 2443
rw_noise[8] 0.36 0.37 0.90 0.89 -1.13 1.79 1.00 5443 3042
# showing 10 of 209 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
Again we can extract the nowcasted onsets and plot them alongside the observed data:
<- rw_nowcast_fit |>
rw_nowcast_onsets gather_draws(onsets[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100)) |> ## sample 100 iterations randomly
mutate(day = day + 1)
ggplot(rw_nowcast_onsets, aes(x = day)) +
geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))
What do you think of the nowcast now? Does it look better than the previous one?
- The nowcast better matches the ultimately observed data. The geometric random walk allows the model to capture the multiplicative growth in the data and so better capture that current indidence is related to past incidence.
- This should be particularly true when the data is more truncated (i.e nearer to the date of the nowcast) as the geometric random walk allows the model to extrapolate incidence based on previous incidence rather than relying on the prior distribution as the simpler model did.
- However, the model is still quite simple and so may struggle to capture more complex patterns in the data. In particular, the prior model for the geometric random walk assumes that onsets are the same as the previous day with statistical noise. This may not be a good assumption in a rapidly changing epidemic (where the reproduction number is not near 1).
What happens if we get the delay distribution wrong?
In practice, we often do not know the delay distribution and so need to estimate it using the data at hand. In the session on biases in delay distributions we saw how we could do this using individual-level records. We will now look at what happens if we get the delay distribution wrong.
We use the same data as before but now assume that the delay distribution is a gamma distribution with shape 2 and rate 3. This is a very different distribution to the lognormal distribution we used to simulate the data.
<- pgamma(1:15, 2, 3)
wrong_proportion_reported plot(wrong_proportion_reported)
We first need to update the data to use this new delay distribution:
<- data
wrong_delay_data $report_cdf <- wrong_proportion_reported wrong_delay_data
We now fit the nowcasting model with the wrong delay distribution:
<- rw_mod$sample(data = wrong_delay_data, parallel_chains = 4) gamma_nowcast_fit
gamma_nowcast_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 2180.37 2180.69 8.16 8.18 2166.51 2193.09 1.00 893 1566
init_onsets 0.31 0.27 0.19 0.16 0.08 0.68 1.00 3943 2758
rw_noise[1] -1.20 -1.17 0.86 0.83 -2.66 0.19 1.00 4572 2930
rw_noise[2] -0.82 -0.81 0.90 0.88 -2.30 0.68 1.00 4263 2770
rw_noise[3] -0.56 -0.55 0.92 0.92 -2.08 0.90 1.00 3929 3121
rw_noise[4] -0.32 -0.33 0.92 0.93 -1.86 1.20 1.00 3796 2970
rw_noise[5] -0.15 -0.15 0.89 0.90 -1.60 1.28 1.00 4180 3263
rw_noise[6] 0.02 0.04 0.88 0.87 -1.47 1.45 1.00 4398 2963
rw_noise[7] 0.22 0.24 0.94 0.95 -1.34 1.76 1.00 4030 2590
rw_noise[8] 0.41 0.42 0.94 0.95 -1.16 1.92 1.00 4857 2976
# showing 10 of 209 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
Again we can extract the nowcast of symptom onsets and plot it alongside the observed data:
<- gamma_nowcast_fit |>
gamma_nowcast_onsets gather_draws(onsets[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100)) |>
mutate(day = day + 1)
ggplot(gamma_nowcast_onsets, aes(x = day)) +
geom_col(data = reported_onset_df, mapping = aes(y = onsets), alpha = 0.6) +
geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
geom_point(data = reported_onset_df, mapping = aes(y = reported_onsets))
What do you think of the nowcast now? How would you know you had the wrong delay if you didn’t have the true delay distribution?
- The nowcast now looks very different to the observed data. This is because the model is using the wrong delay distribution to nowcast the data.
- If you didn’t have the true delay distribution you would not know that you had the wrong delay distribution. This is why it is important to estimate the delay distribution from the data.
- In practice, you would likely compare the nowcast to the observed data and if they did not match you would consider whether the delay distribution was the cause.