library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("tidybayes")
Nowcasting with an unknown reporting delay
Introduction
In the last session we introduced the idea of nowcasting using a simple model. However, this approach had problems: we didn’t fully account for uncertainty, or for example observation error in the primary events, and it’s not a fully generative model of the data reporting process. And as we saw, if we get the delay distribution wrong, we can get the nowcast very wrong.
A better approach is to jointly estimate the delay distribution together with the nowcast. We can do this by using information from multiple snapshots of the data as it changes over time (using a data structure called the “reporting triangle”). In this session, we’ll introduce this approach to joint estimation in nowcasting. At the end we’ll then demonstrate a way to combine this with our previous work estimating the reproduction number, steadily improving our real time outbreak model.
Slides
Objectives
This session aims to introduce how to do nowcasting if the reporting delay distribution is unknown.
Source file
The source file of this session is located at sessions/joint-nowcasting.qmd
.
Libraries used
In this session we will use the nfidd
package to load a data set of infection times and access stan models and helper functions, the dplyr
and tidyr
packages for data wrangling, ggplot2
library for plotting, and the tidybayes
package for extracting results of the inference.
The best way to interact with the material is via the Visual Editor of RStudio.
Initialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr
show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.
set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)
Joint estimation of delay distributions and nowcasting
Motivation
So far we have assumed that the delay distribution is known. In practice, this is often not the case and we need to estimate it from the data. As we discussed in the session on biases in delay distributions, this can be done using individual data and then passing this estimate to a simple nowcasting model like those above. However, this has the disadvantage that the nowcasting model does not take into account the uncertainty in the delay distribution or observation error of the primary events. We can instead estimate the delay distribution and nowcast the data jointly.
The reporting triangle
To jointly estimate we need to decompose observations into what is known as the reporting triangle. This is a matrix where the rows are the days of onset and the columns are the days of report. The entries are the number of onsets on day \(i\) that are reported on day \(j\). We can then use this matrix to estimate the delay distribution and nowcast the data. It is referred to as a triangle because the data for the more recent data entries are incomplete which gives the matrix a triangular shape.
We can construct the reporting triangle from onsets (\(N_{t}\)) as follows: \[ N_{t} = \sum_{d=0}^{D} n_{t,d} \]
Where \(n_{t,d}\) is the number of onsets on day \(t\) that are reported on day \(t-d\) and \(D\) represents the maximum delay between date of reference and time of report which in theory could be infinite but in practice we set to a finite value to make the model identifiable and computationally feasible. We can now construct a model to estimate \(n_{t,d}\),
\[ n_{t,d} \mid \lambda_{t},p_{t,d} \sim \text{Poisson} \left(\lambda_{t} \times p_{t,d} \right),\ t=1,...,T. \]
where \(\lambda_{t}\) is the expected number of onsets on day \(t\) and \(p_{t,d}\) is the probability that an onset on day \(t\) is reported on day \(t-d\). Here \(\lambda_{t}\) is the same as the expected number of onsets on day \(t\) in the simple nowcasting model above so we again modelled it using a geometric random walk for now. We model \(p_{t,d}\) as a Dirichlet distribution as it is a distribution over probabilities. \(p_{t,d}\) is equivalent to the reporting delays we have been using as fixed quantities so far but now estimated within the model. In most real-world settings we would want to use our domain expertise to inform the prior distribution of \(p_{t,d}\).
Simulating the reporting triangle
Now that we are aiming to jointly estimate the delay distribution we need additional data. We can simulate this data by using the same generative process as above but now also simulating the reporting delays.
Once again we generate our simulated onset dataset:
<- make_gen_time_pmf()
gen_time_pmf <- make_ip_pmf()
ip_pmf <- simulate_onsets(
onset_df make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)head(onset_df)
# A tibble: 6 × 3
day onsets infections
<dbl> <int> <int>
1 1 0 0
2 2 0 1
3 3 0 0
4 4 0 2
5 5 0 1
6 6 0 1
<- 71 cutoff
We also need to simulate the reporting delays:
<- censored_delay_pmf(
reporting_delay_pmf max = 15, meanlog = 1, sdlog = 0.5
rlnorm,
)plot(reporting_delay_pmf)
We can then simulate the reporting triangle:
<- onset_df |>
reporting_triangle filter(day < cutoff) |>
mutate(
reporting_delay = list(
tibble(d = 0:15, reporting_delay = reporting_delay_pmf)
)|>
) unnest(reporting_delay) |>
mutate(
reported_onsets = rpois(n(), onsets * reporting_delay)
|>
) mutate(reported_day = day + d)
We also need to update our simulated truth data to include the Poisson observation error we are assuming is part of the observation process.
<- reporting_triangle |>
noisy_onsets_df summarise(noisy_onsets = sum(reported_onsets), .by = day)
As we only partially observe the reporting triangle we need to filter it to only include the data we have observed:
<- reporting_triangle |>
filtered_reporting_triangle filter(reported_day <= max(day))
Finally, we sum the filtered reporting triangle to get the counts we actually observe.
<- filtered_reporting_triangle |>
available_onsets summarise(available_onsets = sum(reported_onsets), .by = day)
Fitting the joint model
As usual we start by loading the model:
<- nfidd_cmdstan_model("joint-nowcast")
joint_mod joint_mod
1: functions {
2: #include "functions/geometric_random_walk.stan"
3: #include "functions/observe_onsets_with_delay.stan"
4: #include "functions/combine_obs_with_predicted_obs_rng.stan"
5: }
6:
7: data {
8: int n; // number of days
9: int m; // number of reports
10: array[n] int p; // number of observations per day
11: array[m] int obs; // observed symptom onsets
12: int d; // number of reporting delays
13: }
14:
15: transformed data{
16: array[n] int P = to_int(cumulative_sum(p));
17: array[n] int D = to_int(cumulative_sum(rep_array(d, n)));
18: }
19:
20: parameters {
21: real<lower=0> init_onsets;
22: array[n-1] real rw_noise;
23: real<lower=0> rw_sd;
24: simplex[d] reporting_delay; // reporting delay distribution
25: }
26:
27: transformed parameters {
28: array[n] real onsets = geometric_random_walk(init_onsets, rw_noise, rw_sd);
29: array[m] real onsets_by_report = observe_onsets_with_delay(onsets, reporting_delay, P, p);
30: }
31:
32: model {
33: // Prior
34: init_onsets ~ normal(1, 1) T[0,];
35: rw_noise ~ std_normal();
36: rw_sd ~ normal(0, 0.1) T[0,];
37: reporting_delay ~ dirichlet(rep_vector(1, d));
38: // Likelihood
39: obs ~ poisson(onsets_by_report);
40: }
41:
42: generated quantities {
43: array[d*n] real complete_onsets_by_report = observe_onsets_with_delay(onsets, reporting_delay, D, rep_array(d, n));
44: array[n] int nowcast = combine_obs_with_predicted_obs_rng(obs, complete_onsets_by_report, P, p, d, D);
45: }
46:
This time we won’t go into details of the model. For now, it is important that you understand the concept but as the models get more complex we hope that you trust us that the model does what we describe above.
Once thing to note is that we are now fitting the initial number of symptom onsets (init_onsets
). This is different from earlier when we had to pass the initial number of infections (I0
) as data. In most situations this number would be unknown so what we do here is closer to what one would do in the real world.
We then fit it do data:
<- list(
joint_data n = length(unique(filtered_reporting_triangle$day)), # number of days
m = nrow(filtered_reporting_triangle), # number of reports
p = filtered_reporting_triangle |>
group_by(day) |>
filter(d == max(d)) |>
mutate(d = d + 1) |>
pull(d), # number of observations per day
obs = filtered_reporting_triangle$reported_onsets, # observed symptom onsets
d = 16 # number of reporting delays
)<- joint_mod$sample(data = joint_data, parallel_chains = 4) joint_nowcast_fit
joint_nowcast_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 373.49 374.05 8.11 7.97 359.21 385.56 1.00 1114 1742
init_onsets 0.20 0.15 0.17 0.15 0.01 0.54 1.00 4228 2067
rw_noise[1] -0.95 -0.95 0.92 0.94 -2.43 0.53 1.00 9303 2943
rw_noise[2] -0.74 -0.76 0.92 0.90 -2.28 0.79 1.00 7938 2402
rw_noise[3] -0.50 -0.49 0.93 0.97 -2.06 1.05 1.00 8804 2774
rw_noise[4] -0.29 -0.29 0.97 0.97 -1.91 1.30 1.00 8994 2767
rw_noise[5] -0.13 -0.12 0.94 0.93 -1.63 1.44 1.00 9706 2848
rw_noise[6] 0.04 0.03 0.93 0.92 -1.47 1.56 1.00 8998 2875
rw_noise[7] -0.06 -0.07 0.94 0.94 -1.66 1.46 1.00 7956 2580
rw_noise[8] -0.16 -0.16 0.92 0.94 -1.66 1.35 1.00 8134 2842
# showing 10 of 2348 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
One benefit of this model is that because we have decomposed the data into the reporting triangle we can make a nowcast that uses the data we have available augmented with predictions from the model. This should give us far more accurate uncertainty estimates than the simple nowcasting models above (see stan/functions/combine_obs_with_predicted_obs_rng.stan
but note the code is fairly involved).
We now extract this nowcast:
<- joint_nowcast_fit |>
joint_nowcast_onsets gather_draws(nowcast[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100))
Finally, we can plot the nowcast alongside the observed data:
ggplot(joint_nowcast_onsets, aes(x = day)) +
geom_col(
data = noisy_onsets_df, mapping = aes(y = noisy_onsets), alpha = 0.6
+
) geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
geom_point(data = available_onsets, mapping = aes(y = available_onsets))
Reminder: The points in this plot represent the data available when the nowcast was made (and so are truncated) whilst the bars represent the finally reported data (a perfect nowcast would exactly reproduce these).
Look back at the last three nowcasts. How do they compare? What are the advantages and disadvantages of each? Could we improve the nowcasts further?
- The simple nowcast struggled to capture the generative process of the data and so produced poor nowcasts. The nowcast with the geometric random walk was better but still struggled to capture the generative process of the data. The joint nowcast was the best of the three as it properly handled the uncertainty and allowed us to fit the delay distribution versus relying on known delays.
- However, the joint nowcast is still quite simple (in the sense that no detailed mechanism or reporting process is being modelled) and so may struggle to capture more complex patterns in the data. In particular the prior model for the geometric random walk assumes that onsets are the same as the previous day with some statistical noise. This may not be a good assumption in a rapidly changing epidemic (where the reproduction number is not near 1).
- In addition, whilst we say it is “quite simple” as should be clear from the code it is quite complex and computationally intensive. This is because we are fitting a model to the reporting triangle which is a much larger data set and so the model is relatively quite slow to fit.
Putting it all together: Estimating the reproduction number, nowcasting, and joint estimation of delay distributions
This section contains a lot of code and is quite complex. It is not necessary to understand all of it to get the main points of the session. We recommend reading through it to get a sense of how all the pieces fit together but don’t worry if you don’t understand all of it.
In the previous sessions, we have seen how to estimate the reproduction number and how to nowcast the data. We can now put these two pieces together to estimate the reproduction number and nowcast the data jointly. This should allow us to produce more accurate nowcasts as we can use the information from the reproduction number to inform the nowcast and vice versa.
As in the renewal sesssion we need to define the generation time distribution and a incubation period distribution. We will use the same distributions as in the renewal session for simplicity. These are:
plot(gen_time_pmf)
and
plot(ip_pmf)
We now load in the model:
<- nfidd_cmdstan_model("joint-nowcast-with-r")
joint_rt_mod joint_rt_mod
1: functions {
2: #include "functions/geometric_random_walk.stan"
3: #include "functions/renewal.stan"
4: #include "functions/convolve_with_delay.stan"
5: #include "functions/observe_onsets_with_delay.stan"
6: #include "functions/combine_obs_with_predicted_obs_rng.stan"
7: }
8:
9: data {
10: int n; // number of days
11: int m; // number of reports
12: array[n] int p; // number of observations per day
13: array[m] int obs; // observed symptom onsets
14: int d; // number of reporting delays
15: int gen_time_max; // maximum generation time
16: array[gen_time_max] real gen_time_pmf; // pmf of generation time distribution
17: int<lower = 1> ip_max; // max incubation period
18: array[ip_max + 1] real ip_pmf;
19: }
20:
21: transformed data{
22: array[n] int P = to_int(cumulative_sum(p));
23: array[n] int D = to_int(cumulative_sum(rep_array(d, n)));
24: }
25:
26: parameters {
27: real<lower = 0> init_I; // initial number of infected
28: real init_R; // initial reproduction number
29: array[n-1] real rw_noise; // random walk noise
30: real<lower = 0> rw_sd; // random walk standard deviation
31: simplex[d] reporting_delay; // reporting delay distribution
32: }
33:
34: transformed parameters {
35: array[n] real R = geometric_random_walk(init_R, rw_noise, rw_sd);
36: array[n] real infections = renewal(init_I, R, gen_time_pmf);
37: array[n] real onsets = convolve_with_delay(infections, ip_pmf);
38: array[m] real onsets_by_report = observe_onsets_with_delay(onsets, reporting_delay, P, p);
39: }
40:
41: model {
42: // Prior
43: init_I ~ lognormal(-1, 1);
44: init_R ~ normal(-.1, 0.5); // Approximately Normal(1, 0.5)
45: rw_noise ~ std_normal();
46: rw_sd ~ normal(0, 0.05) T[0,];
47: reporting_delay ~ dirichlet(rep_vector(1, d));
48: // Likelihood
49: obs ~ poisson(onsets_by_report);
50: }
51:
52: generated quantities {
53: array[d*n] real complete_onsets_by_report = observe_onsets_with_delay(onsets, reporting_delay, D, rep_array(d, n));
54: array[n] int nowcast = combine_obs_with_predicted_obs_rng(obs, complete_onsets_by_report, P, p, d, D);
55: }
56:
Familiarise yourself with the model above. Can you see how it combines the nowcasting and the estimation of the reproduction number? Can you suggest how you swap in the simple nowcasting model whilst keeping the estimation of the reproduction number?
Essentially rather than using observe_onsets_with_delay.stan
we would use condition_onsets_by_report.stan
and pass in the proportion reported as a data. This would allow us to use the simple nowcasting model whilst still estimating the reproduction number. We would also remove the generated quantities
block as we are not nowcasting the data and simplify the observations to just the number of onsets.
Now let’s fit the final model for this session!
<- c(joint_data,
joint_rt_data list(
gen_time_max = length(gen_time_pmf),
gen_time_pmf = gen_time_pmf,
ip_max = length(ip_pmf) - 1,
ip_pmf = ip_pmf,
h = 0 # this is a small easter egg for the attentive reader
)
)<- joint_rt_mod$sample(
joint_rt_fit data = joint_rt_data, parallel_chains = 4,
adapt_delta = 0.95,
init = \() list(init_R = 0, rw_sd = 0.01)
)
We use adapt_delta = 0.95
here as this is a relatively complex model with a difficult to explore posterior. Increasing this setting decreases the step size of the sampler and so makes it easier for it to explore the posterior. The downside is that fitting then takes longer.
joint_rt_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ 359.81 360.22 6.76 6.64 348.24 370.18 1.00 1499 2367
init_I 1.42 1.32 0.65 0.57 0.60 2.54 1.00 1960 2355
init_R 0.35 0.34 0.12 0.10 0.15 0.55 1.00 1760 1547
rw_noise[1] 0.06 0.06 0.99 0.99 -1.55 1.72 1.00 5047 3067
rw_noise[2] 0.04 0.04 0.99 0.96 -1.58 1.65 1.00 4643 2787
rw_noise[3] 0.04 0.06 0.98 1.00 -1.54 1.61 1.00 4968 2731
rw_noise[4] 0.02 0.02 1.01 0.97 -1.67 1.71 1.00 5042 2627
rw_noise[5] 0.02 0.02 1.00 1.00 -1.61 1.66 1.00 5519 2907
rw_noise[6] 0.02 0.02 1.01 1.02 -1.65 1.63 1.00 4420 2785
rw_noise[7] -0.02 0.00 0.99 0.97 -1.66 1.61 1.00 5537 2731
# showing 10 of 2489 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
First we can extract the nowcast and plot the nowcast alongside the observed data:
<- joint_rt_fit |>
joint_nowcast_with_r_onsets gather_draws(nowcast[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100))
ggplot(joint_nowcast_with_r_onsets, aes(x = day)) +
geom_col(
data = noisy_onsets_df, mapping = aes(y = noisy_onsets), alpha = 0.6
+
) geom_line(mapping = aes(y = .value, group = .draw), alpha = 0.1) +
geom_point(data = available_onsets, mapping = aes(y = available_onsets))
We can also extract the reproduction number and plot it:
<- joint_rt_fit |>
joint_rt gather_draws(R[day]) |>
ungroup() |>
filter(.draw %in% sample(.draw, 100))
ggplot(joint_rt, aes(x = day, y = .value, group = .draw)) +
geom_line(alpha = 0.1)
What do you think of the nowcast now? Does it look better than the previous one? What about the reproduction number?
- Whilst the majority of the nowcast is similar we see that the nowcast for days nearer to the present is more accurate as this model can capture the trend in infections and account for delays from infection to onset and onset to report.
- The key takeaway from the reproduction number plot is that it looks similar to the one we estimated in the renewal session. This is because we have accounted for the truncation (otherwise it would be spuriously decreasing towards the end of the timeseries).
Going further
- The simple nowcast models we showed here assumed perfect knowledge of the delay distribution. What happens when you instead use an estimate of the delay distribution from the data? Try and do this using methods from session on biases in delay distributions and see how it affects the simple nowcast models.
- Despite being fairly involved the joint nowcast model we used here is still quite simple and may struggle to capture more complex patterns in the data. In practice more complex methods are often needed to account for structure in the reporting process, time-varying delay distributions, or delays that vary by other factors (such as the age of cases).
- The
epinowcast
package implements a more complex version of the model we have used here. It is designed to be highly flexible and so can be used to model a wide range of different data sets. You will have the opportunity to try it in a later session. - This session focussed on the role of the generative process in nowcasting. This is an area of active research but this paper gives a good overview of the current state of the art.