Using delay distributions to model the data generating process

Introduction

We’ve now developed a good idea of how epidemiological time delays affect our understanding of an evolving outbreak. So far, we’ve been working with individual line-list data. However, we usually want to model how an outbreak evolves through a whole population. The aim of this session is to introduce how delay distributions can be used to model population-level data generating process during an epidemic.

This means working with aggregated count data. This creates some new issues in correctly accounting for uncertainty. We handle population-level data using convolutions as a way of combining count data with a distribution of individual probabilities. We will have to adjust our continuous probability distributions to work with this using discretisation. We’ll then need to re-introduce additional uncertainty to account for the observation process at a population level.

Slides

Objectives

In this session, we’ll focus on the delay from infection to symptom onset at a population level. First, we will introduce the techniques of convolution and discretisation. We’ll apply these to an aggregated time series of infections in order to simulate observed symptom onsets. Then, we’ll use those simulated symptom onsets to try and reconstruct a time series of infections.

Source file

The source file of this session is located at sessions/using-delay-distributions-to-model-the-data-generating-process-of-an-epidemic.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the dplyr and tidyr packages for data wrangling, the ggplot2 library for plotting and the tidybayes package for extracting results of the inference.

library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("tidybayes")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.

set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)

Simulating observations from a time series of infections

As before we first simulate the process that generates the data we typically observe. In this session, we’ll focus on the process from infection to symptom onset. Then we’ll use the same model to conduct inference.

Delay distributions and convolutions

In the last session we simulated individual outcomes from a delay distribution, and then re-estimated the corresponding parameters. However, sometimes we do not have data on these individual-level outcomes, either because they are not recorded or because they cannot be shared, for example due to privacy concerns. At the population level, individual-level delays translate into convolutions.

If we have a time series of infections \(I_t\) (\(t=1, 2, 3, \ldots, t_\mathrm{max}\)), where \(t\) denotes the day on which the infections occur, and observable outcomes occur with a delay given by a delay distribution \(p_i\) (\(i=0, 1, 2, \dots, p_\mathrm{max}\)), where \(i\) is the number of days after infection that the observation happens, then the number of observable outcomes \(C_t\) on day \(t\) is given by

\[ C_t = \sum_{i=0}^{i=p_\mathrm{max}} I_{t-i} p_i \]

In other words, the number of observable outcomes on day \(t\) is given by the sum of infections on all previous days multiplied by the probability that those infections are observed on day \(t\). For example, the observable outcomes \(C_t\) could be the number of symptom onsets on day \(t\) and \(p_i\) is the incubation period.

We can use the same data as in the session on biases in delay distributions, but this time we first aggregate this into a daily time series of infections. You can do this using:

inf_ts <- make_daily_infections(infection_times)
Note

As before you can look at the code of the function by typing make_daily_infections. The second part of this function is used to add days without infections with a zero count. This will make our calculations easier later (as otherwise we would have to try and detect these in any models that used this data which could be complicated).

And look at the first few rows of the daily aggregated data:

head(inf_ts)
# A tibble: 6 × 2
  infection_day infections
          <dbl>      <int>
1             0          1
2             1          0
3             2          1
4             3          0
5             4          2
6             5          1

Now we can convolve the time series with a delay distribution to get a time series of outcomes as suggested above.

Discretising a delay distribution

In our first session, we decided to assume the delay from infection to symptom onset had a gamma distribution. However, if we want to use the gamma distribution with shape 5 and rate 1 as before, we face a familiar issue. The gamma distribution is a continuous distribution, but now our delay data are in days which are discrete entities. We will assume that both events that make up the delay (here infection and symptom onset) are observed as daily counts (e.g. as number of infections/symptom onsets by calendar date). Therefore, both observations are censored (as events are rounded to the nearest date). This means that our distribution is double interval censored which we encountered in the the biases in delay distribution session, so we need to use the same ideas introduced in that session.

The cumulative distribution function (CDF) (\(F(t)\)) of a distribution that has a daily censored primary event can be expressed as,

\[ F^*(t) = \int_0^1 F(t - u) du \]

In effect, this is saying that the daily censored CDF is the average of the continuous distributions CDF over all possible event times (here between 0 and 1).

The probability mass function (PMF) of this distribution when observed as a daily process (i.e. the secondary event is also daily censored) is then

\[ f_t \propto F^*(t + 1) - F^*(t - 1) \]

The important point is that the ultimately observed PMF is a combination of a primary event daily censoring process and a secondary event daily censoring process.

We can think about this via simulation. We do so by generating many replicates of the corresponding random delay, taking into account that we have already rounded down our infection times to infection days. This means that discretising a delay in this context is double censoring as we discussed in the the biases in delay distribution session. In the absence of any other information or model, we assume for our simulation that infection occurred at some random time during the day, with each time equally likely. We can then apply the incubation period using a continuous probability distribution, before once again rounding down to get the day of symptom onset (mimicking daily reporting). We repeat this many times to get the probability mass function that allows us to go from infection days to symptom onset days.

You can view the function we have written do this using:

censored_delay_pmf
function (rgen, max, n = 1e+06, ...) 
{
    first <- runif(n, min = 0, max = 1)
    second <- first + rgen(n, ...)
    delay <- floor(second)
    counts <- table(factor(delay, levels = seq(0, max)))
    pmf <- counts/sum(counts)
    return(pmf)
}
<bytecode: 0x561565a6df50>
<environment: namespace:nfidd>
Take 5 minutes

Try to understand the censored_delay_pmf() function above. Try it with a few different probability distributions and parameters, e.g. for the parameters given above and a maximum delay of 2 weeks (14 days) it would be:

gamma_pmf <- censored_delay_pmf(rgamma, max = 14, shape = 5, rate = 1)
gamma_pmf

           0            1            2            3            4            5 
0.0006738544 0.0213000084 0.0904146458 0.1637886841 0.1914657786 0.1741648410 
           6            7            8            9           10           11 
0.1339778840 0.0919285654 0.0582628773 0.0344727114 0.0194606761 0.0103971836 
          12           13           14 
0.0055410260 0.0027965460 0.0013547178 
plot(gamma_pmf)

Applying a convolution

Next we apply a convolution with the discretised incubation period distribution to the time series of infections, to generate a time series of symptom onsets. The function we have written to do this is called convolve_with_delay

convolve_with_delay
function (ts, delay_pmf) 
{
    max_delay <- length(delay_pmf) - 1
    convolved <- vapply(seq_along(ts), function(i) {
        first_index <- max(1, i - max_delay)
        ts_segment <- ts[seq(first_index, i)]
        pmf <- rev(delay_pmf)[seq_len(i - first_index + 1)]
        ret <- sum(ts_segment * pmf)
        return(ret)
    }, numeric(1))
    return(convolved)
}
<bytecode: 0x5615665dadf8>
<environment: namespace:nfidd>
Take 5 minutes

Try to understand the convolve_with_delay() function above. Try it with a few different time series and delay distributions. How would you create the time series of symptom onsets from infections, using the discretised gamma distribution created above (saved in gamma_pmf)?

onsets <- convolve_with_delay(inf_ts$infections, gamma_pmf)

We can plot these symptom onsets:

combined <- inf_ts |>
  rename(time = infection_day) |>
  mutate(onsets = onsets)
ggplot(combined, aes(x = time, y = onsets)) +
  geom_bar(stat = "identity")

Do they look similar to the plot of symptom onsets in the session on delay distributions?

Observation uncertainty

Usually not all data are perfectly observed. Also, the convolution we applied is a deterministic operation that brushes over the fact that individual delays are random. We should therefore find another way to model the variation these processes introduce.

Given that we are now dealing with count data a natural choice is the Poisson distribution. We can use this to generate uncertainty around our convolved data.

combined <- combined |>
  mutate(observed = rpois(n(), onsets))
Take 5 minutes

Does a plot of these observations look more like the plots from the session on delay distributions than the convolution plotted above?

ggplot(combined, aes(x = time, y = observed)) +
  geom_bar(stat = "identity")

Estimating a time series of infections

As in previous sessions, we don’t usually have data on infections. We now create a model that uses symptom onsets to estimate the number of infections over time. We’ll base the model on an uninformed prior and work forward from what we know about the observation process.

mod <- nfidd_cmdstan_model("estimate-infections")
mod
 1: functions {
 2:   #include "functions/convolve_with_delay.stan"
 3: }
 4: 
 5: data {
 6:   int n;            // number of time days
 7:   array[n] int obs; // observed onsets
 8:   int<lower = 1> ip_max; // max incubation period
 9:   // probability mass function of incubation period distribution (first index zero)
10:   array[ip_max + 1] real ip_pmf;
11: }
12: 
13: parameters {
14:   array[n] real<lower = 0> infections;
15: }
16: 
17: transformed parameters {
18:   array[n] real onsets = convolve_with_delay(infections, ip_pmf);
19: }
20: 
21: model {
22:   // priors
23:   infections ~ normal(0, 10) T[0, ];
24:   obs ~ poisson(onsets);
25: }
Take 10 minutes

Familiarise yourself with the model above. Unlike before there is now a functions block at the beginning of the model (lines 1-3), where we load a function called convolve_with_delay() (line 2) from a file of the same name which can be found in the subdirectory functions of the stan directory or viewed on the github repo. The functions correspond exactly to our earlier R function of the same name. Later, this functions is called in the model block, to generate the time series of symptom onsets (line 18).

What is the prior assumption on the daily number of infections? Which line defines the likelihood, and how does it relate to the section about observation uncertainty above?

The model assumes that infections every day are independent from infections on any other day (line 23) and determined only by the number of symptom onsets that they result in (line 18). Line 24 defines the likelihood, and it does so using the Poisson observation uncertainty we used above.

We can now use this model to conduct inference, i.e. to try to reconstruct the time series of infections from the time series of onsets that we generated earlier.

data <- list(
  n = nrow(combined),
  obs = combined$observed,
  ip_max = length(gamma_pmf) - 1,
  ip_pmf = gamma_pmf
)
inf_fit <- mod$sample(data = data, parallel_chains = 4)
Caution

Note that this code might take a few minutes to run. We have added the parallel_chains option to make use of any paralle hardware you may ave. If you don’t have 4 computing cores and/or the code runs very slowly, you could try only running 2 chains (chains = 2) and/or fewer samples (iter_warmup = 500, iter_sampling = 500).

Tip

In this model, we have estimated many more parameters than in the previous models: instead of e.g. 2 parameters of a probability distribution, we now have a total of time points for which we estimate the number of infections. This is because we don’t have a model for the process that generates infections. How would this be different if we e.g. used an SIR model here?

We can see the first few estimates of the number of infections using:

inf_fit
      variable     mean   median   sd  mad       q5      q95 rhat ess_bulk
 lp__          20152.04 20152.40 9.17 9.04 20136.60 20166.91 1.00     1503
 infections[1]     7.53     6.33 5.79 5.56     0.57    18.99 1.00     4346
 infections[2]     7.33     6.07 5.71 5.53     0.52    18.22 1.00     3514
 infections[3]     6.92     5.82 5.33 5.11     0.53    17.07 1.00     3881
 infections[4]     6.28     5.17 5.03 4.78     0.40    16.07 1.00     3331
 infections[5]     5.74     4.94 4.33 4.20     0.48    14.27 1.00     4356
 infections[6]     3.27     2.47 2.87 2.38     0.23     9.19 1.00     5183
 infections[7]     2.43     1.81 2.23 1.80     0.15     7.00 1.00     4327
 infections[8]     1.90     1.36 1.80 1.36     0.10     5.58 1.00     4503
 infections[9]     1.64     1.20 1.49 1.15     0.10     4.68 1.00     5000
 ess_tail
     2334
     2288
     1909
     2153
     1741
     2211
     2735
     2187
     2147
     2271

 # showing 10 of 285 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)

Again, we can do a posterior predictive check by plotting the modelled estimates of the time series of infections (with uncertainty) against our original data. Does it look like a good fit?

# Extract posterior draws
inf_posterior <- inf_fit |>
  gather_draws(infections[infection_day]) |>
  ungroup() |> 
  mutate(infection_day = infection_day - 1) |>
  filter(.draw %in% sample(.draw, 100))

ggplot(mapping = aes(x = infection_day)) +
  geom_line(
    data = inf_posterior, mapping = aes(y = .value, group = .draw), alpha = 0.1
  ) +
  geom_line(data = inf_ts, mapping = aes(y = infections), colour = "red") +
  labs(title = "Infections per day",
       subtitle = "True data (red), and sample of estimated trajectories (black)")

Tip

This time we used the gather_draws() function included in tidybayes to extract the inference results. This is particularly useful when dealing with arrays such as inference because it allows to extract them with a given index (here: [day]).

Going further

  • Above, we used a Poisson distribution to characterise uncertainty. In the Poisson distribution, the variance is the same as the mean. Another common choice is the negative binomial distribution, which has a more flexible relationship between variance and mean. If you re-did the analysis above with the negative binomial distribution, what would be the difference?

  • We could have used the individual-level model of the previous section to try to estimate the number of infections with a known delay distribution by estimating each individual infection time. How would this look in stan code? Would you expect it to yield a different result?

Wrap up