Delay distributions

Introduction

Every outbreak starts with an infection, but we rarely observe these directly. Instead, there are many types of epidemiological events that we might observe that occur after the time of infection: symptom onsets, hospitalisations and discharge, etc. The length of time between any of these events might be highly variable and uncertain. However, we often need to have some estimate for these delays in order to understand what’s happening now and predict what might happen in the future. We capture this variability using probability distributions, which is the focus of this session.

Slides

Objectives

The aim of this session is for you to familiarise yourself with the concept of delay distributions used to describe reporting in infectious disease epidemiology. First we’ll look at this working forwards from infections in an outbreak. We’ll use R to probabilistically simulate delays from infection to reporting symptoms and hospitalisations. Then we’ll work only from this set of outcome data and use a stan model to estimate the parameters of one specific delay (in this example, symptom onset to hospitalisation).

Source file

The source file of this session is located at sessions/delay-distributions.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the ggplot2 package for plotting, the dplyr and tidyr packages to wrangle data, the lubridate package to deal with dates, and the posterior packages for investigating the results of the inference conducted with stan.

library("nfidd")
library("ggplot2")
library("dplyr")
library("tidyr")
library("lubridate")
library("posterior")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.

set.seed(1234)
options(cmdstanr_print_line_numbers = TRUE)

Simulating delayed epidemiological data

We will start this session by working with a simulated data set of infections from a disease that has caused an outbreak which subsequently ended. In our example outbreak, there were 6383 infections. The outbreak lasted 140 days after the first infection, with new infections peaking roughly around day 80.

Note

For now we will not concern ourselves with the model used to generate the epidemic. This represents a typical situation in the real world, where we may have a model of how an infection has spread, but we don’t necessarily know how well this corresponds to what really happened.

We will later deal with modelling the infectious process. For now, we will focus on modelling how infections get reported in data - the observation process. Using infectious disease data for analysis comes with three common challenges:

  1. We don’t normally observe infections directly, but their outcomes as symptomatic cases, hospitalisations or other realisations.
  2. These observations are incomplete (e.g. not every infection leads to hospitalisation, and so focusing on hospitalisations will leave infections unobserved).
  3. These observations happen with a delay after the infection occurs (e.g. from infection to symptoms).

Load the outbreak data

We will work with a data set that is included in the nfidd R package that you installed initially. The column infection_time is a linelist of infections from our example outbreak, given as a decimal number of days that have passed since the initial infection of the outbreak. It can be loaded with the data command.

data(infection_times)
head(infection_times)
  infection_time
1       0.000000
2       2.236708
3       4.091861
4       7.347199
5       8.990060
6       4.635069
### visualise the infection curve
ggplot(infection_times, aes(x = infection_time)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(n.breaks = 10) +
  labs(x = "Infection time (in days)", y = "Number of infections",
       title = "Infections during an outbreak")

Note

In reality, data from an outbreak will usually be given as dates, not decimals; and those will usually not represent infection, but an observed outcome such as symptom onset or hospital admission. For now we don’t want to spend too much time manipulating dates in R, but we will get back to working with more realistic outbreak data later.

Working forwards: simulating a sequence of delays after infection

We would now like to simulate hospitalisations arising from this outbreak. We will start with our data on infection times, and work forwards to symptom onsets and then hospitalisations. We’ll make the following assumptions about the process from infection to hospital admission:

  • Infection to symptoms:
    • We’ll assume all infections cause symptoms.
    • Time from infection to symptom onset (incubation period): We assume that the incubation period is gamma-distributed with shape 5 and rate 1, i.e. a mean of 5 days.
  • Symptoms to hospital admission:
    • We’ll assume that 30% of symptomatic cases become hospitalised.
    • Time from symptom onset to hospital admission: We assume that the onset-to-hospitalisation period is lognormally distributed, with meanlog 1.75 and sdlog 0.5, corresponding to a mean delay of about a week.

Let’s put these assumptions into practice by simulating some data, adding onset and hospitalisation times (in decimal number of days after the first infection) to the infection times. We’ll use random values for each infection from the probability distributions we’ve assumed above.

Note

Throughout the course, we repeatedly use some small chunks of code. Instead of copying these between sessions, we’ve put them into functions in the nfidd R package that we can use when needed. To see exactly what the function does, just type the function name, e.g. add_delays for the function below.

df <- add_delays(infection_times)
Take 2 minutes

Have a look at the add_delays() function and try to understand its inner workings. You can also access a help file for this function, just like any other R function, using ?add_delays.

Now we can plot infections, hospitalisations and onsets.

# convert our data frame to long format
dfl <- df |>
  pivot_longer(
    cols = c(infection_time, onset_time, hosp_time),
    names_to = "type", values_to = "time"
  )
# plot
ggplot(dfl, aes(x = time)) +
  geom_histogram(position = "dodge", binwidth = 1) +
  facet_wrap(~ type, ncol = 1) +
  xlab("Time (in days)") +
  ylab("Count")
Warning: Removed 4524 rows containing non-finite outside the scale range
(`stat_bin()`).

Estimating delay distributions from outcome data

As mentioned above, our data set of infection, symptom onset and hospitalisation times is not the typical data set we encounter in outbreaks. In reality, we don’t have infection dates, and we also have to deal with missing data, incomplete observations, data entry errors etc. For now, let us just assume we have a data set only including symptom onset times and some hospitalisation times.

Let’s look at a specific problem: we would like to estimate how long it takes for people to become hospitalised after becoming symptomatic. This might be an important delay to know about, for example when modelling and forecasting hospitalisations, or more generally for estimating required hospital capacity.

To do this we can use our data with stan to model the delay from symptom onset to hospitalisation, with the same assumption that it follows a lognormal distribution.

mod <- nfidd_cmdstan_model("lognormal")
mod
 1: // lognormal_model.stan
 2: data {
 3:   int<lower=0> n; // number of data points
 4:   array[n] real y; // data
 5: }
 6: 
 7: parameters {
 8:   real meanlog;
 9:   real<lower=0> sdlog;
10: }
11: 
12: model {
13:   meanlog ~ normal(0, 10);  // prior distribution
14:   sdlog ~ normal(0, 10) T[0, ]; // prior distribution
15: 
16:   y ~ lognormal(meanlog, sdlog);
17: }

Let’s estimate the onset-to-hospitalisation delay using the simulated data set we created above. Do we recover the parameters used for simulation?

We can do this by sampling from the model’s posterior distribution by feeding it our simulated data set.

## Specify the time from onset to hospitalisation
df_onset_to_hosp <- df |>
  mutate(onset_to_hosp = hosp_time - onset_time) |> 
  # exclude infections that didn't result in hospitalisation
  drop_na(onset_to_hosp)
## Use the data to sample from the model posterior
res <- mod$sample(
  data = list(
    n = nrow(df_onset_to_hosp),
    y = df_onset_to_hosp$onset_to_hosp
  )
)

To find out more about the sample function we used here, you can use ?cmdstanr::sample. To see the estimates, we can use:

res$summary() ## could also simply type `res`
# A tibble: 3 × 10
  variable      mean    median      sd     mad        q5      q95  rhat ess_bulk
  <chr>        <dbl>     <dbl>   <dbl>   <dbl>     <dbl>    <dbl> <dbl>    <dbl>
1 lp__     -1322.    -1322.    0.990   0.712   -1324.    -1.32e+3  1.00    1819.
2 meanlog      1.73      1.73  0.0111  0.0110      1.71   1.75e+0  1.00    3728.
3 sdlog        0.493     0.493 0.00816 0.00805     0.479  5.06e-1  1.00    3127.
# ℹ 1 more variable: ess_tail <dbl>

These estimates should look similar to what we used in the simulations. We can plot the resulting probability density functions.

Note

Every time we sample from the model posterior, we create 4000 iterations. That is a lot of data to plot, so instead, when we produce plots we’ll use a random sample of 100 iterations. You’ll see this throughout the course.

## get shape and rate samples from the posterior
res_df <- res |>
  as_draws_df() |>
  filter(.draw %in% sample(.draw, 100)) # sample 100 draws

## find the value (x) that includes 99% of the cumulative density
max_x <- max(qlnorm(0.99, meanlog = res_df$meanlog, sdlog = res_df$sdlog))

## calculate density on grid of x values
x <- seq(0, max_x, length.out = 100)
res_df <- res_df |>
  crossing(x = x) |> ## add grid to data frame
  mutate(density = dlnorm(x, meanlog, sdlog))

## plot
ggplot(res_df, aes(x = x, y = density, group = .draw)) +
  geom_line(alpha = 0.3)

Going further

  • In this session we were in the enviable situation of knowing which distribution was used to generate the data. With real data, of course, we don’t have this information available. Try using a different distribution for inference (e.g. normal, or gamma). Do you get a good fit?

Wrap up