Biases in delay distributions

Introduction

So far, we’ve looked at the uncertainty of the time delays between epidemiological events. The next challenge is that our information on these delays is usually biased, especially when we’re analysing data in real time. We’ll consider two types of biases that commonly occur in reported infectious disease data:

  • Censoring: when we know an event occurred at some time, but not exactly when.
  • Truncation: when not enough time has passed for all the relevant epidemiological events to occur or be observed.

We can again handle these by including them as uncertain parameters in the modelling process.

Slides

Introduction to biases in epidemiological delays

Objectives

In this session, we’ll introduce censoring and right truncation as typical properties of infectious disease data sets, using the delay from symptom onset to hospitalisation as an example.

Source file

The source file of this session is located at sessions/biases-in-delay-distributions.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the ggplot2 package for plotting, the dplyr and tidyr packages to wrangle data, the lubridate package to deal with dates.

library("nfidd")
library("ggplot2")
library("dplyr")
library("tidyr")
library("lubridate")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.

set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)

Load data

We will use the same simulated data set as in the session on delay distributions.

Note

Remember, in this outbreak we are assuming:

  • the incubation period is gamma-distributed with shape 5 and rate 1, i.e. a mean of 5 days
  • the time from onset to hospital admission is lognormally-distributed, with meanlog 1.75 and sdlog 0.5, i.e. a mean delay of about a week

We use the same function we used in that session to simulate symptom onset and hospitalisation data.

df <- add_delays(infection_times)

This creates the df data frame that we can inspect e.g. using

head(df)
  infection_time onset_time hosp_time
1       0.000000   3.389585        NA
2       2.236708   9.615665        NA
3       4.091861   5.721422  8.351006
4       7.347199  12.125639        NA
5       8.990060  17.863624        NA
6       4.635069  10.165931 19.976118

Dates, not days: censoring

Data on health outcomes are usually not recorded in the way that we have used so far in this session: as a numeric time since a given start date. Instead, we usually deal with dates.

We can make our simulated dataset a bit more realistic by rounding down the infection times to an integer number.

# Use the floor() function to round down to integers
df_dates <- df |>
  mutate(
    infection_time = floor(infection_time),
    onset_time = floor(onset_time),
    hosp_time = floor(hosp_time)
  )
head(df_dates)
  infection_time onset_time hosp_time
1              0          3        NA
2              2          9        NA
3              4          5         8
4              7         12        NA
5              8         17        NA
6              4         10        19
Note

As before we are still not working with dates but numbers. This makes handling the data easier - we don’t have to make any conversions before using the data in stan.

Each of the numbers now represent the number of days that have passed since the start of the outbreak. That is, each of the numbers correspond to a day. In that sense, the data is more like typical data we get from infectious disease outbreaks, where we would usually have a line list with key events such as symptom onset or death reported by a date. In statistical terms, we call the delay double interval censored: “double” because the delays represent the time between two events that are both censored; and “interval” because all we know about the timings of the events is that they happened in a certain time interval (between 0:00 and 23:59 on the recorded day).

Estimating delay distributions accounting for censoring

Let’s estimate the time from symptom onset to hospitalisation with the censored data.

A naïve approach to estimating the delay would be to ignore the fact that the data are censored. To estimate the delay from onset to hospitalisation, we could just use the difference between the censored times, which is an integer (the number of days).

df_dates <- df_dates |>
  mutate(
    incubation_period = onset_time - infection_time,
    onset_to_hosp = hosp_time - onset_time
  )
Take 5 minutes

Fit the lognormal model used in the session on delay distributions to the estimates from the rounded data, i.e. using the df_dates data set. Do you still recover the parameters that we put in?

mod <- nfidd_cmdstan_model("lognormal")
res <- mod$sample(
  data = list(
    n = nrow(na.omit(df_dates)),
    y = na.omit(df_dates)$onset_to_hosp
  )
)
res
 variable     mean   median   sd  mad       q5      q95 rhat ess_bulk ess_tail
  lp__    -1320.91 -1320.63 0.95 0.71 -1322.78 -1319.99 1.00     1841     2380
  meanlog     1.73     1.73 0.01 0.01     1.71     1.75 1.00     3955     2880
  sdlog       0.51     0.51 0.01 0.01     0.49     0.52 1.00     3422     2651

Usually the estimates will be further from the “true” parameters than before when we worked with the unrounded data.

To account for double interval censoring, we need to modify the model to include the fact that we don’t know when exactly on any given day the event happened. For example, if we know that symptom onset of an individual occurred on 20 June, 2024, and they were admitted to hospital on 22 June, 2024, this could mean an onset-to-hospitalisation delay from 1 day (onset at 23:59 on the 20th, admitted at 0:01 on the 22nd) to 3 days (onset at 0:01 on the 20th, admitted at 23:59 on the 22nd).

We can use this in our delay estimation by making the exact time of the events based on the dates given part of the estimation procedure:

cmod <- nfidd_cmdstan_model("censored-delay-model")
cmod
 1: data {
 2:   int<lower = 0> n;
 3:   array[n] int<lower = 1> onset_to_hosp;
 4: }
 5: 
 6: parameters {
 7:   real meanlog;
 8:   real<lower = 0> sdlog;
 9:   array[n] real<lower = 0, upper = 1> onset_day_time;
10:   array[n] real<lower = 0, upper = 1> hosp_day_time;
11: }
12: 
13: transformed parameters {
14:   array[n] real<lower = 0> true_onset_to_hosp;
15:   for (i in 1:n) {
16:     true_onset_to_hosp[i] =
17:       onset_to_hosp[i] + hosp_day_time[i] - onset_day_time[i];
18:   }
19: }
20: 
21: model {
22:   meanlog ~ normal(0, 10);
23:   sdlog ~ normal(0, 10) T[0, ];
24:   onset_day_time ~ uniform(0, 1);
25:   hosp_day_time ~ uniform(0, 1);
26: 
27:   true_onset_to_hosp ~ lognormal(meanlog, sdlog);
28: }
Take 5 minutes

Familiarise yourself with the model above. Do you understand all the lines? Which line(s) define the parameter prior distribution(s), which one(s) the likelihood, and which one(s) reflect that we have now provided the delay as the difference in integer days?

Lines 21-24 define the parametric prior distributions (for parameters meanlog and sdlog, and the estimates of exact times of events). Line 27 defines the likelihood. Lines 15-17 reflect the integer delays, adjusted by the estimated times of day.

Now we can use this model to re-estimate the parameters of the delay distribution:

cres <- cmod$sample(
  data = list(
    n = nrow(na.omit(df_dates)),
    onset_to_hosp = na.omit(df_dates)$onset_to_hosp
  )
)
cres
          variable      mean    median    sd   mad        q5       q95 rhat
 lp__              -11573.85 -11573.60 50.68 51.59 -11659.61 -11492.79 1.00
 meanlog                1.73      1.73  0.01  0.01      1.71      1.75 1.00
 sdlog                  0.49      0.49  0.01  0.01      0.48      0.51 1.00
 onset_day_time[1]      0.46      0.43  0.28  0.35      0.04      0.93 1.00
 onset_day_time[2]      0.52      0.53  0.29  0.37      0.06      0.96 1.00
 onset_day_time[3]      0.52      0.53  0.29  0.36      0.05      0.95 1.00
 onset_day_time[4]      0.52      0.53  0.28  0.36      0.06      0.95 1.00
 onset_day_time[5]      0.46      0.44  0.28  0.34      0.04      0.94 1.00
 onset_day_time[6]      0.51      0.53  0.29  0.36      0.06      0.95 1.00
 onset_day_time[7]      0.46      0.44  0.28  0.36      0.04      0.93 1.00
 ess_bulk ess_tail
     1027     1873
     9848     2492
     8352     2993
     7682     2618
     7866     2268
     7520     2593
     7623     2482
     9064     2095
     7348     2152
    10657     2444

 # showing 10 of 5394 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
Take 10 minutes

Try re-simulating the delays using different parameters of the delay distribution. Can you establish under which conditions the bias in estimation gets worse?

Real-time estimation: truncation

The data set we have looked at so far in this session is a “final” data set representing an outbreak that has come and gone. However, information on delay distribution is often important during ongoing outbreaks as they can inform nowcasts and forecasts and help with broader interpretation of data.

Estimating delays in real time comes with particular challenges, as the timing of the cut-off might introduce a bias. If, for example, infections are exponentially increasing then there will be disproportionately more people with recent symptom onset. Without adjustment, this would artificially decrease the estimate of the mean delay compared to its true value for all infections. This happens because most infections are recent (due to the exponential increase), but later symptom onsets amongst these have not had a chance to happen yet.

Once again, we can simulate this effect, for example by imagining we would like to make an estimate on day 70 of our outbreak. Let us work with the original, un-censored data for the time from onset to hospitalisation so as to look at the issue of truncation in isolation:

df_realtime <- df |>
  mutate(onset_to_hosp = hosp_time - onset_time) |>
  filter(hosp_time <= 70)

Estimating delay distributions accounting for truncation

If we take the naïve mean of delays we get an underestimate as expected:

# truncated mean delay
mean(df_realtime$onset_to_hosp)
[1] 5.952562
# compare with the mean delay over the full outbreak
mean(df$hosp_time - df$onset_time, na.rm=TRUE)
[1] 6.382549
Take 5 minutes

Fit the lognormal model used above to the estimates from the truncated data, i.e. using the df_realtime data set. How far away from the “true” parameters do you end up?

res <- mod$sample(
  data = list(
    n = nrow(na.omit(df_realtime)),
    y = na.omit(df_realtime)$onset_to_hosp
  )
)
res
 variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
  lp__    -141.74 -141.45 0.97 0.73 -143.67 -140.79 1.00     1632     2449
  meanlog    1.67    1.67 0.03 0.03    1.62    1.73 1.00     3836     2561
  sdlog      0.47    0.47 0.02 0.02    0.43    0.51 1.00     3730     2855

Once again, we can write a model that adjusts for truncation, by re-creating the simulated truncation effect in the stan model:

tmod <- nfidd_cmdstan_model("truncated-delay-model")
tmod
 1: data {
 2:   int<lower = 0> n;
 3:   array[n] real<lower = 0> onset_to_hosp;
 4:   array[n] real<lower = 0> time_since_onset;
 5: }
 6: 
 7: parameters {
 8:   real meanlog;
 9:   real<lower = 0> sdlog;
10: }
11: 
12: model {
13:   meanlog ~ normal(0, 10);
14:   sdlog ~ normal(0, 10) T[0, ];
15: 
16:   for (i in 1:n) {
17:     onset_to_hosp[i] ~ lognormal(meanlog, sdlog) T[0, time_since_onset[i]];
18:   }
19: }
Take 5 minutes

Familiarise yourself with the model above. Which line introduces the truncation, i.e. the fact that we have not been able to observe hospitalisation times beyond the cutoff of (here) 70 days?

Line 17 defines the upper limit of onset_to_hosp as time_since_onset.

Now we can use this model to re-estimate the parameters of the delay distribution:

tres <- tmod$sample(
  data = list(
    n = nrow(df_realtime),
    onset_to_hosp = df_realtime$onset_to_hosp, 
    time_since_onset = 70 - df_realtime$onset_time
  )
)
tres
 variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
  lp__    -115.58 -115.29 0.98 0.69 -117.53 -114.66 1.00     1882     2267
  meanlog    1.77    1.77 0.04 0.04    1.71    1.84 1.00     2220     1907
  sdlog      0.50    0.50 0.03 0.03    0.46    0.55 1.00     2325     2549
Take 10 minutes

Try re-simulating the delays using different parameters of the delay distribution. Can you establish under which conditions the bias in estimation gets worse?

Going further

  • We have looked at censoring and truncation separately, but in reality often both are present. Can you combine the two in a model?
  • The solutions we introduced for addressing censoring and truncation are only some possible ones for the censoring problem. There are other solutions that reduce the biases from estimation even further. For a full overview, the review by Park et al. might be worth a read. If you are feeling adventurous, try to implement one or more of them in the stan model above - with a warning that this can get quite involved very quickly.

Wrap up