library("nfidd")
library("ggplot2")
library("dplyr")
library("tidyr")
library("lubridate")
library("tidybayes")
Biases in delay distributions
Introduction
So far, we’ve looked at the uncertainty of the time delays between epidemiological events. The next challenge is that our information on these delays is usually biased, especially when we’re analysing data in real time. We’ll consider two types of biases that commonly occur in reported infectious disease data:
- Censoring: when we know an event occurred at some time, but not exactly when.
- Truncation: when not enough time has passed for all the relevant epidemiological events to occur or be observed.
We can again handle these by including them as uncertain parameters in the modelling process.
Slides
Objectives
In this session, we’ll introduce censoring and right truncation as typical properties of the process that generates infectious disease data sets, using the delay from symptom onset to hospitalisation as an example.
Source file
The source file of this session is located at sessions/biases-in-delay-distributions.qmd
.
Libraries used
In this session we will use the nfidd
package to load a data set of infection times and access stan models and helper functions, the ggplot2
package for plotting, the dplyr
and tidyr
packages to wrangle data, the lubridate
package to deal with dates, and the tidybayes
package for extracting results from model fits.
The best way to interact with the material is via the Visual Editor of RStudio.
Initialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr
show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.
set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)
Load data
We will use the same simulated data set as in the session on delay distributions.
Remember, in this outbreak we are assuming:
- the incubation period is gamma-distributed with shape 5 and rate 1, i.e. a mean of 5 days
- the time from onset to hospital admission is lognormally-distributed, with meanlog 1.75 and sdlog 0.5, i.e. a mean delay of about a week
We use the same function we used previously to simulate symptom onset and hospitalisation data.
<- add_delays(infection_times) df
This creates the df
data frame that we can inspect e.g. using
head(df)
infection_time onset_time hosp_time
1 0.000000 3.389585 NA
2 2.236708 9.615665 NA
3 4.091861 5.721422 8.351006
4 7.347199 12.125639 NA
5 8.990060 17.863624 NA
6 4.635069 10.165931 19.976118
Dates, not days: censoring
So far, the data we’ve been using has represented time as a continuous variable: a numeric time since a given starting point. Data on health outcomes are usually not recorded in this way. Instead, we usually deal with dates: a discrete interval. We refer to this restriction of the data, where we only know that an event has occurred within a given time period (e.g. a day) but not when exactly within that period, as interval censoring.
We can make our simulated dataset a bit more realistic by rounding down the infection times to an integer number.
# Use the floor() function to round down to integers
<- df |>
df_dates mutate(
infection_time = floor(infection_time),
onset_time = floor(onset_time),
hosp_time = floor(hosp_time)
)head(df_dates)
infection_time onset_time hosp_time
1 0 3 NA
2 2 9 NA
3 4 5 8
4 7 12 NA
5 8 17 NA
6 4 10 19
The pattern floor(vector / interval) * interval
allows you to simulate censoring at different time intervals:
# Daily censoring (interval = 1) - what we used above
floor(times / 1) * 1
# Weekly censoring (interval = 7)
floor(times / 7) * 7
# Hourly censoring (interval = 1/24)
floor(times / (1/24)) * (1/24)
This rounds values down to the nearest interval boundary, which is useful for exploring how different reporting frequencies affect delay estimation.
As before we are still not working with dates but numbers. This makes handling the data easier - we don’t have to make any conversions before using the data in stan.
Each of the numbers now represent the number of days that have passed since the start of the outbreak. That is, each of the numbers correspond to a day. In that sense, the data is more like typical data we get from infectious disease outbreaks, where we would usually have a line list with key events such as symptom onset or death reported by a date. In statistical terms, we call the data double interval censored:
“double” because the delays represent the time between two events that are both censored
“interval” because all we know about the timings of the events is that they happened in a certain time interval (between 0:00 and 23:59 on the recorded day).
We’ll be spending a lot of time thinking about a potentially “short” 24 hour time interval. In this session, we’ll start to get a sense for how variation, uncertainty, and bias within this interval might combine to impact daily outbreak estimates. The concepts and techniques that we use here also apply to longer time intervals that we might find in surveillance data, for example, reported case counts aggregated by week.
Estimating delay distributions accounting for censoring
Let’s estimate the time from symptom onset to hospitalisation with the censored data.
A naïve approach to estimating the delay would be to ignore the fact that the data are censored. To estimate the delay from onset to hospitalisation, we could just use the difference between the censored times, which is an integer (the number of days).
<- df_dates |>
df_dates mutate(
incubation_period = onset_time - infection_time,
onset_to_hosp = hosp_time - onset_time
)
To help us think about this lets summarise the original time based incubation period
summary(df$onset_time - df$infection_time)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3569 3.3499 4.6329 4.9559 6.2027 18.5544
and then the date based incubation period as we observe it.
summary(df_dates$incubation_period)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 3.00 5.00 4.96 6.00 19.00
You should see that they have nearly the same means but different medians (by about half a day).
Fit the lognormal model used in the session on delay distributions to the estimates from the rounded data, i.e. using the df_dates
data set. Do you still recover the parameters that we put in?
<- nfidd_cmdstan_model("lognormal")
mod <- nfidd_sample(mod,
res data = list(
n = nrow(na.omit(df_dates)),
y = na.omit(df_dates)$onset_to_hosp
) )
res
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -1320.90 -1320.57 0.98 0.68 -1322.92 -1319.98 1.00 921 1240
meanlog 1.73 1.73 0.01 0.01 1.71 1.75 1.00 1748 1431
sdlog 0.50 0.50 0.01 0.01 0.49 0.52 1.00 1888 1517
We can calculate the mean and standard deviation to see the bias more clearly:
|>
res summarise_lognormal()
mean sd
Min. :6.175 Min. :3.186
1st Qu.:6.351 1st Qu.:3.395
Median :6.405 Median :3.452
Mean :6.406 Mean :3.453
3rd Qu.:6.461 3rd Qu.:3.508
Max. :6.666 Max. :3.761
Usually the estimates will be further from the “true” parameters than before when we worked with the unrounded data.
There are many ad-hoc solutions to this problem that, for example, introduce a shift in the data by half a day to centre it on mid-day or discretise the distribution and use the difference between two cumulative density functions with a day, or two day interval. Of these all but the two day interval approach introduce more bias than doing nothing as above. See Park et al. for a detailed discussion of approximate approaches (Park et al. 2024).
To properly account for double interval censoring, we need to modify the model to include the fact that we don’t know when exactly on any given day the event happened. For example, if we know that symptom onset of an individual occurred on 20 June, 2024, and they were admitted to hospital on 22 June, 2024, this could mean an onset-to-hospitalisation delay from 1 day (onset at 23:59 on the 20th, admitted at 0:01 on the 22nd) to 3 days (onset at 0:01 on the 20th, admitted at 23:59 on the 22nd).
We can use this in our delay estimation by making the exact time of the events based on the dates given part of the estimation procedure:
<- nfidd_cmdstan_model("censored-delay-model")
cmod cmod
1: data {
2: int<lower = 0> n;
3: array[n] int<lower = 1> onset_to_hosp;
4: }
5:
6: parameters {
7: real meanlog;
8: real<lower = 0> sdlog;
9: array[n] real<lower = 0, upper = 1> onset_day_time;
10: array[n] real<lower = 0, upper = 1> hosp_day_time;
11: }
12:
13: transformed parameters {
14: array[n] real<lower = 0> true_onset_to_hosp;
15: for (i in 1:n) {
16: true_onset_to_hosp[i] =
17: onset_to_hosp[i] + hosp_day_time[i] - onset_day_time[i];
18: }
19: }
20:
21: model {
22: meanlog ~ normal(0, 10);
23: sdlog ~ normal(0, 10) T[0, ];
24: onset_day_time ~ uniform(0, 1);
25: hosp_day_time ~ uniform(0, 1);
26:
27: true_onset_to_hosp ~ lognormal(meanlog, sdlog);
28: }
Familiarise yourself with the model above. Do you understand all the lines? Which line(s) define the parameter prior distribution(s), which one(s) the likelihood, and which one(s) reflect that we have now provided the delay as the difference in integer days?
Lines 21-24 define the parametric prior distributions (for parameters meanlog and sdlog, and the estimates of exact times of events). Line 27 defines the likelihood. Lines 15-17 reflect the integer delays, adjusted by the estimated times of day.
Now we can use this model to re-estimate the parameters of the delay distribution:
<- nfidd_sample(cmod,
cres data = list(
n = nrow(na.omit(df_dates)),
onset_to_hosp = na.omit(df_dates)$onset_to_hosp
) )
cres
variable mean median sd mad q5 q95 rhat
lp__ -11577.76 -11577.85 51.79 50.78 -11664.91 -11490.40 1.00
meanlog 1.73 1.73 0.01 0.01 1.72 1.75 1.00
sdlog 0.49 0.49 0.01 0.01 0.48 0.51 1.00
onset_day_time[1] 0.45 0.43 0.29 0.36 0.03 0.95 1.00
onset_day_time[2] 0.53 0.54 0.29 0.37 0.05 0.97 1.00
onset_day_time[3] 0.52 0.54 0.29 0.37 0.06 0.95 1.01
onset_day_time[4] 0.51 0.51 0.29 0.38 0.05 0.95 1.01
onset_day_time[5] 0.45 0.44 0.28 0.35 0.04 0.93 1.01
onset_day_time[6] 0.51 0.52 0.29 0.36 0.05 0.95 1.01
onset_day_time[7] 0.45 0.43 0.27 0.34 0.05 0.91 1.00
ess_bulk ess_tail
532 910
5172 1507
4672 1396
4885 1060
5298 941
4139 1223
5155 1019
2991 1151
3818 1006
4090 1201
# showing 10 of 5394 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
We can also examine the mean and standard deviation of the estimated delay distribution:
|>
cres summarise_lognormal()
mean sd
Min. :6.143 Min. :3.022
1st Qu.:6.336 1st Qu.:3.276
Median :6.389 Median :3.336
Mean :6.391 Mean :3.339
3rd Qu.:6.444 3rd Qu.:3.397
Max. :6.641 Max. :3.647
Try re-simulating the delays using different parameters of the delay distribution. Can you establish under which conditions the bias in estimation gets worse?
Real-time estimation: truncation
The data set we have looked at so far in this session is a “final” data set representing an outbreak that has come and gone. However, information on delay distributions is often important during ongoing outbreaks as they can inform nowcasts and forecasts and help with broader interpretation of data.
Estimating delays in real time comes with particular challenges, as we know that our data are usually incomplete: not enough time has passed to observe all of the relevant events that could have happened. We call this incompleteness right truncation. This is a problem if we try to use an incomplete set of data about one type of event to infer the pattern of a previous epidemiological event. For example, our target might be to infer the “primary” event of infections. We want to do this using some observed data about a “secondary” event, such as recent symptom onsets. However, we are still in the middle of an outbreak. Some infected cases haven’t yet presented with symptoms. So we only have a partial, incomplete count to work with. When we condition on an observed event with a delay in this way, we get right truncation of the target event. Our estimates might be biased, because we are missing some of the data to get the full picture.
This bias also changes over the course of an outbreak. If, for example, infections are exponentially increasing, then there will be disproportionately more people with recent symptom onset. Without adjustment, this would artificially decrease the estimate of the mean delay compared to its true value for all infections that have in fact already occurred. This happens because most infections are recent (due to the exponential increase), but later symptom onsets amongst these have not had a chance to happen yet. So among a cohort of people with symptom onset at the same time, those with shorter delays to onset would be over-represented.
Once again, we can simulate this effect, for example by imagining we would like to make an estimate on day 70 of our outbreak. Let us work with the original, un-censored data for the time from onset to hospitalisation so as to look at the issue of truncation in isolation:
<- df |>
df_realtime mutate(onset_to_hosp = hosp_time - onset_time) |>
filter(hosp_time <= 70)
Depending on where we cut off observation (i.e. what time) in our outbreak will impact the amount of truncation our data has. Ending observation close to periods of exponential growth will lead to more truncation than if we cut off observation towards the end of an outbreak when it has been stable or infections have been reducing for some time.
Estimating delay distributions accounting for truncation
If we take the naïve mean of delays we get an underestimate as expected:
# truncated mean delay
mean(df_realtime$onset_to_hosp)
[1] 5.952562
# compare with the mean delay over the full outbreak
mean(df$hosp_time - df$onset_time, na.rm=TRUE)
[1] 6.382549
Fit the lognormal model used above to the estimates from the truncated data, i.e. using the df_realtime
data set. How far away from the “true” parameters do you end up?
<- nfidd_sample(mod,
res data = list(
n = nrow(na.omit(df_realtime)),
y = na.omit(df_realtime)$onset_to_hosp
) )
res
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -141.75 -141.42 1.05 0.70 -143.92 -140.79 1.00 882 1140
meanlog 1.67 1.67 0.03 0.03 1.62 1.73 1.00 1559 1311
sdlog 0.47 0.47 0.02 0.02 0.43 0.51 1.00 1752 1161
The mean and standard deviation show the underestimation due to truncation:
|>
res summarise_lognormal()
mean sd
Min. :5.316 Min. :2.323
1st Qu.:5.813 1st Qu.:2.807
Median :5.942 Median :2.939
Mean :5.951 Mean :2.957
3rd Qu.:6.073 3rd Qu.:3.089
Max. :6.719 Max. :4.043
Once again, we can write a model that adjusts for truncation, by re-creating the simulated truncation effect in the stan model:
<- nfidd_cmdstan_model("truncated-delay-model")
tmod tmod
1: data {
2: int<lower = 0> n;
3: array[n] real<lower = 0> onset_to_hosp;
4: array[n] real<lower = 0> time_since_onset;
5: }
6:
7: parameters {
8: real meanlog;
9: real<lower = 0> sdlog;
10: }
11:
12: model {
13: meanlog ~ normal(0, 10);
14: sdlog ~ normal(0, 10) T[0, ];
15:
16: for (i in 1:n) {
17: onset_to_hosp[i] ~ lognormal(meanlog, sdlog) T[0, time_since_onset[i]];
18: }
19: }
Familiarise yourself with the model above. Which line introduces the truncation, i.e. the fact that we have not been able to observe hospitalisation times beyond the cutoff of (here) 70 days?
Line 17 defines the upper limit of onset_to_hosp
as time_since_onset
.
Now we can use this model to re-estimate the parameters of the delay distribution:
<- nfidd_sample(tmod,
tres data = list(
n = nrow(df_realtime),
onset_to_hosp = df_realtime$onset_to_hosp,
time_since_onset = 70 - df_realtime$onset_time
) )
tres
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -115.63 -115.34 1.00 0.74 -117.74 -114.66 1.00 1018 1080
meanlog 1.77 1.77 0.04 0.04 1.71 1.84 1.00 1056 1264
sdlog 0.50 0.50 0.03 0.03 0.46 0.56 1.00 1152 1113
Let’s also check the mean and standard deviation of the truncation-adjusted delay distribution:
|>
tres summarise_lognormal()
mean sd
Min. :5.623 Min. :2.685
1st Qu.:6.452 1st Qu.:3.328
Median :6.661 Median :3.568
Mean :6.689 Mean :3.605
3rd Qu.:6.907 3rd Qu.:3.829
Max. :8.232 Max. :5.216
Try estimating the delays at different times (i.e. varying the observation cut-off) and also try re-simulating the delays using different parameters of the delay distribution. Can you establish under which conditions the bias in estimation gets worse?
Going further
Challenge
- We have looked at censoring and truncation separately, but in reality often both are present. Can you combine the two in a model?
- The solutions we introduced for addressing censoring and truncation are only some possible ones for the censoring problem. There are other solutions that reduce the biases from estimation even further. For a full overview, see the review by (Park et al. 2024).
Methods in practice
- Charniga et al. (2024) summarises challenges in estimating delay distributions into a set of best practices, with a practical flowchart and reporting checklist.
- The
primarycensored
R package provides a comprehensive framework for dealing with censored and truncated delay distributions. It implements methods techniques for handling primary event censoring, secondary event censoring, and right truncation in a unified and efficient manner. - The
epidist
package extendsprimarycensored
to work withbrms
for estimating epidemiological delay distributions. It enables flexible modelling including time-varying components, spatial effects, and partially pooled estimates of demographic characteristics.
Wrap up
- Review what you’ve learned in this session with the learning objectives
- Share your questions and thoughts