library("nfidd")
library("ggplot2")
library("bayesplot")
library("posterior")
Probability distributions and parameter estimation
Introduction
Many important characteristics, or parameters, of epidemiological processes are not fully observed - and are therefore uncertain. For example, in this course this might include time delays, reproduction numbers, or case numbers now and in the future. We can specify the shape of uncertainty around a specific parameter using a probability distribution.
We’ll want to correctly specify this uncertainty. We can do this best by combining our own prior understanding with the data that we have available. In this course, we’ll use this Bayesian approach to modelling. A useful tool for creating these models is the stan
probabilistic programming language.
Slides
Objectives
The aim of this session is to introduce the concept of probability distributions and how to estimate their parameters using Bayesian inference with stan
.
Source file
The source file of this session is located at sessions/R-Stan-and-statistical-concepts.qmd
.
Libraries used
In this session we will use the ggplot2
library for plotting and the nfidd
package to load the stan models. We will also use the bayesplot
and posterior
packages to investigate the results of the inference conducted with stan.
The best way to interact with the material is via the Visual Editor of RStudio.
Initialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr
show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.
set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)
Simulating data from a probability distribution
First let us simulate some data from a probability distribution. In R, this is usually done using a family of random number generation functions that start with r
. For example, to simulate random numbers from a normal distribution you would use the rnorm()
function. All these functions have a first argument n
, the number of random replicates to generate, and then some further arguments that parameterise the probability distribution.
rnorm(n = 10, mean = 0, sd = 1)
[1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
[7] 0.46091621 -1.26506123 -0.68685285 -0.44566197
We will use probability distributions to characterise epidemiological delays. These are usually strictly positive because e.g. one cannot develop symptoms before becoming infected, and therefore the incubation period cannot be less than zero. Note that this does not necessarily apply to all distributions, e.g. serial intervals can be negative if person X infects Y but Y develops symptoms first.
Probability distribution that are commonly used in this situation are the gamma or lognormal distributions. These are fairly similar, with one difference being that the lognormal commonly has a “heavier tail” on the right, i.e., gives more probability to occasional very large values, whereas the gamma distribution has more of a tail on the left, i.e., gives more probability to values lower than the mean.
The gamma distribution is characterised by the shape
(\(\alpha\)) and rate
(\(\beta\)) parameters, with a mean of \(\alpha/\beta\) and variance \(\alpha/\beta^2\). A gamma distribution with mean 5 and variance 2, for example, has \(\alpha = 12.5\) and \(\beta = 2.5\). To simulate from such a distribution, we can use the following R code.
### simulate gamma with mean 5, variance 2
<- rgamma(100, shape = 12.5, rate = 2.5)
gammas head(gammas)
[1] 6.645975 5.311519 5.371387 4.954592 4.060700 6.761718
mean(gammas)
[1] 4.981108
var(gammas)
[1] 1.361297
The lognormal distribution is characterised by the meanlog
(\(\mu\)) and sdlog
(\(\sigma\)) parameters, with a mean of \(e^{\mu + 0.5\sigma^2}\) and variance \((e^{\sigma^2} - 1) e^{2\mu + \sigma^2}\). A lognormal distribution with mean 5 and variance 2, for example, has (after a bit of calculation) \(\mu = 1.57\) and \(\sigma = 0.28\).
### simulate lognormals with mean 5, variance 2
<- rlnorm(100, meanlog = 1.57, sdlog = 0.28)
lognormals head(lognormals)
[1] 4.306326 4.816148 2.390365 3.655954 5.731626 4.694730
mean(lognormals)
[1] 4.882255
var(lognormals)
[1] 2.362177
We can now plot the two distributions.
<- rbind(
df data.frame(dist = "lognormal", randoms = lognormals),
data.frame(dist = "gamma", randoms = gammas)
)ggplot(df, aes(x = randoms, fill = dist)) +
geom_density(alpha = 0.5)
We have used geom_density()
to get smooth lines. Alternatively we could have used, e.g., geom_histogram()
to plot the raw data.
Estimating the parameters of probability distributions
We will now use stan to estimate the parameters of the probability distribution. To do so, we first load in the stan model. Normally, you would do this by saving the model code in a text file called gamma_model.stan
and then loading this using cmdstanr::cmdstan_model()
. For this course, we included all the models in the nfidd
package, and they can be accessed using nfidd_cmdstan_model()
which uses cmdstanr::cmdstan_model()
to load the specified model included with the package. Afterwards, we can use this model exactly how we would if we had loaded it using cmdstanr
directly.
### load gamma model from the nfidd package
<- nfidd_cmdstan_model("gamma")
mod ### show model code
mod
1: // gamma_model.stan
2: data {
3: int<lower=0> N;
4: array[N] real y;
5: }
6:
7: parameters {
8: real<lower=0> alpha;
9: real<lower=0> beta;
10: }
11:
12: model {
13: alpha ~ normal(0, 10) T[0,];
14: beta ~ normal(0, 10) T[0,];
15: y ~ gamma(alpha, beta);
16: }
On line 4 there is a data declaration starting with array[n]
. This declares an array of size n
of the type given afterwards (here: real
). Arrays work in a similar way as arrays or vectors in R and its elements be accessed with the bracket operator [
. For example, to get the third element of the array y
you would write y[3]
.
Familarise yourself with the model above. Do you understand all the lines? Which line(s) define the parameter prior distribution(s), which one(s) the likelihood, and which one(s) the data that has to be supplied to the model?
Lines 13 and 14 define the parametric prior distributions (for parameters alpha and beta). Line 15 defines the likelihood. Lines 3 and 4 define the data that has to be supplied to the model.
We use the model we have defined in conjunction with the gamma distributed random numbers generated earlier to see if we can recover the parameters of the gamma distribution used. Once you have familiarised yourself with the model, use the sample()
function to fit the model.
<- list(
stan_data N = length(gammas),
y = gammas
)<- mod$sample(data = stan_data) gamma_fit
The stan_data
object is a list with elements that will is passed to the stan model as the data
argument to sample
. The names and types of the elements need to correspond to the data block in the model (see lines 3 and 4 in the model). Here, we pass the length of gammas
as N
and the vector gammas
itself as y
.
Themod$sample
command will produce a lot of messages which we have suppressed above. This is fine and intended to keep the user informed about any issues as well as general progress with the inference. This will come in handy later in the course when we fit more complicated models that can take a little while to run.
In order to view a summary of the posterior samples generated, use
gamma_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -151.25 -150.92 1.09 0.79 -153.39 -150.20 1.01 913 993
alpha 17.98 17.88 2.57 2.73 13.94 22.38 1.01 519 430
beta 3.61 3.59 0.53 0.55 2.78 4.51 1.01 514 466
You can see that the estimates are broadly consistent with the parameters we specified. To investigate this further, we will conduct a so-called posterior predictive check by comparing random numbers simulated using the estimated parameters to the ones we simulated earlier.
## Extract posterior draws
<- as_draws_df(gamma_fit$draws())
gamma_posterior head(gamma_posterior)
# A draws_df: 6 iterations, 1 chains, and 3 variables
lp__ alpha beta
1 -150 18 3.5
2 -150 17 3.4
3 -151 21 4.2
4 -151 20 4.0
5 -150 20 4.0
6 -151 20 3.9
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
## Generate posterior predictive samples
<- sapply(seq_along(gammas), function(i) {
gamma_ppc rgamma(n = length(gammas),
shape = gamma_posterior$alpha[i],
rate = gamma_posterior$beta[i])
})
## Plot posterior predictive check
ppc_dens_overlay(y = gammas, yrep = gamma_ppc)
We can see that the random numbers generated from the posterior samples are distributed relatively evenly around the data (in black), i.e., the samples generated earlier that we fitted to.
Going further
- For the model above we chose truncated normal priors with a mode at 0 and standard deviation 10. If you change the parameters of the prior distributions, does it affect the results?
- You could try the model included in
lognormal.stan
to estimate parameters of the lognormal distribution.