Forecasting concepts

Introduction

So far we’ve looked at epidemiological processes and events up until the present. Now we’ll start to forecast into the future, while considering what we might want from a “good” forecast. After enough time has passed to observe events, we can then retrospectively evaluate how well forecasts performed. We’ll introduce evaluating forecasts qualitatively by visual inspection, and quantitatively using scoring metrics.

Slides

Objectives

The aim of this session is to introduce the concept of forecasting, using a simple model, and forecasting evaluation.

Source file

The source file of this session is located at sessions/forecasting-concepts.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the dplyr and tidyr packages for data wrangling, ggplot2 library for plotting, the tidybayes package for extracting results of the inference and the scoringutils package for evaluating forecasts.

library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("tidybayes")
library("scoringutils")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do. We also set an option that makes cmdstanr show line numbers when printing model code. This is not strictly necessary but will help us talk about the models.

set.seed(123)
options(cmdstanr_print_line_numbers = TRUE)

What is forecasting?

Forecasting is the process of making predictions about the future based on past and present data. In the context of infectious disease epidemiology, forecasting is usually the process of predicting the future course of some metric of infectious disease incidence or prevalence based on past and present data. Here we focus on forecasting observed data (the number of individuals with new symptom onset) but forecasts can also be made for other quantities of interest such as the number of infections, the reproduction number, or the number of deaths. Epidemiological forecasting is closely related to nowcasting and, when using mechanistic approaches, estimation of the reproduction number. In fact, the model we will use for forecasting is the same as the model we used for nowcasting and estimation of the reproduction number. The only difference is that we will extend the model into the future. In order to make things simpler we will remove the nowcasting part of the model, but in principle all these approaches could be combined in a single model.

What do we look for in a good forecast? Some food for thought:
  1. Calibration: The forecast should be well calibrated. This means that the forecasted probabilities should match the observed frequencies. For example, if the model predicts a 50% probability of an event occurring, then the event should occur approximately 50% of the time.
  2. Unbiasedness: The forecast should be unbiased. This means that the average forecasted value should be equal to the average observed value. It shouldn’t consistently over- or underpredict.
  3. Accuracy: The forecast should be accurate. This means that the forecasted values should be close to the observed values.
  4. Sharpness: As long as the other conditions are fulfilled we want prediction intervals to be as narrow as possible. Predicting that “anything can happen” might be correct but not very useful.

Extending a model into the future

The model we introduced in the renewal equation session the reproduction number using a random walk, then used a discrete renewal process to model the number of infections, and convolved these with a delay distribution to model the number of onsets with Poisson observation error. Based on what we found in the nowcasting session this seems like a reasonable model for the data and so we might want to use it to forecast into the future.

We can do this by simulating the generative model forward in time. To do this we use the same model as in the renewal equation session but add a generated quantities block to create the forecasts. This is used for in a stan model for any calculations that only depend on other parameter estimates but are not themselves constrained by the data. So what does this look like in code? Lets load in the model again and take a look.

mod <- nfidd_cmdstan_model("estimate-inf-and-r-rw-forecast")
mod
 1: functions {
 2:   #include "functions/convolve_with_delay.stan"
 3:   #include "functions/renewal.stan"
 4:   #include "functions/geometric_random_walk.stan"
 5: }
 6: 
 7: data {
 8:   int n;                // number of days
 9:   int I0;              // number initially infected
10:   array[n] int obs;     // observed symptom onsets
11:   int gen_time_max;     // maximum generation time
12:   array[gen_time_max] real gen_time_pmf;  // pmf of generation time distribution
13:   int<lower = 1> ip_max; // max incubation period
14:   array[ip_max + 1] real ip_pmf;
15:   int h;                // number of days to forecast
16: }
17: 
18: transformed data {
19:    int m = n + h;
20: }
21: 
22: parameters {
23:   real<lower = -1, upper = 1> init_R;         // initial reproduction number
24:   array[m-1] real rw_noise; // random walk noise
25:   real<lower = 0, upper = 1> rw_sd; // random walk standard deviation
26: }
27: 
28: transformed parameters {
29:   array[m] real R = geometric_random_walk(init_R, rw_noise, rw_sd);
30:   array[m] real infections = renewal(I0, R, gen_time_pmf);
31:   array[m] real onsets = convolve_with_delay(infections, ip_pmf);
32: }
33: 
34: model {
35:   // priors
36:   init_R ~ normal(-.1, 0.5); // Approximately Normal(1, 0.5)
37:   rw_noise ~ std_normal();
38:   rw_sd ~ normal(0, 0.05) T[0,];
39:   obs ~ poisson(onsets[1:n]);
40: }
41: 
42: generated quantities {
43:   array[h] real forecast;
44:   if (h > 0) {
45:     for (i in 1:h) {
46:       forecast[i] = poisson_rng(onsets[n + i]);
47:     }
48:   }
49: }

Take 5 minutes

What have we changed in the model to make it a forecasting model? Do you see any limitations of this approach?

  • What have we changed in the model to make it a forecasting model?
    • Added the h parameter to the data list to specify the number of days to forecast into the future.
    • Added the m parameter as a piece of transformed data (i.e. a calculation that only uses data) that is the total number of days to include in the model (i.e. the number of days in the data plus the number of days to forecast).
    • m is then used in all arrays in the model rather than n. This means that rw_noise is now m - 1 long, and R, onsets, infections and onsets are m long.
    • As there are only n observations in the data in the likelihood we only use the first n elements of onsets.
    • To include observation error in the forecast a generated quantities block has been added which takes the last h onsets as the mean of a Poisson distribution and samples from this distribution to get the forecasted onsets.
  • Do you see any limitations of this approach?
    • Including h in the parameters and model blocks increases the number of parameters and amount of work we have to do when fitting the model. It would be more computationally efficient to have a separate model for forecasting.

Before we can forecast we need some data to fit the model to. In order to assess the quality of the forecasts we will also need some future (or hold-out) data that we can compare the forecasts to. We will use the same simulated data as in the renewal and nowcasting sessions. We will try to make a forecast on day 71 (assuming we don’t know what the data looks like after that day) as in the nowcasting session.

gen_time_pmf <- make_gen_time_pmf()
ip_pmf <- make_ip_pmf()
onset_df <- simulate_onsets(
  make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)
cutoff <- 71
filtered_onset_df <- onset_df |>
  filter(day <= cutoff)
tail(onset_df)
# A tibble: 6 × 3
    day onsets infections
  <dbl>  <int>      <int>
1   137      6          5
2   138      2          1
3   139      2          4
4   140      7          1
5   141      6          1
6   142      4          0

Fitting the model and forecast for 28 days into the future

We can now fit the model to the data and then make a forecast. This should look very similar to the code we used in the renewal session but with the addition of a non-zero h in the data list.

horizon <- 28

data <- list(
  n = nrow(filtered_onset_df),
  I0 = 1,
  obs = filtered_onset_df$onsets,
  gen_time_max = length(gen_time_pmf),
  gen_time_pmf = gen_time_pmf,
  ip_max = length(ip_pmf) - 1,
  ip_pmf = ip_pmf,
  h = horizon # Here we set the number of days to forecast into the future
)
rw_forecast <- mod$sample(
  data = data, parallel_chains = 4, adapt_delta = 0.95,
  init = \() list(init_R = 0, rw_sd = 0.01)
)
rw_forecast
    variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
 lp__        3328.23 3328.66 7.32 7.31 3315.60 3339.48 1.00     1732     2490
 init_R         0.41    0.41 0.07 0.06    0.30    0.53 1.00     2168     1663
 rw_noise[1]    0.06    0.05 0.98 0.95   -1.59    1.70 1.00     4512     2297
 rw_noise[2]    0.05    0.04 0.97 0.95   -1.54    1.65 1.00     4757     2844
 rw_noise[3]    0.02    0.00 0.99 0.98   -1.58    1.67 1.00     2340      519
 rw_noise[4]    0.04    0.06 0.96 0.96   -1.54    1.64 1.00     4233     2825
 rw_noise[5]    0.01    0.01 0.99 0.98   -1.67    1.62 1.00     4748     2670
 rw_noise[6]   -0.05   -0.06 1.02 1.04   -1.68    1.61 1.00     4685     2834
 rw_noise[7]   -0.04   -0.03 0.98 0.98   -1.67    1.56 1.00     5234     2854
 rw_noise[8]   -0.09   -0.08 1.00 1.05   -1.72    1.53 1.00     3682     2775

 # showing 10 of 426 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
Note

Because this model can struggle to fit to the data, we have increased the value of adapt_delta from its default value of 0.8. This is a tuning parameter that affects the step size of the sampler in exploring the posterior (higher adapt_delta leading to smaller step sizes meaning posterior exploration is slower but more careful).

Visualising the forecast

We can now visualise the forecast. We will first extract the forecast and then plot the forecasted number of symptom onsets alongside the observed number of symptom onsets.

forecast <- rw_forecast |>
  gather_draws(forecast[day]) |>
  ungroup() |> 
  mutate(day = day + cutoff)

target_onsets <- onset_df |>
  filter(day > cutoff) |>
  filter(day <= cutoff + horizon)
forecast |>
  filter(.draw %in% sample(.draw, 100)) |>
  ggplot(aes(x = day)) +
  geom_line(alpha = 0.1, aes(y = .value, group = .draw)) +
  geom_point(data = target_onsets, aes(x = day, y = onsets), color = "black") +
  labs(
    title = "Symptom onsets",
    subtitle = "Forecast (trajectories) and observed (points)"
  )

Take 5 minutes

What do you think of this forecast? Did the model do a good job? Is there another way you could visualise the forecast that might be more informative?

  • On the face of it the forecast looks very poor with some very high predictions compared to the data.
  • Based on this visualisation it is hard to tell if the model is doing a good job but it seems like it is not.
  • As outbreaks are generally considered to be exponential processes it might be more informative to plot the forecast on the log scale.
forecast |>
  filter(.draw %in% sample(.draw, 100)) |>
  ggplot(aes(x = day)) +
  geom_line(alpha = 0.1, aes(y = .value, group = .draw)) +
  geom_point(data = target_onsets, aes(x = day, y = onsets), color = "black") +
  scale_y_log10() +
  labs(title = "Symptom onsets, log scale", subtitle = "Forecast and observed")

This should be a lot more informative. We see that for longer forecast horizons the model is not doing a great job of capturing the reduction in symptom onsets. However, we can now see that the model seems to be producing very reasonable forecasts for the first week or so of the forecast. This is a common pattern in forecasting where a model is good at capturing the short term dynamics but struggles with the longer term dynamics.

As our forecasting model is based on the reproduction number, we can also visualise the forecast of the reproduction number. This can be helpful for understanding why our forecasts of symptom onsets look the way they do and for understanding the uncertainty in the forecasts. We can also compare this to the “true” reproduction number, estimated once all relevant data is available. To do this, we will fit the model again but with a later cutoff. Then we can compare the reproduction numbers produced as forecasts at the earlier time, with estimates at the later time that used more of the data.

long_onset_df <- onset_df |>
  filter(day <= cutoff + horizon)

long_data <- list(
  n = nrow(long_onset_df),
  I0 = 1,
  obs = long_onset_df$onsets,
  gen_time_max = length(gen_time_pmf),
  gen_time_pmf = gen_time_pmf,
  ip_max = length(ip_pmf) - 1,
  ip_pmf = ip_pmf,
  h = 0
)


rw_long <- mod$sample(
  data = long_data, parallel_chains = 4, adapt_delta = 0.95,
  init = \() list(init_R = 0, rw_sd = 0.01)
)

We first need to extract the forecast and estimated reproduction numbers.

forecast_r <- rw_forecast |>
  gather_draws(R[day]) |>
  ungroup() |> 
  mutate(type = "forecast")

long_r <- rw_long |>
  gather_draws(R[day]) |>
  ungroup() |>
  mutate(type = "estimate")

We can now plot the forecast and estimated reproduction numbers.

forecast_r |>
  bind_rows(long_r) |>
  filter(.draw %in% sample(.draw, 100)) |>
  ggplot(aes(x = day)) +
  geom_vline(xintercept = cutoff, linetype = "dashed") +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_line(
    aes(y = .value, group = interaction(.draw, type), color = type),
    alpha = 0.1
  ) +
  labs(
    title = "Estimated R",
    subtitle = "Estimated over whole time series (red), and forecast (blue)"
  ) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Note

The horizontal dashed line at 1 represents the threshold for epidemic growth. If the reproduction number is above 1 then the epidemic is growing, if it is below 1 then the epidemic is shrinking. The vertical dashed line represents the point at which we started forecasting.

Take 5 minutes

Can you use this plot to explain why the forecast of onsets looks the way it does?

  • When both models are being fit to data (i.e before the vertical dashed line) the forecast and estimated reproduction numbers are very similar.
  • For short-term forecasts \(R_t\) estimates continue to be fairly similar.
  • However, the estimates have a consistent downwards trend which is not captured by the forecast (which looks like it has a constant mean value with increasing uncertainty).
  • This explains the divergence between the forecast and the data as the horizon increases.
  • It looks like only a relatively small number of forecast \(R_t\) trajectories grow to be very large but these are enough to visually dominate the forecast of onsets on the natural scale.
  • The performance we are seeing here makes sense given that random walks are defined to have a constant mean and increasing variance.

We managed to learn quite a lot about our model’s forecasting limitations just looking at a single forecast using visualisations. However, what if we wanted to quantify how well the model is doing? This is where forecast evaluation comes in which we will cover in the next section.

Introduction to forecast evaluation

An important aspect of making forecasts is that we can later confront the forecasts with what really happened and use this to assess whether our forecast model makes good predictions, or which of multiple models work best in which situation. In this section you will get to know several ways of assessing different aspects of forecast performance. You will then use them to evaluate your forecasts.

Evaluate your forecast

In order to properly evaluate forecasts from this model we really need to forecast over a period of time. Ideally, capturing different epidemic dynamics. This will also give us more to work with when using scoring metrics. We will now load in some forecasts we made earlier and evaluate them.

data(rw_forecasts)
rw_forecasts |>
  ungroup()
# A tibble: 224,000 × 7
     day .draw .variable .value horizon target_day model      
   <dbl> <int> <chr>      <dbl>   <int>      <dbl> <chr>      
 1    23     1 forecast       4       1         22 Random walk
 2    23     2 forecast       2       1         22 Random walk
 3    23     3 forecast       2       1         22 Random walk
 4    23     4 forecast       6       1         22 Random walk
 5    23     5 forecast       2       1         22 Random walk
 6    23     6 forecast       3       1         22 Random walk
 7    23     7 forecast       5       1         22 Random walk
 8    23     8 forecast       2       1         22 Random walk
 9    23     9 forecast       3       1         22 Random walk
10    23    10 forecast       5       1         22 Random walk
# ℹ 223,990 more rows
Tip

We generated these forecasts using the code in data-raw/generate-example-forecasts.r which uses the same approach we just took for a single forecast date but generalises it to many forecasts dates.

Some important things to note about these forecasts:

  • We used a 14 day forecast horizon.
  • Each forecast used all the data up to the forecast date.
  • We generated 1000 posterior samples for each forecast.
  • We started forecasting 3 weeks into the outbreak, and then forecast once a week (every 7 days), i.e., we created forecasts on day 22, day 29, … to day 71. We excluded the last 14 days to allow a full forecast.
  • We used the outbreak data simulated in the same way as the one we’ve been using through the course:
head(onset_df)
# A tibble: 6 × 3
    day onsets infections
  <dbl>  <int>      <int>
1     1      0          0
2     2      0          1
3     3      0          0
4     4      0          2
5     5      0          1
6     6      0          1

Visualising your forecast

As for a single forecast, our first step is to visualise the forecasts as this can give us a good idea of how well the model is doing without having to calculate any metrics.

rw_forecasts |>
  filter(.draw %in% sample(.draw, 100)) |>
  ggplot(aes(x = day)) +
  geom_line(
    aes(y = .value, group = interaction(.draw, target_day), col = target_day),
    alpha = 0.1
  ) +
  geom_point(
    data = onset_df |>
      filter(day >= 21),
    aes(x = day, y = onsets), color = "black") +
  scale_color_binned(type = "viridis") +
  labs(title = "Weekly forecasts of symptom onsets over an outbreak",
       col = "Forecast start day")

As for the single forecast it may be helpful to also plot the forecast on the log scale.

rw_forecasts |>
  filter(.draw %in% sample(.draw, 100)) |> 
  ggplot(aes(x = day)) +
  geom_line(
    aes(y = .value, group = interaction(.draw, target_day), col = target_day),
    alpha = 0.1
  ) +
  geom_point(data = onset_df, aes(x = day, y = onsets), color = "black") +
  scale_y_log10() +
  scale_color_binned(type = "viridis") +
  labs(title = "Weekly symptom onset forecasts: log scale",
       col = "Forecast start day")
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

Take 5 minutes

What do you think of these forecasts? Are they any good? How well do they capture changes in trend? Does the uncertainty seem reasonable? Do they seem to under or over predict consistently? Would you visualise the forecast in a different way?

  • We think these forecasts are a reasonable place to start but there is definitely room for improvement. Are they any good?
  • They seem to do a reasonable job of capturing the short term dynamics but struggle with the longer term dynamics. How well do they capture changes in trend?
  • There is little evidence of the model capturing the reduction in onsets before it begins to show in the data. Does the uncertainty seem reasonable?
  • On the natural scale it looks like the model often over predicts. Things seem more balanced on the log scale but the model still seems to be overly uncertain. Do they seem to under or over predict consistently?
  • It looks like the model is consistently over predicting on the natural scale but this is less clear on the log scale.

Scoring your forecast

On top of visualising the forecasts, we can also summarise performance quantitatively by transforming them using scoring metrics. Whilst some of these metrics are more useful for comparing models, many can be also be useful for understanding the performance of a single model. We will look at some of these metrics in the next section.

Tip

In this session, we’ll use “proper” scoring rules: these are scoring rules that make sure no model can get better scores than the true model, i.e. the model used to generate the data. Of course we usually don’t know this (as we don’t know the “true model” for real-world data) but proper scoring rules incentivise forecasters to make their best attempt at reproducing its behaviour.

We will use the {scoringutils} package to calculate these metrics. Our first step is to convert our forecasts into a format that the {scoringutils} package can use. We will use as_forecast_sample() to do this:

sc_forecasts <- rw_forecasts |>
  left_join(onset_df, by = "day") |>
  filter(!is.na(.value)) |>
  as_forecast_sample(
    forecast_unit = c(
      "target_day", "horizon", "model"
    ),
    observed = "onsets",
    predicted = ".value",
    sample_id = ".draw"
  )
sc_forecasts
Forecast type: sample
Forecast unit:
target_day, horizon, and model

        sample_id predicted observed target_day horizon       model
            <int>     <num>    <int>      <num>   <int>      <char>
     1:         1         4        3         22       1 Random walk
     2:         2         2        3         22       1 Random walk
     3:         3         2        3         22       1 Random walk
     4:         4         6        3         22       1 Random walk
     5:         5         2        3         22       1 Random walk
    ---                                                            
223996:       996         1        6        127      14 Random walk
223997:       997         0        6        127      14 Random walk
223998:       998         0        6        127      14 Random walk
223999:       999         1        6        127      14 Random walk
224000:      1000         0        6        127      14 Random walk

As you can see this has created a forecast object which has a print method that summarises the forecasts.

Take 2 minutes

What important information is in the forecast object?

  • The forecast unit which is the target day, horizon, and model
  • The type of forecast which is a sample forecast

Everything seems to be in order. We can now use the scoringutils package to calculate some metrics. We will use the default sample metrics (as our forecasts are in sample format) and score our forecasts.

sc_scores <- sc_forecasts |>
  score()

sc_scores
     target_day horizon       model   bias        dss     crps overprediction
          <num>   <int>      <char>  <num>      <num>    <num>          <num>
  1:         22       1 Random walk  0.043  1.5408811 0.469424          0.000
  2:         22       2 Random walk  0.828  3.2154909 1.664522          1.118
  3:         22       3 Random walk  0.858  3.5962828 2.010624          1.436
  4:         22       4 Random walk  0.425  2.7289591 1.002469          0.294
  5:         22       5 Random walk  0.909  4.4509533 3.038579          2.252
 ---                                                                         
220:        127      10 Random walk -0.963  8.3509289 3.573842          0.000
221:        127      11 Random walk -0.434  0.9339838 0.538866          0.000
222:        127      12 Random walk -0.510  0.9362237 0.586738          0.000
223:        127      13 Random walk -0.992 18.8680434 5.240689          0.000
224:        127      14 Random walk -0.979 13.6802184 4.374974          0.000
     underprediction dispersion log_score    mad ae_median   se_mean
               <num>      <num>     <num>  <num>     <num>     <num>
  1:           0.000   0.469424  1.655610 1.4826         0  0.155236
  2:           0.000   0.546522  2.231108 2.9652         3  8.479744
  3:           0.000   0.574624  2.499141 2.9652         3 11.410884
  4:           0.000   0.708469  2.010027 2.9652         2  4.190209
  5:           0.000   0.786579  2.943145 2.9652         4 24.157225
 ---                                                                
220:           3.228   0.345842  3.985123 1.4826         5 19.061956
221:           0.250   0.288866  1.594287 1.4826         1  0.343396
222:           0.326   0.260738  1.563904 1.4826         1  0.527076
223:           4.984   0.256689  6.071192 1.4826         6 34.515625
224:           4.096   0.278974  4.600167 1.4826         5 24.740676

See the documentation for ?metrics_sample for information on the default metrics for forecasts that are represented as samples (in our case the samples generated by the stan model).

At a glance

Before we look in detail at the scores, we can use summarise_scores to get a quick overview of the scores. Don’t worry if you don’t understand all the scores yet, we will go some of them in more detail in the next section and you can find more information in the {scoringutils} documentation.

sc_scores |>
  summarise_scores(by = "model")
         model      bias      dss    crps overprediction underprediction
        <char>     <num>    <num>   <num>          <num>           <num>
1: Random walk 0.1661116 6.480467 13.4868       8.379857       0.7544018
   dispersion log_score     mad ae_median  se_mean
        <num>     <num>   <num>     <num>    <num>
1:   4.352542  4.030831 18.2512  18.00446 1526.176
Take 2 minutes

Before we look in detail at the scores, what do you think the scores are telling you?

Continuous ranked probability score

What is the Continuous Ranked Probability Score (CRPS)?

The Continuous Ranked Probability Score (CRPS) is a proper scoring rule used to evaluate the accuracy of probabilistic forecasts. It is a generalization of the Mean Absolute Error (MAE) to probabilistic forecasts, where the forecast is a distribution rather than a single point estimate (i.e. like ours).

The CRPS can be thought about as the combination of two key aspects of forecasting: 1. The accuracy of the forecast in terms of how close the predicted values are to the observed value. 2. The confidence of the forecast in terms of the spread of the predicted values.

By balancing these two aspects, the CRPS provides a comprehensive measure of the quality of probabilistic forecasts.

Key things to note about the CRPS
  • Small values are better
  • As it is an absolute scoring rule it can be difficult to use to compare forecasts across scales.

For distributions with a finite first moment (a mean exists and it is finite), the CRPS can be expressed as:

\[ CRPS(D, y) = \mathbb{E}_{X \sim D}[|X - y|] - \frac{1}{2} \mathbb{E}_{X, X' \sim D}[|X - X'|] \]

where \(X\) and \(X'\) are independent random variables sampled from the distribution \(D\). To calculate this we simpley replace \(X\) and \(X'\) by samples from our posterior distribution and sum over all possible combinations.

This equation can be broke down into the two components:

Breakdown of the Components
  1. Expected Absolute Error Between Forecast and Observation: \(\mathbb{E}_{X \sim D}[|X - y|]\) This term represents the average absolute difference between the values predicted by the forecasted distribution \(D\) and the actual observed value \(y\). It measures how far, on average, the forecasted values are from the observed value. A smaller value indicates that the forecasted distribution is closer to the observed value.

  2. Expected Absolute Error Between Two Forecasted Values: \(\frac{1}{2} \mathbb{E}_{X, X' \sim D}[|X - X'|]\) This term represents the average absolute difference between two independent samples from the forecasted distribution \(D\). It measures the internal variability or spread of the forecasted distribution. A larger value indicates a wider spread of the forecasted values.

Interpretation
  • First Term (\(\mathbb{E}_{X \sim D}[|X - y|]\)): This term penalizes the forecast based on how far the predicted values are from the observed value. It ensures that the forecast is accurate in terms of proximity to the actual observation.

  • Second Term (\(\frac{1}{2} \mathbb{E}_{X, X' \sim D}[|X - X'|]\)): This term accounts for the spread of the forecasted distribution. It penalizes forecasts that are too uncertain or have a wide spread. By subtracting this term, the CRPS rewards forecasts that are not only accurate but also confident (i.e., have a narrow spread).

Whilst the CRPS is a very useful metric it can be difficult to interpret in isolation. It is often useful to compare the CRPS of different models or to compare the CRPS of the same model under different conditions. For example, lets compare the CRPS across different forecast horizons.

sc_scores |>
  summarise_scores(by = "horizon") |>
  ggplot(aes(x = horizon, y = crps)) +
  geom_point() +
  labs(title = "CRPS by daily forecast horizon", 
       subtitle = "Summarised across all forecasts")

and at different time points.

sc_scores |>
  summarise_scores(by = "target_day") |>
  ggplot(aes(x = target_day, y = crps)) +
  geom_point() +
  labs(title = "CRPS by forecast start date", 
       subtitle = "Summarised across all forecasts", x = "forecast date")

Take 5 minutes

How do the CRPS scores change based on forecast date? How do the CRPS scores change with forecast horizon? What does this tell you about the model?

  • The CRPS scores increase for forecast dates where incidence is higher.
  • The CRPS scores increase with forecast horizon.
  • As the CRPS is an absolute measure it is hard to immediately know if the CRPS increasing with forecast date indicates that the model is performing worse.
  • However, the CRPS increasing with forecast horizon is a sign that the model is struggling to capture the longer term dynamics of the epidemic.

PIT histograms

As well as the CRPS we can also look at the calibration and bias of the model. Calibration is the agreement between the forecast probabilities and the observed frequencies. Bias is a measure of how likely the model is to over or under predict the observed values.

There are many ways to assess calibration and bias but one common way is to use a probability integral transform (PIT) histogram. This is a histogram of the cumulative distribution of function of a forecast evaluated at the observed value.

Interpreting the PIT histogram
  • Ideally PIT histograms should be uniform.
  • If is a U shape then the model is overconfident and if it is an inverted U shape then the model is underconfident.
  • If it is skewed then the model is biased towards the direction of the skew.

Continuous Case

For a continuous random variable \(X\) with cumulative distribution function (CDF) \(F_X\), the PIT is defined as:

\[ Y = F_X(X) \]

where \(Y\) is uniformly distributed on \([0, 1]\).

Integer Case

When dealing with integer forecasts, the standard PIT does not yield a uniform distribution even if the forecasts are perfectly calibrated. To address this, a randomized version of the PIT is used. For an integer-valued random variable \(X\) with CDF \(F_X\), the randomized PIT is defined as:

\[ U = F_X(k) + v \cdot (F_X(k) - F_X(k-1)) \]

where:

  • \(k\) is the observed integer value.
  • \(F_X(k)\) is the CDF evaluated at \(k\).
  • \(v\) is a random variable uniformly distributed on \([0, 1]\).

This transformation ensures that \(U\) is uniformly distributed on \([0, 1]\) if the forecasted distribution \(F_X\) is correctly specified.

Let’s first look at the overall PIT histogram.

sc_forecasts |>
  get_pit_histogram() |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() +
  labs(title = "PIT histogram", x = "Quantile", y = "Density")

As before lets look at the PIT histogram by forecast horizon. To save space we will group horizons into a few days each:

sc_forecasts |>
  mutate(group_horizon = case_when(
    horizon <= 3 ~ "1-3",
    horizon <= 7 ~ "4-7",
    horizon <= 14 ~ "8-14"
  )) |>
  get_pit_histogram(by = "group_horizon") |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() + 
  facet_wrap(~group_horizon) +
  labs(title = "PIT by forecast horizon (days)")

and then for different forecast dates.

sc_forecasts |>
  get_pit_histogram(by = "target_day") |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() + 
  facet_wrap(~target_day) +
  labs(title = "PIT by forecast date")

Take 5 minutes

What do you think of the PIT histograms? Do they look well calibrated? Do they look biased?

  • It looks like the model is biased towards overpredicting and that this bias gets worse at longer forecast horizons.
  • Looking over forecast dates it looks like much of this bias is coming from near the outbreak peak where the model is consistently overpredicting but the model is also over predicting at other times.

Scoring on the log scale

We can also score on the logarithmic scale. This can be useful if we are interested in the relative performance of the model at different scales of the data, for example if we are interested in the model’s performance at capturing the exponential growth phase of the epidemic. In some sense scoring in this way can be an approximation of scoring the effective reproduction number estimates. Doing this directly can be difficult as the effective reproduction number is a latent variable and so we cannot directly score it.

We again use scoringutils but first transform both the forecasts and observations to the log scale.

log_sc_forecasts <- sc_forecasts |>
  transform_forecasts(
    fun = log_shift,
    offset = 1,
    append = FALSE
  )

log_scores <- log_sc_forecasts |>
  score()

For more on scoring on the log scale see this paper on scoring forecasts on transformed scales.

At a glance

log_scores |>
  summarise_scores(by = "model")
         model      bias        dss     crps overprediction underprediction
        <char>     <num>      <num>    <num>          <num>           <num>
1: Random walk 0.1292321 -0.4000756 0.247976      0.1217433      0.03912202
   dispersion log_score       mad ae_median   se_mean
        <num>     <num>     <num>     <num>     <num>
1: 0.08711065 0.6487742 0.3761217 0.3263046 0.2241744
Take 2 minutes

Before we look in detail at the scores, what do you think the scores are telling you? How do you think they will differ from the scores on the natural scale?

CRPS

log_scores |>
  summarise_scores(by = "horizon") |>
  ggplot(aes(x = horizon, y = crps)) +
  geom_point() +
  labs(title = "CRPS by daily forecast horizon, scored on the log scale")

and across different forecast dates

log_scores |>
  summarise_scores(by = "target_day") |>
  ggplot(aes(x = target_day, y = crps)) +
  geom_point() +
  labs(title = "CRPS by forecast date, scored on the log scale")

Take 5 minutes

How do the CRPS scores change based on forecast date? How do the CRPS scores change with forecast horizon? What does this tell you about the model?

  • As for the natural scale CRPS scores increase with forecast horizon but now the increase appears to be linear vs exponential.
  • There has been a reduction in the CRPS scores for forecast dates near the outbreak peak compared to other forecast dates but this is still the period where the model is performing worst.

PIT histograms

Let’s first look at the overall PIT histogram.

log_sc_forecasts |>
  get_pit_histogram(by = "model") |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() +
  labs(title = "PIT histogram, scored on the log scale")

As before lets look at the PIT histogram by forecast horizon

log_sc_forecasts |>
  mutate(group_horizon = case_when(
    horizon <= 3 ~ "1-3",
    horizon <= 7 ~ "4-7",
    horizon <= 14 ~ "8-14"
  )) |>
  get_pit_histogram(by = "group_horizon") |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() +
  facet_wrap(~group_horizon) +
  labs(title = "PIT by forecast horizon, scored on the log scale")

and then for different forecast dates.

log_sc_forecasts |>
  get_pit_histogram(by = "target_day") |>
  ggplot(aes(x = mid, y = density)) +
  geom_col() +
  facet_wrap(~target_day) +
  labs(title = "PIT by forecast date, scored on the log scale")

Take 5 minutes

What do you think of the PIT histograms? Do they look well calibrated? Do they look biased?

  • The overall PIT histograms suggest that the model is less biased to over predict when scored on the log scale than the natural scale, but it is still biased. This makes sense when we think back to the comparison of reproduction number estimates and forecasts we made earlier where the model was consistently over predicting on the reproduction number.
  • By forecast horizon the model is still biased towards over predicting but this bias is less pronounced than on the natural scale.
  • Towards the end and beginning of the forecast period the model appears to be well calibrated on the log scale but is biased towards over predicting in the middle of the forecast period.
  • This matches with our knowledge of the underlying reproduction number which were initially constant and then began to decrease only to stabilise towards the end of the outbreak.

Going further

  • In which other ways could we summarise the performance of the forecasts?
  • What other metrics could we use?
  • There is no one-size-fits-all approach to forecast evaluation, often you will need to use a combination of metrics to understand the performance of your model and typically the metrics you use will depend on the context of the forecast. What attributes of the forecast are most important to you?
  • There are many other metrics that can be used to evaluate forecasts. The documentation for the {scoringutils} package has a good overview of these metrics and how to use them.
  • One useful way to think about evaluating forecasts is to consider exploring the scores as a data analysis in its own right. For example, you could look at how the scores change over time, how they change with different forecast horizons, or how they change with different models. This can be a useful way to understand the strengths and weaknesses of your model. Explore some of these aspects using the scores from this session.

Wrap up