Forecast ensembles

Introduction

As we saw in the forecast evaluation session, different modelling approaches have different strength and weaknesses, and it is not clear a priori which one produces the best forecast in any given situation. One way to attempt to draw strength from a diversity of approaches is the creation of so-called forecast ensembles from the forecasts produced by different models.

In this session, we’ll build ensembles using forecasts from models of different levels of mechanism vs. statistical complexity. We will then compare the performance of these ensembles to the individual models and to each other. Rather than using the forecast samples we have been using we will instead now use quantile-based forecasts.

Probabilistic predictions can be described as coming from a probabilistic probability distributions. In general and when using complex models such as the one we discuss in this course, these distributions can not be expressed in a simple analytical formal as we can do if, e.g. talking about common probability distributions such as the normal or gamma distributions. Instead, we typically use a limited number of samples generated from Monte-Carlo methods to represent the predictive distribution. However, this is not the only way to characterise distributions.

A quantile is the value that corresponds to a given quantile level of a distribution. For example, the median is the 50th quantile of a distribution, meaning that 50% of the values in the distribution are less than the median and 50% are greater. Similarly, the 90th quantile is the value that corresponds to 90% of the distribution being less than this value. If we characterise a predictive distribution by its quantiles, we specify these values at a range of specific quantile levels, e.g. from 5% to 95% in 5% steps.

Deciding how to represent forecasts depends on many things, for example the method used (and whether it produces samples by default) but also logistic considerations. Many collaborative forecasting projects and so-called forecasting hubs use quantile-based representations of forecasts in the hope to be able to characterise both the centre and tails of the distributions more reliably and with less demand on storage space than a sample-based representation.

Slides

Objectives

The aim of this session is to introduce the concept of ensembles of forecasts and to evaluate the performance of ensembles of the multiple models.

Source file

The source file of this session is located at sessions/forecast-ensembles.qmd.

Libraries used

In this session we will use the nfidd package to load a data set of infection times and access stan models and helper functions, the dplyr and tidyr packages for data wrangling, ggplot2 library for plotting, the tidybayes package for extracting results of the inference and the scoringutils package for evaluating forecasts. We will also use qrensemble for quantile regression averaging in the weighted ensemble section.

library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("scoringutils")
library("qrensemble")
Tip

The best way to interact with the material is via the Visual Editor of RStudio.

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.

set.seed(123)

Individual forecast models

In this session we will use the forecasts from different models we introduced in the forecasting evaluation of multiple models session. There all shared the same basic renewal with delays structure but used different models for the evolution of the effective reproduction number over time. These were:

  • A random walk model
  • A simple model of susceptible depletion referred to as “More mechanistic”
  • A differenced autoregressive model referred to as “More statistical”

As previously, we have fitted these models to a range of forecast dates so you don’t have to wait for the models to fit.

data(rw_forecasts, stat_forecasts, mech_forecasts)
forecasts <- bind_rows(
  rw_forecasts,
  mutate(stat_forecasts, model = "More statistical"),
  mutate(mech_forecasts, model = "More mechanistic")
) |>
  ungroup()

forecasts
# A tibble: 672,000 × 7
     day .draw .variable .value horizon target_day model      
   <dbl> <int> <chr>      <dbl>   <int>      <dbl> <chr>      
 1    23     1 forecast       4       1         22 Random walk
 2    23     2 forecast       2       1         22 Random walk
 3    23     3 forecast       2       1         22 Random walk
 4    23     4 forecast       6       1         22 Random walk
 5    23     5 forecast       2       1         22 Random walk
 6    23     6 forecast       3       1         22 Random walk
 7    23     7 forecast       5       1         22 Random walk
 8    23     8 forecast       2       1         22 Random walk
 9    23     9 forecast       3       1         22 Random walk
10    23    10 forecast       5       1         22 Random walk
# ℹ 671,990 more rows

Some important things to note about these forecasts:

  • We used a 14 day forecast horizon.
  • Each forecast used all the data up to the forecast date.
  • We generated 1000 predictive posterior samples for each forecast.
  • We started forecasting 3 weeks into the outbreak and then forecast every 7 days until the end of the data (excluding the last 14 days to allow a full forecast).
  • We use the same simulated outbreak data as before:
gen_time_pmf <- make_gen_time_pmf()
ip_pmf <- make_ip_pmf()
onset_df <- simulate_onsets(
  make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)
head(onset_df)
# A tibble: 6 × 3
    day onsets infections
  <dbl>  <int>      <int>
1     1      0          0
2     2      0          1
3     3      0          0
4     4      0          2
5     5      0          1
6     6      0          1

Converting sample-based forecasts to quantile-based forecasts

As in this session we will be thinking about forecasts in terms quantiles of the predictive distributions, we will need to convert our sample based forecasts to quantile-based forecasts. We will do this by focusing on the marginal distribution at each predicted time point, that is we treat each time point as independent of all others and calculate quantiles based on the sample predictive trajectories at that time point. An easy way to do this is to use the {scoringutils} package. The steps to do this are to first declare the forecasts as sample forecasts.

sample_forecasts <- forecasts |>
  left_join(onset_df, by = "day") |>
  filter(!is.na(.value)) |>
  as_forecast_sample(
    forecast_unit = c("target_day", "horizon", "model", "day"),
    observed = "onsets",
    predicted = ".value",
    sample_id = ".draw"
  )
sample_forecasts
Forecast type: sample
Forecast unit:
target_day, horizon, model, and day

        sample_id predicted observed target_day horizon            model   day
            <int>     <num>    <int>      <num>   <int>           <char> <num>
     1:         1         4        4         22       1      Random walk    23
     2:         2         2        4         22       1      Random walk    23
     3:         3         2        4         22       1      Random walk    23
     4:         4         6        4         22       1      Random walk    23
     5:         5         2        4         22       1      Random walk    23
    ---                                                                       
671996:       996         1        2        127      14 More mechanistic   141
671997:       997         7        2        127      14 More mechanistic   141
671998:       998         2        2        127      14 More mechanistic   141
671999:       999         1        2        127      14 More mechanistic   141
672000:      1000         5        2        127      14 More mechanistic   141

and then convert to quantile forecasts.

quantile_forecasts <- sample_forecasts |>
  as_forecast_quantile()
quantile_forecasts
Forecast type: quantile
Forecast unit:
target_day, horizon, model, and day

      observed target_day horizon            model   day quantile_level
         <int>      <num>   <int>           <char> <num>          <num>
   1:        4         22       1      Random walk    23           0.05
   2:        4         22       1      Random walk    23           0.25
   3:        4         22       1      Random walk    23           0.50
   4:        4         22       1      Random walk    23           0.75
   5:        4         22       1      Random walk    23           0.95
  ---                                                                  
3356:        2        127      14 More mechanistic   141           0.05
3357:        2        127      14 More mechanistic   141           0.25
3358:        2        127      14 More mechanistic   141           0.50
3359:        2        127      14 More mechanistic   141           0.75
3360:        2        127      14 More mechanistic   141           0.95
      predicted
          <num>
   1:         0
   2:         2
   3:         3
   4:         5
   5:         7
  ---          
3356:         1
3357:         2
3358:         3
3359:         4
3360:         6
  • Internally scoringutils is calculating the quantiles of the sample-based forecasts.
  • It does this by using a set of default quantiles but different ones can be specified by the user to override the default.
  • It then calls the quantile() function from base R to calculate the quantiles.
  • This is estimating the value that corresponds to each given quantile level by ordering the samples and then taking the value at the appropriate position.

Simple unweighted ensembles

A good place to start when building ensembles is to take the mean or median of the unweighted forecast at each quantile level, and treat these as quantiles of the ensemble predictive distribution. Typically, the median is preferred when outlier forecasts are likely to be present as it is less sensitive to these. However, the mean is preferred when forecasters have more faith in models that diverge from the median performance and want to represent this in the ensemble.

The procedure of calculating quantiles of a new distribution as a weighted average of quantiles of constituent distributions (e.g., different measurements) is called a Vincent average, after the biologist Stella Vincent who described this as early as 1912 when studying the function of whiskers in the behaviour of white rats.

Construction

We can calculate the mean ensemble by taking the mean of the forecasts at each quantile level.

mean_ensemble <- quantile_forecasts |>
  as_tibble() |>
  summarise(
    predicted = mean(predicted),
    observed = unique(observed),
    model = "Mean ensemble",
    .by = c(target_day, horizon, quantile_level, day)
  )

Similarly, we can calculate the median ensemble by taking the median of the forecasts at each quantile level.

median_ensemble <- quantile_forecasts |>
  as_tibble() |>
  summarise(
    predicted = median(predicted),
    observed = unique(observed),
    model = "Median ensemble",
    .by = c(target_day, horizon, quantile_level, day)
  )

We combine the ensembles into a single data frame along with the individual forecasts in order to make visualisation easier.

simple_ensembles <- bind_rows(
  mean_ensemble,
  median_ensemble,
  quantile_forecasts
)

Visualisation

How do these ensembles visually differ from the individual models? Lets start by plotting a single forecast from each model and comparing them.

plot_ensembles <- function(data, obs_data) {
  data |>
    pivot_wider(names_from = quantile_level, values_from = predicted) |>
    ggplot(aes(x = day)) +
    geom_ribbon(
    aes(
      ymin = .data[["0.05"]], ymax = .data[["0.95"]], fill = model,
        group = target_day
      ),
      alpha = 0.2
    ) +
    geom_ribbon(
      aes(
        ymin = .data[["0.25"]], ymax = .data[["0.75"]], fill = model,
        group = target_day
      ),
      alpha = 0.5
    ) +
    geom_point(
      data = obs_data,
      aes(x = day, y = onsets), color = "black"
    ) +
    scale_color_binned(type = "viridis") +
    facet_wrap(~model) +
    theme(legend.position = "none")
}

plot_single_forecasts <- simple_ensembles |>
  filter(target_day == 57) |>
  plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))

plot_single_forecasts

Again we can get a different perspective by plotting the forecasts on the log scale.

plot_single_forecasts +
  scale_y_log10()

Take 2 minutes

How do these ensembles compare to the individual models? How do they differ from each other?

How do these ensembles compare to the individual models?

  • Both of the simple ensembles appear to be less variable than the statistical models but are more variable than the mechanistic model.
  • Both ensembles are more like the statistical models than the mechanistic model.

How do they differ from each other?

  • The mean ensemble has slightly tighter uncertainty bounds than the median ensemble.

Now lets plot a range of forecasts from each model and ensemble.

plot_multiple_forecasts <- simple_ensembles |>
  plot_ensembles(onset_df |> filter(day >= 21)) +
  lims(y = c(0, 400))

plot_multiple_forecasts

Again we can get a different perspective by plotting the forecasts on the log scale.

plot_multiple_forecasts +
  scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

Take 2 minutes

How do these ensembles compare to the individual models?

How do they differ from each other?

Are there any differences across forecast dates?

How do these ensembles compare to the individual models?

  • As before, the ensembles appear to be less variable than the statistical models but more variable than the mechanistic model.

How do they differ from each other?

  • The mean ensemble has marginally tighter uncertainty bounds than the median ensemble as for the single forecast.

Are there any differences across forecast dates?

  • The mean ensemble appears to be more variable across forecast dates than the median ensemble with this being more pronounced after the peak of the outbreak.

Evaluation

As in the forecast evaluation session, we can evaluate the accuracy of the ensembles using the {scoringutils} package and in particular the score() function.

ensemble_scores <- simple_ensembles |>
  as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
  score()
Note

The weighted interval score (WIS) is a proper scoring rule for quantile forecasts that approximates the Continuous Ranked Probability Score (CRPS) by considering a weighted sum of multiple prediction intervals. As the number of intervals increases, the WIS converges to the CRPS, combining sharpness and penalties for over- and under-prediction.

We see it here as we are scoring quantiles and not samples hence we cannot use CRPS as we did before.

Again we start with a high level overview of the scores by model.

ensemble_scores |>
  summarise_scores(by = c("model"))
              model       wis overprediction underprediction dispersion
             <char>     <num>          <num>           <num>      <num>
1:    Mean ensemble  8.250097       4.271101       0.9264881   3.052507
2:  Median ensemble 10.197812       5.954107       0.9897321   3.253973
3:      Random walk 12.086955       7.568929       0.7236607   3.794366
4: More statistical 10.717272       5.710446       1.3504464   3.656379
5: More mechanistic  5.453563       1.614821       2.1319643   1.706777
           bias interval_coverage_50 interval_coverage_90 ae_median
          <num>                <num>                <num>     <num>
1:  0.200000000            0.5133929            0.8750000 12.498512
2:  0.158928571            0.5267857            0.8794643 15.484375
3:  0.227678571            0.4955357            0.8616071 18.209821
4: -0.004464286            0.5357143            0.8616071 15.959821
5:  0.214285714            0.4732143            0.7857143  8.473214
Take 5 minutes

What do you think the scores are telling you? Which model do you think is best? What other scoring breakdowns might you want to look at?

What do you think the scores are telling you? Which model do you think is best?

  • The mean ensemble appears to be the best performing ensemble model overall.
  • However, the more mechanistic model appears to be the best performing model overall.

What other scoring breakdowns might you want to look at?

  • There might be variation over forecast dates or horizons between the different ensemble methods

Unweighted ensembles of filtered models

A simple method that is often used to improve ensemble performance is to prune out models that perform very poorly. Balancing this can be tricky however as it can be hard to know how much to prune. The key tradeoff to consider is how much to optimise for which models have performed well in the past (and what your definition of the past is, for example all time or only the last few weeks) versus how much you want to allow for the possibility that these models may not perform well in the future.

Construction

As we just saw, the random walk model (our original baseline model) is performing poorly in comparison to the other models. We can remove this model from the ensemble and see if this improves the performance of the ensemble.

Warning

Here we are technically cheating a little as we are using the test data to help select the models to include in the ensemble. In the real world you would not do this as you would not have access to the test data and so this is an idealised scenario.

filtered_models <- quantile_forecasts |>
  filter(model != "Random walk")

We then need to recalculate the ensembles. First the mean ensemble,

filtered_mean_ensembles <- filtered_models |>
  as_tibble() |>
  summarise(
    predicted = mean(predicted),
    observed = unique(observed),
    model = "Mean filtered ensemble",
    .by = c(target_day, horizon, quantile_level, day)
  )

and then the median ensemble.

filtered_median_ensembles <- filtered_models |>
  as_tibble() |>
  summarise(
    predicted = median(predicted),
    observed = unique(observed),
    model = "Median filtered ensemble",
    .by = c(target_day, horizon, quantile_level, day)
  )

We combine these new ensembles with our previous ensembles in order to make visualisation easier.

filtered_ensembles <- bind_rows(
  filtered_mean_ensembles,
  filtered_median_ensembles,
  simple_ensembles
)

Visualisation

As for the simple ensembles, we can plot a single forecast from each model and ensemble.

filtered_ensembles |>
  filter(target_day == 57) |>
  plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))

and on the log scale.

filtered_ensembles |>
  filter(target_day == 57) |>
  plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14)) +
  scale_y_log10()

To get an overview we also plot a range of forecasts from each model and ensemble.

plot_multiple_filtered_forecasts <- filtered_ensembles |>
  plot_ensembles(onset_df |> filter(day >= 21)) +
  lims(y = c(0, 400))
plot_multiple_filtered_forecasts

and on the log scale.

plot_multiple_filtered_forecasts +
  scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

Take 2 minutes

How do the filtered ensembles compare to the simple ensembles? Which do you think is best?

How do the filtered ensembles compare to the simple ensembles?

  • The filtered ensembles appear to be less variable than the simple ensembles.
  • The filtered ensembles appear to be more like the mechanistic model than the simple ensembles.

Which do you think is best?

  • Visually, the filtered ensembles appear very similar. This makes sense given we know there are only two models left in the ensemble.

Evaluation

Let us score the filtered ensembles.

filtered_ensemble_scores <- filtered_ensembles |>
  as_forecast_quantile(
    forecast_unit = c(
      "target_day", "horizon", "model"
    )
  ) |>
  score()

Again we can get a high level overview of the scores by model.

filtered_ensemble_scores |>
  summarise_scores(by = c("model"))
                      model       wis overprediction underprediction dispersion
                     <char>     <num>          <num>           <num>      <num>
1:   Mean filtered ensemble  6.754926       2.891027       1.1823214   2.681578
2: Median filtered ensemble  6.754926       2.891027       1.1823214   2.681578
3:            Mean ensemble  8.250097       4.271101       0.9264881   3.052507
4:          Median ensemble 10.197812       5.954107       0.9897321   3.253973
5:              Random walk 12.086955       7.568929       0.7236607   3.794366
6:         More statistical 10.717272       5.710446       1.3504464   3.656379
7:         More mechanistic  5.453563       1.614821       2.1319643   1.706777
           bias interval_coverage_50 interval_coverage_90 ae_median
          <num>                <num>                <num>     <num>
1:  0.152232143            0.5312500            0.8794643 10.337054
2:  0.152232143            0.5312500            0.8794643 10.337054
3:  0.200000000            0.5133929            0.8750000 12.498512
4:  0.158928571            0.5267857            0.8794643 15.484375
5:  0.227678571            0.4955357            0.8616071 18.209821
6: -0.004464286            0.5357143            0.8616071 15.959821
7:  0.214285714            0.4732143            0.7857143  8.473214
Take 2 minutes

How do the filtered ensembles compare to the simple ensembles?

How do the filtered ensembles compare to the simple ensembles?

  • The filtered ensembles appear to be more accurate than the simple ensembles.
  • As you would expect they are an average of the more mechanistic model and the more statistical model.
  • As there are only two models in the ensemble, the median and mean ensembles are identical.
  • For the first time there are features of the ensemble that outperform the more mechanistic model though it remains the best performing model overall.

Weighted quantile ensembles

The simple mean and median we used to average quantiles earlier treats every model as the same. We could try to improve performance by replacing this with a weighted mean (or weighted median), for example given greater weight to models that have proven to make better forecasts. Here we will explore two common weighting methods:

  • Inverse WIS weighting
  • Quantile regression averaging

Inverse WIS weighting is a simple method that weights the forecasts by the inverse of their WIS over some period (note that identifying what this period should be in order to produce the best forecasts is not straightforward as predictive performance may vary over time if models are good at different things). The main benefit of WIS weighting over other methods is that it is simple to understand and implement. However, it does not optimise the weights directly to produce the best forecasts. It relies on the hope that giving more weight to better performing models yields a better ensemble

Quantile regression averaging on the other hand does optimise the weights directly in order to yield the best scores on past data.

Construction

Inverse WIS weighting

In order to perform inverse WIS weighting we first need to calculate the WIS for each model. We already have this from the previous evaluation so we can reuse this.

weights_per_model <- ensemble_scores |>
  dplyr::filter(
    model %in% c("More mechanistic", "More statistical", "Random walk")
  ) |>
  summarise_scores(by = c("model", "target_day")) |>
  select(model, target_day, wis) |>
  mutate(inv_wis = 1 / wis) |>
  mutate(
    inv_wis_total_by_date = sum(inv_wis), .by = target_day
  ) |>
  mutate(weights = inv_wis / inv_wis_total_by_date)

weights_per_model |>
  select(model, target_day, weights) |>
  pivot_wider(names_from = model, values_from = weights)
# A tibble: 16 × 4
   target_day `Random walk` `More statistical` `More mechanistic`
        <dbl>         <dbl>              <dbl>              <dbl>
 1         22        0.323              0.379               0.298
 2         29        0.408              0.346               0.247
 3         36        0.378              0.287               0.335
 4         43        0.324              0.267               0.409
 5         50        0.297              0.226               0.477
 6         57        0.290              0.248               0.462
 7         64        0.243              0.264               0.494
 8         71        0.418              0.326               0.256
 9         78        0.233              0.334               0.432
10         85        0.0652             0.0748              0.860
11         92        0.215              0.300               0.484
12         99        0.205              0.304               0.490
13        106        0.151              0.177               0.672
14        113        0.324              0.470               0.206
15        120        0.440              0.387               0.173
16        127        0.319              0.260               0.421

Now lets apply the weights to the forecast models. As we can only use information that was available at the time of the forecast to perform the weighting, we use weights from two weeks prior to the forecast date to inform each ensemble.

inverse_wis_ensemble <- quantile_forecasts |>
  as_tibble() |>
  left_join(
    weights_per_model |>
      mutate(target_day = target_day + 14),
    by = c("model", "target_day")
  ) |>
  # assign equal weights if no weights are available
  mutate(weights = ifelse(is.na(weights), 1/3, weights)) |>
  summarise(
    predicted = sum(predicted * weights),
    observed = unique(observed),
    model = "Inverse WIS ensemble",
    .by = c(target_day, horizon, quantile_level, day)
  )

Quantile regression averaging

We futher to perform quantile regression averaging (QRA) for each forecast date. Again we need to consider how many previous forecasts we wish to use to inform each ensemble forecast. Here we decide to use up to 3 weeks of previous forecasts to inform each QRA ensemble.

forecast_dates <- quantile_forecasts |>
  as_tibble() |>
  pull(target_day) |>
  unique()

qra_by_forecast <- function(
  quantile_forecasts,
  forecast_dates,
  group = c("target_end_date"), 
  ...
) {
  lapply(forecast_dates, \(x) {
    quantile_forecasts |>
      mutate(target_end_date = x) |>
      dplyr::filter(target_day <= x) |>
      dplyr::filter(target_day >= x - (3 * 7 + 1)) |>
      dplyr::filter(target_day == x | day <= x) |>
      qra(
        group = group,
        target = c(target_day = x),
        ...
      )
  })
}

qra_ensembles_obj <- qra_by_forecast(
  quantile_forecasts,
  forecast_dates[-1],
  group = c("target_end_date")
)

qra_weights <- seq_along(qra_ensembles_obj) |>
  lapply(\(i) attr(qra_ensembles_obj[[i]], "weights") |>
    mutate(target_day = forecast_dates[i + 1])
  ) |>
  bind_rows() |>
  dplyr::filter(quantile == 0.5)

qra_ensembles <- qra_ensembles_obj |>
  bind_rows()

Instead of creating a single optimised ensemble and using this for all forecast horizons we might also want to consider a separate optimised QRA ensemble for each forecast horizon, reflecting that models might perform differently depending on how far ahead a forecast is produced. We can do this using qra() with the group argument.

qra_ensembles_by_horizon <- qra_by_forecast(
  quantile_forecasts,
  forecast_dates[-c(1:2)],
  group = c("horizon", "target_end_date"),
  model = "QRA by horizon"
)

qra_weights_by_horizon <- seq_along(qra_ensembles_by_horizon) |>
  lapply(\(i) attr(qra_ensembles_by_horizon[[i]], "weights") |>
    mutate(target_day = forecast_dates[i + 2])
  ) |>
  bind_rows()

Combine ensembles

weighted_ensembles <- bind_rows(
  inverse_wis_ensemble,
  qra_ensembles,
  qra_ensembles_by_horizon,
  filtered_ensembles
) |>
  # remove the repeated filtered ensemble
  filter(model != "Mean filtered ensemble")

Visualisation

Single forecasts

Again we start by plotting a single forecast from each model and ensemble.

weighted_ensembles |>
  filter(target_day == 57) |>
  plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))

and on the log scale.

weighted_ensembles |>
  filter(target_day == 57) |>
  plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14)) +
  scale_y_log10()

Multiple forecasts

As before we can plot a range of forecasts from each model and ensemble.

plot_multiple_weighted_forecasts <- weighted_ensembles |>
  plot_ensembles(onset_df |> filter(day >= 21)) +
  lims(y = c(0, 400))
plot_multiple_weighted_forecasts

and on the log scale.

plot_multiple_weighted_forecasts +
  scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

Take 2 minutes

How do the weighted ensembles compare to the simple ensembles? Which do you think is best? Are you surprised by the results? Can you think of any reasons that would explain them?

Models weights

Now that we have a weighted ensemble, we can also look at the weights of the individual models. Here we do this for the quantile regression averaging ensemble but we could also do this for the inverse WIS ensemble or any other weighted ensemble (for an unweighted ensemble the weights are all equal).

qra_weights |>
  ggplot(aes(x = target_day, y = weight, fill = model)) +
  geom_col(position = "stack") +
  theme(legend.position = "bottom")

Take 2 minutes

How do the weights change over time? Are you surprised by the results given what you know about the models performance?

How do the weights change over time?

  • Early on the more statistical models have higher weights
  • Gradually the random walk model gains weight and by the end of the forecast horizon it represents the entire ensemble.
  • Near the peak the mechanistic model also gains weight.

Are you surprised by the results given what you know about the models performance?

  • As the random walk model is performing poorly, you would expect it to have low weights but actually it often doesn’t. This implies that its poor performance is restricted to certain parts of the outbreak.
  • Even though the mechanistic model performs really well overall it is only included in the ensemble around the peak. This could be because the training data doesn’t include changes in the epidemic dynamics and so the mechanistic model is not given sufficient weight.

Evaluation

For a final evaluation we can look at the scores for each model and ensemble again. We remove the two weeks of forecasts as these do not have a quantile regression average forecasts as these require training data to estimate.

weighted_ensemble_scores <- weighted_ensembles |>
  filter(target_day >= 29) |>
  as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
  score()

Again we can get a high level overview of the scores by model.

weighted_ensemble_scores |>
  summarise_scores(by = c("model"))
                         model       wis overprediction underprediction
                        <char>     <num>          <num>           <num>
1:        Inverse WIS ensemble  8.473266       4.422306       1.1072491
2: Quantile Regression Average  7.362562       3.040759       1.9483875
3:              QRA by horizon  7.963665       3.409738       2.0659589
4:    Median filtered ensemble  7.022126       2.985429       1.2601905
5:               Mean ensemble  8.612259       4.452825       0.9873016
6:             Median ensemble 10.681557       6.238667       1.0547619
7:                 Random walk 12.695476       7.961143       0.7709524
8:            More statistical 11.263910       6.030190       1.4395238
9:            More mechanistic  5.603486       1.571524       2.2731429
   dispersion        bias interval_coverage_50 interval_coverage_90 ae_median
        <num>       <num>                <num>                <num>     <num>
1:   2.943712  0.16619048            0.5095238            0.8190476 12.681390
2:   2.373415  0.07190476            0.5047619            0.7952381 11.193606
3:   2.487968  0.13367347            0.4897959            0.7755102 12.005322
4:   2.776507  0.11476190            0.5523810            0.8761905 10.728571
5:   3.172132  0.16571429            0.5333333            0.8714286 13.020635
6:   3.388129  0.12238095            0.5476190            0.8714286 16.178571
7:   3.963381  0.19571429            0.5142857            0.8523810 19.085714
8:   3.794195 -0.04238095            0.5333333            0.8523810 16.780952
9:   1.758819  0.17809524            0.4952381            0.7857143  8.685714

Remembering the session on forecast evaluation, we should also check performance on the log scale.

log_ensemble_scores <- weighted_ensembles |>
  filter(target_day >= 29) |>
  as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
    transform_forecasts(
    fun = log_shift,
    offset = 1,
    append = FALSE
  ) |>
  score()

log_ensemble_scores |>
  summarise_scores(by = c("model"))
                         model       wis overprediction underprediction
                        <char>     <num>          <num>           <num>
1:        Inverse WIS ensemble 0.1738246     0.08278544      0.03090237
2: Quantile Regression Average 0.1770130     0.08663305      0.03016151
3:              QRA by horizon 0.1657445     0.09253210      0.02377453
4:    Median filtered ensemble 0.1549876     0.07138677      0.02566016
5:               Mean ensemble 0.1689778     0.07928707      0.02702292
6:             Median ensemble 0.1889663     0.08160408      0.03772979
7:                 Random walk 0.2046647     0.09644129      0.03538436
8:            More statistical 0.2160592     0.06803280      0.06550430
9:            More mechanistic 0.1598201     0.09859516      0.02048076
   dispersion        bias interval_coverage_50 interval_coverage_90 ae_median
        <num>       <num>                <num>                <num>     <num>
1: 0.06013681  0.16619048            0.5095238            0.8190476 0.2634627
2: 0.06021847  0.07190476            0.5047619            0.7952381 0.2597085
3: 0.04943783  0.13826531            0.4948980            0.7755102 0.2405630
4: 0.05794062  0.11476190            0.5523810            0.8761905 0.2364991
5: 0.06266781  0.16571429            0.5333333            0.8714286 0.2565306
6: 0.06963244  0.12238095            0.5476190            0.8714286 0.2876486
7: 0.07283902  0.19571429            0.5142857            0.8523810 0.3108524
8: 0.08252213 -0.04238095            0.5333333            0.8523810 0.3193459
9: 0.04074423  0.17809524            0.4952381            0.7857143 0.2396753
Take 2 minutes

How do the weighted ensembles compare to the simple ensembles on the natural and log scale?

The best ensembles slightly outperform some of the simple ensembles but there is no obvious benefit from using weighted ensembles. Why might this be the case?

Wrap up

References