library("nfidd")
library("dplyr")
library("tidyr")
library("ggplot2")
library("scoringutils")
library("qrensemble")
Forecast ensembles
Introduction
As we saw in the forecast evaluation session, different modelling approaches have different strength and weaknesses, and it is not clear a priori which one produces the best forecast in any given situation. One way to attempt to draw strength from a diversity of approaches is the creation of so-called forecast ensembles from the forecasts produced by different models.
In this session, we’ll build ensembles using forecasts from models of different levels of mechanism vs. statistical complexity. We will then compare the performance of these ensembles to the individual models and to each other. Rather than using the forecast samples we have been using we will instead now use quantile-based forecasts.
Probabilistic predictions can be described as coming from a probabilistic probability distributions. In general and when using complex models such as the one we discuss in this course, these distributions can not be expressed in a simple analytical formal as we can do if, e.g. talking about common probability distributions such as the normal or gamma distributions. Instead, we typically use a limited number of samples generated from Monte-Carlo methods to represent the predictive distribution. However, this is not the only way to characterise distributions.
A quantile is the value that corresponds to a given quantile level of a distribution. For example, the median is the 50th quantile of a distribution, meaning that 50% of the values in the distribution are less than the median and 50% are greater. Similarly, the 90th quantile is the value that corresponds to 90% of the distribution being less than this value. If we characterise a predictive distribution by its quantiles, we specify these values at a range of specific quantile levels, e.g. from 5% to 95% in 5% steps.
Deciding how to represent forecasts depends on many things, for example the method used (and whether it produces samples by default) but also logistic considerations. Many collaborative forecasting projects and so-called forecasting hubs use quantile-based representations of forecasts in the hope to be able to characterise both the centre and tails of the distributions more reliably and with less demand on storage space than a sample-based representation.
Slides
Objectives
The aim of this session is to introduce the concept of ensembles of forecasts and to evaluate the performance of ensembles of the multiple models.
Source file
The source file of this session is located at sessions/forecast-ensembles.qmd
.
Libraries used
In this session we will use the nfidd
package to load a data set of infection times and access stan models and helper functions, the dplyr
and tidyr
packages for data wrangling, ggplot2
library for plotting, the tidybayes
package for extracting results of the inference and the scoringutils
package for evaluating forecasts. We will also use qrensemble
for quantile regression averaging in the weighted ensemble section.
The best way to interact with the material is via the Visual Editor of RStudio.
Initialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.
set.seed(123)
Individual forecast models
In this session we will use the forecasts from different models we introduced in the forecasting evaluation of multiple models session. There all shared the same basic renewal with delays structure but used different models for the evolution of the effective reproduction number over time. These were:
- A random walk model
- A simple model of susceptible depletion referred to as “More mechanistic”
- A differenced autoregressive model referred to as “More statistical”
As previously, we have fitted these models to a range of forecast dates so you don’t have to wait for the models to fit.
data(rw_forecasts, stat_forecasts, mech_forecasts)
<- bind_rows(
forecasts
rw_forecasts,mutate(stat_forecasts, model = "More statistical"),
mutate(mech_forecasts, model = "More mechanistic")
|>
) ungroup()
forecasts
# A tibble: 672,000 × 7
day .draw .variable .value horizon target_day model
<dbl> <int> <chr> <dbl> <int> <dbl> <chr>
1 23 1 forecast 4 1 22 Random walk
2 23 2 forecast 2 1 22 Random walk
3 23 3 forecast 2 1 22 Random walk
4 23 4 forecast 6 1 22 Random walk
5 23 5 forecast 2 1 22 Random walk
6 23 6 forecast 3 1 22 Random walk
7 23 7 forecast 5 1 22 Random walk
8 23 8 forecast 2 1 22 Random walk
9 23 9 forecast 3 1 22 Random walk
10 23 10 forecast 5 1 22 Random walk
# ℹ 671,990 more rows
Some important things to note about these forecasts:
- We used a 14 day forecast horizon.
- Each forecast used all the data up to the forecast date.
- We generated 1000 predictive posterior samples for each forecast.
- We started forecasting 3 weeks into the outbreak and then forecast every 7 days until the end of the data (excluding the last 14 days to allow a full forecast).
- We use the same simulated outbreak data as before:
<- make_gen_time_pmf()
gen_time_pmf <- make_ip_pmf()
ip_pmf <- simulate_onsets(
onset_df make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)head(onset_df)
# A tibble: 6 × 3
day onsets infections
<dbl> <int> <int>
1 1 0 0
2 2 0 1
3 3 0 0
4 4 0 2
5 5 0 1
6 6 0 1
Converting sample-based forecasts to quantile-based forecasts
As in this session we will be thinking about forecasts in terms quantiles of the predictive distributions, we will need to convert our sample based forecasts to quantile-based forecasts. We will do this by focusing on the marginal distribution at each predicted time point, that is we treat each time point as independent of all others and calculate quantiles based on the sample predictive trajectories at that time point. An easy way to do this is to use the {scoringutils}
package. The steps to do this are to first declare the forecasts as sample
forecasts.
<- forecasts |>
sample_forecasts left_join(onset_df, by = "day") |>
filter(!is.na(.value)) |>
as_forecast_sample(
forecast_unit = c("target_day", "horizon", "model", "day"),
observed = "onsets",
predicted = ".value",
sample_id = ".draw"
) sample_forecasts
Forecast type: sample
Forecast unit:
target_day, horizon, model, and day
sample_id predicted observed target_day horizon model day
<int> <num> <int> <num> <int> <char> <num>
1: 1 4 4 22 1 Random walk 23
2: 2 2 4 22 1 Random walk 23
3: 3 2 4 22 1 Random walk 23
4: 4 6 4 22 1 Random walk 23
5: 5 2 4 22 1 Random walk 23
---
671996: 996 1 2 127 14 More mechanistic 141
671997: 997 7 2 127 14 More mechanistic 141
671998: 998 2 2 127 14 More mechanistic 141
671999: 999 1 2 127 14 More mechanistic 141
672000: 1000 5 2 127 14 More mechanistic 141
and then convert to quantile
forecasts.
<- sample_forecasts |>
quantile_forecasts as_forecast_quantile()
quantile_forecasts
Forecast type: quantile
Forecast unit:
target_day, horizon, model, and day
observed target_day horizon model day quantile_level
<int> <num> <int> <char> <num> <num>
1: 4 22 1 Random walk 23 0.05
2: 4 22 1 Random walk 23 0.25
3: 4 22 1 Random walk 23 0.50
4: 4 22 1 Random walk 23 0.75
5: 4 22 1 Random walk 23 0.95
---
3356: 2 127 14 More mechanistic 141 0.05
3357: 2 127 14 More mechanistic 141 0.25
3358: 2 127 14 More mechanistic 141 0.50
3359: 2 127 14 More mechanistic 141 0.75
3360: 2 127 14 More mechanistic 141 0.95
predicted
<num>
1: 0
2: 2
3: 3
4: 5
5: 7
---
3356: 1
3357: 2
3358: 3
3359: 4
3360: 6
- Internally
scoringutils
is calculating the quantiles of the sample-based forecasts. - It does this by using a set of default quantiles but different ones can be specified by the user to override the default.
- It then calls the
quantile()
function from base R to calculate the quantiles. - This is estimating the value that corresponds to each given quantile level by ordering the samples and then taking the value at the appropriate position.
Simple unweighted ensembles
A good place to start when building ensembles is to take the mean or median of the unweighted forecast at each quantile level, and treat these as quantiles of the ensemble predictive distribution. Typically, the median is preferred when outlier forecasts are likely to be present as it is less sensitive to these. However, the mean is preferred when forecasters have more faith in models that diverge from the median performance and want to represent this in the ensemble.
The procedure of calculating quantiles of a new distribution as a weighted average of quantiles of constituent distributions (e.g., different measurements) is called a Vincent average, after the biologist Stella Vincent who described this as early as 1912 when studying the function of whiskers in the behaviour of white rats.
Construction
We can calculate the mean ensemble by taking the mean of the forecasts at each quantile level.
<- quantile_forecasts |>
mean_ensemble as_tibble() |>
summarise(
predicted = mean(predicted),
observed = unique(observed),
model = "Mean ensemble",
.by = c(target_day, horizon, quantile_level, day)
)
Similarly, we can calculate the median ensemble by taking the median of the forecasts at each quantile level.
<- quantile_forecasts |>
median_ensemble as_tibble() |>
summarise(
predicted = median(predicted),
observed = unique(observed),
model = "Median ensemble",
.by = c(target_day, horizon, quantile_level, day)
)
We combine the ensembles into a single data frame along with the individual forecasts in order to make visualisation easier.
<- bind_rows(
simple_ensembles
mean_ensemble,
median_ensemble,
quantile_forecasts )
Visualisation
How do these ensembles visually differ from the individual models? Lets start by plotting a single forecast from each model and comparing them.
<- function(data, obs_data) {
plot_ensembles |>
data pivot_wider(names_from = quantile_level, values_from = predicted) |>
ggplot(aes(x = day)) +
geom_ribbon(
aes(
ymin = .data[["0.05"]], ymax = .data[["0.95"]], fill = model,
group = target_day
),alpha = 0.2
+
) geom_ribbon(
aes(
ymin = .data[["0.25"]], ymax = .data[["0.75"]], fill = model,
group = target_day
),alpha = 0.5
+
) geom_point(
data = obs_data,
aes(x = day, y = onsets), color = "black"
+
) scale_color_binned(type = "viridis") +
facet_wrap(~model) +
theme(legend.position = "none")
}
<- simple_ensembles |>
plot_single_forecasts filter(target_day == 57) |>
plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))
plot_single_forecasts
Again we can get a different perspective by plotting the forecasts on the log scale.
+
plot_single_forecasts scale_y_log10()
How do these ensembles compare to the individual models? How do they differ from each other?
How do these ensembles compare to the individual models?
- Both of the simple ensembles appear to be less variable than the statistical models but are more variable than the mechanistic model.
- Both ensembles are more like the statistical models than the mechanistic model.
How do they differ from each other?
- The mean ensemble has slightly tighter uncertainty bounds than the median ensemble.
Now lets plot a range of forecasts from each model and ensemble.
<- simple_ensembles |>
plot_multiple_forecasts plot_ensembles(onset_df |> filter(day >= 21)) +
lims(y = c(0, 400))
plot_multiple_forecasts
Again we can get a different perspective by plotting the forecasts on the log scale.
+
plot_multiple_forecasts scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
How do these ensembles compare to the individual models?
How do they differ from each other?
Are there any differences across forecast dates?
How do these ensembles compare to the individual models?
- As before, the ensembles appear to be less variable than the statistical models but more variable than the mechanistic model.
How do they differ from each other?
- The mean ensemble has marginally tighter uncertainty bounds than the median ensemble as for the single forecast.
Are there any differences across forecast dates?
- The mean ensemble appears to be more variable across forecast dates than the median ensemble with this being more pronounced after the peak of the outbreak.
Evaluation
As in the forecast evaluation session, we can evaluate the accuracy of the ensembles using the {scoringutils}
package and in particular the score()
function.
<- simple_ensembles |>
ensemble_scores as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
score()
The weighted interval score (WIS) is a proper scoring rule for quantile forecasts that approximates the Continuous Ranked Probability Score (CRPS) by considering a weighted sum of multiple prediction intervals. As the number of intervals increases, the WIS converges to the CRPS, combining sharpness and penalties for over- and under-prediction.
We see it here as we are scoring quantiles and not samples hence we cannot use CRPS as we did before.
Again we start with a high level overview of the scores by model.
|>
ensemble_scores summarise_scores(by = c("model"))
model wis overprediction underprediction dispersion
<char> <num> <num> <num> <num>
1: Mean ensemble 8.250097 4.271101 0.9264881 3.052507
2: Median ensemble 10.197812 5.954107 0.9897321 3.253973
3: Random walk 12.086955 7.568929 0.7236607 3.794366
4: More statistical 10.717272 5.710446 1.3504464 3.656379
5: More mechanistic 5.453563 1.614821 2.1319643 1.706777
bias interval_coverage_50 interval_coverage_90 ae_median
<num> <num> <num> <num>
1: 0.200000000 0.5133929 0.8750000 12.498512
2: 0.158928571 0.5267857 0.8794643 15.484375
3: 0.227678571 0.4955357 0.8616071 18.209821
4: -0.004464286 0.5357143 0.8616071 15.959821
5: 0.214285714 0.4732143 0.7857143 8.473214
What do you think the scores are telling you? Which model do you think is best? What other scoring breakdowns might you want to look at?
What do you think the scores are telling you? Which model do you think is best?
- The mean ensemble appears to be the best performing ensemble model overall.
- However, the more mechanistic model appears to be the best performing model overall.
What other scoring breakdowns might you want to look at?
- There might be variation over forecast dates or horizons between the different ensemble methods
Unweighted ensembles of filtered models
A simple method that is often used to improve ensemble performance is to prune out models that perform very poorly. Balancing this can be tricky however as it can be hard to know how much to prune. The key tradeoff to consider is how much to optimise for which models have performed well in the past (and what your definition of the past is, for example all time or only the last few weeks) versus how much you want to allow for the possibility that these models may not perform well in the future.
Construction
As we just saw, the random walk model (our original baseline model) is performing poorly in comparison to the other models. We can remove this model from the ensemble and see if this improves the performance of the ensemble.
Here we are technically cheating a little as we are using the test data to help select the models to include in the ensemble. In the real world you would not do this as you would not have access to the test data and so this is an idealised scenario.
<- quantile_forecasts |>
filtered_models filter(model != "Random walk")
We then need to recalculate the ensembles. First the mean ensemble,
<- filtered_models |>
filtered_mean_ensembles as_tibble() |>
summarise(
predicted = mean(predicted),
observed = unique(observed),
model = "Mean filtered ensemble",
.by = c(target_day, horizon, quantile_level, day)
)
and then the median ensemble.
<- filtered_models |>
filtered_median_ensembles as_tibble() |>
summarise(
predicted = median(predicted),
observed = unique(observed),
model = "Median filtered ensemble",
.by = c(target_day, horizon, quantile_level, day)
)
We combine these new ensembles with our previous ensembles in order to make visualisation easier.
<- bind_rows(
filtered_ensembles
filtered_mean_ensembles,
filtered_median_ensembles,
simple_ensembles )
Visualisation
As for the simple ensembles, we can plot a single forecast from each model and ensemble.
|>
filtered_ensembles filter(target_day == 57) |>
plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))
and on the log scale.
|>
filtered_ensembles filter(target_day == 57) |>
plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14)) +
scale_y_log10()
To get an overview we also plot a range of forecasts from each model and ensemble.
<- filtered_ensembles |>
plot_multiple_filtered_forecasts plot_ensembles(onset_df |> filter(day >= 21)) +
lims(y = c(0, 400))
plot_multiple_filtered_forecasts
and on the log scale.
+
plot_multiple_filtered_forecasts scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
How do the filtered ensembles compare to the simple ensembles? Which do you think is best?
How do the filtered ensembles compare to the simple ensembles?
- The filtered ensembles appear to be less variable than the simple ensembles.
- The filtered ensembles appear to be more like the mechanistic model than the simple ensembles.
Which do you think is best?
- Visually, the filtered ensembles appear very similar. This makes sense given we know there are only two models left in the ensemble.
Evaluation
Let us score the filtered ensembles.
<- filtered_ensembles |>
filtered_ensemble_scores as_forecast_quantile(
forecast_unit = c(
"target_day", "horizon", "model"
)|>
) score()
Again we can get a high level overview of the scores by model.
|>
filtered_ensemble_scores summarise_scores(by = c("model"))
model wis overprediction underprediction dispersion
<char> <num> <num> <num> <num>
1: Mean filtered ensemble 6.754926 2.891027 1.1823214 2.681578
2: Median filtered ensemble 6.754926 2.891027 1.1823214 2.681578
3: Mean ensemble 8.250097 4.271101 0.9264881 3.052507
4: Median ensemble 10.197812 5.954107 0.9897321 3.253973
5: Random walk 12.086955 7.568929 0.7236607 3.794366
6: More statistical 10.717272 5.710446 1.3504464 3.656379
7: More mechanistic 5.453563 1.614821 2.1319643 1.706777
bias interval_coverage_50 interval_coverage_90 ae_median
<num> <num> <num> <num>
1: 0.152232143 0.5312500 0.8794643 10.337054
2: 0.152232143 0.5312500 0.8794643 10.337054
3: 0.200000000 0.5133929 0.8750000 12.498512
4: 0.158928571 0.5267857 0.8794643 15.484375
5: 0.227678571 0.4955357 0.8616071 18.209821
6: -0.004464286 0.5357143 0.8616071 15.959821
7: 0.214285714 0.4732143 0.7857143 8.473214
How do the filtered ensembles compare to the simple ensembles?
How do the filtered ensembles compare to the simple ensembles?
- The filtered ensembles appear to be more accurate than the simple ensembles.
- As you would expect they are an average of the more mechanistic model and the more statistical model.
- As there are only two models in the ensemble, the median and mean ensembles are identical.
- For the first time there are features of the ensemble that outperform the more mechanistic model though it remains the best performing model overall.
Weighted quantile ensembles
The simple mean and median we used to average quantiles earlier treats every model as the same. We could try to improve performance by replacing this with a weighted mean (or weighted median), for example given greater weight to models that have proven to make better forecasts. Here we will explore two common weighting methods:
- Inverse WIS weighting
- Quantile regression averaging
Inverse WIS weighting is a simple method that weights the forecasts by the inverse of their WIS over some period (note that identifying what this period should be in order to produce the best forecasts is not straightforward as predictive performance may vary over time if models are good at different things). The main benefit of WIS weighting over other methods is that it is simple to understand and implement. However, it does not optimise the weights directly to produce the best forecasts. It relies on the hope that giving more weight to better performing models yields a better ensemble
Quantile regression averaging on the other hand does optimise the weights directly in order to yield the best scores on past data.
Construction
Inverse WIS weighting
In order to perform inverse WIS weighting we first need to calculate the WIS for each model. We already have this from the previous evaluation so we can reuse this.
<- ensemble_scores |>
weights_per_model ::filter(
dplyr%in% c("More mechanistic", "More statistical", "Random walk")
model |>
) summarise_scores(by = c("model", "target_day")) |>
select(model, target_day, wis) |>
mutate(inv_wis = 1 / wis) |>
mutate(
inv_wis_total_by_date = sum(inv_wis), .by = target_day
|>
) mutate(weights = inv_wis / inv_wis_total_by_date)
|>
weights_per_model select(model, target_day, weights) |>
pivot_wider(names_from = model, values_from = weights)
# A tibble: 16 × 4
target_day `Random walk` `More statistical` `More mechanistic`
<dbl> <dbl> <dbl> <dbl>
1 22 0.323 0.379 0.298
2 29 0.408 0.346 0.247
3 36 0.378 0.287 0.335
4 43 0.324 0.267 0.409
5 50 0.297 0.226 0.477
6 57 0.290 0.248 0.462
7 64 0.243 0.264 0.494
8 71 0.418 0.326 0.256
9 78 0.233 0.334 0.432
10 85 0.0652 0.0748 0.860
11 92 0.215 0.300 0.484
12 99 0.205 0.304 0.490
13 106 0.151 0.177 0.672
14 113 0.324 0.470 0.206
15 120 0.440 0.387 0.173
16 127 0.319 0.260 0.421
Now lets apply the weights to the forecast models. As we can only use information that was available at the time of the forecast to perform the weighting, we use weights from two weeks prior to the forecast date to inform each ensemble.
<- quantile_forecasts |>
inverse_wis_ensemble as_tibble() |>
left_join(
|>
weights_per_model mutate(target_day = target_day + 14),
by = c("model", "target_day")
|>
) # assign equal weights if no weights are available
mutate(weights = ifelse(is.na(weights), 1/3, weights)) |>
summarise(
predicted = sum(predicted * weights),
observed = unique(observed),
model = "Inverse WIS ensemble",
.by = c(target_day, horizon, quantile_level, day)
)
Quantile regression averaging
We futher to perform quantile regression averaging (QRA) for each forecast date. Again we need to consider how many previous forecasts we wish to use to inform each ensemble forecast. Here we decide to use up to 3 weeks of previous forecasts to inform each QRA ensemble.
<- quantile_forecasts |>
forecast_dates as_tibble() |>
pull(target_day) |>
unique()
<- function(
qra_by_forecast
quantile_forecasts,
forecast_dates,group = c("target_end_date"),
...
) {lapply(forecast_dates, \(x) {
|>
quantile_forecasts mutate(target_end_date = x) |>
::filter(target_day <= x) |>
dplyr::filter(target_day >= x - (3 * 7 + 1)) |>
dplyr::filter(target_day == x | day <= x) |>
dplyrqra(
group = group,
target = c(target_day = x),
...
)
})
}
<- qra_by_forecast(
qra_ensembles_obj
quantile_forecasts,-1],
forecast_dates[group = c("target_end_date")
)
<- seq_along(qra_ensembles_obj) |>
qra_weights lapply(\(i) attr(qra_ensembles_obj[[i]], "weights") |>
mutate(target_day = forecast_dates[i + 1])
|>
) bind_rows() |>
::filter(quantile == 0.5)
dplyr
<- qra_ensembles_obj |>
qra_ensembles bind_rows()
Instead of creating a single optimised ensemble and using this for all forecast horizons we might also want to consider a separate optimised QRA ensemble for each forecast horizon, reflecting that models might perform differently depending on how far ahead a forecast is produced. We can do this using qra()
with the group
argument.
<- qra_by_forecast(
qra_ensembles_by_horizon
quantile_forecasts,-c(1:2)],
forecast_dates[group = c("horizon", "target_end_date"),
model = "QRA by horizon"
)
<- seq_along(qra_ensembles_by_horizon) |>
qra_weights_by_horizon lapply(\(i) attr(qra_ensembles_by_horizon[[i]], "weights") |>
mutate(target_day = forecast_dates[i + 2])
|>
) bind_rows()
Combine ensembles
<- bind_rows(
weighted_ensembles
inverse_wis_ensemble,
qra_ensembles,
qra_ensembles_by_horizon,
filtered_ensembles|>
) # remove the repeated filtered ensemble
filter(model != "Mean filtered ensemble")
Visualisation
Single forecasts
Again we start by plotting a single forecast from each model and ensemble.
|>
weighted_ensembles filter(target_day == 57) |>
plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14))
and on the log scale.
|>
weighted_ensembles filter(target_day == 57) |>
plot_ensembles(onset_df |> filter(day >= 57, day <= 57 + 14)) +
scale_y_log10()
Multiple forecasts
As before we can plot a range of forecasts from each model and ensemble.
<- weighted_ensembles |>
plot_multiple_weighted_forecasts plot_ensembles(onset_df |> filter(day >= 21)) +
lims(y = c(0, 400))
plot_multiple_weighted_forecasts
and on the log scale.
+
plot_multiple_weighted_forecasts scale_y_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.
How do the weighted ensembles compare to the simple ensembles? Which do you think is best? Are you surprised by the results? Can you think of any reasons that would explain them?
Models weights
Now that we have a weighted ensemble, we can also look at the weights of the individual models. Here we do this for the quantile regression averaging ensemble but we could also do this for the inverse WIS ensemble or any other weighted ensemble (for an unweighted ensemble the weights are all equal).
|>
qra_weights ggplot(aes(x = target_day, y = weight, fill = model)) +
geom_col(position = "stack") +
theme(legend.position = "bottom")
How do the weights change over time? Are you surprised by the results given what you know about the models performance?
How do the weights change over time?
- Early on the more statistical models have higher weights
- Gradually the random walk model gains weight and by the end of the forecast horizon it represents the entire ensemble.
- Near the peak the mechanistic model also gains weight.
Are you surprised by the results given what you know about the models performance?
- As the random walk model is performing poorly, you would expect it to have low weights but actually it often doesn’t. This implies that its poor performance is restricted to certain parts of the outbreak.
- Even though the mechanistic model performs really well overall it is only included in the ensemble around the peak. This could be because the training data doesn’t include changes in the epidemic dynamics and so the mechanistic model is not given sufficient weight.
Evaluation
For a final evaluation we can look at the scores for each model and ensemble again. We remove the two weeks of forecasts as these do not have a quantile regression average forecasts as these require training data to estimate.
<- weighted_ensembles |>
weighted_ensemble_scores filter(target_day >= 29) |>
as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
score()
Again we can get a high level overview of the scores by model.
|>
weighted_ensemble_scores summarise_scores(by = c("model"))
model wis overprediction underprediction
<char> <num> <num> <num>
1: Inverse WIS ensemble 8.473266 4.422306 1.1072491
2: Quantile Regression Average 7.362562 3.040759 1.9483875
3: QRA by horizon 7.963665 3.409738 2.0659589
4: Median filtered ensemble 7.022126 2.985429 1.2601905
5: Mean ensemble 8.612259 4.452825 0.9873016
6: Median ensemble 10.681557 6.238667 1.0547619
7: Random walk 12.695476 7.961143 0.7709524
8: More statistical 11.263910 6.030190 1.4395238
9: More mechanistic 5.603486 1.571524 2.2731429
dispersion bias interval_coverage_50 interval_coverage_90 ae_median
<num> <num> <num> <num> <num>
1: 2.943712 0.16619048 0.5095238 0.8190476 12.681390
2: 2.373415 0.07190476 0.5047619 0.7952381 11.193606
3: 2.487968 0.13367347 0.4897959 0.7755102 12.005322
4: 2.776507 0.11476190 0.5523810 0.8761905 10.728571
5: 3.172132 0.16571429 0.5333333 0.8714286 13.020635
6: 3.388129 0.12238095 0.5476190 0.8714286 16.178571
7: 3.963381 0.19571429 0.5142857 0.8523810 19.085714
8: 3.794195 -0.04238095 0.5333333 0.8523810 16.780952
9: 1.758819 0.17809524 0.4952381 0.7857143 8.685714
Remembering the session on forecast evaluation, we should also check performance on the log scale.
<- weighted_ensembles |>
log_ensemble_scores filter(target_day >= 29) |>
as_forecast_quantile(forecast_unit = c("target_day", "horizon", "model")) |>
transform_forecasts(
fun = log_shift,
offset = 1,
append = FALSE
|>
) score()
|>
log_ensemble_scores summarise_scores(by = c("model"))
model wis overprediction underprediction
<char> <num> <num> <num>
1: Inverse WIS ensemble 0.1738246 0.08278544 0.03090237
2: Quantile Regression Average 0.1770130 0.08663305 0.03016151
3: QRA by horizon 0.1657445 0.09253210 0.02377453
4: Median filtered ensemble 0.1549876 0.07138677 0.02566016
5: Mean ensemble 0.1689778 0.07928707 0.02702292
6: Median ensemble 0.1889663 0.08160408 0.03772979
7: Random walk 0.2046647 0.09644129 0.03538436
8: More statistical 0.2160592 0.06803280 0.06550430
9: More mechanistic 0.1598201 0.09859516 0.02048076
dispersion bias interval_coverage_50 interval_coverage_90 ae_median
<num> <num> <num> <num> <num>
1: 0.06013681 0.16619048 0.5095238 0.8190476 0.2634627
2: 0.06021847 0.07190476 0.5047619 0.7952381 0.2597085
3: 0.04943783 0.13826531 0.4948980 0.7755102 0.2405630
4: 0.05794062 0.11476190 0.5523810 0.8761905 0.2364991
5: 0.06266781 0.16571429 0.5333333 0.8714286 0.2565306
6: 0.06963244 0.12238095 0.5476190 0.8714286 0.2876486
7: 0.07283902 0.19571429 0.5142857 0.8523810 0.3108524
8: 0.08252213 -0.04238095 0.5333333 0.8523810 0.3193459
9: 0.04074423 0.17809524 0.4952381 0.7857143 0.2396753
How do the weighted ensembles compare to the simple ensembles on the natural and log scale?
The best ensembles slightly outperform some of the simple ensembles but there is no obvious benefit from using weighted ensembles. Why might this be the case?