Confidence Intervals
This notebook overviews scalecast intervals, which are scored with Mean Scaled Interval Score (MSIS). Lower scores are better. This notebook requires scalecast>=0.18.0
.
Scalecast uses a naive conformal interval, created from holding out a test-set of data and setting interval ranges from the percentiles of the test-set residuals. This simple approach allows every estimator type to get the same kind of interval. As we will see by scoring the intervals and comparing them to ARIMA intervals from statsmodels, the results are good.
To evaluate the intervals, we leave out a section of each series to score out-of-sample. This is usually not necessary for scalecast, as all models are tested automatically, but the confidence intervals can overfit on any test set stored in the Forecaster
or MVForecaster
object due to leakage that occurs when constructing these intervals. Scalecast intervals are compared to ARIMA intervals on the same series in the last section of this notebook and the scalecast intervals, on the whole,
perform better. The series used in this example are ordered from easiest-to-hardest to forecast.
Easy Distribution-Free Conformal Intervals for Time Series
[1]:
import pandas as pd
import numpy as np
from scalecast.Forecaster import Forecaster
from scalecast import GridGenerator
from scalecast.util import metrics
from scalecast.notebook import tune_test_forecast
from scalecast.SeriesTransformer import SeriesTransformer
import matplotlib.pyplot as plt
import seaborn as sns
import time
from tqdm.notebook import tqdm
[2]:
import warnings
warnings.simplefilter(action='ignore', category=DeprecationWarning)
[3]:
models = (
'mlr',
'elasticnet',
'ridge',
'knn',
'xgboost',
'lightgbm',
'gbt',
) # these are all scikit-learn models or APIs
# this will be used later to fill in results
results_template = pd.DataFrame(index=models)
[4]:
def score_cis(results, fcsts, ci_name, actuals, obs, val_len, models=models, m_=1):
for m in models:
results.loc[m,ci_name] = metrics.msis(
a = actuals,
uf = fcsts[m+'_upperci'],
lf = fcsts[m+'_lowerci'],
obs = obs,
m = m_,
)
return results
[5]:
GridGenerator.get_example_grids()
Daily Website Visitors
Link to data: https://www.kaggle.com/datasets/bobnau/daily-website-visitors
We will use a length of 180 observations (about half a year) both to tune and test the models
We want to optimize the forecasts and confidence intervals for a 60-day forecast horizon
[6]:
val_len = 180
fcst_len = 60
[7]:
data = pd.read_csv('daily-website-visitors.csv',parse_dates=['Date']).set_index('Date')
data.head()
[7]:
Row | Day | Day.Of.Week | Page.Loads | Unique.Visits | First.Time.Visits | Returning.Visits | |
---|---|---|---|---|---|---|---|
Date | |||||||
2014-09-14 | 1 | Sunday | 1 | 2,146 | 1,582 | 1430 | 152 |
2014-09-15 | 2 | Monday | 2 | 3,621 | 2,528 | 2297 | 231 |
2014-09-16 | 3 | Tuesday | 3 | 3,698 | 2,630 | 2352 | 278 |
2014-09-17 | 4 | Wednesday | 4 | 3,667 | 2,614 | 2327 | 287 |
2014-09-18 | 5 | Thursday | 5 | 3,316 | 2,366 | 2130 | 236 |
[8]:
visits_sep = data['First.Time.Visits'].iloc[-fcst_len:]
visits = data['First.Time.Visits'].iloc[:-fcst_len]
[9]:
f=Forecaster(
y=visits,
current_dates=visits.index,
future_dates=fcst_len,
test_length = val_len,
validation_length = val_len, # for hyperparameter tuning
cis = True, # set to True at initialization to always evaluate cis
)
f.auto_Xvar_select(
estimator='elasticnet',
alpha=.2,
max_ar=100,
monitor='ValidationMetricValue',
)
f
[9]:
Forecaster(
DateStartActuals=2014-09-14T00:00:00.000000000
DateEndActuals=2020-06-20T00:00:00.000000000
Freq=D
N_actuals=2107
ForecastLength=60
Xvars=['t', 'AR1', 'AR2', 'AR3', 'AR4', 'AR5', 'AR6', 'AR7', 'AR8', 'AR9', 'AR10', 'AR11', 'AR12', 'AR13', 'AR14', 'AR15', 'AR16', 'AR17', 'AR18', 'AR19', 'AR20', 'AR21', 'AR22', 'AR23', 'AR24', 'AR25', 'AR26', 'AR27', 'AR28', 'AR29', 'AR30', 'AR31', 'AR32', 'AR33', 'AR34', 'AR35', 'AR36', 'AR37', 'AR38', 'AR39', 'AR40', 'AR41', 'AR42', 'AR43', 'AR44', 'AR45', 'AR46', 'AR47', 'AR48', 'AR49', 'AR50', 'AR51', 'AR52', 'AR53', 'AR54', 'AR55', 'AR56', 'AR57', 'AR58', 'AR59', 'AR60', 'AR61', 'AR62', 'AR63', 'AR64', 'AR65', 'AR66', 'AR67', 'AR68', 'AR69', 'AR70', 'AR71', 'AR72', 'AR73', 'AR74', 'AR75', 'AR76', 'AR77', 'AR78', 'AR79', 'AR80', 'AR81', 'AR82', 'AR83', 'AR84', 'AR85', 'AR86', 'AR87', 'AR88', 'AR89', 'AR90', 'AR91', 'AR92', 'AR93', 'AR94', 'AR95', 'AR96', 'AR97', 'AR98', 'AR99', 'AR100']
TestLength=180
ValidationMetric=rmse
ForecastsEvaluated=[]
CILevel=0.95
CurrentEstimator=mlr
GridsFile=Grids
)
[10]:
tune_test_forecast(
f,
models,
)
Finished loading model, total used 250 iterations
Finished loading model, total used 250 iterations
Finished loading model, total used 250 iterations
[11]:
ms = f.export('model_summaries',determine_best_by='TestSetRMSE')
ms[['ModelNickname','TestSetRMSE','InSampleRMSE']]
[11]:
ModelNickname | TestSetRMSE | InSampleRMSE | |
---|---|---|---|
0 | xgboost | 532.374846 | 14.496801 |
1 | lightgbm | 582.261645 | 112.543232 |
2 | knn | 788.704950 | 306.828988 |
3 | elasticnet | 794.649818 | 186.616620 |
4 | ridge | 798.711718 | 187.026325 |
5 | mlr | 803.931314 | 185.620629 |
6 | gbt | 1171.782785 | 147.478748 |
We will demonstrate how the confidence intervals change as they are re-evaluated using the best model according to the test RMSE.
Evaluate Interval
[12]:
fig, ax = plt.subplots(figsize=(12,6))
f.plot(ci=True,models='top_1',order_by='TestSetRMSE',ax=ax)
sns.lineplot(
y = 'First.Time.Visits',
x = 'Date',
data = visits_sep.reset_index(),
ax = ax,
label = 'held-out actuals',
color = 'green',
alpha = 0.7,
)
plt.show()
[13]:
# export test-set preds and confidence intervals
fcsts1 = f.export("lvl_fcsts",cis=True)
fcsts1.head()
[13]:
DATE | mlr | mlr_upperci | mlr_lowerci | elasticnet | elasticnet_upperci | elasticnet_lowerci | ridge | ridge_upperci | ridge_lowerci | ... | knn_lowerci | xgboost | xgboost_upperci | xgboost_lowerci | lightgbm | lightgbm_upperci | lightgbm_lowerci | gbt | gbt_upperci | gbt_lowerci | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-06-21 | 2190.555279 | 3464.907755 | 916.202804 | 2194.962019 | 3452.970789 | 936.953250 | 2191.868075 | 3464.239858 | 919.496293 | ... | 812.176316 | 2039.648438 | 3098.496887 | 980.799988 | 2198.858468 | 3232.664729 | 1165.052207 | 2125.968510 | 4021.476092 | 230.460929 |
1 | 2020-06-22 | 2835.674158 | 4110.026633 | 1561.321683 | 2823.011263 | 4081.020033 | 1565.002493 | 2831.886812 | 4104.258594 | 1559.515029 | ... | 1612.281579 | 2743.368896 | 3802.217346 | 1684.520447 | 2820.575637 | 3854.381898 | 1786.769376 | 2729.050006 | 4624.557587 | 833.542425 |
2 | 2020-06-23 | 2975.510723 | 4249.863198 | 1701.158248 | 2944.846465 | 4202.855235 | 1686.837696 | 2972.183687 | 4244.555469 | 1699.811904 | ... | 1804.176316 | 2851.803711 | 3910.652161 | 1792.955261 | 2819.654176 | 3853.460437 | 1785.847915 | 2880.262061 | 4775.769642 | 984.754480 |
3 | 2020-06-24 | 3038.258157 | 4312.610632 | 1763.905681 | 3006.932564 | 4264.941333 | 1748.923794 | 3033.992818 | 4306.364600 | 1761.621035 | ... | 1673.860526 | 2905.132568 | 3963.981018 | 1846.284119 | 2824.992206 | 3858.798467 | 1791.185945 | 2930.293895 | 4825.801476 | 1034.786314 |
4 | 2020-06-25 | 3032.580568 | 4306.933043 | 1758.228093 | 3011.122478 | 4269.131247 | 1753.113708 | 3021.274091 | 4293.645873 | 1748.902309 | ... | 1475.386842 | 2955.912598 | 4014.761047 | 1897.064148 | 2961.588543 | 3995.394805 | 1927.782282 | 3029.007375 | 4924.514956 | 1133.499794 |
5 rows × 22 columns
The values in the below table are mean scaled interval scores for confidence intervals. Lower scores are better.
[14]:
results = score_cis(
results_template.copy(),
fcsts1,
'Daily Visitors',
visits_sep,
visits,
val_len = val_len,
)
results
[14]:
Daily Visitors | |
---|---|
mlr | 5.632456 |
elasticnet | 5.548336 |
ridge | 5.622506 |
knn | 5.639715 |
xgboost | 5.597762 |
lightgbm | 5.196714 |
gbt | 8.319963 |
Housing Starts
Link to data: https://fred.stlouisfed.org/series/HOUSTNSA
We will use a length of 96 observations (8 years) both to tune and test the models
We want to optimize the forecasts and confidence intervals for a 24-month forecast horizon
[15]:
import pandas_datareader as pdr
[16]:
val_len = 96
fcst_len = 24
[17]:
housing = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-06-01')
housing.head()
[17]:
HOUSTNSA | |
---|---|
DATE | |
1959-01-01 | 96.2 |
1959-02-01 | 99.0 |
1959-03-01 | 127.7 |
1959-04-01 | 150.8 |
1959-05-01 | 152.5 |
[18]:
starts_sep = housing.iloc[-fcst_len:,0]
starts = housing.iloc[:-fcst_len,0]
[19]:
f = Forecaster(
y=starts,
current_dates=starts.index,
future_dates=fcst_len,
test_length=val_len,
validation_length=val_len,
cis=True,
)
# difference the data for stationary modeling
transformer = SeriesTransformer(f)
f = transformer.DiffTransform(1)
# find best xvars to forecast with
f.auto_Xvar_select(
estimator='elasticnet',
alpha=.2,
max_ar=100,
monitor='ValidationMetricValue', # not test set
)
f
[19]:
Forecaster(
DateStartActuals=1959-02-01T00:00:00.000000000
DateEndActuals=2019-06-01T00:00:00.000000000
Freq=MS
N_actuals=725
ForecastLength=24
Xvars=['monthsin', 'monthcos', 'quartersin', 'quartercos', 'AR1', 'AR2', 'AR3', 'AR4', 'AR5', 'AR6', 'AR7', 'AR8', 'AR9', 'AR10', 'AR11', 'AR12', 'AR13', 'AR14', 'AR15', 'AR16', 'AR17', 'AR18', 'AR19', 'AR20', 'AR21', 'AR22', 'AR23', 'AR24', 'AR25', 'AR26', 'AR27', 'AR28', 'AR29', 'AR30', 'AR31', 'AR32', 'AR33', 'AR34', 'AR35', 'AR36', 'AR37', 'AR38', 'AR39', 'AR40', 'AR41', 'AR42', 'AR43', 'AR44', 'AR45', 'AR46', 'AR47', 'AR48', 'AR49', 'AR50', 'AR51', 'AR52', 'AR53', 'AR54', 'AR55', 'AR56', 'AR57', 'AR58', 'AR59', 'AR60', 'AR61', 'AR62', 'AR63', 'AR64', 'AR65', 'AR66', 'AR67', 'AR68', 'AR69', 'AR70', 'AR71', 'AR72', 'AR73', 'AR74', 'AR75', 'AR76', 'AR77', 'AR78', 'AR79', 'AR80', 'AR81', 'AR82', 'AR83', 'AR84', 'AR85', 'AR86', 'AR87', 'AR88', 'AR89', 'AR90', 'AR91', 'AR92', 'AR93', 'AR94', 'AR95', 'AR96', 'AR97', 'AR98', 'AR99', 'AR100']
TestLength=96
ValidationMetric=rmse
ForecastsEvaluated=[]
CILevel=0.95
CurrentEstimator=mlr
GridsFile=Grids
)
[20]:
tune_test_forecast(
f,
models,
dynamic_testing = fcst_len,
)
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
[21]:
# rever the difftransform
f = transformer.DiffRevert(1)
[22]:
ms = f.export('model_summaries',determine_best_by='TestSetRMSE')
ms[['ModelNickname','TestSetRMSE','InSampleRMSE']]
[22]:
ModelNickname | TestSetRMSE | InSampleRMSE | |
---|---|---|---|
0 | xgboost | 21.402469 | 3.112793 |
1 | lightgbm | 23.160361 | 27.660879 |
2 | mlr | 30.059747 | 71.624715 |
3 | gbt | 38.185841 | 38.657789 |
4 | knn | 42.432505 | 146.502991 |
5 | ridge | 46.457215 | 55.969584 |
6 | elasticnet | 53.111686 | 38.237499 |
Evaluate Interval
[23]:
fig, ax = plt.subplots(figsize=(12,6))
f.plot(ci=True,models='top_1',order_by='TestSetRMSE',ax=ax)
sns.lineplot(
y = 'HOUSTNSA',
x = 'DATE',
data = starts_sep.reset_index(),
ax = ax,
label = 'held-out actuals',
color = 'green',
alpha = 0.7,
)
plt.show()
[24]:
housing_fcsts1 = f.export("lvl_fcsts",cis=True)
housing_results = score_cis(
results_template.copy(),
housing_fcsts1,
'Housing Starts',
starts_sep,
starts,
val_len = val_len,
m_ = 12, # monthly seasonality
)
housing_results
[24]:
Housing Starts | |
---|---|
mlr | 5.661675 |
elasticnet | 8.247183 |
ridge | 7.432564 |
knn | 6.552641 |
xgboost | 3.930183 |
lightgbm | 4.289493 |
gbt | 5.884677 |
[25]:
results['Housing Starts'] = housing_results['Housing Starts']
results
[25]:
Daily Visitors | Housing Starts | |
---|---|---|
mlr | 5.632456 | 5.661675 |
elasticnet | 5.548336 | 8.247183 |
ridge | 5.622506 | 7.432564 |
knn | 5.639715 | 6.552641 |
xgboost | 5.597762 | 3.930183 |
lightgbm | 5.196714 | 4.289493 |
gbt | 8.319963 | 5.884677 |
Avocado Sales
Link to data: https://www.kaggle.com/datasets/neuromusic/avocado-prices
We will use a length of 20 observations both to tune and test the models (95% CIs require at least 20 observations in the test set)
We want to optimize the forecasts and confidence intervals for a 20-week forecast horizon
[26]:
# change display settings
pd.options.display.float_format = '{:,.2f}'.format
[27]:
val_len = 20
fcst_len = 20
[28]:
avocados = pd.read_csv('avocado.csv',parse_dates = ['Date'])
volume = avocados.groupby('Date')['Total Volume'].sum()
[29]:
volume.reset_index().head()
[29]:
Date | Total Volume | |
---|---|---|
0 | 2015-01-04 | 84,674,337.20 |
1 | 2015-01-11 | 78,555,807.24 |
2 | 2015-01-18 | 78,388,784.08 |
3 | 2015-01-25 | 76,466,281.07 |
4 | 2015-02-01 | 119,453,235.25 |
[30]:
volume_sep = volume.iloc[-fcst_len:]
volume = volume.iloc[:-fcst_len]
[31]:
f = Forecaster(
y = volume,
current_dates = volume.index,
future_dates = fcst_len,
test_length = val_len,
validation_length = val_len,
cis = True,
)
# difference the data for stationary modeling
transformer = SeriesTransformer(f)
f = transformer.DiffTransform(1)
f = transformer.DiffTransform(52) # seasonal differencing
# find best xvars
f.auto_Xvar_select(
estimator='elasticnet',
alpha=.2,
max_ar=26,
monitor='ValidationMetricValue', # not test set
decomp_trend=False,
)
f
[31]:
Forecaster(
DateStartActuals=2016-01-10T00:00:00.000000000
DateEndActuals=2017-11-05T00:00:00.000000000
Freq=W-SUN
N_actuals=96
ForecastLength=20
Xvars=['weeksin', 'weekcos', 'AR1', 'AR2', 'AR3', 'AR4', 'AR5', 'AR6', 'AR7', 'AR8', 'AR9']
TestLength=20
ValidationMetric=rmse
ForecastsEvaluated=[]
CILevel=0.95
CurrentEstimator=mlr
GridsFile=Grids
)
[32]:
tune_test_forecast(
f,
models,
dynamic_testing = fcst_len,
)
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
[33]:
# revert differencing
f = transformer.DiffRevert(52)
f = transformer.DiffRevert(1)
[34]:
ms = f.export('model_summaries',determine_best_by='TestSetRMSE')
ms[['ModelNickname','TestSetRMSE','InSampleRMSE']]
[34]:
ModelNickname | TestSetRMSE | InSampleRMSE | |
---|---|---|---|
0 | knn | 12,792,100.70 | 14,090,078.03 |
1 | ridge | 13,457,205.75 | 23,224,298.55 |
2 | elasticnet | 13,532,378.91 | 23,130,014.21 |
3 | mlr | 13,532,379.31 | 23,130,012.59 |
4 | lightgbm | 15,265,692.24 | 21,365,608.04 |
5 | gbt | 20,539,987.16 | 18,023,127.21 |
6 | xgboost | 21,487,561.26 | 12,876.01 |
Evaluate Interval
[35]:
fig, ax = plt.subplots(figsize=(12,6))
f.plot(ci=True,models='top_1',order_by='TestSetRMSE',ax=ax)
sns.lineplot(
y = 'Total Volume',
x = 'Date',
data = volume_sep.reset_index(),
ax = ax,
label = 'held-out actuals',
color = 'green',
alpha = 0.7,
)
plt.show()
[36]:
avc_fcsts1 = f.export("lvl_fcsts",cis=True)
avc_results = score_cis(
results_template.copy(),
avc_fcsts1,
'Avocados',
volume_sep,
volume,
val_len = val_len,
models = models,
)
avc_results
[36]:
Avocados | |
---|---|
mlr | 8.55 |
elasticnet | 8.55 |
ridge | 8.48 |
knn | 19.70 |
xgboost | 13.74 |
lightgbm | 19.81 |
gbt | 10.65 |
[37]:
results['Avocados'] = avc_results['Avocados']
results
[37]:
Daily Visitors | Housing Starts | Avocados | |
---|---|---|---|
mlr | 5.63 | 5.66 | 8.55 |
elasticnet | 5.55 | 8.25 | 8.55 |
ridge | 5.62 | 7.43 | 8.48 |
knn | 5.64 | 6.55 | 19.70 |
xgboost | 5.60 | 3.93 | 13.74 |
lightgbm | 5.20 | 4.29 | 19.81 |
gbt | 8.32 | 5.88 | 10.65 |
All Aggregated Results
[38]:
_, ax = plt.subplots(figsize=(12,6))
sns.heatmap(
results,
annot=True,
fmt='.2f',
cmap="Spectral_r",
ax=ax
)
plt.show()
For the most part, the linear models set the best intervals and the boosted tree models were very good or very bad, with GBT more towards the bad side.
Benchmark Against StatsModels ARIMA
Confidence intervals come from StatsModels but the auto-ARIMA process is from PMDARIMA.
[39]:
from scalecast.auxmodels import auto_arima
[40]:
all_series = {
# series,out-of-sample series,seasonal step
'visitors':[visits,visits_sep,1],
'housing starts':[starts,starts_sep,12],
'avocados':[volume,volume_sep,1]
}
arima_conformal_results = pd.DataFrame()
arima_sm_results = pd.DataFrame()
[41]:
for k, v in all_series.items():
print(k)
f = Forecaster(
y=v[0],
current_dates=v[0].index,
future_dates=len(v[1]),
test_length = len(v[1]),
cis=True,
)
auto_arima(f,m=v[2])
arima_results = f.export("lvl_fcsts",cis=True)
# scalecast intervals
arima_conformal_results.loc[k,'MSIS'] = metrics.msis(
a = v[1].values,
uf = arima_results['auto_arima_upperci'].values,
lf = arima_results['auto_arima_lowerci'].values,
obs = v[0].values,
m = v[2],
)
# statsmodels intervals
cis = f.regr.get_forecast(len(v[1])).conf_int()
arima_sm_results.loc[k,'MSIS'] = metrics.msis(
a = v[1].values,
uf = cis.T[1],
lf = cis.T[0],
obs = v[0].values,
m = v[2],
)
visitors
housing starts
avocados
MSIS Results - ARIMA Scalecast
[42]:
# results from the scalecast intervals
arima_conformal_results
[42]:
MSIS | |
---|---|
visitors | 4.97 |
housing starts | 24.90 |
avocados | 23.15 |
[43]:
arima_conformal_results.mean()
[43]:
MSIS 17.67
dtype: float64
MSIS Results - ARIMA StatsModels
[44]:
# results from the statsmodels intervals
arima_sm_results
[44]:
MSIS | |
---|---|
visitors | 5.80 |
housing starts | 5.62 |
avocados | 19.94 |
[45]:
arima_sm_results.mean()
[45]:
MSIS 10.46
dtype: float64
[46]:
all_results = results.copy()
all_results.loc['arima (conformal)'] = arima_conformal_results.T.values[0]
all_results.loc['arima (statsmodels) - benchmark'] = arima_sm_results.T.values[0]
all_results = all_results.T
all_results
[46]:
mlr | elasticnet | ridge | knn | xgboost | lightgbm | gbt | arima (conformal) | arima (statsmodels) - benchmark | |
---|---|---|---|---|---|---|---|---|---|
Daily Visitors | 5.63 | 5.55 | 5.62 | 5.64 | 5.60 | 5.20 | 8.32 | 4.97 | 5.80 |
Housing Starts | 5.66 | 8.25 | 7.43 | 6.55 | 3.93 | 4.29 | 5.88 | 24.90 | 5.62 |
Avocados | 8.55 | 8.55 | 8.48 | 19.70 | 13.74 | 19.81 | 10.65 | 23.15 | 19.94 |
[47]:
def highlight_rows(row):
ret_row = ['']*all_results.shape[1]
for i, c in enumerate(all_results.iloc[:,:-1]):
if row[c] < row['arima (statsmodels) - benchmark']:
ret_row[i] = 'background-color: lightgreen;'
else:
ret_row[i] = 'background-color: lightcoral;'
return ret_row
all_results.style.apply(
highlight_rows,
axis=1,
)
[47]:
mlr | elasticnet | ridge | knn | xgboost | lightgbm | gbt | arima (conformal) | arima (statsmodels) - benchmark | |
---|---|---|---|---|---|---|---|---|---|
Daily Visitors | 5.632456 | 5.548336 | 5.622506 | 5.639715 | 5.597762 | 5.196714 | 8.319963 | 4.966161 | 5.796559 |
Housing Starts | 5.661675 | 8.247183 | 7.432564 | 6.552641 | 3.930183 | 4.289493 | 5.884677 | 24.897373 | 5.624803 |
Avocados | 8.551783 | 8.551783 | 8.478400 | 19.700378 | 13.743804 | 19.810258 | 10.654140 | 23.151315 | 19.944485 |
The above table shows which scalecast intervals performed better or worse than the ARIMA interval. Green scores are better, red are worse. The scalecast conformal intervals were on the whole better, but not always. When comparing ARIMA to ARIMA, the confidence intervals from statsmodels performed slightly worse on one series, slightly better on another, and significantly better on the remaining one. The rest of the models run through scalecast usually beat the ARIMA intervals on all datasets, except housing starts. XGBoost and Lightgbm were the only models to always beat the intervals from StatsModels. On the whole, the scalecast interval has a nice showing against the more traditional interval from ARIMA in the StatsModels package.
[ ]: