VECM

[1]:
import pandas as pd
import numpy as np
from pandas_datareader import data as pdr
import yfinance as yf
from scalecast.Forecaster import Forecaster
from scalecast.MVForecaster import MVForecaster
from scalecast.Pipeline import Transformer, Reverter, MVPipeline
from scalecast.util import (
    find_optimal_lag_order,
    find_optimal_coint_rank,
    Forecaster_with_missing_vals,
)
from scalecast.auxmodels import vecm
from scalecast.multiseries import export_model_summaries
from scalecast import GridGenerator
import matplotlib.pyplot as plt
[2]:
yf.pdr_override()

Download data using a public API

[3]:
FANG = [
    'META',
    'AMZN',
    'NFLX',
    'GOOG',
]

fs = []
for sym in FANG:
    df = pdr.get_data_yahoo(sym)
    # since the api doesn't send the data exactly in Business-day frequency
    # we can correct it using this function
    f = Forecaster_with_missing_vals(
        y=df['Close'],
        current_dates = df.index,
        future_dates = 65,
        end = '2022-09-30',
        desired_frequency = 'B',
        fill_strategy = 'linear_interp',
        add_noise = True,
        noise_lookback = 5,
    )
    fs.append(f)

mvf = MVForecaster(*fs,names=FANG,test_length=65)
mvf.set_validation_metric('rmse')
mvf.add_sklearn_estimator(vecm,'vecm')

mvf
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[3]:
MVForecaster(
    DateStartActuals=2012-05-18T00:00:00.000000000
    DateEndActuals=2023-08-03T00:00:00.000000000
    Freq=B
    N_actuals=2925
    N_series=4
    SeriesNames=['META', 'AMZN', 'NFLX', 'GOOG']
    ForecastLength=65
    Xvars=[]
    TestLength=65
    ValidationLength=1
    ValidationMetric=rmse
    ForecastsEvaluated=[]
    CILevel=None
    CurrentEstimator=mlr
    OptimizeOn=mean
    GridsFile=MVGrids
)
[4]:
mvf.plot()
plt.show()
../_images/vecm_vecm_5_0.png

Augmented Dickey Fuller Tests to Confirm Unit-1 Roots

[5]:
for stock, f in zip(FANG,fs):
    adf_result = f.adf_test(full_res=True)
    print('the stock {} is {}stationary at level'.format(
        stock,
        'not ' if adf_result[1] > 0.05 else ''
    )
    )
the stock META is not stationary at level
the stock AMZN is not stationary at level
the stock NFLX is not stationary at level
the stock GOOG is not stationary at level
[6]:
for stock, f in zip(FANG,fs):
    adf_result = f.adf_test(diffy=True,full_res=True)
    print('the stock {} is {}stationary at its first difference'.format(
        stock,
        'not ' if adf_result[1] > 0.05 else ''
    )
    )
the stock META is stationary at its first difference
the stock AMZN is stationary at its first difference
the stock NFLX is stationary at its first difference
the stock GOOG is stationary at its first difference

Measure IC to Find Optimal Lag Order

  • this is used to run the cointegration test

[7]:
lag_test = find_optimal_lag_order(mvf,train_only=True)
pd.DataFrame(
    {
        'aic':lag_test.aic,
        'bic':lag_test.bic,
        'hqic':lag_test.hqic,
        'fpe':lag_test.fpe,
    },
    index = ['optimal lag order'],
).T
[7]:
optimal lag order
aic 27
bic 1
hqic 3
fpe 27

Johansen cointegration test

[8]:
coint_res = find_optimal_coint_rank(
    mvf,
    det_order=1,
    k_ar_diff=10,
    train_only=True,
)
print(coint_res)
coint_res.rank
Johansen cointegration test using trace test statistic with 5% significance level
=====================================
r_0 r_1 test statistic critical value
-------------------------------------
  0   4          56.60          55.25
  1   4          33.50          35.01
-------------------------------------
[8]:
1

We found a cointegration rank of 1.

Run VECM

  • Now, we can specify a grid that will try more lags, deterministic terms, seasonal fluctuations, and cointegration ranks of 0 and 1

[9]:
vecm_grid = dict(
    lags = [0], # required to set this to 0 for the vecm model in scalecast
    freq = ['B'], # only necessary to suppress a warning
    k_ar_diff = range(1,66), # lag orders to try
    coint_rank = [0,1],
    deterministic = ["n","co","lo","li","cili","colo"],
    seasons = [0,5,30,65,260],
)

mvf.set_estimator('vecm')
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast()

results = mvf.export('model_summaries')
results[[
    'ModelNickname',
    'Series',
    'TestSetRMSE',
    'TestSetMAE',
]]
Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 28, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 5}.
[9]:
ModelNickname Series TestSetRMSE TestSetMAE
0 vecm META 50.913814 44.133838
1 vecm AMZN 23.752750 22.348669
2 vecm NFLX 113.705564 105.310047
3 vecm GOOG 18.846431 18.041912

View VECM Results

[10]:
results['TestSetRMSE'].mean()
[10]:
51.80463965264752
[11]:
mvf.export_validation_grid('vecm').sample(15)
[11]:
lags freq k_ar_diff coint_rank deterministic seasons Fold0Metric Fold1Metric Fold2Metric AverageMetric MetricEvaluated Optimized On
71 0 B 58 0 colo 0 162.599874 44.268484 26.594518 77.820958 rmse mean
40 0 B 7 0 li 260 NaN NaN NaN NaN rmse mean
3 0 B 54 1 lo 30 110.553065 51.443275 19.154490 60.383610 rmse mean
30 0 B 29 0 co 30 112.035140 44.427934 18.598329 58.353801 rmse mean
76 0 B 1 1 cili 65 93.945660 51.090754 21.836356 55.624257 rmse mean
87 0 B 50 0 lo 260 163.152289 46.616014 13.327631 74.365311 rmse mean
10 0 B 44 0 cili 65 NaN NaN NaN NaN rmse mean
9 0 B 47 0 n 0 86.212252 55.639330 36.182421 59.344668 rmse mean
94 0 B 38 1 colo 5 139.580751 41.852200 24.055439 68.496130 rmse mean
43 0 B 55 1 colo 260 112.237679 43.782141 63.595936 73.205252 rmse mean
73 0 B 40 0 lo 0 159.422501 46.642590 18.090453 74.718515 rmse mean
62 0 B 18 1 n 5 166.241648 54.618358 30.753820 83.871275 rmse mean
63 0 B 12 0 co 260 108.180519 44.685347 23.286315 58.717394 rmse mean
60 0 B 21 1 li 260 82.029398 50.186756 31.553748 54.589967 rmse mean
64 0 B 43 1 co 0 100.291180 46.496187 16.839982 54.542450 rmse mean
[12]:
mvf.plot_test_set(
    series='AMZN',
    models='vecm',
    include_train=130,
    figsize=(16,8)
)
plt.show()
../_images/vecm_vecm_19_0.png

Re-weight Evaluation Metrics and Rerun VECM

[13]:
weights = results['TestSetRMSE'] / results['TestSetRMSE'].sum()
weights
[13]:
0    0.245701
1    0.114627
2    0.548723
3    0.090950
Name: TestSetRMSE, dtype: float64
[14]:
mvf.set_optimize_on(
    lambda x: (
        x[0]*weights[0] +
        x[1]*weights[1] +
        x[2]*weights[2] +
        x[3]*weights[3]
    )
)
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast(call_me='vecm_weighted')

results = mvf.export('model_summaries')
results[[
    'ModelNickname',
    'Series',
    'TestSetRMSE',
    'TestSetMAE',
]]
Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 62, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 0}.
[14]:
ModelNickname Series TestSetRMSE TestSetMAE
0 vecm META 50.913814 44.133838
1 vecm_weighted META 33.772797 28.520031
2 vecm AMZN 23.752750 22.348669
3 vecm_weighted AMZN 15.835426 14.617043
4 vecm NFLX 113.705564 105.310047
5 vecm_weighted NFLX 59.887563 53.210215
6 vecm GOOG 18.846431 18.041912
7 vecm_weighted GOOG 16.757831 16.121594
[15]:
results.loc[results['ModelNickname'] == 'vecm_weighted','TestSetRMSE'].mean()
[15]:
31.56340410699967

An improvement by weighting the optimizer!

[16]:
mvf.plot_test_set(
    series='META',
    models='all',
    include_train=130,
    figsize=(16,8)
)
plt.show()
../_images/vecm_vecm_25_0.png
[17]:
mvf.plot(
    series='all',
    models='all',
    figsize=(16,8)
)
plt.show()
../_images/vecm_vecm_26_0.png

Try Other MV Models

[31]:
GridGenerator.get_mv_grids()
# open MVGrids.py and manually change all lags arguments to range(1,66)
[32]:
transformers = []
reverters = []
for stock, f in zip(FANG,fs):
    transformer = Transformer(
        transformers = [('DiffTransform',)]
    )
    reverter = Reverter(
        reverters = [('DiffRevert',)],
        base_transformer = transformer,
    )
    transformers.append(transformer)
    reverters.append(reverter)
[33]:
def Xvar_select(f):
    f.set_validation_length(65)
    f.auto_Xvar_select(
        estimator='gbt',
        max_depth=2,
        max_ar=0, # in mv modeling, lags are a hyperparameter, not a regressor in the MVForecaster object
    )

def mvforecaster(mvf):
    models = (
        'mlr',
        'elasticnet',
        'gbt',
        'xgboost',
        'lightgbm',
        'knn',
    )
    mvf.set_test_length(65)
    mvf.tune_test_forecast(
        models,
        limit_grid_size=10,
        cross_validate=True,
        k=3,
    )
[34]:
pipeline = MVPipeline(
    steps = [
        ('Transform',transformers),
        ('Xvar Select',[Xvar_select]*4),
        ('Forecast',mvforecaster),
        ('Revert',reverters),
    ],
    names = FANG,
)
[35]:
fs = pipeline.fit_predict(*fs)
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
[36]:
results = export_model_summaries(dict(zip(FANG,fs)))

View Results

[37]:
model_rmses = results.groupby('ModelNickname')['TestSetRMSE'].mean().sort_values().reset_index()
model_rmses
[37]:
ModelNickname TestSetRMSE
0 xgboost 27.918698
1 knn 41.832711
2 mlr 42.519262
3 gbt 43.895070
4 elasticnet 44.320649
5 lightgbm 44.554279

The above table is the mean mape performance from each model over all series.

[38]:
series_rmses = results.groupby('Series')['TestSetRMSE'].min().reset_index()
series_rmses['Model'] = [
    results.loc[
        results['TestSetRMSE'] == i,
        'ModelNickname'
    ].values[0] for i in series_rmses['TestSetRMSE']
]
series_rmses
[38]:
Series TestSetRMSE Model
0 AMZN 17.433315 xgboost
1 GOOG 12.892690 xgboost
2 META 10.509525 xgboost
3 NFLX 70.839262 xgboost
[39]:
series_rmses['TestSetRMSE'].mean()
[39]:
27.918697955026953

The above table shows the best model for each series and its derived RMSE. The average RMSE of all these models applied to the individual series is 27.9, but being so dependent on the test set to choose the model probably leads to overfitting.

[41]:
fs[1].plot_test_set(
    models='xgboost',
    include_train=130,
)
plt.title('AMZN Best Model Test Predictions')
plt.show()
../_images/vecm_vecm_40_0.png
[42]:
fs[1].plot(
    models='xgboost',
)
plt.title('AMZN Best Model Forecast')
plt.show()
../_images/vecm_vecm_41_0.png
[ ]: