VECM
Using a VECM to predict FANG stocks
See the VECM documentation
[1]:
import pandas as pd
import numpy as np
from pandas_datareader import data as pdr
import yfinance as yf
from scalecast.Forecaster import Forecaster
from scalecast.MVForecaster import MVForecaster
from scalecast.Pipeline import Transformer, Reverter, MVPipeline
from scalecast.util import (
find_optimal_lag_order,
find_optimal_coint_rank,
Forecaster_with_missing_vals,
)
from scalecast.auxmodels import vecm
from scalecast.multiseries import export_model_summaries
from scalecast import GridGenerator
import matplotlib.pyplot as plt
[2]:
yf.pdr_override()
Download data using a public API
[3]:
FANG = [
'META',
'AMZN',
'NFLX',
'GOOG',
]
fs = []
for sym in FANG:
df = pdr.get_data_yahoo(sym)
# since the api doesn't send the data exactly in Business-day frequency
# we can correct it using this function
f = Forecaster_with_missing_vals(
y=df['Close'],
current_dates = df.index,
future_dates = 65,
end = '2022-09-30',
desired_frequency = 'B',
fill_strategy = 'linear_interp',
add_noise = True,
noise_lookback = 5,
)
fs.append(f)
mvf = MVForecaster(*fs,names=FANG,test_length=65)
mvf.set_validation_metric('rmse')
mvf.add_sklearn_estimator(vecm,'vecm')
mvf
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
[3]:
MVForecaster(
DateStartActuals=2012-05-18T00:00:00.000000000
DateEndActuals=2023-08-03T00:00:00.000000000
Freq=B
N_actuals=2925
N_series=4
SeriesNames=['META', 'AMZN', 'NFLX', 'GOOG']
ForecastLength=65
Xvars=[]
TestLength=65
ValidationLength=1
ValidationMetric=rmse
ForecastsEvaluated=[]
CILevel=None
CurrentEstimator=mlr
OptimizeOn=mean
GridsFile=MVGrids
)
[4]:
mvf.plot()
plt.show()
Augmented Dickey Fuller Tests to Confirm Unit-1 Roots
[5]:
for stock, f in zip(FANG,fs):
adf_result = f.adf_test(full_res=True)
print('the stock {} is {}stationary at level'.format(
stock,
'not ' if adf_result[1] > 0.05 else ''
)
)
the stock META is not stationary at level
the stock AMZN is not stationary at level
the stock NFLX is not stationary at level
the stock GOOG is not stationary at level
[6]:
for stock, f in zip(FANG,fs):
adf_result = f.adf_test(diffy=True,full_res=True)
print('the stock {} is {}stationary at its first difference'.format(
stock,
'not ' if adf_result[1] > 0.05 else ''
)
)
the stock META is stationary at its first difference
the stock AMZN is stationary at its first difference
the stock NFLX is stationary at its first difference
the stock GOOG is stationary at its first difference
Measure IC to Find Optimal Lag Order
this is used to run the cointegration test
[7]:
lag_test = find_optimal_lag_order(mvf,train_only=True)
pd.DataFrame(
{
'aic':lag_test.aic,
'bic':lag_test.bic,
'hqic':lag_test.hqic,
'fpe':lag_test.fpe,
},
index = ['optimal lag order'],
).T
[7]:
optimal lag order | |
---|---|
aic | 27 |
bic | 1 |
hqic | 3 |
fpe | 27 |
Johansen cointegration test
[8]:
coint_res = find_optimal_coint_rank(
mvf,
det_order=1,
k_ar_diff=10,
train_only=True,
)
print(coint_res)
coint_res.rank
Johansen cointegration test using trace test statistic with 5% significance level
=====================================
r_0 r_1 test statistic critical value
-------------------------------------
0 4 56.60 55.25
1 4 33.50 35.01
-------------------------------------
[8]:
1
We found a cointegration rank of 1.
Run VECM
Now, we can specify a grid that will try more lags, deterministic terms, seasonal fluctuations, and cointegration ranks of 0 and 1
[9]:
vecm_grid = dict(
lags = [0], # required to set this to 0 for the vecm model in scalecast
freq = ['B'], # only necessary to suppress a warning
k_ar_diff = range(1,66), # lag orders to try
coint_rank = [0,1],
deterministic = ["n","co","lo","li","cili","colo"],
seasons = [0,5,30,65,260],
)
mvf.set_estimator('vecm')
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast()
results = mvf.export('model_summaries')
results[[
'ModelNickname',
'Series',
'TestSetRMSE',
'TestSetMAE',
]]
Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 28, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 5}.
[9]:
ModelNickname | Series | TestSetRMSE | TestSetMAE | |
---|---|---|---|---|
0 | vecm | META | 50.913814 | 44.133838 |
1 | vecm | AMZN | 23.752750 | 22.348669 |
2 | vecm | NFLX | 113.705564 | 105.310047 |
3 | vecm | GOOG | 18.846431 | 18.041912 |
View VECM Results
[10]:
results['TestSetRMSE'].mean()
[10]:
51.80463965264752
[11]:
mvf.export_validation_grid('vecm').sample(15)
[11]:
lags | freq | k_ar_diff | coint_rank | deterministic | seasons | Fold0Metric | Fold1Metric | Fold2Metric | AverageMetric | MetricEvaluated | Optimized On | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
71 | 0 | B | 58 | 0 | colo | 0 | 162.599874 | 44.268484 | 26.594518 | 77.820958 | rmse | mean |
40 | 0 | B | 7 | 0 | li | 260 | NaN | NaN | NaN | NaN | rmse | mean |
3 | 0 | B | 54 | 1 | lo | 30 | 110.553065 | 51.443275 | 19.154490 | 60.383610 | rmse | mean |
30 | 0 | B | 29 | 0 | co | 30 | 112.035140 | 44.427934 | 18.598329 | 58.353801 | rmse | mean |
76 | 0 | B | 1 | 1 | cili | 65 | 93.945660 | 51.090754 | 21.836356 | 55.624257 | rmse | mean |
87 | 0 | B | 50 | 0 | lo | 260 | 163.152289 | 46.616014 | 13.327631 | 74.365311 | rmse | mean |
10 | 0 | B | 44 | 0 | cili | 65 | NaN | NaN | NaN | NaN | rmse | mean |
9 | 0 | B | 47 | 0 | n | 0 | 86.212252 | 55.639330 | 36.182421 | 59.344668 | rmse | mean |
94 | 0 | B | 38 | 1 | colo | 5 | 139.580751 | 41.852200 | 24.055439 | 68.496130 | rmse | mean |
43 | 0 | B | 55 | 1 | colo | 260 | 112.237679 | 43.782141 | 63.595936 | 73.205252 | rmse | mean |
73 | 0 | B | 40 | 0 | lo | 0 | 159.422501 | 46.642590 | 18.090453 | 74.718515 | rmse | mean |
62 | 0 | B | 18 | 1 | n | 5 | 166.241648 | 54.618358 | 30.753820 | 83.871275 | rmse | mean |
63 | 0 | B | 12 | 0 | co | 260 | 108.180519 | 44.685347 | 23.286315 | 58.717394 | rmse | mean |
60 | 0 | B | 21 | 1 | li | 260 | 82.029398 | 50.186756 | 31.553748 | 54.589967 | rmse | mean |
64 | 0 | B | 43 | 1 | co | 0 | 100.291180 | 46.496187 | 16.839982 | 54.542450 | rmse | mean |
[12]:
mvf.plot_test_set(
series='AMZN',
models='vecm',
include_train=130,
figsize=(16,8)
)
plt.show()
Re-weight Evaluation Metrics and Rerun VECM
[13]:
weights = results['TestSetRMSE'] / results['TestSetRMSE'].sum()
weights
[13]:
0 0.245701
1 0.114627
2 0.548723
3 0.090950
Name: TestSetRMSE, dtype: float64
[14]:
mvf.set_optimize_on(
lambda x: (
x[0]*weights[0] +
x[1]*weights[1] +
x[2]*weights[2] +
x[3]*weights[3]
)
)
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast(call_me='vecm_weighted')
results = mvf.export('model_summaries')
results[[
'ModelNickname',
'Series',
'TestSetRMSE',
'TestSetMAE',
]]
Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 62, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 0}.
[14]:
ModelNickname | Series | TestSetRMSE | TestSetMAE | |
---|---|---|---|---|
0 | vecm | META | 50.913814 | 44.133838 |
1 | vecm_weighted | META | 33.772797 | 28.520031 |
2 | vecm | AMZN | 23.752750 | 22.348669 |
3 | vecm_weighted | AMZN | 15.835426 | 14.617043 |
4 | vecm | NFLX | 113.705564 | 105.310047 |
5 | vecm_weighted | NFLX | 59.887563 | 53.210215 |
6 | vecm | GOOG | 18.846431 | 18.041912 |
7 | vecm_weighted | GOOG | 16.757831 | 16.121594 |
[15]:
results.loc[results['ModelNickname'] == 'vecm_weighted','TestSetRMSE'].mean()
[15]:
31.56340410699967
An improvement by weighting the optimizer!
[16]:
mvf.plot_test_set(
series='META',
models='all',
include_train=130,
figsize=(16,8)
)
plt.show()
[17]:
mvf.plot(
series='all',
models='all',
figsize=(16,8)
)
plt.show()
Try Other MV Models
[31]:
GridGenerator.get_mv_grids()
# open MVGrids.py and manually change all lags arguments to range(1,66)
[32]:
transformers = []
reverters = []
for stock, f in zip(FANG,fs):
transformer = Transformer(
transformers = [('DiffTransform',)]
)
reverter = Reverter(
reverters = [('DiffRevert',)],
base_transformer = transformer,
)
transformers.append(transformer)
reverters.append(reverter)
[33]:
def Xvar_select(f):
f.set_validation_length(65)
f.auto_Xvar_select(
estimator='gbt',
max_depth=2,
max_ar=0, # in mv modeling, lags are a hyperparameter, not a regressor in the MVForecaster object
)
def mvforecaster(mvf):
models = (
'mlr',
'elasticnet',
'gbt',
'xgboost',
'lightgbm',
'knn',
)
mvf.set_test_length(65)
mvf.tune_test_forecast(
models,
limit_grid_size=10,
cross_validate=True,
k=3,
)
[34]:
pipeline = MVPipeline(
steps = [
('Transform',transformers),
('Xvar Select',[Xvar_select]*4),
('Forecast',mvforecaster),
('Revert',reverters),
],
names = FANG,
)
[35]:
fs = pipeline.fit_predict(*fs)
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
[36]:
results = export_model_summaries(dict(zip(FANG,fs)))
View Results
[37]:
model_rmses = results.groupby('ModelNickname')['TestSetRMSE'].mean().sort_values().reset_index()
model_rmses
[37]:
ModelNickname | TestSetRMSE | |
---|---|---|
0 | xgboost | 27.918698 |
1 | knn | 41.832711 |
2 | mlr | 42.519262 |
3 | gbt | 43.895070 |
4 | elasticnet | 44.320649 |
5 | lightgbm | 44.554279 |
The above table is the mean mape performance from each model over all series.
[38]:
series_rmses = results.groupby('Series')['TestSetRMSE'].min().reset_index()
series_rmses['Model'] = [
results.loc[
results['TestSetRMSE'] == i,
'ModelNickname'
].values[0] for i in series_rmses['TestSetRMSE']
]
series_rmses
[38]:
Series | TestSetRMSE | Model | |
---|---|---|---|
0 | AMZN | 17.433315 | xgboost |
1 | GOOG | 12.892690 | xgboost |
2 | META | 10.509525 | xgboost |
3 | NFLX | 70.839262 | xgboost |
[39]:
series_rmses['TestSetRMSE'].mean()
[39]:
27.918697955026953
The above table shows the best model for each series and its derived RMSE. The average RMSE of all these models applied to the individual series is 27.9, but being so dependent on the test set to choose the model probably leads to overfitting.
[41]:
fs[1].plot_test_set(
models='xgboost',
include_train=130,
)
plt.title('AMZN Best Model Test Predictions')
plt.show()
[42]:
fs[1].plot(
models='xgboost',
)
plt.title('AMZN Best Model Forecast')
plt.show()
[ ]: