VECM

Using a VECM to predict FANG stocks
See the VECM documentation

[1]:

import pandas as pd
import numpy as np
from pandas_datareader import data as pdr
import yfinance as yf
from scalecast.Forecaster import Forecaster
from scalecast.MVForecaster import MVForecaster
from scalecast.Pipeline import Transformer, Reverter, MVPipeline
from scalecast.util import (
    find_optimal_lag_order,
    find_optimal_coint_rank,
    Forecaster_with_missing_vals,
)
from scalecast.auxmodels import vecm
from scalecast.multiseries import export_model_summaries
from scalecast import GridGenerator
import matplotlib.pyplot as plt

[2]:

yf.pdr_override()

Download data using a public API

[3]:

FANG = [
    'META',
    'AMZN',
    'NFLX',
    'GOOG',
]

fs = []
for sym in FANG:
    df = pdr.get_data_yahoo(sym)
    # since the api doesn't send the data exactly in Business-day frequency
    # we can correct it using this function
    f = Forecaster_with_missing_vals(
        y=df['Close'],
        current_dates = df.index,
        future_dates = 65,
        end = '2022-09-30',
        desired_frequency = 'B',
        fill_strategy = 'linear_interp',
        add_noise = True,
        noise_lookback = 5,
    )
    fs.append(f)

mvf = MVForecaster(*fs,names=FANG,test_length=65)
mvf.set_validation_metric('rmse')
mvf.add_sklearn_estimator(vecm,'vecm')

mvf

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

[3]:

MVForecaster(
    DateStartActuals=2012-05-18T00:00:00.000000000
    DateEndActuals=2023-08-03T00:00:00.000000000
    Freq=B
    N_actuals=2925
    N_series=4
    SeriesNames=['META', 'AMZN', 'NFLX', 'GOOG']
    ForecastLength=65
    Xvars=[]
    TestLength=65
    ValidationLength=1
    ValidationMetric=rmse
    ForecastsEvaluated=[]
    CILevel=None
    CurrentEstimator=mlr
    OptimizeOn=mean
    GridsFile=MVGrids
)

[4]:

mvf.plot()
plt.show()

Augmented Dickey Fuller Tests to Confirm Unit-1 Roots

[5]:

for stock, f in zip(FANG,fs):
    adf_result = f.adf_test(full_res=True)
    print('the stock {} is {}stationary at level'.format(
        stock,
        'not ' if adf_result[1] > 0.05 else ''
    )
    )

the stock META is not stationary at level
the stock AMZN is not stationary at level
the stock NFLX is not stationary at level
the stock GOOG is not stationary at level

[6]:

for stock, f in zip(FANG,fs):
    adf_result = f.adf_test(diffy=True,full_res=True)
    print('the stock {} is {}stationary at its first difference'.format(
        stock,
        'not ' if adf_result[1] > 0.05 else ''
    )
    )

the stock META is stationary at its first difference
the stock AMZN is stationary at its first difference
the stock NFLX is stationary at its first difference
the stock GOOG is stationary at its first difference

Measure IC to Find Optimal Lag Order

this is used to run the cointegration test

[7]:

lag_test = find_optimal_lag_order(mvf,train_only=True)
pd.DataFrame(
    {
        'aic':lag_test.aic,
        'bic':lag_test.bic,
        'hqic':lag_test.hqic,
        'fpe':lag_test.fpe,
    },
    index = ['optimal lag order'],
).T

[7]:

	optimal lag order
aic	27
bic	1
hqic	3
fpe	27

Johansen cointegration test

[8]:

coint_res = find_optimal_coint_rank(
    mvf,
    det_order=1,
    k_ar_diff=10,
    train_only=True,
)
print(coint_res)
coint_res.rank

Johansen cointegration test using trace test statistic with 5% significance level
=====================================
r_0 r_1 test statistic critical value
-------------------------------------
  0   4          56.60          55.25
  1   4          33.50          35.01
-------------------------------------

[8]:

We found a cointegration rank of 1.

Run VECM

Now, we can specify a grid that will try more lags, deterministic terms, seasonal fluctuations, and cointegration ranks of 0 and 1

[9]:

vecm_grid = dict(
    lags = [0], # required to set this to 0 for the vecm model in scalecast
    freq = ['B'], # only necessary to suppress a warning
    k_ar_diff = range(1,66), # lag orders to try
    coint_rank = [0,1],
    deterministic = ["n","co","lo","li","cili","colo"],
    seasons = [0,5,30,65,260],
)

mvf.set_estimator('vecm')
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast()

results = mvf.export('model_summaries')
results[[
    'ModelNickname',
    'Series',
    'TestSetRMSE',
    'TestSetMAE',
]]

Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 28, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 5}.

[9]:

	ModelNickname	Series	TestSetRMSE	TestSetMAE
0	vecm	META	50.913814	44.133838
1	vecm	AMZN	23.752750	22.348669
2	vecm	NFLX	113.705564	105.310047
3	vecm	GOOG	18.846431	18.041912

View VECM Results

[10]:

results['TestSetRMSE'].mean()

[10]:

51.80463965264752

[11]:

mvf.export_validation_grid('vecm').sample(15)

[11]:

	freq	k_ar_diff	coint_rank	deterministic	seasons	Fold0Metric	Fold1Metric	Fold2Metric	AverageMetric	MetricEvaluated	Optimized On
71	B	58	0	colo	0	162.599874	44.268484	26.594518	77.820958	rmse	mean
40	B	7	0	li	260	NaN	NaN	NaN	NaN	rmse	mean
3	B	54	1	lo	30	110.553065	51.443275	19.154490	60.383610	rmse	mean
30	B	29	0	co	30	112.035140	44.427934	18.598329	58.353801	rmse	mean
76	B	1	1	cili	65	93.945660	51.090754	21.836356	55.624257	rmse	mean
87	B	50	0	lo	260	163.152289	46.616014	13.327631	74.365311	rmse	mean
10	B	44	0	cili	65	NaN	NaN	NaN	NaN	rmse	mean
9	B	47	0	n	0	86.212252	55.639330	36.182421	59.344668	rmse	mean
94	B	38	1	colo	5	139.580751	41.852200	24.055439	68.496130	rmse	mean
43	B	55	1	colo	260	112.237679	43.782141	63.595936	73.205252	rmse	mean
73	B	40	0	lo	0	159.422501	46.642590	18.090453	74.718515	rmse	mean
62	B	18	1	n	5	166.241648	54.618358	30.753820	83.871275	rmse	mean
63	B	12	0	co	260	108.180519	44.685347	23.286315	58.717394	rmse	mean
60	B	21	1	li	260	82.029398	50.186756	31.553748	54.589967	rmse	mean
64	B	43	1	co	0	100.291180	46.496187	16.839982	54.542450	rmse	mean

[12]:

mvf.plot_test_set(
    series='AMZN',
    models='vecm',
    include_train=130,
    figsize=(16,8)
)
plt.show()

Re-weight Evaluation Metrics and Rerun VECM

[13]:

weights = results['TestSetRMSE'] / results['TestSetRMSE'].sum()
weights

[13]:

0    0.245701
1    0.114627
2    0.548723
3    0.090950
Name: TestSetRMSE, dtype: float64

[14]:

mvf.set_optimize_on(
    lambda x: (
        x[0]*weights[0] +
        x[1]*weights[1] +
        x[2]*weights[2] +
        x[3]*weights[3]
    )
)
mvf.ingest_grid(vecm_grid)
mvf.limit_grid_size(100,random_seed=20)
mvf.cross_validate(k=3,verbose=True)
mvf.auto_forecast(call_me='vecm_weighted')

results = mvf.export('model_summaries')
results[[
    'ModelNickname',
    'Series',
    'TestSetRMSE',
    'TestSetMAE',
]]

Num hyperparams to try for the vecm model: 100.
Fold 0: Train size: 2145 (2012-05-18 00:00:00 - 2020-08-06 00:00:00). Test Size: 715 (2020-08-07 00:00:00 - 2023-05-04 00:00:00).
Fold 1: Train size: 1430 (2012-05-18 00:00:00 - 2017-11-09 00:00:00). Test Size: 715 (2017-11-10 00:00:00 - 2020-08-06 00:00:00).
Fold 2: Train size: 715 (2012-05-18 00:00:00 - 2015-02-12 00:00:00). Test Size: 715 (2015-02-13 00:00:00 - 2017-11-09 00:00:00).
Chosen paramaters: {'lags': 0, 'freq': 'B', 'k_ar_diff': 62, 'coint_rank': 1, 'deterministic': 'li', 'seasons': 0}.

[14]:

	ModelNickname	Series	TestSetRMSE	TestSetMAE
0	vecm	META	50.913814	44.133838
1	vecm_weighted	META	33.772797	28.520031
2	vecm	AMZN	23.752750	22.348669
3	vecm_weighted	AMZN	15.835426	14.617043
4	vecm	NFLX	113.705564	105.310047
5	vecm_weighted	NFLX	59.887563	53.210215
6	vecm	GOOG	18.846431	18.041912
7	vecm_weighted	GOOG	16.757831	16.121594

[15]:

results.loc[results['ModelNickname'] == 'vecm_weighted','TestSetRMSE'].mean()

[15]:

31.56340410699967

An improvement by weighting the optimizer!

[16]:

mvf.plot_test_set(
    series='META',
    models='all',
    include_train=130,
    figsize=(16,8)
)
plt.show()

[17]:

mvf.plot(
    series='all',
    models='all',
    figsize=(16,8)
)
plt.show()

Try Other MV Models

[31]:

GridGenerator.get_mv_grids()
# open MVGrids.py and manually change all lags arguments to range(1,66)

[32]:

transformers = []
reverters = []
for stock, f in zip(FANG,fs):
    transformer = Transformer(
        transformers = [('DiffTransform',)]
    )
    reverter = Reverter(
        reverters = [('DiffRevert',)],
        base_transformer = transformer,
    )
    transformers.append(transformer)
    reverters.append(reverter)

[33]:

def Xvar_select(f):
    f.set_validation_length(65)
    f.auto_Xvar_select(
        estimator='gbt',
        max_depth=2,
        max_ar=0, # in mv modeling, lags are a hyperparameter, not a regressor in the MVForecaster object
    )

def mvforecaster(mvf):
    models = (
        'mlr',
        'elasticnet',
        'gbt',
        'xgboost',
        'lightgbm',
        'knn',
    )
    mvf.set_test_length(65)
    mvf.tune_test_forecast(
        models,
        limit_grid_size=10,
        cross_validate=True,
        k=3,
    )

[34]:

pipeline = MVPipeline(
    steps = [
        ('Transform',transformers),
        ('Xvar Select',[Xvar_select]*4),
        ('Forecast',mvforecaster),
        ('Revert',reverters),
    ],
    names = FANG,
)

[35]:

fs = pipeline.fit_predict(*fs)

Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations
Finished loading model, total used 150 iterations

[36]:

results = export_model_summaries(dict(zip(FANG,fs)))

View Results

[37]:

model_rmses = results.groupby('ModelNickname')['TestSetRMSE'].mean().sort_values().reset_index()
model_rmses

[37]:

	ModelNickname	TestSetRMSE
0	xgboost	27.918698
1	knn	41.832711
2	mlr	42.519262
3	gbt	43.895070
4	elasticnet	44.320649
5	lightgbm	44.554279

The above table is the mean mape performance from each model over all series.

[38]:

series_rmses = results.groupby('Series')['TestSetRMSE'].min().reset_index()
series_rmses['Model'] = [
    results.loc[
        results['TestSetRMSE'] == i,
        'ModelNickname'
    ].values[0] for i in series_rmses['TestSetRMSE']
]
series_rmses

[38]:

	Series	TestSetRMSE	Model
0	AMZN	17.433315	xgboost
1	GOOG	12.892690	xgboost
2	META	10.509525	xgboost
3	NFLX	70.839262	xgboost

[39]:

series_rmses['TestSetRMSE'].mean()

[39]:

27.918697955026953

The above table shows the best model for each series and its derived RMSE. The average RMSE of all these models applied to the individual series is 27.9, but being so dependent on the test set to choose the model probably leads to overfitting.

[41]:

fs[1].plot_test_set(
    models='xgboost',
    include_train=130,
)
plt.title('AMZN Best Model Test Predictions')
plt.show()

[42]:

fs[1].plot(
    models='xgboost',
)
plt.title('AMZN Best Model Forecast')
plt.show()

[ ]: