import numpy as np
from statsmodels.tsa.vector_ar import var_model
import statsmodels.tsa.vector_ar.plotting as plotting
import matplotlib.pyplot as plt
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import pandas
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base.datetools import dates_from_str
Throughout this tutorial, we're going to be using the VAR class from the statsmodels python module.
In-depth documentation on how vector autoregressions are implemented in statsmodel can be found here:
http://www.statsmodels.org/dev/vector_ar.html
We're going to create a Vector Autoregression Model for some of the largest energy firms in the world. Specifically, we're going to look at their daily Google search index. The point of this is to see if one particular firm, or group of firms, leads or lags other firms in terms of search interest. First, I take a list of the top 10 largest energy firms from Wikipedia.
Next, I searched each firm's ticker in Google Trends and filtered the list based on the results that showed up in Related Topics. Specifically, I looked for whether the company name was in the top results or if Stock - Topic was in the top results. After applying these filters, I was left with the following firms:
#Load in the Google trends csv file and clean up the pandas dataframe
top_energy_firms = pd.read_csv("energy_firms.csv")
top_energy_firms.index = top_energy_firms['Date']
del top_energy_firms['Date']
del top_energy_firms.index.name
top_energy_firms.index = dates_from_str(top_energy_firms.index)
#Create a Vector Auto Regression model from the data
model = VAR(top_energy_firms)
Note
For direct analysis of non-stationary time series, a standard stable VAR(p) model is not appropriate. In using the VAR class, we're making the assumption that the time series is stationary. Non-stationary or trending data can often be transformed to be stationary by first-differencing or some other method. For tests of stationarity, please see the following links (these are all unit-root tests):
results = model.fit(1)
results.summary()
Please see these slides for a more in-depth discussion of VAR models.
In general, the VAR model is defined as:
$y_{t} = c + \Phi_{1}y_{t-1} + \cdots + \Phi_{p}y_{t-p} + \epsilon_{t}$
Note that $y_{t}$ is a vector. More specifically, in our example it is the following vector:
$y_{t} = [XOM_{t} \, CVX_{t} \, BP_{t} \, SLB_{t} \, EPD_{t} \, ENI_{t}]'$
Note that $y_{t}$ is a $6 x 1$ vector. Further, our ticker symbols represent the search index for each ticker at date t. The $\Phi$'s in the VAR equation are $6 x 6$ vectors of coefficients. We have one for each of our lags p. Finally, $\epsilon_{t}$ is a $6 x 1$ vector of residuals.
In order to use the VAR model we need the following assumptions to hold:
What we're essentially trying to do with our regression model is to estimate $y_{t}$ using the last $p$ observations. That is, given $(T+p)$ observations we condition on the first $p$ observations in order to estimate the last $T$ observations. The conditional likelihood is:
$f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)}$
Where $\theta = [c' \, \, \Phi'_{1} \, \cdots \, \Phi'_{p} \,\, \Omega']'$. That is, $\theta$ is an array of matrix coefficients, the constant term, and the auto-covariance of our model.
Note that assumptions 1, 2, 3, and 4 can be summarized by assuming $\epsilon_{t} \sim i.i.d. N(0, \Omega)$
Using the above assumptions, we have that:
$E[y_{t}|y_{t-1} \, \cdots \, y_{1-p}] = \Pi'x_{t}$
Where $\Pi' = [c \,\, \Phi_{1} \, \cdots \, \Phi_{p}]$ and $x_{t} = [1 \, y_{t-1} \, \cdots \, y_{t-p}]'$
Assuming the error terms are i.i.d normally distributed with mean 0 and covariance matrix $\Omega$ results in the following explicit functional form for the above likelihood function:
$ f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = (2\pi)^{-\frac{n}{2}}|\Omega^{-1}|^{\frac{1}{2}}exp(-\frac{1}{2}(y_{t} - \Pi'x_{t})'\Omega^{-1}(y_{t} - \Pi'x_{t})) $
Further, we can factorize the likelihood function as:
$ f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = f({y_{t}|y_{t-1}, \cdots, y_{1-p}};\theta) * f({y_{t-1}|y_{t-2}, \cdots, y_{1-p}};\theta) * \cdots f({y_{1}|y_{0}, \cdots, y_{1-p}};\theta) $
This can be summarized as:
$f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = \Pi_{t=1}^{T} f({y_{t}|y_{t-1}, \cdots, y_{1-p}};\theta) $
Combining our expression using the explicit form of the normal distribution and the joint density expansion and taking logs, we have the log likelihood function:
$ \mathcal{L}(\theta) = -\frac{TN}{2}\log(2\pi) + \frac{T}{2}\log|\Omega^{-1}|-\frac{1}{2}\sum_{t=1}^{T}[((y_{t} - \Pi'x_{t})'\Omega^{-1}(y_{t} - \Pi'x_{t}))] $
Taking the max of the loglikelihood function with respect to $\Pi$ and $\Omega$ yields our ML estimates of the constant term, each of the coefficient matrices, and the covariance matrix:
$\hat{\Pi}' = \bigg[ \sum_{t=1}^{T}y_{t}x'_{t} \bigg] \bigg[ \sum_{t=1}^{T}x_{t}x'_{t} \bigg]^{-1}$
$\hat{\Omega} = \frac{1}{T}\sum_{t=1}^{T}\hat{\epsilon}_{t}\hat{\epsilon}'_{t}$
By construction, the first row of $\hat{\Pi}'$ is the ML estimate for the constant term. The second row of $\hat{\Pi}'$ is the estimate for the coefficient matrix for the first lag, the third row is the coefficient matrix for the second lag and so on until the $p+1$ row.
Akaike Information Criterion = AIC = $-2L + 2(k + 2kp)$
Bayesian Information Criterion = BIC = $-2L + (k + 2kp)ln(T)$
Where L is the log-likelihood, k is the number of variables in the system (6), T is the sample size, and p is the lag. Note that in the above, we have the lowest BIC score with a lag set to 1. In general, we select the lag based on the lowest AIC (or BIC) score.
For a thorough description of log likelihood see here
#Plot the auto correlation function for each firm
results.plot_acorr()
On the diagonal is the auto-correlation for each of the tickers. For each column, we have the correlation between the lagged observations of each ticker against the ticker represented by that column. The columns are in the order listed above. For example, the graph in the first column and the second row represents the correlation between the current search volume for CVX against one-day lagged observations of the search index for XOM.
#Graph Impulse Response Functions
irf = results.irf(10)
#Orthogonalized impulse response http://economia.unipv.it/pagp/pagine_personali/erossi/dottorato_svar.pdf
orth=True
if orth:
title = 'Impulse responses (orthogonalized)'
irfs = irf.orth_irfs
impulse=None
response=None
signif=0.05
plot_params=None
subplot_params=None,
plot_stderr=True
stderr_type='asym'
if stderr_type == 'asym':
stderr = irf.cov(orth=orth)
repl=1000
seed=None
component=None
plotting.irf_grid_plot(irfs, stderr, impulse, response,
irf.model.names, title, signif=signif,
subplot_params={'fontsize': 14},
plot_params=plot_params, figsize=(20,20), stderr_type=stderr_type)
On the diagonal, we have the impulse response of each search term with itself. As we can see, across the board we have that a one unit shock in the weekly Google Search Index leads to a increase in the short term and a gradual decline in the search index over time. The graphs on the off-diagonal shows what happens in the 10 weeks following a one unit increase in the search index for the search term on the right of the arrow on the weekly search index on the left of the arrow.
Most notably, we have the following observations: