A short guide to Vector Auto Regressions using Google Trends

Download the data here.

Download the Jupyter Notebook here.

In [12]:
import numpy as np
from statsmodels.tsa.vector_ar import var_model
import statsmodels.tsa.vector_ar.plotting as plotting
import matplotlib.pyplot as plt
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import pandas
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base.datetools import dates_from_str

Throughout this tutorial, we're going to be using the VAR class from the statsmodels python module.

In-depth documentation on how vector autoregressions are implemented in statsmodel can be found here:

http://www.statsmodels.org/dev/vector_ar.html

We're going to create a Vector Autoregression Model for some of the largest energy firms in the world. Specifically, we're going to look at their daily Google search index. The point of this is to see if one particular firm, or group of firms, leads or lags other firms in terms of search interest. First, I take a list of the top 10 largest energy firms from Wikipedia.

Next, I searched each firm's ticker in Google Trends and filtered the list based on the results that showed up in Related Topics. Specifically, I looked for whether the company name was in the top results or if Stock - Topic was in the top results. After applying these filters, I was left with the following firms:

  • Exxon Mobil - XOM
  • Chevron - CVX
  • British Petroleum - BP
  • Schlumberger - SLB
  • Enterprise Products - EPD
  • Eni S.p.A. - ENI
In [2]:
#Load in the Google trends csv file and clean up the pandas dataframe
top_energy_firms = pd.read_csv("energy_firms.csv")
top_energy_firms.index = top_energy_firms['Date']
del top_energy_firms['Date']
del top_energy_firms.index.name
top_energy_firms.index = dates_from_str(top_energy_firms.index)
In [3]:
#Create a Vector Auto Regression model from the data
model = VAR(top_energy_firms)

Note

For direct analysis of non-stationary time series, a standard stable VAR(p) model is not appropriate. In using the VAR class, we're making the assumption that the time series is stationary. Non-stationary or trending data can often be transformed to be stationary by first-differencing or some other method. For tests of stationarity, please see the following links (these are all unit-root tests):

In [4]:
results = model.fit(1)
results.summary()
Out[4]:
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Mon, 17, Jul, 2017
Time:                     11:25:59
--------------------------------------------------------------------
No. of Equations:         7.00000    BIC:                    22.9071
Nobs:                     260.000    HQIC:                   22.4485
Log likelihood:          -5404.70    FPE:                4.12507e+09
AIC:                      22.1402    Det(Omega_mle):     3.33658e+09
--------------------------------------------------------------------
Results for equation XOM
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               11.119694         5.252845            2.117           0.035
L1.XOM               0.773444         0.043730           17.687           0.000
L1.CVX              -0.072683         0.049335           -1.473           0.142
L1.BP                0.019882         0.050025            0.397           0.691
L1.SLB               0.018998         0.033213            0.572           0.568
L1.EPD               0.019365         0.038628            0.501           0.617
L1.ENI              -0.000606         0.043204           -0.014           0.989
L1.Oil Price         0.126435         0.032497            3.891           0.000
===============================================================================

Results for equation CVX
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               17.355683         6.598770            2.630           0.009
L1.XOM               0.061330         0.054935            1.116           0.265
L1.CVX               0.613470         0.061976            9.898           0.000
L1.BP               -0.019251         0.062843           -0.306           0.760
L1.SLB               0.004277         0.041723            0.103           0.918
L1.EPD              -0.029299         0.048525           -0.604           0.547
L1.ENI              -0.078568         0.054274           -1.448           0.149
L1.Oil Price         0.149718         0.040823            3.667           0.000
===============================================================================

Results for equation BP
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const                8.511407         2.983063            2.853           0.005
L1.XOM               0.005130         0.024834            0.207           0.837
L1.CVX              -0.022192         0.028017           -0.792           0.429
L1.BP                0.932230         0.028409           32.814           0.000
L1.SLB              -0.014444         0.018862           -0.766           0.445
L1.EPD               0.001620         0.021937            0.074           0.941
L1.ENI              -0.035881         0.024535           -1.462           0.145
L1.Oil Price         0.019363         0.018455            1.049           0.295
===============================================================================

Results for equation SLB
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               32.063875         8.252089            3.886           0.000
L1.XOM               0.009862         0.068699            0.144           0.886
L1.CVX               0.056392         0.077504            0.728           0.468
L1.BP               -0.172366         0.078589           -2.193           0.029
L1.SLB               0.624671         0.052177           11.972           0.000
L1.EPD              -0.025960         0.060683           -0.428           0.669
L1.ENI               0.026520         0.067872            0.391           0.696
L1.Oil Price        -0.028808         0.051051           -0.564           0.573
===============================================================================

Results for equation EPD
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               28.473026         8.411571            3.385           0.001
L1.XOM               0.038763         0.070026            0.554           0.580
L1.CVX              -0.032752         0.079002           -0.415           0.679
L1.BP                0.110536         0.080108            1.380           0.169
L1.SLB               0.007557         0.053186            0.142           0.887
L1.EPD               0.468886         0.061856            7.580           0.000
L1.ENI              -0.006955         0.069184           -0.101           0.920
L1.Oil Price         0.109574         0.052038            2.106           0.036
===============================================================================

Results for equation ENI
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               29.844175         6.417529            4.650           0.000
L1.XOM              -0.057959         0.053426           -1.085           0.279
L1.CVX               0.060428         0.060274            1.003           0.317
L1.BP               -0.045145         0.061117           -0.739           0.461
L1.SLB               0.040903         0.040578            1.008           0.314
L1.EPD              -0.001517         0.047193           -0.032           0.974
L1.ENI               0.598849         0.052783           11.345           0.000
L1.Oil Price         0.054024         0.039702            1.361           0.175
===============================================================================

Results for equation Oil Price
===============================================================================
                  coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------
const               13.857461         6.491691            2.135           0.034
L1.XOM               0.043962         0.054043            0.813           0.417
L1.CVX               0.036611         0.060971            0.600           0.549
L1.BP               -0.003531         0.061824           -0.057           0.955
L1.SLB              -0.082394         0.041046           -2.007           0.046
L1.EPD              -0.082223         0.047738           -1.722           0.086
L1.ENI              -0.075408         0.053393           -1.412           0.159
L1.Oil Price         0.939670         0.040161           23.398           0.000
===============================================================================

Correlation matrix of residuals
                  XOM       CVX        BP       SLB       EPD       ENI  Oil Price
XOM          1.000000  0.436654 -0.035099  0.213779  0.102149  0.046439   0.354328
CVX          0.436654  1.000000  0.114422  0.248729  0.238418  0.165678   0.536930
BP          -0.035099  0.114422  1.000000  0.016327  0.302273  0.247222   0.147511
SLB          0.213779  0.248729  0.016327  1.000000  0.133523  0.208441   0.279100
EPD          0.102149  0.238418  0.302273  0.133523  1.000000  0.281742   0.289401
ENI          0.046439  0.165678  0.247222  0.208441  0.281742  1.000000   0.251650
Oil Price    0.354328  0.536930  0.147511  0.279100  0.289401  0.251650   1.000000

General Derivation of a VAR model

Please see these slides for a more in-depth discussion of VAR models.

In general, the VAR model is defined as:

$y_{t} = c + \Phi_{1}y_{t-1} + \cdots + \Phi_{p}y_{t-p} + \epsilon_{t}$

Note that $y_{t}$ is a vector. More specifically, in our example it is the following vector:

$y_{t} = [XOM_{t} \, CVX_{t} \, BP_{t} \, SLB_{t} \, EPD_{t} \, ENI_{t}]'$

Note that $y_{t}$ is a $6 x 1$ vector. Further, our ticker symbols represent the search index for each ticker at date t. The $\Phi$'s in the VAR equation are $6 x 6$ vectors of coefficients. We have one for each of our lags p. Finally, $\epsilon_{t}$ is a $6 x 1$ vector of residuals.

In order to use the VAR model we need the following assumptions to hold:

  1. $E[\epsilon_{t}] = 0$
  2. $E[\epsilon_{t}\epsilon_{\tau}'] = 0 \, \forall \, t \neq \tau$
  3. $E[\epsilon_{t}\epsilon_{\tau}'] = \Omega \, \forall \, t = t$ where $\Omega$ is a positive definite matrix
  4. $\epsilon_{t}$ is normally distributed $\forall$ $t$
  5. Stationarity:
    • $E[y_{t}]$ and $E[y_{t}y_{t-j}']$ are both independent of the date t

Maximum Likelihood Estimation

What we're essentially trying to do with our regression model is to estimate $y_{t}$ using the last $p$ observations. That is, given $(T+p)$ observations we condition on the first $p$ observations in order to estimate the last $T$ observations. The conditional likelihood is:

$f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)}$

Where $\theta = [c' \, \, \Phi'_{1} \, \cdots \, \Phi'_{p} \,\, \Omega']'$. That is, $\theta$ is an array of matrix coefficients, the constant term, and the auto-covariance of our model.

Note that assumptions 1, 2, 3, and 4 can be summarized by assuming $\epsilon_{t} \sim i.i.d. N(0, \Omega)$

Using the above assumptions, we have that:

$E[y_{t}|y_{t-1} \, \cdots \, y_{1-p}] = \Pi'x_{t}$

Where $\Pi' = [c \,\, \Phi_{1} \, \cdots \, \Phi_{p}]$ and $x_{t} = [1 \, y_{t-1} \, \cdots \, y_{t-p}]'$

Assuming the error terms are i.i.d normally distributed with mean 0 and covariance matrix $\Omega$ results in the following explicit functional form for the above likelihood function:

$ f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = (2\pi)^{-\frac{n}{2}}|\Omega^{-1}|^{\frac{1}{2}}exp(-\frac{1}{2}(y_{t} - \Pi'x_{t})'\Omega^{-1}(y_{t} - \Pi'x_{t})) $

Further, we can factorize the likelihood function as:

$ f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = f({y_{t}|y_{t-1}, \cdots, y_{1-p}};\theta) * f({y_{t-1}|y_{t-2}, \cdots, y_{1-p}};\theta) * \cdots f({y_{1}|y_{0}, \cdots, y_{1-p}};\theta) $

This can be summarized as:

$f({Y_{T}, Y_{T-1}, \cdots, Y_{1}|Y_{0}, Y_{-1}, \cdots, Y_{1-p}; \theta)} = \Pi_{t=1}^{T} f({y_{t}|y_{t-1}, \cdots, y_{1-p}};\theta) $

Combining our expression using the explicit form of the normal distribution and the joint density expansion and taking logs, we have the log likelihood function:

$ \mathcal{L}(\theta) = -\frac{TN}{2}\log(2\pi) + \frac{T}{2}\log|\Omega^{-1}|-\frac{1}{2}\sum_{t=1}^{T}[((y_{t} - \Pi'x_{t})'\Omega^{-1}(y_{t} - \Pi'x_{t}))] $

Taking the max of the loglikelihood function with respect to $\Pi$ and $\Omega$ yields our ML estimates of the constant term, each of the coefficient matrices, and the covariance matrix:

$\hat{\Pi}' = \bigg[ \sum_{t=1}^{T}y_{t}x'_{t} \bigg] \bigg[ \sum_{t=1}^{T}x_{t}x'_{t} \bigg]^{-1}$

$\hat{\Omega} = \frac{1}{T}\sum_{t=1}^{T}\hat{\epsilon}_{t}\hat{\epsilon}'_{t}$

By construction, the first row of $\hat{\Pi}'$ is the ML estimate for the constant term. The second row of $\hat{\Pi}'$ is the estimate for the coefficient matrix for the first lag, the third row is the coefficient matrix for the second lag and so on until the $p+1$ row.

Information Criteria

Akaike Information Criterion = AIC = $-2L + 2(k + 2kp)$

Bayesian Information Criterion = BIC = $-2L + (k + 2kp)ln(T)$

Where L is the log-likelihood, k is the number of variables in the system (6), T is the sample size, and p is the lag. Note that in the above, we have the lowest BIC score with a lag set to 1. In general, we select the lag based on the lowest AIC (or BIC) score.

For a thorough description of log likelihood see here

In [5]:
#Plot the auto correlation function for each firm
results.plot_acorr()

On the diagonal is the auto-correlation for each of the tickers. For each column, we have the correlation between the lagged observations of each ticker against the ticker represented by that column. The columns are in the order listed above. For example, the graph in the first column and the second row represents the correlation between the current search volume for CVX against one-day lagged observations of the search index for XOM.

In [17]:
#Graph Impulse Response Functions
irf = results.irf(10)

#Orthogonalized impulse response http://economia.unipv.it/pagp/pagine_personali/erossi/dottorato_svar.pdf
orth=True

if orth:
    title = 'Impulse responses (orthogonalized)'
    irfs = irf.orth_irfs
    
impulse=None
response=None
signif=0.05 
plot_params=None
subplot_params=None,
plot_stderr=True 
stderr_type='asym'
if stderr_type == 'asym':
    stderr = irf.cov(orth=orth)

repl=1000
seed=None 
component=None

plotting.irf_grid_plot(irfs, stderr, impulse, response,
                       irf.model.names, title, signif=signif,
                       subplot_params={'fontsize': 14},
                        plot_params=plot_params, figsize=(20,20), stderr_type=stderr_type)

Interpreting Impulse Response Functions

On the diagonal, we have the impulse response of each search term with itself. As we can see, across the board we have that a one unit shock in the weekly Google Search Index leads to a increase in the short term and a gradual decline in the search index over time. The graphs on the off-diagonal shows what happens in the 10 weeks following a one unit increase in the search index for the search term on the right of the arrow on the weekly search index on the left of the arrow.

Most notably, we have the following observations:

  • XOM has a positive and significant impact on CVX but CVX does not have a significant impact on XOM. This suggests that searces for Exxon Mobil lead searches for Chevron.
  • Searches for Oil Price leads XOM and CVX but does not seem to have a significant relationship with SLB or ENI.
  • Searches for BP seems to have a negative relationship with searches for SLB. In both graphs, a positive shock to one results in less searches for the other.
  • Searches for BP lead searches for EPD
  • Searches for ENI does not seem to lead searches for any of the otehr search terms