# Overview

BigVAR is the companion R package to the papers “VARX-L: Structured Regularization for Large Vector Autoregression with Exogenous Variables” (Joint with David Matteson and Jacob Bien) and “High Dimensional Forecasting via Interpretable Vector Autoregression (HLag)” (Joint with Ines Wilms, David Matteson, and Jacob Bien).

BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the standard vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. This is useful in many applications which make use of time-dependent data, such as macroeconomics, finance, and internet traffic, as the conventional VAR and VARX are heavily overparameterized.

Our package adapts solution methods from the regularization literature to a multivariate time series setting, allowing for the computationally efficient estimation of high-dimensional VAR and VARX models.

We also allow for refitting based on the nonzero support selected by our procedures as well as the ability to incorporate mild non-stationarity by shrinking toward a random walk. For more information on these extensions, we refer you to our papers VARX-L and HLAG.

The following presents a brief overview of our notation, the models contained in BigVAR, and the functionality of the package.

## Notation and Methodology

Let $$\{ \mathbf{y_t}\}_{t = 1}^T$$ denote a $$k$$ dimensional vector time series and $$\{\mathbf{x}_t\}_{t=1}^{T}$$ denote an $$m$$-dimensional unmodeled series. A vector autoregression with exogenous variables of order (p,s) , VARX$$_{k,m}$$($$p,s$$), can be expressed as
\begin{align} \label{VAR1} \mathbf{y}_t=\mathbf{\nu}+\sum_{\ell=1}^p\mathbf{\Phi}^{(\ell)}\mathbf{y}_{t-\ell}+\sum_{j=1}^s \mathbf{\beta}^{(j)}\mathbf{x}_{t-j}+\mathbf{u}_t \; \text{ for } \;t=1,\ldots,T, \end{align} in which $$\mathbf{\nu}$$ denotes a $$k\times 1$$ intercept vector, each $$\mathbf{\Phi}^{(\ell)}$$ represents a $$k\times k$$ endogenous (modeled) coefficient matrix, each $$\mathbf{\beta}^{(j)}$$ represents a $$k\times m$$ exogenous (unmodeled) coefficient matrix, and $$\mathbf{u}_t\stackrel{\text{wn}}{\sim}(\mathbf{0},\mathbf{\Sigma}_u)$$. Note the the VAR is a special case of Equation $$\ref{VAR1}$$ in which the second summation ($$\sum_{j=1}^s \mathbf{\beta}^{(j)}\mathbf{x}_{t-j}$$) is not included.

### Example Sparsity Patterns

1. The Basic VARX-L will zero individual elements in $$[\mathbf{\Phi},\mathbf{\beta}]$$
2. The Lag Group VARX-L assigns a group lasso penalty by lag coefficient matrix $$\mathbf{\Phi}$$. Each column in the exogenous coefficient matrices $$\mathbf{\beta}$$ are assigned to their own group.
3. The Own/Other VARX-L assigns separate group lasso penalties to own lags ($$\text{diag}(\mathbf{\Phi}^{(\ell)})$$ and other lags $$(\mathbf{\Phi}^{(\ell)^{-}})$$. It applies the same penalty as the Lag Group VARX-L to the exogenous structure.
4. The Sparse Lag and Sparse Own/Other VARX-L extend the above structures to allow for within-group sparsity.

### VARX-L

$$\tt BigVAR$$ can be used to apply the following penalties to the VARX (Equation $$\ref{VAR1}$$):

Model Structure BigVAR Name Penalty Solution Algorithm
Basic $$\tt{Basic}$$ $$\lambda\|[\mathbf{\Phi},\mathbf{\beta}]\|_1$$ Coordinate Descent
Lag VARX-L $$\tt{Lag}$$ $$\lambda\big(\sum_{\ell=1}^p||\mathbf{\Phi}_\ell||_F+\sqrt{k}\sum_{j=1}^s\sum_{i=1}^m\|\beta_{\cdot,i}^{(j)}\|_F \big)$$ Block Coordinate Descent
Own/Other VARX-L $$\tt{OwnOther}$$ $$\lambda(\rho_1\sum_{\ell=1}^p \|\text{diag}(\mathbf{\Phi}^{(\ell)})\|_F+\gamma_1\sum_{\ell=1}^p\|\mathbf{\Phi}^{(\ell^{-})}\|_F+\sqrt{k}\sum_{j=1}^s\sum_{i=1}^m\|\beta_{\cdot,i}^{(j)}\|_F)$$ Block Coordinate Descent
Sparse Lag VARX-L $$\tt{SparseLag}$$ $$(1-\alpha)\sqrt{k^2}\sum_{\ell=1}^p\|\mathbf{\Phi}^{(\ell)}\|_F+\alpha\|\mathbf{\Phi}\|_1 (1-\alpha)\sqrt{k}\sum_{j=1}^s\sum_{i=1}^m\|\beta_{\cdot,i}^{(j)}\|_F +\alpha\|\beta\|_1$$ Generalized Gradient Descent
Own/Other Sparse VARX-L $$\tt{SparseOwnOther}$$ $$\lambda(1-\alpha)[\rho_1\sum_{\ell=1}^p \|\text{diag}(\mathbf{\Phi}^{(\ell)})\|_F+\gamma_1\sum_{\ell=1}^p\|\mathbf{\Phi}^{(\ell)^{-}}\|_F]+\alpha\lambda\|\mathbf{\Phi}\|{_1}$$ Generalized Gradient Descent

$$\lambda>0$$ is a penalty parameter that can be estimated empirically or chosen by the user; larger values of $$\lambda$$ encourage a greater degree of sparsity. $$0\leq \alpha\leq 1$$ is an additional penalty parameter set by default to $$\frac{1}{k+1}$$ to control within-group sparsity (though we also allow for it to be estimated empirically with the option $${\tt dual=TRUE}$$ in the function $${\tt constructModel}$$). $$\rho_1=\sqrt{k}$$ and $$\gamma_1=\sqrt{k(k-1)}$$ are weights accounting for the number of elements in each group.

### HVAR

Model Structure BigVAR Name Penalty
Componentwise $$\tt{HVARC}$$ $$\sum_{i=1}^k\sum_{\ell=1}^p\|\mathbf{\Phi}_i^{(\ell:p)}\|_2.$$
Own/Other $$\tt{HVAROO}$$ $$\sum_{i=1}^k\sum_{\ell=1}^p\left[\|\mathbf{\Phi}_{i}^{(\ell:p)}\|_2+\|(\mathbf{\Phi}_{i,-i}^{(\ell)}, \mathbf{\Phi}_{i}^{([\ell+1]:p)})\|_2\right]$$.
Elementwise $$\tt{HVARELEM}$$ $$\sum_{i=1}^k\sum_{j=1}^k\sum_{\ell=1}^p\|\mathbf{\Phi}_{ij}^{(\ell:p)}\|_2$$.
Lag-weighted Lasso $$\tt{Tapered}$$ $$\sum_{\ell=1}^p\ell^{\gamma}\|\mathbf{\Phi}^{(\ell)}\|_1$$.

We additionally incorporate several VAR-specific penalties that directly address lag order selection. In addition to returning sparse solutions, our $$\text{HVAR}_k(p)$$ procedures induce regularization toward models with low maximum lag order. To allow for greater flexibility, instead of imposing a single, universal lag order (as information criterion minimization based approaches tend to do), we allow it to vary across marginal models (i.e. the rows of the coefficient matrix $$\mathbf{\Phi}=[\mathbf{\Phi}^{(1)},\dots,\mathbf{\Phi}^{(p)}]$$).

$$\tt{BigVAR}$$ includes three HVAR models as well as the Lag-weighted Lasso,’’ which incorporates a lasso penalty that increases geometrically as the lag order increases. This penalty does not directly address lag order but it encourages a greater degree of sparsity at more distant lags (as controlled by the additional penalty parameter $$\gamma \in (0,1)$$.

The componentwise HVAR embeds a conventional lag order selection penalty into the hierarchical group lasso; the maximum lag order can vary across series, but within a series all components have the same maximum lag. The Own/Other HVAR adds another layer of lag order: within a lag a series own lags will be prioritized over other lags. Finally, the Elementwise HVAR allows for the most flexibility, allowing each series in each marginal model to have its own maximum lag order resulting in $$k^2$$ possible lag orders.

### Penalty Parameter selection

In order to account for time-dependence, cross-validation is conducted in a rolling manner. Define time indices $$T_1=\left \lfloor \frac{T}{3} \right\rfloor$$, and $$T_2=\left\lfloor \frac{2T}{3} \right\rfloor$$

The training period $$T_1+1$$ through $$T_2$$ is used to select $$\lambda$$, $$T_2+1$$ through $$T$$ is for evaluation of forecast accuracy in a rolling manner. The process is visualized in the following figure.

We choose the optimal lambda based on minimizing one-step ahead mean squared forecast error (MSFE).

Define $$\hat{\mathbf{y}}_{t+1}^{\lambda}$$ as the one-step ahead forecast based on $$\mathbf{y}_1,\dots\mathbf{y}_t$$. We choose $$\lambda$$ based on minimizing one-step ahead mean square forecast error (MSFE) over the training period: MSFE$$(\lambda)=\frac{1}{(T_2-T_1-1)}\sum_{t=T_1}^{T_2-1} \|\hat{\mathbf{y}}_{t+1}^{\lambda}-\mathbf{y}_{t+1}\|_F^2.$$

### Data

A simulated multivariate time series is included with $${\tt BigVAR}$$. The sparsity structure of the generator matrix $$B$$ is visualized in the following figure

# BigVAR Package

Our package allows for the straightforward estimation of the aforementioned procedures. Unlike Bayesian approaches, we do not require the use of complex or subjective hyperparameters.

## Input Arguments

The end-user completely specifies their model by constructing an object of class $${\tt BigVAR}$$.

To construct an object of class “BigVAR,” simply run the command:

data(Y)
mod1<-constructModel(Y,p=4,"Basic",gran=c(150,10),RVAR=FALSE,h=1,cv="Rolling",MN=FALSE,verbose=FALSE,IC=TRUE)

There are four required arguments; the rest are only needed if expanded or non-standard functionality is requested.

The input arguments are:

1. Y: Multivariate time series of the form $$T\times k$$
2. p: Maximal lag order
3. Structure: Structure of Penalty. Choices are
1. “Basic:” (Lasso Penalty)
2. “Lag:” (Lag Group Penalty)
3. “OwnOther:” (Own/Other Group Penalty)
4. “SparseLag:” (Lag Sparse Group Lasso Penalty)
5. “SparseOwnOther:” (Own/Other Sparse Group Penalty)
6. “EFX:” (Endogenous-First VARX)
7. “HVARC:” (Componentwise HLAG)
8. “HVAROO:” (Own/Other HLAG)
9. “HVARELEM:” (Elementwise HLAG)
10. “Tapered:” (Lasso with Lag Penalty)
11. “BGR:” Bayesian VAR as detailed in Banbura et al (2010).

The first 5 can be applied to VAR and VARX models, EFX can only be applied toward VARX models, the remaining 5 are only applicable to VAR models.

1. Granularity: two options for the grid of penalty parameters $$\lambda$$.

The first option controls the depth of the lambda grid (a good default option is 50). The second option controls the number of grid values (a good deafault is 10). If your grid does not go deep enough, your forecasts results may not be optimal, but if it is too deep, the program may take a substantial amount of time to run. The index of the optimal penalty parameter is monitored by $${\tt cv.BigVAR}$$. If it is on the border of the grid, it is recommended to re-run the function with an adjusted granularity parameter. If you set the option $${\tt ownlambdas}$$ to TRUE, $${\tt gran}$$ can be used to supply a user defined grid of lambdas.
2. “RVAR:” TRUE/FALSE Option to use our relaxed VAR(X) procedure to re-estimate the nonzero support selected by one of our models.
3. h Forecast horizon to optimize over (default 1).
4. “MN:” Option to select a pseudo “Minnesota Prior” (shrinks variables toward a univariate random walk ; useful for mildly non-stationary data)
5. “Verbose:” TRUE/FALSE. If TRUE, will display a progress bar for the cross-validation and evaluation procedures.
6. “IC:” TRUE/FALSE. If TRUE, will return AIC and BIC VAR(X) benchmarks.
7. “VARX:” VARX characteristics in a list form. The list must contain two arguments:
1. k: number of modeled series.
2. s: maximum lag order for unmodeled series.
8. “alpha:” user defined $$0\leq alpha \leq 1$$ representing the tradeoff between lasso and group lasso penalties (only applies to Sparse Lag and Sparse Own/Other structures).
9. “intercept:” Indicator as to whether an intercept should be fit (default $${\tt TRUE}$$). (Note that the intercept is fit separately and not subject to regularization).
10. “tol:” Optimization tolerance (default $$1e-4$$).
11. “window.size” Size of window for rolling cv. (Default is 0; resulting in an expanding window).
12. “T1:” Start of rolling cv period (default $$\lfloor \frac{T}{3} \rfloor$$)
13. “T2:” End of rolling cv period (default $$\lfloor \frac{2T}{3} \rfloor$$)
14. “cv:” Type of cross validation used to select the penalty parameter (options are “rolling” (default) and “LOO” which is a pseudo “leave one out” cv procedure that respects time dependence).
15. “ownlambdas:” Indicator as to whether the user supplied their own penalty grid in slot $${\tt gran}$$ (default {FALSE}).
16. “recursive:” Whether recursive (iterated) or direct forecasts are used for multi-step predictions (default FALSE, indicating direct forecasts are used). For more information on the distinction consult Marcellino and Watson (2004).

One we construct the model, we can run $${\tt cv.BigVAR(A)}$$, which chooses the optimal penalty parameter via rolling cross-validation, evaluates its forecast accuracy, and compares it against a mean, random walk, AIC, and BIC VARX benchmarks over an out-of-sample period.

results=cv.BigVAR(mod1)
results
## *** BIGVAR MODEL Results ***
## Structure
## [1] "Basic"
## Forecast Horizon
## [1] 1
## Minnesota VAR
## [1] FALSE
## Maximum Lag Order
## [1] 4
## Optimal Lambda
## [1] 0.0276
## Grid Depth
## [1] 150
## Index of Optimal Lambda
## [1] 10
## In-Sample MSFE
## [1] 0.045
## BigVAR Out of Sample MSFE
## [1] 0.037
## *** Benchmark Results ***
## Conditional Mean Out of Sample MSFE
## [1] 0.244
## AIC Out of Sample MSFE
## [1] 0.04
## BIC Out of Sample MSFE
## [1] 0.04
## RW Out of Sample MSFE
## [1] 0.19

### Diagnostics

Generally, the only potential post-hoc diagnostic procedures are adjusting the depth/size of the penalty grid as well as the maximal lag order. A more coherent lag selection procedure is a future extension of $${\tt BigVAR}$$. Currently, we suggest setting the maximal lag order based on the frequency of the data (e.g. 4 for quarterly, 12 for monthly, etc).

The $${\tt cv.BigVAR(A)}$$ function returns an object of class “BigVAR.results.” This object inherits the properties of class BigVAR and contains substantially more information than is displayed.

str(results)
## Formal class 'BigVAR.results' [package "BigVAR"] with 56 slots
##   ..@ InSampMSFE   : num [1:10] 0.1747 0.1585 0.1322 0.1055 0.0853 ...
##   ..@ InSampSD     : num [1:10] 0.024 0.0227 0.02 0.0157 0.0119 ...
##   ..@ LambdaGrid   : num [1:10] 4.146 2.376 1.361 0.78 0.447 ...
##   ..@ index        : int 10
##   ..@ OptimalLambda: num 0.0276
##   ..@ OOSMSFE      : num [1:34] 0.0492 0.0449 0.0515 0.0473 0.0223 ...
##   ..@ seoosmsfe    : num 0.00424
##   ..@ MeanMSFE     : num 0.244
##   ..@ AICMSFE      : num 0.0396
##   ..@ AICPreds     : num [1:34, 1:3] 0.09081 -0.25179 -0.14488 0.00224 0.05599 ...
##   ..@ BICMSFE      : num 0.0396
##   ..@ BICpvec      : int [1:34] 3 3 3 3 3 3 3 3 3 3 ...
##   ..@ BICsvec      : int [1:34] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ AICpvec      : int [1:34] 3 3 3 3 3 3 3 3 3 3 ...
##   ..@ AICsvec      : int [1:34] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ BICSD        : num 0.00455
##   ..@ BICPreds     : num [1:34, 1:3] 0.09081 -0.25179 -0.14488 0.00224 0.05599 ...
##   ..@ RWMSFE       : num 0.19
##   ..@ RWPreds      : num [1:34, 1:3] 0.27917 0.00805 -0.06027 -0.15444 0.216 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$: chr [1:34] "trainY1" "trainY1" "trainY1" "trainY1" ... ## .. .. ..$ : NULL
##   ..@ MeanSD       : num 0.0284
##   ..@ MeanPreds    : num [1:34, 1:3] -0.00222 -0.00206 -0.00297 -0.0053 -0.00194 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$: chr [1:34] "ypred" "ypred" "ypred" "ypred" ... ## .. .. ..$ : NULL
##   ..@ AICSD        : num 0.00455
##   ..@ RWSD         : num 0.0328
##   ..@ betaPred     : num [1:3, 1:13] 0.00618 0.01118 -0.01606 -0.17272 -0.24682 ...
##   ..@ Zvals        : num [1:13, 1] 1 0.1087 0.1347 0.2403 -0.0885 ...
##   ..@ VARXI        : logi FALSE
##   ..@ resids       : num [1:96, 1:3] 0.054 0.0524 -0.0978 -0.0144 0.1777 ...
##   ..@ preds        : num [1:34, 1:3] 0.08452 -0.20426 -0.12849 0.00424 0.05938 ...
##   ..@ dual         : logi FALSE
##   ..@ contemp      : logi(0)
##   ..@ fitted       : num [1:96, 1:3] -0.013 0.0902 0.0178 -0.1123 0.0341 ...
##   ..@ lagmatrix    : num [1:13, 1:96] 1 -0.0529 0.1385 0.0501 0.0174 ...
##   ..@ betaArray    : num [1:3, 1:13, 1:34] 0.000763 -0.003185 -0.028925 -0.154035 -0.185676 ...
##   ..@ Data         : num [1:100, 1:3] -0.00611 -0.00372 0.01744 -0.05285 0.04102 ...
##   ..@ lagmax       : num 4
##   ..@ Structure    : chr "Basic"
##   ..@ Relaxed      : logi FALSE
##   ..@ Granularity  : num [1:2] 150 10
##   ..@ intercept    : logi TRUE
##   ..@ Minnesota    : logi FALSE
##   ..@ horizon      : num 1
##   ..@ verbose      : logi(0)
##   ..@ crossval     : chr "Rolling"
##   ..@ ic           : logi(0)
##   ..@ VARX         : list()
##   ..@ T1           : num 29
##   ..@ T2           : num 62
##   ..@ ONESE        : logi(0)
##   ..@ ownlambdas   : logi FALSE
##   ..@ tf           : logi FALSE
##   ..@ alpha        : num 0.25
##   ..@ recursive    : logi FALSE
##   ..@ dates        : chr(0)
##   ..@ constvec     : num [1:3] 1 1 1
##   ..@ tol          : num 1e-04
##   ..@ lagselect    : logi(0)

It also has a plot method, to show a comparison of in-sample MSFE of the grid of $$\lambda$$ values.

plot(results)

Generally, you want this graph to have a parabolic shape with the optimal value in one of the middle indices. In this scenario, since the slope of the line is very flat, it is likely that increasing the depth of the grid would not substantially improve forecasts. It is not recommended to make the depth too large as it substantially increases computation time.

mod2<-constructModel(Y,p=4,"Basic",gran=c(5,10),RVAR=FALSE,h=1,cv="Rolling",MN=FALSE,verbose=FALSE,IC=FALSE)
res2=cv.BigVAR(mod2)
plot(res2)