The economic value of analyzing highfrequency financial data is now obvious, both in the academic and financial world. It is the basis of intraday and daily risk monitoring and forecasting, an input to the portfolio allocation process, and also for highfrequency trading. For the efficient use of highfrequency data in financial decision making, the highfrequency
package implements state of the art techniques for cleaning and matching for trades and quotes, as well as the calculation and forecasting of liquidity, realized and correlation measures based on highfrequency data.
The highfrequency
package is the outcome of a Google summer of code project and an improved and updated version of the 2 packages:
RTAQ
(Cornelissen and Boudt 2012) and realized
(Payseur 2008).
Handling highfrequency data can be particularly challenging because of the specific characteristics of the data, as extensively documented in Yan and Zivot 2003.
The highfrequency
package offers highlevel tools for the analysis of highfrequency data.
These tools tackle three typical challenges of working with highfrequency data.
A first specific challenge is the enormous number of observations, that can reach heights of millions observations per stock per day.
Secondly, transactionbytransaction data is by nature irregularly spaced over time.
Thirdly, the recorded data often contains errors for various reasons.
The highfrequency
package offers a (basic) interface to manage highfrequency trades and quotes data.
Furthermore and foremost, it offers easytouse tools
to clean and aggregate highfrequency data, to calculate liquidity measures and, measure and forecast volatility.
The tools in highfrequency
are designed in such a way that they will work with most data input types.
The package contains an extensive collection of functions to calculate realized measures, since it arose by merging the package RTAQ
(Cornelissen and Boudt 2012) from
the TradeAnalytics project and the package realized
(Payseur 2008).
highfrequency
strongly builds on the functionality offered by the xts
package (Ryan and Ulrich 2009) package.
You can use the functionality in the highfrequency
package for trades and quotes objects from the most popular vendors,
as long as your objects are stored as xts objects.
A stable version of the highfrequency
package has been published to CRAN and
can be downloaded through the following command:
install.packages("highfrequency")
The latest version of the highfrequency
package is hosted on Rforge and can be downloaded through the following command:
library("devtools"); install_github("highfrequency","jonathancornelissen");
see highfrequency github repo. In a couple of months, I'll upload an interactive R tutorial for the highfrequency package on the DataMind platform. Furthermore, read online the full the highfrequency manual and documentation.
Users that have their own system set up to store their highfrequency data as xts objects can skip this section.
The functions to calculate liquidity, etc. in the highfrequency
package can take highfrequency data of various data vendors as input:
NYSE TAQ, WRDS TAQ, Reuters, Bloomberg, etc. To achieve this flexibility, we rely on the functionality provided by the excellent quantmod
package (Ryan 2011). Additionally, highfrequency
provides functionality to convert raw data from several data providers into xts
objects. The latter is discussed and illustrated in this section.
Highfrequency financial data is typically subdivided into two files, containing information on the trades and quotes respectively.
highfrequency
can currently parse three types of input data into xts objects:
(i) ".txt" files extracted from the NYSE TAQ database,
(ii) ".csv" "files extracted from the WRDS database,
(iii) ".asc" files from http://www.tickdata.com.
For each of these input data types, we briefly discuss how the convert
function can be used to convert the txt/csv/asc files into xts objects (Ryan and Ulrich 2009) that are saved ondisk in the “RData" format. We opt to store the data as xts objects because this type of object can be indexed by an indicator of time and date.^{1}
Parsing raw NYSE TAQ data into xts:
Suppose the folder ~/raw_data
on your drive contains the two folders: "20080102" and "20080103".
Suppose these folders each contain the files AAPL_trades.txt
and AA_trades.txt
with the daily TAQ data
bought from the New York Stock Exchange (NYSE) for the AAPL
and AA
stock.
The raw data can then easily be converted into ondisk xts objects as follows:
from = "20080102"; to = "20080103"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert(from, to, datasource, datadestination, trades=TRUE, quotes=FALSE,ticker=c("AA","AAPL"), dir=TRUE, extension="txt", header=FALSE,tradecolnames=NULL,quotecolnames=NULL, format="%Y%m%d %H:%M:%S");
At this point, the folder ~/xts_data
will contain two folders named 20080102 and 20080103 containing the files AAPL_trades.RData
and AAPL_trades.RData
which contain the respective daily trade data. The data can now be loaded with the TAQLoad
function (see below).
Parsing raw WRDS TAQ data into xts:
Suppose the folder ~/raw_data
on your drive contains the files IBM_quotes.csv
and
IBM_trades.csv
acquired from Wharton Research Data Service (WRDS).^{2}
Both files contain respectively the trade and quote data for the "IBM" stock for "20111201" up to "20111202".
The data can then easily be converted into ondisk xts objects as follows:
from = "20111201"; to = "20111202"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert( from=from, to=to, datasource=datasource, datadestination=datadestination, trades = T, quotes = T, ticker="IBM", dir = TRUE, extension = "csv", header = TRUE, tradecolnames = NULL, quotecolnames = NULL, format="%Y%m%d %H:%M:%S", onefile = TRUE )
Now, the folder ~/xts_data
contains two folders, one for each day in the sample, i.e.
"20111201" up to "20111202".
Each of these folders contains a file named IBM_trades.RData
in which the trades for that day can be found and a file IBM_quotes.RData
in which the quotes for that day can be found, both saved as xts objects.
As a final step, the function TAQLoad
can be used to load the ondisk data into your R workspace.
You can for example get the trades for this sample in your workspace as follows:
> xts_data = TAQLoad( tickers="IBM", from="20111201", + to="20111202",trades=F, + quotes=TRUE, datasource=datadestination) > head(xts_data) SYMBOL EX BID BIDSIZ OFR OFRSIZ MODE 20111201 04:00:00 "IBM" "P" "176.85" "1" "188.00" " 1" "12" 20111201 04:00:17 "IBM" "P" "185.92" "1" "187.74" " 1" "12" 20111201 04:00:18 "IBM" "P" "176.85" "1" "187.74" " 1" "12" 20111201 04:00:25 "IBM" "P" "176.85" "1" "187.73" " 1" "12" 20111201 04:00:26 "IBM" "P" "176.85" "1" "188.00" " 1" "12" 20111201 04:00:26 "IBM" "P" "176.85" "1" "187.74" " 1" "12"
Parsing raw Tickdata.com data into xts:
Suppose the folder ~/raw_data
on your drive contains the files GLP_quotes.asc
and
GLP_trades.asc
acquired from www.tickdata.com.
Both files contain respectively the trade and quote data for the "GLP" stock for "20110111" up to "20110311".
The data can then easily be converted into ondisk xts objects as follows:
from = "20110111"; to = "20110311"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert(from=from, to=to, datasource=datasource, datadestination=datadestination, trades = TRUE, quotes = TRUE, ticker="GLP", dir = TRUE, format = "%d/%m/%Y %H:%M:%OS", extension = "tickdatacom", header = TRUE, onefile = TRUE );
At this point, the folder ~/xts_data
now contains three folders, one for each day in the sample, i.e.
"20110111", "20110211", "20110311".
Each of these folders contains a file named GLP_trades.RData
in which the trades for that day can be found and a file GLP_quotes.RData
in which the quotes for that day can be found, both saved as xts object.
Notice the
format = '%d/%m/%Y %H:%M:%OS'
argument that allows for timestamps more accurate than seconds, e.g. milliseconds in this case.
As a final step, the function TAQLoad
can be used to load the ondisk data into your R workspace.
You can for example get the trade data for the first day of the sample as follows:
> options("digits.secs"=3); #Show milliseconds > xts_data = TAQLoad(tickers="GLP", from="20110111", to="20110111", + trades=T, quotes=F, + datasource=datadestination) > head(xts_data) SYMBOL EX PRICE SIZE COND CORR G127 20110111 09:30:00.338 "GLP" "T" "18.0700" " 500" "O X" "0" "" 20110111 09:30:00.338 "GLP" "T" "18.0700" " 500" "Q" "0" "" 20110111 09:33:49.342 "GLP" "T" "18.5000" " 150" "F" "0" "" 20110111 09:39:29.280 "GLP" "N" "19.2000" "4924" "O" "0" "" 20110111 09:39:29.348 "GLP" "D" "19.2400" " 500" "@" "0" "T" 20110111 09:39:29.411 "GLP" "N" "19.2400" " 200" "F" "0" ""
Typical trades and quotes: data description:
Both trades and quotes data should thus be xts objects (Ryan and Ulrich 2009), if you want to use the functionality of highfrequency
.
Use the code below to see how trade and quote data objects are typically stored.
library(highfrequency); data("sample_tdataraw"); head(sample_tdataraw); data("sample_qdataraw"); head(sample_qdataraw);
In this section we discuss two very common first steps in the manipulation of highfrequency financial data: (i) the cleaning and (ii) aggregation of the data.
For various reasons, raw trade and quote data contains numerous data errors.
Therefore, the data is not suited for analysis rightaway, and datacleaning is an essential step in dealing with tickbytick data (Brownlees and Gallo 2006).
highfrequency
implements the stepbystep cleaning procedure proposed by BarndorffNielsen et al. (2008).
Table 1 provides an overview of the cleaning functions.
A user can either use a specific cleaning procedure or a wrapper function that performs multiple cleaning steps.
The wrapper functions offer ondisk functionality: i.e. load ondisk raw data and save the clean data back to your hard disk.
This method is advisable in case you have multiple days of data to clean.
Of course, the data on your disk should be organized as discussed in the previous section to benefit from this functionality.
More specifically, these functions expect that the data is saved as xts objects saved in a folder for each day (as the TAQLoad
function does), which will always be the case if you used the convert
function to convert the raw data into xts objects.
To maintain some insight in the cleaning process, the functions tradesCleanup
and quotesCleanup
report
the total number of remaining observations after each cleaning step:
> data("sample_tdataraw"); > dim(sample_tdataraw); [1] 48484 7 > tdata_afterfirstcleaning = tradesCleanup(tdataraw=sample_tdataraw,exchanges="N"); > tdata_afterfirstcleaning$report; initial number no zero prices select exchange 48484 48479 20795 sales condition merge same timestamp 20135 9105 > dim(tdata_afterfirstcleaning$tdata) [1] 9105 7
Function Function Description All Data: ExchangeHoursOnly Restrict data to exchange hours selectexchange Restrict data to specific exchange Trade Data: noZeroPrices Delete entries with zero prices autoSelectExchangeTrades Restrict data to exchange with highest trade volume salesCondition Delete entries with abnormal Sale Condition mergeTradesSameTimestamp Delete entries with same time stamp and use median price rmTradeOutliers Delete entries with prices above/below ask/bid +/ bid/ask spread Quote Data: noZeroQuotes Delete entries with zero quotes autoSelectExchangeQuotes Restrict data to exchange with highest bidsize + offersize mergeQuotesSameTimestamp Delete entries with same time stamp and use median quotes rmNegativeSpread Delete entries with negative spreads rmLargeSpread Delete entries if spread > maxi*median daily spread rmOutliers Delete entries for which the midquote is outlying with respect to surrounding entries Wrapper cleanup functions (perform sequentially the following for ondisk data) tradesCleanup noZeroPrices, selectExchange, salesCondition, mergeTradesSameTimestamp. quotesCleanup noZeroQuotes, selectExchange, rmLargeSpread, mergeQuotesSameTimestamp rmOutliers tradesCleanupFinal rmTradeOutliers (based on cleaned quote data as well)
Prices are typically not recorded at equispaced time points, while e.g. many realized volatility measures rely on equispaced returns. Furthermore, prices are often observed at different points in time for different assets, while e.g. most multivariate realized volatility estimators rely on synchronized data (see e.g. Table 2). There several ways to force these asynchronously and/or irregularly recorded series to a synchronized and/or equispaced time grid.
The most popular method, previous tick aggregation, forces prices to an equispaced grid by taking the last price realized before each grid point.
highfrequency
provides users with the aggregatets
function for fast and easy previous tick aggregation:
> library("highfrequency"); > # Load sample price data > data("sample_tdata"); > ts = sample_tdata$PRICE; > # Previous tick aggregation to the 5minute sampling frequency: > tsagg5min = aggregatets(ts,on="minutes",k=5); > head(tsagg5min); PRICE 20080104 09:35:00 193.920 20080104 09:40:00 194.630 20080104 09:45:00 193.520 20080104 09:50:00 192.850 20080104 09:55:00 190.795 20080104 10:00:00 190.420 > # Previous tick aggregation to the 30second sampling frequency: > tsagg30sec = aggregatets(ts,on="seconds",k=30); > tail(tsagg30sec); PRICE 20080104 15:57:30 191.790 20080104 15:58:00 191.740 20080104 15:58:30 191.760 20080104 15:59:00 191.470 20080104 15:59:30 191.825 20080104 16:00:00 191.670
In the example above the prices are forced to a regular equispaced time grid of respectively 5 minutes and 30 seconds.
Furthermore, the aggregatets
function is build into all realized measures (see Section 4) and can be called by setting the arguments align.by
and align.period
. In that case, first the price(s) are forced the an equispaced regular time grid and then the realized measure is calculated based on the returns over these regular time periods to which the observations were forced. This has the advantage that the user can input the original price series into the realized measures without having to worry about the asynchronicity or the irregularity of the price series.
Another synchronization method (not integrated into the realized measures) is refresh time, initially proposed by Harris and Wood (1995) and recently advocated in BarndorffNielsen et al. (2011). The function refreshTime
in highfrequency
can be used to force time series (very fast) to a synchronized but not necessarily equispaced time grid. The socalled refresh times are the time points at which all assets have traded at least once since the last refresh point. More specifically, the first refresh time corresponds to the first time at which all stocks have traded. The subsequent refresh time is defined as the first time when all stocks have been traded again.
This process is repeated until the end of one time series is reached.
Illustration on price aggregation with refresh time and subsequent volatility calculation:
> data("sample_tdata"); > data("sample_qdata"); > #We assume that stock1 and stock2 contain price data on imaginary stocks: > stock1 = sample_tdata$PRICE; > stock2 = sample_qdata$BID; > #Previoustick aggregation to one minute: > mPrice_1min = cbind(aggregatePrice(stock1),aggregatePrice(stock2)); > #Refresh time aggregation: > mPrice_Refresh = refreshTime(list(stock1,stock2)); > #Calculate a jump robust volatility measures > #based on synchronized data: > rbpcov1 = rBPCov(mPrice_1min,makeReturns=TRUE); > rbpcov2 = rBPCov(mPrice_Refresh,makeReturns=TRUE); > #Calculate a jump and microstructure noise robust volatility measure > #based on nonsynchronous data: > rtscov = rTSCov(list(stock1,stock2));
The availability of highfrequency data has enabled researchers to estimate the ex post realized volatility based on squared intraday returns (Andersen et al. 2003).
In practice, the main challenges in univariate volatility estimation are dealing with (i) jumps in the price level and (ii) microstructure noise.
Multivariate volatility estimation is additionally callenging because of
(i) the asynchronicity of observations between assets and
(ii) the need for a positive semidefnite covariance matrix estimator.
The highfrequency
package implements many recently proposed realized volatility and covolatility measures^{3}
An overview of the univariate and multivariate volatility estimators implemented in
highfrequency
is given in Table 2.
The first two columns indicate whether the estimator can be applied to univariate or multivariate price series.
The following two columns indicate whether the estimator is robust with respect to jumps and microstructure noise respecitively.
The next column reports whether asynchronic price series can/should be used as input.
The last column indicates whether the estimator always yields a positive semidefinite matrix in the multivariate case.
All realized measures have (at least) the following arguments:
rdata
: The return data.
align.by
: a string, align the tick data to "seconds""minutes""hours".
align.period
: an integer, align the tick data to this many [secondsminuteshours].
makeReturns
: boolean, should be TRUE when rdata contains prices instead of returns. FALSE by default.
cor
: boolean, in case it is TRUE, the correlation is returned. FALSE by default.
makePsd
: boolean. Non positive semidefinite estimates can be transformed into a positive semidefinite matrix by setting this argument to TRUE. The eigenvalue method is used.
...
: additional arguments depending on the type of realized measure.
Estimator Univariate Multivariate Jump Microstructure Tickbytick Positive robust noise robust returns as input semidefinite medRV (Andersen et al. 2012) x x / minRV (Andersen et al. 2012) x x / rCov (Andersen et al. 2003) x x x rBPCov (BarndorffNielsen and Shephard 2004) x x x rOWCov (Boudt et al. 2011a) x x x x rThresholdCov (Gobbi and Mancini 2009) x x x rTSCov (Zhang 2011) x x x x rRTSCov (Boudt and Zhang 2012) x x x x x rAVGCov (AitSahalia et al. 2005) x x x x x rKernelCov (BarndorffNielsen et al. 2004) x x x x x rHYCov (Hayashi and Yoshida 2005) x x
Another interesting feature of highfrequency data in the context of volatility measurement is the periodicity in the volatility of highfrequency returns induced by opening, lunch and closing of financial markets (Andersen and Bollerslev 1997). highfrequency
implements both the classic intraday periodicity estimation method of Andersen and Bollerslev 1997 and a jump robust version proposed by Boudt et al. 2011b.
These estimation methods assume that the standard deviation of intraday returns (called vol in output) can be decomposed in a daily volatility
factor (constant for all intraday returns observed on the same day, called dailyvol) and an intraday factor (deterministic function of the intraday time, called periodicvol
). The estimates are either nonparametric (based on a scale estimator) or parametric (based on regression estimation of a flexible specification for the intraday factor).
The sample code below illustrates the estimation of intraday periodicity as well as Figure 1:
> data("sample_real5minprices"); > > #compute and plot intraday periodicity > out = spotVol(sample_real5minprices,P1=6,P2=4,periodicvol="TML",k=5, + dummies=FALSE); > head(out); returns vol dailyvol periodicvol 20050304 09:35:00 0.0010966963 0.004081072 0.001896816 2.151539 20050304 09:40:00 0.0005614217 0.003695715 0.001896816 1.948379 20050304 09:45:00 0.0026443880 0.003417950 0.001896816 1.801941
It is broadly accepted by academic researchers that, when appropriately managed, access to highfrequency data leads to a comparative advantage in better predicting the volatility of future price evolution and hence the generation of alpha through quantitative investment models. Already in 2003 Fleming et al. 2003 estimated that an investor would be willing to pay 50 to 200 basis points per year to capture the gains in portfolio performance, from using highfrequency returns instead of daily returns for volatility forecasting. His findings were later confirmed in a somewhat different setting by the research of De Pooter et al. 2008.
In this section we discuss 2 classes of univariate forecasting models implemented in the
highfrequency
package that have highfrequency data as input:
(i) the Heterogeneous Autoregressive model for Realized Volatility (HAR) discussed in Andersen et al. 2007 and Corsi 2009a and
(ii) the HEAVY model (High frEquency bAsed VolatilitY) introduced in Shephard and Sheppard 2010.
The goal of both models is roughly speaking to predict the next days volatility based on the highfrequency price evolutions prior to that day.
Although the HAR and the HEAVY model have the same goal, i.e. modeling the conditional volatility, they take a different approach. Whereas the HAR model is focused on predicting the opentoclose variation, the HEAVY model has two equations: one focused on the opentoclose variation, one focused on the closetoclose variation. The main advantages of the HAR model are among other things, that it is simple and easy to estimate (since it is essentially a special type of linear model that can be estimated by least squares), even though it still manages to reproduce typical features found in volatility such as the long memory, fat tails, etc. The main advantages of the HEAVY model are among other things, that it models both the closetoclose conditional variance, as well as the conditional expectation of the opentoclose variation. Furthermore, the HEAVY model has momentum and mean reversion effects, and it adjusts quickly to structural breaks in the level of the volatility process. In contrast to the HAR model, the estimation of the HEAVY model is done by Guassian quasimaximum likelihood, for both equations in the model separately.
The next two sections review the HAR model and HEAVY model each in more detail, and of course discuss and illustrate how highfrequency
can be used to estimate these models.
Corsi 2009b proposed the Heterogeneous Autoregressive (HAR) model for Realized Volatility.
As a first step, intraday returns are used to measure the daily volatility using e.g. simply Realized Volatility and typically ignoring the overnight return.
As a second step, the realized volatility is parameterized as a linear function of the lagged realized volatilities over different horizons.
Typically, the daily, weekly and monthly horizons are chosen to aggregate the realized volatility over.
The simplest version of the model, as proposed in Corsi 2009b, is referred to as the HARRV model.
In an interesting paper, Andersen et al. 2007 extend the model of Corsi 2009b by explicitly including the contribution of jumps to the daily Realized Volatility in the model. Their analysis suggests that the volatility originating from jumps in the price level is less persistent than the volatility originating from the continuous component of Realized Volatility, and that separating the rough jump moves from the smooth continuous moves enables better forecasting results. They propose mainly two ways to model this idea, and both are implemented in the highfrequency
package. In what follows, we first discuss the different versions of the HAR model in more detail, and then we discuss and illustrate how highfrequency
can be used to estimate them.
The harModel
function in highfrequency
implements the Heterogeneous Autoregressive model for Realized Volatility
discussed in Andersen et al. 2007 and Corsi 2009a. Let r_{t,i} the ith intraday return on day t, for i=1,…,M with M the number of intraday returns per day. The realized volatility is then given by RV_{t} = ∑_{i=1}^{M} r_{t,i}^{2}. Denote by
RV_{t,t+h} = h^{−1} [ RV_{t+1}+RV_{t+2}+ … + RV_{t+h}], 
the Realized volatility aggregate over h days. Note that by definition RV_{t,t+1} ≡ RV_{t+1}.
The most basic volatility model discussed Andersen et al. 2007 and Corsi 2009a is given by:
RV_{t,t+h} = β_{0} + β_{D} RV_{t} + β_{W} RV_{t−5,t} + β_{M} RV_{t−22,t} + є_{t,t+h} , 
for t = 1,2,…,T. Typically h=1, and the model simplifies to
RV_{t+1} = β_{0} + β_{D} RV_{t} + β_{W} RV_{t−5,t} + β_{M} RV_{t−22,t} + є_{t+1} . 
In the presence of jumps in intraday price process, the traditional realized volatility measures are dominated by the contribution of the jumps to volatility. Therefore, robust measure of volatility were developed in order to estimate only the "smooth" price variation (roughly speaking). The contribution of the jumps to the quadratic price process can then be estimated by J_{t} = max[RV_{t} − BPV_{t}, 0 ]. The HARRVJ model is then given by
RV_{t,t+h} = β_{0} + β_{D} RV_{t} + β_{W} RV_{t−5,t} + β_{M} RV_{t−22,t} + J_{t} + є_{t,t+h}. 
Andersen et al. 2007 argue that it may be desirable to treat small jumps as measurement errors, or part of the continuous sample path variation process, associating only large values of (RV_{t} − BPV_{t}) with the jump component. Therefore, they define Z_{t} as
Z_{t} = (1/n)^{−1/2} × 
 , (1) 
where TQ_{t} = M × µ_{4/3}^{−3} ∑_{j=3}^{M} r_{t,j}^{4/3} r_{t,(j−1)}^{4/3} r_{t,(j−2)}^{4/3}, with µ_{p} = E(z^{p}). Suppose you identify the "significant" jumps by the realizations of Z_{t} in excess of some critical value, say Φ_{α},
J_{t} = I[Z_{t} < Φ_{α}] RV_{t} + I[Z_{t} ≤ Φ_{α}] BPV_{t}, 
where I[·] denotes the indicator function. To ensure that the estimated continuous sample path component variation and jump variation sum to the total realized variation, we follow Andersen et al. 2007 in estimating the former component as the residual,
C_{t} = I[Z_{t} ≤ Φ_{α}] RV_{t} + I[ Z_{t} > Φ_{α}] BPV_{t}. 
The HARRVCJ model is then given by:
RV_{t,t+h} = β_{0} + β_{CD} C_{t} + β_{CW} C_{t−5,t} + β_{CM} C_{t−22,t} + J_{t} + є_{t,t+h}. 
The harModel
function in highfrequency
takes as data
input an xts object containing the intraday returns.
You can specify the type of HAR model that you want to estimate by setting the type
argument.
By default, the most basic model is chosen type='HARRV'
. Other valid options are type='HARRVJ'
or type='HARRVCJ'
(see above for a discussion of these models).
Based on the intraday returns, daily realized measures are calculated. With the argument RVest
, you can set which volatility estimators should be used to estimate both the daily integrated variance (nonjumprobust) and the continuous component of volatility. By default, total daily volatility is measured by Realized Volatility and the continuous component is measured by the Realized Bipower Variation, hence RVest=c(rCov, rBPCov)
, but users can set estimators to their liking (see Section 4 for a discussion on the implemented Realized Volatility measures).
The daily volatility measures are subsequently aggregated over different time horizons. Use the arguments to periods
and periodsJ
to specify the horizons over which the continuous/total volatility and the jump component should be aggregated. Typically, the daily, weekly and monthly frequency are used, i.e. one, five and twentytwo days which translates into the default setting: periods=c(1,5,22)
and periodsJ=c(1,5,22)
.
Corsi and Reno 2012 propose to further extend the HAR model with a leverage component (Leverage Heterogeneous AutoRegressive model). This allows to take into account that negative returns affect future volatility differently than positive returns. By default, the leverage effect is not taken into account and the argument leverage=NULL
. In case you want to mimic the analysis in Corsi and Reno 2012, with leverage components aggregated on the daily, weekly and monthly frequency, just set leverage=c(1,5,22)
.
In the HARRVCJ
a test determines whether the contribution of the jumps is statistically significant.
The argument jumptest
should be set to the function name of the test to determine whether the jump variability is significant that day.
By default jumptest='ABDJumptest'
, hence using the test statistic in Equation (1) (or Equation (18) of Andersen et al. 2007).
The argument alpha
can then be used to set the confidence level used in testing for jumps, which is set to the traditional " 5%" by default.
Often, the object will be to model the volatility for the next day, but this needn’t be the case. Use the argument h
to set the time horizon over which you want to aggregate the dependent variable, e.g. h=5
in case you are interested in modeling the weekly realized volatility.
To conclude, we note that in applications of the HAR model sometimes the dependent and explanatory variables are transformed by taking the logarithm or the square root. Set the argument transform='log'
or transform='sqrt'
or any other function to transform all variables right before the linear model is estimated. By default, no transformation is done and transform=NULL
.
Fitting the HARRV model on the Dow Jones Industrial Average in 2008
As a first step, we load the daily realized volatility for the Dow Jones Industrial average and select only 2008.
The sample dataset realized_library
in highfrequency
contains a very rich set of realized volatility measures
computed by Heber et al. 2009. highfrequency
only contains a subset of the full realized library, which can be found on their website: http://realized.oxfordman.ox.ac.uk.
> data(realized_library); #Get sample daily Realized Volatility data > DJI_RV = realized_library$Dow.Jones.Industrials.Realized.Variance; #Select DJI > DJI_RV = DJI_RV[!is.na(DJI_RV)]; #Remove NA's > DJI_RV = DJI_RV['2008'];
As a second step, we compute the traditional Heterogeneous AutoRegressive model (HAR).
The output of the model is an S3 object. Since the HARmodel is simply a special type of linear model,
it is also implemented that way:
the output of the harModel
function is a sublclass harModel
of lm
, the standard class for linear models.
Figure 2 plots the output object of the harModel
function, which has time on the horizontal axis and
the observed realized volatility and the forecasted realized volatility on the vertical axis (of course this analysis is insample, but the estimated coefficients of the model can be used for real outofsample forecasts obviously). It is clear from a visual inspection of Figure 2 that one of the features
of the harModel
is that it can adapt relatively fast to changes in the volatility level, which explains its popularity in recent years.
> x = harModel(data=DJI_RV , periods = c(1,5,22), RVest = c("rCov"), + type="HARRV",h=1,transform=NULL); > class(x); [1] "harModel" "lm" > x; Model: RV1 = beta0 + beta1 * RV1 + beta2 * RV5 + beta3 * RV22 Coefficients: beta0 beta1 beta2 beta3 4.432e05 1.586e01 6.213e01 8.721e02 r.squared adj.r.squared 0.4679 0.4608 > summary(x); Call: "RV1 = beta0 + beta1 * RV1 + beta2 * RV5 + beta3 * RV22" Residuals: Min 1Q Median 3Q Max 0.0017683 0.0000626 0.0000427 0.0000087 0.0044331 Coefficients: Estimate Std. Error t value Pr(>t) beta0 4.432e05 3.695e05 1.200 0.2315 beta1 1.586e01 8.089e02 1.960 0.0512 . beta2 6.213e01 1.362e01 4.560 8.36e06 *** beta3 8.721e02 1.217e01 0.716 0.4745  Signif. codes 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.0004344 on 227 degrees of freedom Multiple Rsquared: 0.4679, Adjusted Rsquared: 0.4608 Fstatistic: 66.53 on 3 and 227 DF, pvalue: < 2.2e16 > plot(x);
Fitting the HARRVCJ model on small example dataset
Using highfrequency
, it is also very easy to estimate more sophisticated versions of the harModel.
For example, the HARRVCJ model discussed in Andersen et al. 2007 can be easily estimated with a small example dataset as follows:
> data("sample_5minprices_jumps"); > data = sample_5minprices_jumps[,1]; > data = makeReturns(data); #Get the highfrequency return data > x = harModel(data, periods = c(1,5,10), periodsJ=c(1,5,10), + RVest = c("rCov","rBPCov"), type="HARRVCJ", + transform="sqrt"); > x Model: sqrt(RV1) = beta0 + beta1 * sqrt(C1) + beta2 * sqrt(C5) + beta3 * sqrt(C10) + beta4 * sqrt(J1) + beta5 * sqrt(J5) + beta6 * sqrt(J10) Coefficients: beta0 beta1 beta2 beta3 beta4 beta5 0.8835 1.1957 25.1922 38.9909 0.4483 0.8084 beta6 6.8305 r.squared adj.r.squared 0.9915 0.9661
This last example shows how easily a special type of HAR model can be estimated having solely the intraday returns as input. Note that this last example purely serves a didactic purpose: the high Rsquared can be explained by the small number of observastions (i.e. days) in the dataset.
Shephard and Sheppard 2010 introduced the HEAVY model (High frEquency bAsed VolatilitY) to leverage highfrequency data for volatility modeling. Essentially, the model boils down to two equations: Equation (2) models the closetoclose conditional variance, while Equation (3) models the conditional expectation of the opentoclose variation. Denote by F_{t−1}^{HF} the past high frequency data (i.e. all intraday returns) and by F_{t−1}^{LF} the past low frequency data (i.e. daily returns). The HEAVY model is now based on F_{t−1}^{HF} in contrast to traditional GARCH models that use only F_{t−1}^{LF}.
Denote the daily returns:
r_{1},r_{2},…,r_{T}. 
Based on intraday data, one can calculate the daily realized measures:
RM_{1}, RM_{2}, …, RM_{T}, 
with T the total number of days in the sample. The heavy model is given by:
Equation (2) models the closetoclose conditional variance, and equation (3) models the conditional expectation of the opentoclose variation. In matrix notation, the model becomes:

The function heavyModel
in the highfrequency
package implements the above model in very general way.
It enables the user the obtain the estimates for the parameters (ω,ω_{R},α;α_{R},β,β_{R}) using Quasimaximum likelihood.^{4}
The heavyModel
^{5} takes a (T x K) matrix containing the data as input, with T the number of days. For the traditional HEAVY model: K=2, the first column contains the squared demeaned daily returns, i.e. r_{1}^{2},…, r_{T}^{2} and the second column contains the realized measures, i.e RM_{1},…, RM_{T}.
As you can see in Equation (4), the matrix notation of the HEAVY model contains two matrices with parameters named α and named β respectively. This matrix structure allows to see that the traditional HEAVY model given in (4) is just a special case of a larger set of models that can be written this way. You can use the arguments p
and q
from the heavyModel
function to specify for each of these two matrices: (i) which parameters in the matrix should be estimated by setting a nonzero integer in that matrix cell (ii) how many lags should be included in the model. More formally: p
is a (K × K) matrix containing the lag length for the model innovations. Position (i,j) in the matrix indicates the number of lags in equation i of the model for the innovations in data column j. For the traditional heavy model introduced above p is given by (cfr. the matrix with the αs in Equation (4)):
p = 
 . 
Similarly to the matrix p
, the matrix q
is a (K × K) matrix containing the lag length for the conditional variances. Position (i,j) in the matrix indicates the number of lags in equation i of the model for conditional variances corresponding to series j. For the traditional heavy model introduced above q is given by (cfr. the matrix with the βs in Equation (4)):
q = 
 . 
The starting values that will be used in the optimization can be set by the argument
startingvalues
. The arguments LB
and UB
can be used to set the Upper and Lower Bound for the parameters. By default, the lower bound is set to zero and the upperbound to infinity. To specify how the estimation should be initialized you can use the backcast
argument, which will be set to the unconditional estimates by default. Finally, the compconst
argument is a boolean indicating how the ωs should be estimated. In case it is TRUE, ωs are estimated in the optimization, and in case it is FALSE volatility targeting is done and ω is just 1 minus the sum of all relevant αs and βs multiplied by the unconditional variance.
The output of the heavyModel
is a list. Most interestingly, the listitem estparams
contains a matrix with the parameter estimates and their names. The order in which the parameters are reported is as follows: first the estimates for ω, then the estimates for the nonzero αs with the most recent lags first in case max(p)>1, then the estimates for the nonzero βs with the most recent lag first in case max(q) > 1.
Whereas the loglikelihood
listitem reports the total log likelihood, we offer users a bit more insight with the listitem likelihoods
containing an xtsobject with the daily log likelihood evaluated at the parameters.
The listitem convergence
provides more information on the convergence of the optimization, see optim
for more information.
The listitem condvar
is a (T × K) xtsobject containing the conditional variances. For the traditional HEAVY model, Figure 3 plots for example the conditional closetoclose variance for the DJIA for 1996 up to 2009 (see next subsection for more information).
Fitting the HEAVY model on the Dow Jones Industrial Average from 1996 up to 2009
As a first step, we load the realized_library
(Heber et al. 2009) which contains information on the Dow Jones Industrial Average.
We then select from this library the daily returns and the daily Realized Kernel estimates (BarndorffNielsen et al. 2004).
The data matrix that serves as an input for the heavyModel
now has the returns as the first column and the Realized Kernel estimates in the second column. We further set the argument backcast
to the variance of the daily returns and the average realized kernel over the sample period.We now have all ingredients to estimate the HEAVY Model. Based on the output of the model, Figure 3 plots the conditional opentoclose variance, as estimated by the second equation in the model. Note the striking peak in 2008 at the start of the financial crisis.
> # Implementation of the heavy model on DJIA: > data("realized_library"); > returns = realized_library$Dow.Jones.Industrials.Returns; > rk = realized_library$Dow.Jones.Industrials.Realized.Kernel; > returns = returns[!is.na(rk)]; rk = rk[!is.na(rk)]; # Remove NA's > data = cbind( returns^2, rk ); > backcast = matrix( c(var(returns),mean(rk)) ,ncol=1); > > startvalues = c(0.004,0.02,0.44,0.41,0.74,0.56); # Initial values > output = heavyModel( data = as.matrix(data,ncol=2), compconst=FALSE, + startingvalues = startvalues, backcast=backcast); > output$estparams [,1] omega1 0.01750506 omega2 0.06182249 alpha1 0.45118753 alpha2 0.41204541 beta1 0.73834594 beta2 0.56367558
Trades and quotes are often supplied as separate data objects.
For many research and practical questions related to transaction data, one needs to merge trades and quotes.
Since trades and quotes can be subject to different reporting lags, this is not a straightforward operation (Lee and Ready 1991).
The function matchTradesQuotes
can be used for matching trades and quotes.
One should supply the number of seconds quotes are registered faster than trades.
Based on the research of Vergote 2005, we set 2 seconds as the default.
Many trades and quotes databases do not indicate whether individual trades are market buy or market sell orders.
highfrequency
implements with getTradeDirection
the LeeReady rule (Lee and Ready 1991)
to infer the trade direction based on the matched trades and quotes.
Numerous liquidity measures can be calculated based on matched trade and quote data, using the function tqLiquidity
(see Bessembinder 2003, Boehmer 2005, Hasbrouck and Seppi 2001 and Venkataraman 2001 for more information on the implemented liquidity measures).
The main implemented liquidity measures are listen in Table 3, and can be used as arguments of the function tqLiquidity
.
Argument(s) Liquidity measures es, rs Compute effective (realized) spread value_trade Compute trade value (price × size) signed_value_trade Compute signed trade value
signed_trade_size Compute signed trade size di_diff, di_div Compute depth imbalance pes, prs Compute proportional effective and realized spread price_impact, prop_price_impact Compute price impact tspread, pts Compute half traded and proportional halftraded spread qs, logqs Compute quoted spread qsslope, logqslope Compute quoted slope
The example below illustrates how to: (i) match trades and quotes, (ii) get the trade direction and (iii) calculate liquidity measures.
> #Load data samples > data("sample_tdata"); > data("sample_qdata"); > #Match the trade and quote data > tqdata = matchTradesQuotes(sample_tdata,sample_qdata); > #Display information in tqdata > colnames(tqdata)[1:6]; [1] "SYMBOL" "EX" "PRICE" "SIZE" "COND" "CORR" > colnames(tqdata)[7:12]; [1] "G127" "BID" "BIDSIZ" "OFR" "OFRSIZ" "MODE" > #Get the inferred trade direction according to the LeeReady rule > x = getTradeDirection(tqdata); > #Calculate the proportional realized spread: > prs = tqLiquidity(tqdata,sample_tdata,sample_qdata,type="prs"); > #Calculate the effective spread: > es = tqLiquidity(tqdata,type="es");
We thank Google for the financial support through the Google summer of code 2012 program and Brian G. Peterson for his mentorship. Furthermore, we thank the participants of the 1st R/Rmetrics Summer School and 4th User/Developer Meeting on Computational Finance and Financial Engineering, Switzerland (2010) and the "R/Finance: Applied finance with R" conference, USA, Chicago (2010), for their comments and suggestions. In particular, we thank Christophe Croux, Brian Peterson and Eric Zivot for their insights and help in developing this package. Financial support from the Flemish IWT (Institute for Science and Innovation), the National Bank of Belgium and the Research Fund K.U.Leuven is gratefully acknowledged.