The highfrequency package contains a toolkit for the analysis of highfrequency financial data in R. The package is an updated and extended version of the RTAQ and the realized package.




1  General information

The economic value of analyzing high-frequency financial data is now obvious, both in the academic and financial world. It is the basis of intraday and daily risk monitoring and forecasting, an input to the portfolio allocation process, and also for high-frequency trading. For the efficient use of high-frequency data in financial decision making, the highfrequency package implements state of the art techniques for cleaning and matching for trades and quotes, as well as the calculation and forecasting of liquidity, realized and correlation measures based on high-frequency data. The highfrequency package is the outcome of a Google summer of code project and an improved and updated version of the 2 packages: RTAQ (Cornelissen and Boudt 2012) and realized (Payseur 2008).

Handling high-frequency data can be particularly challenging because of the specific characteristics of the data, as extensively documented in Yan and Zivot 2003. The highfrequency package offers high-level tools for the analysis of high-frequency data. These tools tackle three typical challenges of working with high-frequency data. A first specific challenge is the enormous number of observations, that can reach heights of millions observations per stock per day. Secondly, transaction-by-transaction data is by nature irregularly spaced over time. Thirdly, the recorded data often contains errors for various reasons.

The highfrequency package offers a (basic) interface to manage highfrequency trades and quotes data. Furthermore and foremost, it offers easy-to-use tools to clean and aggregate high-frequency data, to calculate liquidity measures and, measure and forecast volatility. The tools in highfrequency are designed in such a way that they will work with most data input types. The package contains an extensive collection of functions to calculate realized measures, since it arose by merging the package RTAQ (Cornelissen and Boudt 2012) from the TradeAnalytics project and the package realized (Payseur 2008).

highfrequency strongly builds on the functionality offered by the xts package (Ryan and Ulrich 2009) package. You can use the functionality in the highfrequency package for trades and quotes objects from the most popular vendors, as long as your objects are stored as xts objects.

A stable version of the highfrequency package has been published to CRAN and can be downloaded through the following command:

install.packages("highfrequency")

The latest version of the highfrequency package is hosted on R-forge and can be downloaded through the following command:

library("devtools"); install_github("highfrequency","jonathancornelissen");

see highfrequency github repo. In a couple of months, I'll upload an interactive R tutorial for the highfrequency package on the DataMind platform. Furthermore, read online the full the highfrequency manual and documentation.


2  Organizing highfrequency data

Users that have their own system set up to store their highfrequency data as xts objects can skip this section. The functions to calculate liquidity, etc. in the highfrequency package can take highfrequency data of various data vendors as input: NYSE TAQ, WRDS TAQ, Reuters, Bloomberg, etc. To achieve this flexibility, we rely on the functionality provided by the excellent quantmod package (Ryan 2011). Additionally, highfrequency provides functionality to convert raw data from several data providers into xts objects. The latter is discussed and illustrated in this section.

Highfrequency financial data is typically subdivided into two files, containing information on the trades and quotes respectively. highfrequency can currently parse three types of input data into xts objects: (i) ".txt" files extracted from the NYSE TAQ database, (ii) ".csv" "files extracted from the WRDS database, (iii) ".asc" files from http://www.tickdata.com. For each of these input data types, we briefly discuss how the convert function can be used to convert the txt/csv/asc files into xts objects (Ryan and Ulrich 2009) that are saved on-disk in the “RData" format. We opt to store the data as xts objects because this type of object can be indexed by an indicator of time and date.1
Parsing raw NYSE TAQ data into xts:
Suppose the folder ~/raw_data on your drive contains the two folders: "2008-01-02" and "2008-01-03". Suppose these folders each contain the files AAPL_trades.txt and AA_trades.txt with the daily TAQ data bought from the New York Stock Exchange (NYSE) for the AAPL and AA stock. The raw data can then easily be converted into on-disk xts objects as follows:

from = "2008-01-02";
to = "2008-01-03";
datasource = "~/raw_data";
datadestination = "~/xts_data";

convert(from, to, datasource, datadestination, trades=TRUE,
          quotes=FALSE,ticker=c("AA","AAPL"), dir=TRUE, extension="txt",
          header=FALSE,tradecolnames=NULL,quotecolnames=NULL,
          format="%Y%m%d %H:%M:%S");

At this point, the folder ~/xts_data will contain two folders named 2008-01-02 and 2008-01-03 containing the files AAPL_trades.RData and AAPL_trades.RData which contain the respective daily trade data. The data can now be loaded with the TAQLoad function (see below).
Parsing raw WRDS TAQ data into xts:
Suppose the folder ~/raw_data on your drive contains the files IBM_quotes.csv and
IBM_trades.csv acquired from Wharton Research Data Service (WRDS).2 Both files contain respectively the trade and quote data for the "IBM" stock for "2011-12-01" up to "2011-12-02". The data can then easily be converted into on-disk xts objects as follows:

from = "2011-12-01"; 
to = "2011-12-02"; 
datasource = "~/raw_data";
datadestination = "~/xts_data";
convert( from=from, to=to, datasource=datasource, 
              datadestination=datadestination, trades = T,  quotes = T, 
              ticker="IBM", dir = TRUE, extension = "csv", 
              header = TRUE, tradecolnames = NULL, quotecolnames = NULL, 
              format="%Y%m%d %H:%M:%S", onefile = TRUE )

Now, the folder ~/xts_data contains two folders, one for each day in the sample, i.e. "2011-12-01" up to "2011-12-02". Each of these folders contains a file named IBM_trades.RData in which the trades for that day can be found and a file IBM_quotes.RData in which the quotes for that day can be found, both saved as xts objects. As a final step, the function TAQLoad can be used to load the on-disk data into your R workspace. You can for example get the trades for this sample in your workspace as follows:

>   xts_data = TAQLoad( tickers="IBM", from="2011-12-01",
+              to="2011-12-02",trades=F,
+              quotes=TRUE, datasource=datadestination)
>   head(xts_data)
                    SYMBOL EX  BID      BIDSIZ OFR      OFRSIZ MODE
2011-12-01 04:00:00 "IBM"  "P" "176.85" "1"    "188.00" " 1"   "12"
2011-12-01 04:00:17 "IBM"  "P" "185.92" "1"    "187.74" " 1"   "12"
2011-12-01 04:00:18 "IBM"  "P" "176.85" "1"    "187.74" " 1"   "12"
2011-12-01 04:00:25 "IBM"  "P" "176.85" "1"    "187.73" " 1"   "12"
2011-12-01 04:00:26 "IBM"  "P" "176.85" "1"    "188.00" " 1"   "12"
2011-12-01 04:00:26 "IBM"  "P" "176.85" "1"    "187.74" " 1"   "12"


Parsing raw Tickdata.com data into xts:
Suppose the folder ~/raw_data on your drive contains the files GLP_quotes.asc and
GLP_trades.asc acquired from www.tickdata.com. Both files contain respectively the trade and quote data for the "GLP" stock for "2011-01-11" up to "2011-03-11". The data can then easily be converted into on-disk xts objects as follows:

from = "2011-01-11"; 
to     = "2011-03-11"; 
datasource = "~/raw_data"; 
datadestination = "~/xts_data";
convert(from=from, to=to, datasource=datasource, 
             datadestination=datadestination, trades = TRUE, 
             quotes = TRUE, ticker="GLP", dir = TRUE, format = "%d/%m/%Y %H:%M:%OS",
             extension = "tickdatacom", header = TRUE,  onefile = TRUE );

At this point, the folder ~/xts_data now contains three folders, one for each day in the sample, i.e. "2011-01-11", "2011-02-11", "2011-03-11". Each of these folders contains a file named GLP_trades.RData in which the trades for that day can be found and a file GLP_quotes.RData in which the quotes for that day can be found, both saved as xts object. Notice the
format = '%d/%m/%Y %H:%M:%OS' argument that allows for timestamps more accurate than seconds, e.g. milliseconds in this case. As a final step, the function TAQLoad can be used to load the on-disk data into your R workspace. You can for example get the trade data for the first day of the sample as follows:

>  options("digits.secs"=3); #Show milliseconds
>  xts_data = TAQLoad(tickers="GLP", from="2011-01-11", to="2011-01-11",
+             trades=T, quotes=F, 
+             datasource=datadestination)
>  head(xts_data)
                        SYMBOL EX  PRICE     SIZE   COND  CORR G127
2011-01-11 09:30:00.338 "GLP"  "T" "18.0700" " 500" "O X" "0"  ""  
2011-01-11 09:30:00.338 "GLP"  "T" "18.0700" " 500" "Q"   "0"  ""  
2011-01-11 09:33:49.342 "GLP"  "T" "18.5000" " 150" "F"   "0"  ""  
2011-01-11 09:39:29.280 "GLP"  "N" "19.2000" "4924" "O"   "0"  ""  
2011-01-11 09:39:29.348 "GLP"  "D" "19.2400" " 500" "@"   "0"  "T" 
2011-01-11 09:39:29.411 "GLP"  "N" "19.2400" " 200" "F"   "0"  "" 


Typical trades and quotes: data description:
Both trades and quotes data should thus be xts objects (Ryan and Ulrich 2009), if you want to use the functionality of highfrequency. Use the code below to see how trade and quote data objects are typically stored.

library(highfrequency);
data("sample_tdataraw");
head(sample_tdataraw);
data("sample_qdataraw");
head(sample_qdataraw);

3  Manipulation of highfrequency data

In this section we discuss two very common first steps in the manipulation of highfrequency financial data: (i) the cleaning and (ii) aggregation of the data.

3.1  Cleaning of highfrequency data

For various reasons, raw trade and quote data contains numerous data errors. Therefore, the data is not suited for analysis right-away, and data-cleaning is an essential step in dealing with tick-by-tick data (Brownlees and Gallo 2006). highfrequency implements the step-by-step cleaning procedure proposed by Barndorff-Nielsen et al. (2008). Table 1 provides an overview of the cleaning functions. A user can either use a specific cleaning procedure or a wrapper function that performs multiple cleaning steps. The wrapper functions offer on-disk functionality: i.e. load on-disk raw data and save the clean data back to your hard disk. This method is advisable in case you have multiple days of data to clean. Of course, the data on your disk should be organized as discussed in the previous section to benefit from this functionality. More specifically, these functions expect that the data is saved as xts objects saved in a folder for each day (as the TAQLoad function does), which will always be the case if you used the convert function to convert the raw data into xts objects.

To maintain some insight in the cleaning process, the functions tradesCleanup and quotesCleanup report the total number of remaining observations after each cleaning step:

> data("sample_tdataraw");
> dim(sample_tdataraw);
[1] 48484     7
> tdata_afterfirstcleaning = tradesCleanup(tdataraw=sample_tdataraw,exchanges="N");
> tdata_afterfirstcleaning$report;
      initial number       no zero prices      select exchange 
               48484                48479                20795 
     sales condition merge same timestamp 
               20135                 9105 
> dim(tdata_afterfirstcleaning$tdata)
[1] 9105    7

Table 1: Cleaning functions
FunctionFunction Description
All Data:
ExchangeHoursOnlyRestrict data to exchange hours
selectexchangeRestrict data to specific exchange
Trade Data:
noZeroPricesDelete entries with zero prices
autoSelectExchangeTradesRestrict data to exchange with highest trade volume
salesConditionDelete entries with abnormal Sale Condition
mergeTradesSameTimestampDelete entries with same time stamp and use median price
rmTradeOutliersDelete entries with prices above/below ask/bid +/- bid/ask spread
Quote Data:
noZeroQuotesDelete entries with zero quotes
autoSelectExchangeQuotesRestrict data to exchange with highest bidsize + offersize
mergeQuotesSameTimestampDelete entries with same time stamp and use median quotes
rmNegativeSpreadDelete entries with negative spreads
rmLargeSpreadDelete entries if spread > maxi*median daily spread
rmOutliersDelete entries for which the mid-quote is outlying with respect to
 surrounding entries
Wrapper cleanup functions (perform sequentially the following for on-disk data)
tradesCleanupnoZeroPrices, selectExchange, salesCondition, mergeTradesSameTimestamp.
quotesCleanupnoZeroQuotes, selectExchange, rmLargeSpread, mergeQuotesSameTimestamp
 rmOutliers
tradesCleanupFinalrmTradeOutliers (based on cleaned quote data as well)

3.2  Aggregation of highfrequency data

Prices are typically not recorded at equispaced time points, while e.g. many realized volatility measures rely on equispaced returns. Furthermore, prices are often observed at different points in time for different assets, while e.g. most multivariate realized volatility estimators rely on synchronized data (see e.g. Table 2). There several ways to force these asynchronously and/or irregularly recorded series to a synchronized and/or equispaced time grid.

The most popular method, previous tick aggregation, forces prices to an equispaced grid by taking the last price realized before each grid point. highfrequency provides users with the aggregatets function for fast and easy previous tick aggregation:

> library("highfrequency");
> # Load sample price data
> data("sample_tdata");
> ts = sample_tdata$PRICE;
> # Previous tick aggregation to the 5-minute sampling frequency:
> tsagg5min = aggregatets(ts,on="minutes",k=5);
> head(tsagg5min);
                      PRICE
2008-01-04 09:35:00 193.920
2008-01-04 09:40:00 194.630
2008-01-04 09:45:00 193.520
2008-01-04 09:50:00 192.850
2008-01-04 09:55:00 190.795
2008-01-04 10:00:00 190.420
> # Previous tick aggregation to the 30-second sampling frequency:
> tsagg30sec = aggregatets(ts,on="seconds",k=30);
> tail(tsagg30sec);
                      PRICE
2008-01-04 15:57:30 191.790
2008-01-04 15:58:00 191.740
2008-01-04 15:58:30 191.760
2008-01-04 15:59:00 191.470
2008-01-04 15:59:30 191.825
2008-01-04 16:00:00 191.670

In the example above the prices are forced to a regular equispaced time grid of respectively 5 minutes and 30 seconds. Furthermore, the aggregatets function is build into all realized measures (see Section 4) and can be called by setting the arguments align.by and align.period. In that case, first the price(s) are forced the an equispaced regular time grid and then the realized measure is calculated based on the returns over these regular time periods to which the observations were forced. This has the advantage that the user can input the original price series into the realized measures without having to worry about the asynchronicity or the irregularity of the price series.

Another synchronization method (not integrated into the realized measures) is refresh time, initially proposed by Harris and Wood (1995) and recently advocated in Barndorff-Nielsen et al. (2011). The function refreshTime in highfrequency can be used to force time series (very fast) to a synchronized but not necessarily equispaced time grid. The so-called refresh times are the time points at which all assets have traded at least once since the last refresh point. More specifically, the first refresh time corresponds to the first time at which all stocks have traded. The subsequent refresh time is defined as the first time when all stocks have been traded again. This process is repeated until the end of one time series is reached.

Illustration on price aggregation with refresh time and subsequent volatility calculation:

> data("sample_tdata");
> data("sample_qdata");
> #We assume that stock1 and stock2 contain price data on imaginary stocks:
> stock1 = sample_tdata$PRICE;
> stock2 = sample_qdata$BID;
> #Previous-tick aggregation to one minute:
> mPrice_1min = cbind(aggregatePrice(stock1),aggregatePrice(stock2));
> #Refresh time aggregation:
> mPrice_Refresh = refreshTime(list(stock1,stock2));
> #Calculate a jump robust volatility measures 
> #based on synchronized data:
> rbpcov1 = rBPCov(mPrice_1min,makeReturns=TRUE);
> rbpcov2 = rBPCov(mPrice_Refresh,makeReturns=TRUE);
> #Calculate a jump and microstructure noise robust volatility measure
> #based on nonsynchronous data:
> rtscov = rTSCov(list(stock1,stock2));

4  Realized volatility measures

The availability of high-frequency data has enabled researchers to estimate the ex post realized volatility based on squared intraday returns (Andersen et al. 2003). In practice, the main challenges in univariate volatility estimation are dealing with (i) jumps in the price level and (ii) microstructure noise. Multivariate volatility estimation is additionally callenging because of (i) the asynchronicity of observations between assets and (ii) the need for a positive semidefnite covariance matrix estimator. The highfrequency package implements many recently proposed realized volatility and covolatility measures3

An overview of the univariate and multivariate volatility estimators implemented in
highfrequency is given in Table 2. The first two columns indicate whether the estimator can be applied to univariate or multivariate price series. The following two columns indicate whether the estimator is robust with respect to jumps and microstructure noise respecitively. The next column reports whether asynchronic price series can/should be used as input. The last column indicates whether the estimator always yields a positive semidefinite matrix in the multivariate case. All realized measures have (at least) the following arguments:


Table 2: Overview of Volatility estimators
EstimatorUnivariateMultivariateJumpMicrostructureTick-by-tickPositive
   robustnoise robustreturns as inputsemidefinite
medRV (Andersen et al. 2012) x x  /
minRV (Andersen et al. 2012) x x  /
rCov (Andersen et al. 2003) xx   x
rBPCov (Barndorff -Nielsen and Shephard 2004) xxx   
rOWCov (Boudt et al. 2011a)xxx  x
rThresholdCov (Gobbi and Mancini 2009) xx  x
rTSCov (Zhang 2011)xx xx 
rRTSCov (Boudt and Zhang 2012)xxxxx 
rAVGCov (Ait-Sahalia et al. 2005)xx xxx
rKernelCov (Barndorff-Nielsen et al. 2004)xx xxx
rHYCov (Hayashi and Yoshida 2005)  x  x 

Another interesting feature of high-frequency data in the context of volatility measurement is the periodicity in the volatility of high-frequency returns induced by opening, lunch and closing of financial markets (Andersen and Bollerslev 1997). highfrequency implements both the classic intraday periodicity estimation method of Andersen and Bollerslev 1997 and a jump robust version proposed by Boudt et al. 2011b. These estimation methods assume that the standard deviation of intraday returns (called vol in output) can be decomposed in a daily volatility factor (constant for all intraday returns observed on the same day, called dailyvol) and an intraday factor (deterministic function of the intraday time, called periodicvol). The estimates are either non-parametric (based on a scale estimator) or parametric (based on regression estimation of a flexible specification for the intraday factor).

The sample code below illustrates the estimation of intraday periodicity as well as Figure 1:

> data("sample_real5minprices");
> 
> #compute and plot intraday periodicity
> out = spotVol(sample_real5minprices,P1=6,P2=4,periodicvol="TML",k=5, 
+ dummies=FALSE);
> head(out);
                          returns         vol    dailyvol periodicvol
2005-03-04 09:35:00 -0.0010966963 0.004081072 0.001896816    2.151539
2005-03-04 09:40:00 -0.0005614217 0.003695715 0.001896816    1.948379
2005-03-04 09:45:00 -0.0026443880 0.003417950 0.001896816    1.801941

Figure 1: Estimated periodicity pattern using spotVol function.
Intraday periodicity estimation in the highfrequency package

5  Volatility forecasting

It is broadly accepted by academic researchers that, when appropriately managed, access to high-frequency data leads to a comparative advantage in better predicting the volatility of future price evolution and hence the generation of alpha through quantitative investment models. Already in 2003 Fleming et al. 2003 estimated that an investor would be willing to pay 50 to 200 basis points per year to capture the gains in portfolio performance, from using high-frequency returns instead of daily returns for volatility forecasting. His findings were later confirmed in a somewhat different setting by the research of De Pooter et al. 2008.

In this section we discuss 2 classes of univariate forecasting models implemented in the
highfrequency package that have highfrequency data as input: (i) the Heterogeneous Autoregressive model for Realized Volatility (HAR) discussed in Andersen et al. 2007 and Corsi 2009a and (ii) the HEAVY model (High frEquency bAsed VolatilitY) introduced in Shephard and Sheppard 2010. The goal of both models is roughly speaking to predict the next days volatility based on the highfrequency price evolutions prior to that day.

Although the HAR and the HEAVY model have the same goal, i.e. modeling the conditional volatility, they take a different approach. Whereas the HAR model is focused on predicting the open-to-close variation, the HEAVY model has two equations: one focused on the open-to-close variation, one focused on the close-to-close variation. The main advantages of the HAR model are among other things, that it is simple and easy to estimate (since it is essentially a special type of linear model that can be estimated by least squares), even though it still manages to reproduce typical features found in volatility such as the long memory, fat tails, etc. The main advantages of the HEAVY model are among other things, that it models both the close-to-close conditional variance, as well as the conditional expectation of the open-to-close variation. Furthermore, the HEAVY model has momentum and mean reversion effects, and it adjusts quickly to structural breaks in the level of the volatility process. In contrast to the HAR model, the estimation of the HEAVY model is done by Guassian quasi-maximum likelihood, for both equations in the model separately.

The next two sections review the HAR model and HEAVY model each in more detail, and of course discuss and illustrate how highfrequency can be used to estimate these models.

5.1  The HAR model

Corsi 2009b proposed the Heterogeneous Autoregressive (HAR) model for Realized Volatility. As a first step, intraday returns are used to measure the daily volatility using e.g. simply Realized Volatility and typically ignoring the overnight return. As a second step, the realized volatility is parameterized as a linear function of the lagged realized volatilities over different horizons. Typically, the daily, weekly and monthly horizons are chosen to aggregate the realized volatility over. The simplest version of the model, as proposed in Corsi 2009b, is referred to as the HAR-RV model. In an interesting paper, Andersen et al. 2007 extend the model of Corsi 2009b by explicitly including the contribution of jumps to the daily Realized Volatility in the model. Their analysis suggests that the volatility originating from jumps in the price level is less persistent than the volatility originating from the continuous component of Realized Volatility, and that separating the rough jump moves from the smooth continuous moves enables better forecasting results. They propose mainly two ways to model this idea, and both are implemented in the highfrequency package. In what follows, we first discuss the different versions of the HAR model in more detail, and then we discuss and illustrate how highfrequency can be used to estimate them.

The harModel function in highfrequency implements the Heterogeneous Autoregressive model for Realized Volatility discussed in Andersen et al. 2007 and Corsi 2009a. Let rt,i the i-th intraday return on day t, for i=1,…,M with M the number of intraday returns per day. The realized volatility is then given by RVt = ∑i=1M rt,i2. Denote by

RVt,t+h = h−1 [ RVt+1+RVt+2+ … + RVt+h], 

the Realized volatility aggregate over h days. Note that by definition RVt,t+1RVt+1.

5.1.1  The HARRV model

The most basic volatility model discussed Andersen et al. 2007 and Corsi 2009a is given by:

RVt,t+h = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + єt,t+h ,

for t = 1,2,…,T. Typically h=1, and the model simplifies to

RVt+1 = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + єt+1 .

5.1.2  The HARRVJ model

In the presence of jumps in intraday price process, the traditional realized volatility measures are dominated by the contribution of the jumps to volatility. Therefore, robust measure of volatility were developed in order to estimate only the "smooth" price variation (roughly speaking). The contribution of the jumps to the quadratic price process can then be estimated by Jt = max[RVtBPVt, 0 ]. The HAR-RV-J model is then given by

RVt,t+h = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + Jt + єt,t+h

5.1.3  The HARRVCJ model

Andersen et al. 2007 argue that it may be desirable to treat small jumps as measurement errors, or part of the continuous sample path variation process, associating only large values of (RVtBPVt) with the jump component. Therefore, they define Zt as

 Zt = (1/n)−1/2 × 
 (RVtBPVt)RVt−1 
 [ (µ1−4 + 2µ1−2 − 5 ) max{1,TQt BPVt−2 } ]1/2 
,     (1)

where TQt = M × µ4/3−3j=3M |rt,j|4/3 |rt,(j−1)|4/3 |rt,(j−2)|4/3, with µp = E(|z|p). Suppose you identify the "significant" jumps by the realizations of Zt in excess of some critical value, say Φα,

Jt = I[Zt < ΦαRVt + I[Zt ≤ ΦαBPVt

where I[·] denotes the indicator function. To ensure that the estimated continuous sample path component variation and jump variation sum to the total realized variation, we follow Andersen et al. 2007 in estimating the former component as the residual,

Ct = I[Zt ≤ ΦαRVt + IZt > ΦαBPVt.

The HAR-RV-CJ model is then given by:

RVt,t+h = β0 + βCD Ct + βCW Ct−5,t + βCM Ct−22,t + Jt + єt,t+h

5.1.4  Usage and arguments

The harModel function in highfrequency takes as data input an xts object containing the intraday returns. You can specify the type of HAR model that you want to estimate by setting the type argument. By default, the most basic model is chosen type='HARRV'. Other valid options are type='HARRVJ' or type='HARRVCJ' (see above for a discussion of these models).

Based on the intraday returns, daily realized measures are calculated. With the argument RVest, you can set which volatility estimators should be used to estimate both the daily integrated variance (non-jump-robust) and the continuous component of volatility. By default, total daily volatility is measured by Realized Volatility and the continuous component is measured by the Realized Bipower Variation, hence RVest=c(rCov, rBPCov), but users can set estimators to their liking (see Section 4 for a discussion on the implemented Realized Volatility measures).

The daily volatility measures are subsequently aggregated over different time horizons. Use the arguments to periods and periodsJ to specify the horizons over which the continuous/total volatility and the jump component should be aggregated. Typically, the daily, weekly and monthly frequency are used, i.e. one, five and twenty-two days which translates into the default setting: periods=c(1,5,22) and periodsJ=c(1,5,22).

Corsi and Reno 2012 propose to further extend the HAR model with a leverage component (Leverage Heterogeneous Auto-Regressive model). This allows to take into account that negative returns affect future volatility differently than positive returns. By default, the leverage effect is not taken into account and the argument leverage=NULL. In case you want to mimic the analysis in Corsi and Reno 2012, with leverage components aggregated on the daily, weekly and monthly frequency, just set leverage=c(1,5,22).

In the HARRVCJ a test determines whether the contribution of the jumps is statistically significant. The argument jumptest should be set to the function name of the test to determine whether the jump variability is significant that day. By default jumptest='ABDJumptest', hence using the test statistic in Equation (1) (or Equation (18) of Andersen et al. 2007). The argument alpha can then be used to set the confidence level used in testing for jumps, which is set to the traditional " 5%" by default.

Often, the object will be to model the volatility for the next day, but this needn’t be the case. Use the argument h to set the time horizon over which you want to aggregate the dependent variable, e.g. h=5 in case you are interested in modeling the weekly realized volatility.

To conclude, we note that in applications of the HAR model sometimes the dependent and explanatory variables are transformed by taking the logarithm or the square root. Set the argument transform='log' or transform='sqrt' or any other function to transform all variables right before the linear model is estimated. By default, no transformation is done and transform=NULL.

5.1.5  Examples

Fitting the HARRV model on the Dow Jones Industrial Average in 2008
As a first step, we load the daily realized volatility for the Dow Jones Industrial average and select only 2008. The sample dataset realized_library in highfrequency contains a very rich set of realized volatility measures computed by Heber et al. 2009. highfrequency only contains a subset of the full realized library, which can be found on their website: http://realized.oxford-man.ox.ac.uk.

> data(realized_library); #Get sample daily Realized Volatility data
> DJI_RV = realized_library$Dow.Jones.Industrials.Realized.Variance; #Select DJI
> DJI_RV = DJI_RV[!is.na(DJI_RV)]; #Remove NA's
> DJI_RV = DJI_RV['2008'];

As a second step, we compute the traditional Heterogeneous AutoRegressive model (HAR). The output of the model is an S3 object. Since the HAR-model is simply a special type of linear model, it is also implemented that way: the output of the harModel function is a sublclass harModel of lm, the standard class for linear models. Figure 2 plots the output object of the harModel function, which has time on the horizontal axis and the observed realized volatility and the forecasted realized volatility on the vertical axis (of course this analysis is in-sample, but the estimated coefficients of the model can be used for real out-of-sample forecasts obviously). It is clear from a visual inspection of Figure 2 that one of the features of the harModel is that it can adapt relatively fast to changes in the volatility level, which explains its popularity in recent years.

> x = harModel(data=DJI_RV , periods = c(1,5,22), RVest = c("rCov"),
+                       type="HARRV",h=1,transform=NULL);
> class(x); 
[1] "harModel" "lm"      
> x;
Model:
RV1 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22

Coefficients:
    beta0      beta1      beta2      beta3  
4.432e-05  1.586e-01  6.213e-01  8.721e-02  


    r.squared  adj.r.squared  
       0.4679         0.4608  
> summary(x);
Call:
"RV1 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22"

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0017683 -0.0000626 -0.0000427 -0.0000087  0.0044331 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
beta0 4.432e-05  3.695e-05   1.200   0.2315    
beta1 1.586e-01  8.089e-02   1.960   0.0512 .  
beta2 6.213e-01  1.362e-01   4.560 8.36e-06 ***
beta3 8.721e-02  1.217e-01   0.716   0.4745    
---
Signif. codes   0 '***' 0.001 '**' 0.01  '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0004344 on 227 degrees of freedom
Multiple R-squared: 0.4679, Adjusted R-squared: 0.4608 
F-statistic: 66.53 on 3 and 227 DF,  p-value: < 2.2e-16 
> plot(x);

Figure 2: HAR-model for the Realized Volatility of the DJIA in 2008
HAR model of Corsi estimation

Fitting the HARRVCJ model on small example dataset
Using highfrequency, it is also very easy to estimate more sophisticated versions of the harModel. For example, the HARRVCJ model discussed in Andersen et al. 2007 can be easily estimated with a small example dataset as follows:

> data("sample_5minprices_jumps"); 
> data = sample_5minprices_jumps[,1];
> data = makeReturns(data); #Get the high-frequency return data
> x      = harModel(data, periods = c(1,5,10), periodsJ=c(1,5,10), 
+                   RVest = c("rCov","rBPCov"), type="HARRVCJ",
+                   transform="sqrt");
> x

Model:
sqrt(RV1) = beta0  +  beta1 * sqrt(C1) +  beta2 * sqrt(C5) +  beta3 * sqrt(C10) 
       +  beta4 * sqrt(J1) +  beta5 * sqrt(J5) +  beta6 * sqrt(J10)

Coefficients:
   beta0     beta1     beta2     beta3     beta4     beta5  
 -0.8835    1.1957  -25.1922   38.9909   -0.4483    0.8084  
   beta6  
 -6.8305  

    r.squared  adj.r.squared  
       0.9915         0.9661  

This last example shows how easily a special type of HAR model can be estimated having solely the intraday returns as input. Note that this last example purely serves a didactic purpose: the high R-squared can be explained by the small number of observastions (i.e. days) in the dataset.

5.2  The HEAVY model

Shephard and Sheppard 2010 introduced the HEAVY model (High frEquency bAsed VolatilitY) to leverage highfrequency data for volatility modeling. Essentially, the model boils down to two equations: Equation (2) models the close-to-close conditional variance, while Equation (3) models the conditional expectation of the open-to-close variation. Denote by Ft−1HF the past high frequency data (i.e. all intraday returns) and by Ft−1LF the past low frequency data (i.e. daily returns). The HEAVY model is now based on Ft−1HF in contrast to traditional GARCH models that use only Ft−1LF.

Denote the daily returns:

r1,r2,…,rT.

Based on intraday data, one can calculate the daily realized measures:

RM1RM2, …, RMT,

with T the total number of days in the sample. The heavy model is given by:

     
 Var(rtFt−1HF)ht   = ω + α RMt−1 + β ht−1,     ω, α ≥ 0  and  β ∈ [0,1],              (2)
E(RMt | Ft−1HF)= θt = ωR + αR RMt−1 + βR θt−1,      ωR, αR, βR ≥ 0   and  αR + βR ∈ [0,1].              (3)

Equation (2) models the close-to-close conditional variance, and equation (3) models the conditional expectation of the open-to-close variation. In matrix notation, the model becomes:

     
   


ht    
θt  


=


ω 
ωR  


+


0α 
0αR  




rt−12 
RMt−1  


+


β0  
0βR    




ht−1 
θt−1  


.  
             (4)

The function heavyModel in the highfrequency package implements the above model in very general way. It enables the user the obtain the estimates for the parameters (ω,ωR,α;αR,β,βR) using Quasi-maximum likelihood.4

5.2.1  Usage and arguments

The heavyModel5 takes a (T x K) matrix containing the data as input, with T the number of days. For the traditional HEAVY model: K=2, the first column contains the squared demeaned daily returns, i.e. r12,…, rT2 and the second column contains the realized measures, i.e RM1,…, RMT.

As you can see in Equation (4), the matrix notation of the HEAVY model contains two matrices with parameters named α and named β respectively. This matrix structure allows to see that the traditional HEAVY model given in (4) is just a special case of a larger set of models that can be written this way. You can use the arguments p and q from the heavyModel function to specify for each of these two matrices: (i) which parameters in the matrix should be estimated by setting a non-zero integer in that matrix cell (ii) how many lags should be included in the model. More formally: p is a (K × K) matrix containing the lag length for the model innovations. Position (i,j) in the matrix indicates the number of lags in equation i of the model for the innovations in data column j. For the traditional heavy model introduced above p is given by (cfr. the matrix with the αs in Equation (4)):

p = 


01
0


Similarly to the matrix p, the matrix q is a (K × K) matrix containing the lag length for the conditional variances. Position (i,j) in the matrix indicates the number of lags in equation i of the model for conditional variances corresponding to series j. For the traditional heavy model introduced above q is given by (cfr. the matrix with the βs in Equation (4)):

q = 


10
0


.

The starting values that will be used in the optimization can be set by the argument
startingvalues. The arguments LB and UB can be used to set the Upper and Lower Bound for the parameters. By default, the lower bound is set to zero and the upperbound to infinity. To specify how the estimation should be initialized you can use the backcast argument, which will be set to the unconditional estimates by default. Finally, the compconst argument is a boolean indicating how the ωs should be estimated. In case it is TRUE, ωs are estimated in the optimization, and in case it is FALSE volatility targeting is done and ω is just 1 minus the sum of all relevant αs and βs multiplied by the unconditional variance.

5.2.2  Output

The output of the heavyModel is a list. Most interestingly, the list-item estparams contains a matrix with the parameter estimates and their names. The order in which the parameters are reported is as follows: first the estimates for ω, then the estimates for the non-zero αs with the most recent lags first in case max(p)>1, then the estimates for the non-zero βs with the most recent lag first in case max(q) > 1.

Whereas the loglikelihood list-item reports the total log likelihood, we offer users a bit more insight with the list-item likelihoods containing an xts-object with the daily log likelihood evaluated at the parameters. The list-item convergence provides more information on the convergence of the optimization, see optim for more information.

The list-item condvar is a (T × K) xts-object containing the conditional variances. For the traditional HEAVY model, Figure 3 plots for example the conditional close-to-close variance for the DJIA for 1996 up to 2009 (see next subsection for more information).

5.2.3  Examples

Fitting the HEAVY model on the Dow Jones Industrial Average from 1996 up to 2009
As a first step, we load the realized_library (Heber et al. 2009) which contains information on the Dow Jones Industrial Average. We then select from this library the daily returns and the daily Realized Kernel estimates (Barndorff-Nielsen et al. 2004). The data matrix that serves as an input for the heavyModel now has the returns as the first column and the Realized Kernel estimates in the second column. We further set the argument backcast to the variance of the daily returns and the average realized kernel over the sample period.We now have all ingredients to estimate the HEAVY Model. Based on the output of the model, Figure 3 plots the conditional open-to-close variance, as estimated by the second equation in the model. Note the striking peak in 2008 at the start of the financial crisis.

>  # Implementation of the heavy model on DJIA:
>  data("realized_library");
>  returns  =  realized_library$Dow.Jones.Industrials.Returns; 
>  rk       =  realized_library$Dow.Jones.Industrials.Realized.Kernel; 
>  returns  = returns[!is.na(rk)];  rk = rk[!is.na(rk)]; # Remove NA's 
>  data     = cbind( returns^2, rk ); 
>  backcast = matrix( c(var(returns),mean(rk)) ,ncol=1);  
> 
>  startvalues = c(0.004,0.02,0.44,0.41,0.74,0.56); # Initial values 
>  output = heavyModel( data = as.matrix(data,ncol=2), compconst=FALSE, 
+                       startingvalues = startvalues, backcast=backcast); 
>  output$estparams
             [,1]
omega1 0.01750506
omega2 0.06182249
alpha1 0.45118753
alpha2 0.41204541
beta1  0.73834594
beta2  0.56367558

Figure 3: HEAVY Model for the Realized Volatility of the DJIA in 2008.
HEAVY Model estimation in R

6  Liquidity

6.1  Matching trades and quotes

Trades and quotes are often supplied as separate data objects. For many research and practical questions related to transaction data, one needs to merge trades and quotes. Since trades and quotes can be subject to different reporting lags, this is not a straightforward operation (Lee and Ready 1991). The function matchTradesQuotes can be used for matching trades and quotes. One should supply the number of seconds quotes are registered faster than trades. Based on the research of Vergote 2005, we set 2 seconds as the default.

6.2  Inferred trade direction

Many trades and quotes databases do not indicate whether individual trades are market buy or market sell orders. highfrequency implements with getTradeDirection the Lee-Ready rule (Lee and Ready 1991) to infer the trade direction based on the matched trades and quotes.

6.3  Liquidity measures

Numerous liquidity measures can be calculated based on matched trade and quote data, using the function tqLiquidity (see Bessembinder 2003, Boehmer 2005, Hasbrouck and Seppi 2001 and Venkataraman 2001 for more information on the implemented liquidity measures). The main implemented liquidity measures are listen in Table 3, and can be used as arguments of the function tqLiquidity.


Table 3: Overview of Liquidity measures
Argument(s)Liquidity measures
es, rsCompute effective (realized) spread
value_tradeCompute trade value (price × size)
signed_value_trade

Compute signed trade value

signed_trade_sizeCompute signed trade size
di_diff, di_divCompute depth imbalance
pes, prsCompute proportional effective and realized spread
price_impact, prop_price_impactCompute price impact
tspread, ptsCompute half traded and proportional half-traded spread
qs, logqsCompute quoted spread
qsslope, logqslopeCompute quoted slope

The example below illustrates how to: (i) match trades and quotes, (ii) get the trade direction and (iii) calculate liquidity measures.

 
> #Load data samples
> data("sample_tdata");
> data("sample_qdata");
> #Match the trade and quote data
> tqdata = matchTradesQuotes(sample_tdata,sample_qdata);
> #Display information in tqdata
> colnames(tqdata)[1:6];
[1] "SYMBOL" "EX"     "PRICE"  "SIZE"   "COND"   "CORR"  
> colnames(tqdata)[7:12];
[1] "G127"   "BID"    "BIDSIZ" "OFR"    "OFRSIZ" "MODE"  
> #Get the inferred trade direction according to the Lee-Ready rule
> x = getTradeDirection(tqdata);
> #Calculate the proportional realized spread:
> prs = tqLiquidity(tqdata,sample_tdata,sample_qdata,type="prs");
> #Calculate the effective spread:
> es = tqLiquidity(tqdata,type="es");

7  Acknowledgements:

We thank Google for the financial support through the Google summer of code 2012 program and Brian G. Peterson for his mentorship. Furthermore, we thank the participants of the 1st R/Rmetrics Summer School and 4th User/Developer Meeting on Computational Finance and Financial Engineering, Switzerland (2010) and the "R/Finance: Applied finance with R" conference, USA, Chicago (2010), for their comments and suggestions. In particular, we thank Christophe Croux, Brian Peterson and Eric Zivot for their insights and help in developing this package. Financial support from the Flemish IWT (Institute for Science and Innovation), the National Bank of Belgium and the Research Fund K.U.Leuven is gratefully acknowledged.

8  References:

Ait-Sahalia, Y., Mykland, P. A. and Zhang, L.
A Tale of Two Time Scales: Determining Integrated Volatility With Noisy High-Frequency Data
Journal of the American Statistical Association, 2005, Vol. 100, pp. 1394-1411
Andersen, T. G. and Bollerslev, T.
Intraday periodicity and volatility persistence in financial markets
Journal of Empirical Finance, 1997, Vol. 4, pp. 115-158
Andersen, T. G., Bollerslev, T. and Diebold, F.
Roughing it up: including jump components in the measurement, modelling and forecasting of return volatility
The Review of Economics and Statistics, 2007, Vol. 89, pp. 701-720
Andersen, T. G., Bollerslev, T., Diebold, F. and Labys, P.
Modeling and forecasting realized volatility
Econometrica, 2003, Vol. 71, pp. 579-625
Andersen, T. G., Dobrev, D. and Schaumburg, E.
Jump-Robust Volatility Estimation using Nearest Neighbor Truncation
Journal of Econometrics, 2012, Vol. 169, pp. 75-93
Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A. and Shephard, N.
Multivariate realised kernels: consistent positive semi-definite estimators of the covariation of equity prices with noise and non-synchronous trading
Journal of Econometrics, 2011, Vol. 162, pp. 149-169
Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A. and Shephard, N.
Realised Kernels in Practice: Trades and Quotes
Econometrics Journal, 2008, Vol. 4, pp. 1-32
Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A. and Shephard, N.
Regular and modified kernel-based estimators of integrated variance: The case with independent noise
Working paper, 2004
Barndorff-Nielsen, O. E. and Shephard, N.
Measuring the impact of jumps in multivariate price processes using bipower covariation
Discussion paper, Nuffield College, Oxford University, 2004
Bessembinder, H.
Issues in Assessing Trade Execution Costs
Journal of Financial Markets, 2003, Vol. 6, pp. 223-257
Boehmer, E.
Dimensions of execution quality: Recent evidence for US equity markets
Journal of Financial Economics, 2005, Vol. 78, pp. 553-582
Boudt, K., Croux, C. and Laurent, S.
Robust estimation of intraweek periodicity in volatility and jump detection
Journal of Empirical Finance, 2011, Vol. 18, pp. 353-367
Boudt, K., Croux, C. and Laurent, S.
Outlyingness weighted covariation
Journal of Financial Econometrics, 2011, Vol. 9, pp. 657-684
Boudt, K. and Zhang, J.
Jump robust two time scale covariance estimation and realized volatility budgets
Mimeo, 2010
Brownlees, C. and Gallo, G.
Financial econometric analysis at ultra-high frequency: Data handling concerns
Computational Statistics & Data Analysis, 2006, Vol. 51, pp. 2232-2245
Cornelissen, J. and Boudt, K.
RTAQ: Tools for the analysis of trades and quotes in R
2012
Corsi, F.
A Simple Approximate Long Memory Model of Realized Volatility
Journal of Financial Econometrics, 2009, Vol. 7, pp. 174-196
Corsi, F. and Reno, R.
Discrete-time volatility forecasting with persistent leverage effect and the link with continuous-time volatility modeling
Journal of Business and Economic Statistics, 2012, pp. forthcoming
De Pooter, M., Martens, M. P. and Van Dijk, D. J.
Predicting the Daily Covariance Matrix for S&P 100 Stocks using Intraday Data - But Which Frequency to Use?
Econometric Reviews, 2008, Vol. 27, pp. 199-229
Fleming, J., Kirby, C. and Ostdiek, B.
The Economic Value of Volatility Timing Using "Realized" Volatility
Journal of Financial Economics, 2003, Vol. 67, pp. 473-509
Gobbi, F. and Mancini, C.
Identifying the covariation between the diffusion parts and the co-jumps given discrete observations
Mimeo, 2009
Harris F., T. M. G. S. and Wood, R.
Cointegration, error correction, and price discovery on infomationally linked security markets.
Journal of Financial and Quantitative Analysis, 1995, Vol. 30, pp. 563-581
Hasbrouck Joel and Seppi, D. J.
Common Factors in Prices, Order Flows and
Liquidity
Journal of Financial Economics, 2001, Vol. 59, pp. 383-411
Hayashi, T. and Yoshida, N.
On covariance estimation of non-synchronously observed diffusion processes
Bernoulli, 2005, Vol. 11, pp. 359-379
Heber, G., Lunde, A., Shephard, N. and Sheppard, K.
Oxford-Man Institute's realized library, version 0.1
", Oxford-Man Institute, University of Oxford, 2009
Lee, C. M. C. and Ready, M. J.
Inferring Trade Direction from Intraday Data
Journal of Finance, 1991, Vol. 46, pp. 733-746
Payseur, S.
realized: Realized
2008
Ryan, J. A.
quantmod: Quantitative Financial Modelling Framework
2011
Ryan, J. A. and Ulrich, J. M.
xts: Extensible Time Series
2009
Shephard, N. and Sheppard, K.
Realising the future: forecasting with high frequency based volatility (HEAVY) models
Journal of Applied Econometrics, 2010, Vol. 25, pp. 197-231
Venkataraman, K.
Automated Versus Floor Trading: An Analysis of Execution
Costs on the Paris and New York Exchanges
The Journal of Finance, 2001, Vol. 56, pp. 1445-1485
Vergote, O.
How to Match Trades and Quotes for NYSE Stocks?
K.U.Leuven working paper, 2005
Yan, B. and Zivot, E.
Analysis of High-Frequency Financial Data with S-Plus
UWEC-2005-03, University of Washington, Department of Economics, 2003
Zhang, L.
Estimating covariation: Epps effect, microstructure noise
Journal of Econometrics, 2011, Vol. 160, pp. 33-47


1 Examples of the use of xts objects can be found on http://www.quantmod.com/examples/data/.
2 http://wrds-web.wharton.upenn.edu/wrds
3 Throughout we use the terms realized volatility and realized variance interchangeably to denote high-frequency data based estimators for the daily variation in the returns. All outputted values should be interpreted as variances, unless they clearly indicate standard deviations, e.g. in the spotVol function.
4 The implementation is (loosely) based on the matlab code from Kevin Sheppard: http://www.kevinsheppard.com/wiki/MFE_Toolbox
5 The implementation of the heavyModel is not completely finished. For the moment only bound constraints on the parameters are imposed in the optimization. Future developments also include outputting standard errors, and a c-implementation of the likelihood function to speed up the QML estimation.