The economic value of analyzing high-frequency financial data is now obvious, both in the academic and financial world. It is the basis of intraday and daily risk monitoring and forecasting, an input to the portfolio allocation process, and also for high-frequency trading. For the efficient use of high-frequency data in financial decision making, the
highfrequency package implements state of the art techniques for cleaning and matching for trades and quotes, as well as the calculation and forecasting of liquidity, realized and correlation measures based on high-frequency data.
highfrequency package is the outcome of a Google summer of code project and an improved and updated version of the 2 packages:
RTAQ (Cornelissen and Boudt 2012) and
realized (Payseur 2008).
Handling high-frequency data can be particularly challenging because of the specific characteristics of the data, as extensively documented in Yan and Zivot 2003.
highfrequency package offers high-level tools for the analysis of high-frequency data.
These tools tackle three typical challenges of working with high-frequency data.
A first specific challenge is the enormous number of observations, that can reach heights of millions observations per stock per day.
Secondly, transaction-by-transaction data is by nature irregularly spaced over time.
Thirdly, the recorded data often contains errors for various reasons.
highfrequency package offers a (basic) interface to manage highfrequency trades and quotes data.
Furthermore and foremost, it offers easy-to-use tools
to clean and aggregate high-frequency data, to calculate liquidity measures and, measure and forecast volatility.
The tools in
highfrequency are designed in such a way that they will work with most data input types.
The package contains an extensive collection of functions to calculate realized measures, since it arose by merging the package
RTAQ (Cornelissen and Boudt 2012) from
the TradeAnalytics project and the package
realized (Payseur 2008).
highfrequency strongly builds on the functionality offered by the
xts package (Ryan and Ulrich 2009) package.
You can use the functionality in the
highfrequency package for trades and quotes objects from the most popular vendors,
as long as your objects are stored as xts objects.
A stable version of the
highfrequency package has been published to CRAN and
can be downloaded through the following command:
The latest version of the
highfrequency package is hosted on R-forge and can be downloaded through the following command:
see highfrequency github repo. In a couple of months, I'll upload an interactive R tutorial for the highfrequency package on the DataMind platform. Furthermore, read online the full the highfrequency manual and documentation.
Users that have their own system set up to store their highfrequency data as xts objects can skip this section.
The functions to calculate liquidity, etc. in the
highfrequency package can take highfrequency data of various data vendors as input:
NYSE TAQ, WRDS TAQ, Reuters, Bloomberg, etc. To achieve this flexibility, we rely on the functionality provided by the excellent
quantmod package (Ryan 2011). Additionally,
highfrequency provides functionality to convert raw data from several data providers into
xts objects. The latter is discussed and illustrated in this section.
Highfrequency financial data is typically subdivided into two files, containing information on the trades and quotes respectively.
highfrequency can currently parse three types of input data into xts objects:
(i) ".txt" files extracted from the NYSE TAQ database,
(ii) ".csv" "files extracted from the WRDS database,
(iii) ".asc" files from http://www.tickdata.com.
For each of these input data types, we briefly discuss how the
convert function can be used to convert the txt/csv/asc files into xts objects (Ryan and Ulrich 2009) that are saved on-disk in the “RData" format. We opt to store the data as xts objects because this type of object can be indexed by an indicator of time and date.1
Parsing raw NYSE TAQ data into xts:
Suppose the folder
~/raw_data on your drive contains the two folders: "2008-01-02" and "2008-01-03".
Suppose these folders each contain the files
AA_trades.txt with the daily TAQ data
bought from the New York Stock Exchange (NYSE) for the
The raw data can then easily be converted into on-disk xts objects as follows:
from = "2008-01-02"; to = "2008-01-03"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert(from, to, datasource, datadestination, trades=TRUE, quotes=FALSE,ticker=c("AA","AAPL"), dir=TRUE, extension="txt", header=FALSE,tradecolnames=NULL,quotecolnames=NULL, format="%Y%m%d %H:%M:%S");
At this point, the folder
~/xts_data will contain two folders named 2008-01-02 and 2008-01-03 containing the files
AAPL_trades.RData which contain the respective daily trade data. The data can now be loaded with the
TAQLoad function (see below).
Parsing raw WRDS TAQ data into xts:
Suppose the folder
~/raw_data on your drive contains the files
IBM_trades.csv acquired from Wharton Research Data Service (WRDS).2
Both files contain respectively the trade and quote data for the "IBM" stock for "2011-12-01" up to "2011-12-02".
The data can then easily be converted into on-disk xts objects as follows:
from = "2011-12-01"; to = "2011-12-02"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert( from=from, to=to, datasource=datasource, datadestination=datadestination, trades = T, quotes = T, ticker="IBM", dir = TRUE, extension = "csv", header = TRUE, tradecolnames = NULL, quotecolnames = NULL, format="%Y%m%d %H:%M:%S", onefile = TRUE )
Now, the folder
~/xts_data contains two folders, one for each day in the sample, i.e.
"2011-12-01" up to "2011-12-02".
Each of these folders contains a file named
IBM_trades.RData in which the trades for that day can be found and a file
IBM_quotes.RData in which the quotes for that day can be found, both saved as xts objects.
As a final step, the function
TAQLoad can be used to load the on-disk data into your R workspace.
You can for example get the trades for this sample in your workspace as follows:
> xts_data = TAQLoad( tickers="IBM", from="2011-12-01", + to="2011-12-02",trades=F, + quotes=TRUE, datasource=datadestination) > head(xts_data) SYMBOL EX BID BIDSIZ OFR OFRSIZ MODE 2011-12-01 04:00:00 "IBM" "P" "176.85" "1" "188.00" " 1" "12" 2011-12-01 04:00:17 "IBM" "P" "185.92" "1" "187.74" " 1" "12" 2011-12-01 04:00:18 "IBM" "P" "176.85" "1" "187.74" " 1" "12" 2011-12-01 04:00:25 "IBM" "P" "176.85" "1" "187.73" " 1" "12" 2011-12-01 04:00:26 "IBM" "P" "176.85" "1" "188.00" " 1" "12" 2011-12-01 04:00:26 "IBM" "P" "176.85" "1" "187.74" " 1" "12"
Parsing raw Tickdata.com data into xts:
Suppose the folder
~/raw_data on your drive contains the files
GLP_trades.asc acquired from www.tickdata.com.
Both files contain respectively the trade and quote data for the "GLP" stock for "2011-01-11" up to "2011-03-11".
The data can then easily be converted into on-disk xts objects as follows:
from = "2011-01-11"; to = "2011-03-11"; datasource = "~/raw_data"; datadestination = "~/xts_data"; convert(from=from, to=to, datasource=datasource, datadestination=datadestination, trades = TRUE, quotes = TRUE, ticker="GLP", dir = TRUE, format = "%d/%m/%Y %H:%M:%OS", extension = "tickdatacom", header = TRUE, onefile = TRUE );
At this point, the folder
~/xts_data now contains three folders, one for each day in the sample, i.e.
"2011-01-11", "2011-02-11", "2011-03-11".
Each of these folders contains a file named
GLP_trades.RData in which the trades for that day can be found and a file
GLP_quotes.RData in which the quotes for that day can be found, both saved as xts object.
format = '%d/%m/%Y %H:%M:%OS' argument that allows for timestamps more accurate than seconds, e.g. milliseconds in this case.
As a final step, the function
TAQLoad can be used to load the on-disk data into your R workspace.
You can for example get the trade data for the first day of the sample as follows:
> options("digits.secs"=3); #Show milliseconds > xts_data = TAQLoad(tickers="GLP", from="2011-01-11", to="2011-01-11", + trades=T, quotes=F, + datasource=datadestination) > head(xts_data) SYMBOL EX PRICE SIZE COND CORR G127 2011-01-11 09:30:00.338 "GLP" "T" "18.0700" " 500" "O X" "0" "" 2011-01-11 09:30:00.338 "GLP" "T" "18.0700" " 500" "Q" "0" "" 2011-01-11 09:33:49.342 "GLP" "T" "18.5000" " 150" "F" "0" "" 2011-01-11 09:39:29.280 "GLP" "N" "19.2000" "4924" "O" "0" "" 2011-01-11 09:39:29.348 "GLP" "D" "19.2400" " 500" "@" "0" "T" 2011-01-11 09:39:29.411 "GLP" "N" "19.2400" " 200" "F" "0" ""
Typical trades and quotes: data description:
Both trades and quotes data should thus be xts objects (Ryan and Ulrich 2009), if you want to use the functionality of
Use the code below to see how trade and quote data objects are typically stored.
library(highfrequency); data("sample_tdataraw"); head(sample_tdataraw); data("sample_qdataraw"); head(sample_qdataraw);
In this section we discuss two very common first steps in the manipulation of highfrequency financial data: (i) the cleaning and (ii) aggregation of the data.
For various reasons, raw trade and quote data contains numerous data errors.
Therefore, the data is not suited for analysis right-away, and data-cleaning is an essential step in dealing with tick-by-tick data (Brownlees and Gallo 2006).
highfrequency implements the step-by-step cleaning procedure proposed by Barndorff-Nielsen et al. (2008).
Table 1 provides an overview of the cleaning functions.
A user can either use a specific cleaning procedure or a wrapper function that performs multiple cleaning steps.
The wrapper functions offer on-disk functionality: i.e. load on-disk raw data and save the clean data back to your hard disk.
This method is advisable in case you have multiple days of data to clean.
Of course, the data on your disk should be organized as discussed in the previous section to benefit from this functionality.
More specifically, these functions expect that the data is saved as xts objects saved in a folder for each day (as the
TAQLoad function does), which will always be the case if you used the
convert function to convert the raw data into xts objects.
To maintain some insight in the cleaning process, the functions
the total number of remaining observations after each cleaning step:
> data("sample_tdataraw"); > dim(sample_tdataraw);  48484 7 > tdata_afterfirstcleaning = tradesCleanup(tdataraw=sample_tdataraw,exchanges="N"); > tdata_afterfirstcleaning$report; initial number no zero prices select exchange 48484 48479 20795 sales condition merge same timestamp 20135 9105 > dim(tdata_afterfirstcleaning$tdata)  9105 7
Function Function Description All Data: ExchangeHoursOnly Restrict data to exchange hours selectexchange Restrict data to specific exchange Trade Data: noZeroPrices Delete entries with zero prices autoSelectExchangeTrades Restrict data to exchange with highest trade volume salesCondition Delete entries with abnormal Sale Condition mergeTradesSameTimestamp Delete entries with same time stamp and use median price rmTradeOutliers Delete entries with prices above/below ask/bid +/- bid/ask spread Quote Data: noZeroQuotes Delete entries with zero quotes autoSelectExchangeQuotes Restrict data to exchange with highest bidsize + offersize mergeQuotesSameTimestamp Delete entries with same time stamp and use median quotes rmNegativeSpread Delete entries with negative spreads rmLargeSpread Delete entries if spread > maxi*median daily spread rmOutliers Delete entries for which the mid-quote is outlying with respect to surrounding entries Wrapper cleanup functions (perform sequentially the following for on-disk data) tradesCleanup noZeroPrices, selectExchange, salesCondition, mergeTradesSameTimestamp. quotesCleanup noZeroQuotes, selectExchange, rmLargeSpread, mergeQuotesSameTimestamp rmOutliers tradesCleanupFinal rmTradeOutliers (based on cleaned quote data as well)
Prices are typically not recorded at equispaced time points, while e.g. many realized volatility measures rely on equispaced returns. Furthermore, prices are often observed at different points in time for different assets, while e.g. most multivariate realized volatility estimators rely on synchronized data (see e.g. Table 2). There several ways to force these asynchronously and/or irregularly recorded series to a synchronized and/or equispaced time grid.
The most popular method, previous tick aggregation, forces prices to an equispaced grid by taking the last price realized before each grid point.
highfrequency provides users with the
aggregatets function for fast and easy previous tick aggregation:
> library("highfrequency"); > # Load sample price data > data("sample_tdata"); > ts = sample_tdata$PRICE; > # Previous tick aggregation to the 5-minute sampling frequency: > tsagg5min = aggregatets(ts,on="minutes",k=5); > head(tsagg5min); PRICE 2008-01-04 09:35:00 193.920 2008-01-04 09:40:00 194.630 2008-01-04 09:45:00 193.520 2008-01-04 09:50:00 192.850 2008-01-04 09:55:00 190.795 2008-01-04 10:00:00 190.420 > # Previous tick aggregation to the 30-second sampling frequency: > tsagg30sec = aggregatets(ts,on="seconds",k=30); > tail(tsagg30sec); PRICE 2008-01-04 15:57:30 191.790 2008-01-04 15:58:00 191.740 2008-01-04 15:58:30 191.760 2008-01-04 15:59:00 191.470 2008-01-04 15:59:30 191.825 2008-01-04 16:00:00 191.670
In the example above the prices are forced to a regular equispaced time grid of respectively 5 minutes and 30 seconds.
aggregatets function is build into all realized measures (see Section 4) and can be called by setting the arguments
align.period. In that case, first the price(s) are forced the an equispaced regular time grid and then the realized measure is calculated based on the returns over these regular time periods to which the observations were forced. This has the advantage that the user can input the original price series into the realized measures without having to worry about the asynchronicity or the irregularity of the price series.
Another synchronization method (not integrated into the realized measures) is refresh time, initially proposed by Harris and Wood (1995) and recently advocated in Barndorff-Nielsen et al. (2011). The function
highfrequency can be used to force time series (very fast) to a synchronized but not necessarily equispaced time grid. The so-called refresh times are the time points at which all assets have traded at least once since the last refresh point. More specifically, the first refresh time corresponds to the first time at which all stocks have traded. The subsequent refresh time is defined as the first time when all stocks have been traded again.
This process is repeated until the end of one time series is reached.
Illustration on price aggregation with refresh time and subsequent volatility calculation:
> data("sample_tdata"); > data("sample_qdata"); > #We assume that stock1 and stock2 contain price data on imaginary stocks: > stock1 = sample_tdata$PRICE; > stock2 = sample_qdata$BID; > #Previous-tick aggregation to one minute: > mPrice_1min = cbind(aggregatePrice(stock1),aggregatePrice(stock2)); > #Refresh time aggregation: > mPrice_Refresh = refreshTime(list(stock1,stock2)); > #Calculate a jump robust volatility measures > #based on synchronized data: > rbpcov1 = rBPCov(mPrice_1min,makeReturns=TRUE); > rbpcov2 = rBPCov(mPrice_Refresh,makeReturns=TRUE); > #Calculate a jump and microstructure noise robust volatility measure > #based on nonsynchronous data: > rtscov = rTSCov(list(stock1,stock2));
The availability of high-frequency data has enabled researchers to estimate the ex post realized volatility based on squared intraday returns (Andersen et al. 2003).
In practice, the main challenges in univariate volatility estimation are dealing with (i) jumps in the price level and (ii) microstructure noise.
Multivariate volatility estimation is additionally callenging because of
(i) the asynchronicity of observations between assets and
(ii) the need for a positive semidefnite covariance matrix estimator.
highfrequency package implements many recently proposed realized volatility and covolatility measures3
An overview of the univariate and multivariate volatility estimators implemented in
highfrequency is given in Table 2.
The first two columns indicate whether the estimator can be applied to univariate or multivariate price series.
The following two columns indicate whether the estimator is robust with respect to jumps and microstructure noise respecitively.
The next column reports whether asynchronic price series can/should be used as input.
The last column indicates whether the estimator always yields a positive semidefinite matrix in the multivariate case.
All realized measures have (at least) the following arguments:
rdata: The return data.
align.by: a string, align the tick data to "seconds"|"minutes"|"hours".
align.period: an integer, align the tick data to this many [seconds|minutes|hours].
makeReturns: boolean, should be TRUE when rdata contains prices instead of returns. FALSE by default.
cor: boolean, in case it is TRUE, the correlation is returned. FALSE by default.
makePsd: boolean. Non positive semidefinite estimates can be transformed into a positive semidefinite matrix by setting this argument to TRUE. The eigenvalue method is used.
...: additional arguments depending on the type of realized measure.
Estimator Univariate Multivariate Jump Microstructure Tick-by-tick Positive robust noise robust returns as input semidefinite medRV (Andersen et al. 2012) x x / minRV (Andersen et al. 2012) x x / rCov (Andersen et al. 2003) x x x rBPCov (Barndorff-Nielsen and Shephard 2004) x x x rOWCov (Boudt et al. 2011a) x x x x rThresholdCov (Gobbi and Mancini 2009) x x x rTSCov (Zhang 2011) x x x x rRTSCov (Boudt and Zhang 2012) x x x x x rAVGCov (Ait-Sahalia et al. 2005) x x x x x rKernelCov (Barndorff-Nielsen et al. 2004) x x x x x rHYCov (Hayashi and Yoshida 2005) x x
Another interesting feature of high-frequency data in the context of volatility measurement is the periodicity in the volatility of high-frequency returns induced by opening, lunch and closing of financial markets (Andersen and Bollerslev 1997).
highfrequency implements both the classic intraday periodicity estimation method of Andersen and Bollerslev 1997 and a jump robust version proposed by Boudt et al. 2011b.
These estimation methods assume that the standard deviation of intraday returns (called vol in output) can be decomposed in a daily volatility
factor (constant for all intraday returns observed on the same day, called dailyvol) and an intraday factor (deterministic function of the intraday time, called
periodicvol). The estimates are either non-parametric (based on a scale estimator) or parametric (based on regression estimation of a flexible specification for the intraday factor).
The sample code below illustrates the estimation of intraday periodicity as well as Figure 1:
> data("sample_real5minprices"); > > #compute and plot intraday periodicity > out = spotVol(sample_real5minprices,P1=6,P2=4,periodicvol="TML",k=5, + dummies=FALSE); > head(out); returns vol dailyvol periodicvol 2005-03-04 09:35:00 -0.0010966963 0.004081072 0.001896816 2.151539 2005-03-04 09:40:00 -0.0005614217 0.003695715 0.001896816 1.948379 2005-03-04 09:45:00 -0.0026443880 0.003417950 0.001896816 1.801941
It is broadly accepted by academic researchers that, when appropriately managed, access to high-frequency data leads to a comparative advantage in better predicting the volatility of future price evolution and hence the generation of alpha through quantitative investment models. Already in 2003 Fleming et al. 2003 estimated that an investor would be willing to pay 50 to 200 basis points per year to capture the gains in portfolio performance, from using high-frequency returns instead of daily returns for volatility forecasting. His findings were later confirmed in a somewhat different setting by the research of De Pooter et al. 2008.
In this section we discuss 2 classes of univariate forecasting models implemented in the
highfrequency package that have highfrequency data as input:
(i) the Heterogeneous Autoregressive model for Realized Volatility (HAR) discussed in Andersen et al. 2007 and Corsi 2009a and
(ii) the HEAVY model (High frEquency bAsed VolatilitY) introduced in Shephard and Sheppard 2010.
The goal of both models is roughly speaking to predict the next days volatility based on the highfrequency price evolutions prior to that day.
Although the HAR and the HEAVY model have the same goal, i.e. modeling the conditional volatility, they take a different approach. Whereas the HAR model is focused on predicting the open-to-close variation, the HEAVY model has two equations: one focused on the open-to-close variation, one focused on the close-to-close variation. The main advantages of the HAR model are among other things, that it is simple and easy to estimate (since it is essentially a special type of linear model that can be estimated by least squares), even though it still manages to reproduce typical features found in volatility such as the long memory, fat tails, etc. The main advantages of the HEAVY model are among other things, that it models both the close-to-close conditional variance, as well as the conditional expectation of the open-to-close variation. Furthermore, the HEAVY model has momentum and mean reversion effects, and it adjusts quickly to structural breaks in the level of the volatility process. In contrast to the HAR model, the estimation of the HEAVY model is done by Guassian quasi-maximum likelihood, for both equations in the model separately.
The next two sections review the HAR model and HEAVY model each in more detail, and of course discuss and illustrate how
highfrequency can be used to estimate these models.
Corsi 2009b proposed the Heterogeneous Autoregressive (HAR) model for Realized Volatility.
As a first step, intraday returns are used to measure the daily volatility using e.g. simply Realized Volatility and typically ignoring the overnight return.
As a second step, the realized volatility is parameterized as a linear function of the lagged realized volatilities over different horizons.
Typically, the daily, weekly and monthly horizons are chosen to aggregate the realized volatility over.
The simplest version of the model, as proposed in Corsi 2009b, is referred to as the HAR-RV model.
In an interesting paper, Andersen et al. 2007 extend the model of Corsi 2009b by explicitly including the contribution of jumps to the daily Realized Volatility in the model. Their analysis suggests that the volatility originating from jumps in the price level is less persistent than the volatility originating from the continuous component of Realized Volatility, and that separating the rough jump moves from the smooth continuous moves enables better forecasting results. They propose mainly two ways to model this idea, and both are implemented in the
highfrequency package. In what follows, we first discuss the different versions of the HAR model in more detail, and then we discuss and illustrate how
highfrequency can be used to estimate them.
harModel function in
highfrequency implements the Heterogeneous Autoregressive model for Realized Volatility
discussed in Andersen et al. 2007 and Corsi 2009a. Let rt,i the i-th intraday return on day t, for i=1,…,M with M the number of intraday returns per day. The realized volatility is then given by RVt = ∑i=1M rt,i2. Denote by
|RVt,t+h = h−1 [ RVt+1+RVt+2+ … + RVt+h],|
the Realized volatility aggregate over h days. Note that by definition RVt,t+1 ≡ RVt+1.
|RVt,t+h = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + єt,t+h ,|
for t = 1,2,…,T. Typically h=1, and the model simplifies to
|RVt+1 = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + єt+1 .|
In the presence of jumps in intraday price process, the traditional realized volatility measures are dominated by the contribution of the jumps to volatility. Therefore, robust measure of volatility were developed in order to estimate only the "smooth" price variation (roughly speaking). The contribution of the jumps to the quadratic price process can then be estimated by Jt = max[RVt − BPVt, 0 ]. The HAR-RV-J model is then given by
|RVt,t+h = β0 + βD RVt + βW RVt−5,t + βM RVt−22,t + Jt + єt,t+h.|
Andersen et al. 2007 argue that it may be desirable to treat small jumps as measurement errors, or part of the continuous sample path variation process, associating only large values of (RVt − BPVt) with the jump component. Therefore, they define Zt as
|Zt = (1/n)−1/2 ×|
where TQt = M × µ4/3−3 ∑j=3M |rt,j|4/3 |rt,(j−1)|4/3 |rt,(j−2)|4/3, with µp = E(|z|p). Suppose you identify the "significant" jumps by the realizations of Zt in excess of some critical value, say Φα,
|Jt = I[Zt < Φα] RVt + I[Zt ≤ Φα] BPVt,|
where I[·] denotes the indicator function. To ensure that the estimated continuous sample path component variation and jump variation sum to the total realized variation, we follow Andersen et al. 2007 in estimating the former component as the residual,
|Ct = I[Zt ≤ Φα] RVt + I[ Zt > Φα] BPVt.|
The HAR-RV-CJ model is then given by:
|RVt,t+h = β0 + βCD Ct + βCW Ct−5,t + βCM Ct−22,t + Jt + єt,t+h.|
harModel function in
highfrequency takes as
data input an xts object containing the intraday returns.
You can specify the type of HAR model that you want to estimate by setting the
By default, the most basic model is chosen
type='HARRV'. Other valid options are
type='HARRVCJ' (see above for a discussion of these models).
Based on the intraday returns, daily realized measures are calculated. With the argument
RVest, you can set which volatility estimators should be used to estimate both the daily integrated variance (non-jump-robust) and the continuous component of volatility. By default, total daily volatility is measured by Realized Volatility and the continuous component is measured by the Realized Bipower Variation, hence
RVest=c(rCov, rBPCov), but users can set estimators to their liking (see Section 4 for a discussion on the implemented Realized Volatility measures).
The daily volatility measures are subsequently aggregated over different time horizons. Use the arguments to
periodsJ to specify the horizons over which the continuous/total volatility and the jump component should be aggregated. Typically, the daily, weekly and monthly frequency are used, i.e. one, five and twenty-two days which translates into the default setting:
Corsi and Reno 2012 propose to further extend the HAR model with a leverage component (Leverage Heterogeneous Auto-Regressive model). This allows to take into account that negative returns affect future volatility differently than positive returns. By default, the leverage effect is not taken into account and the argument
leverage=NULL. In case you want to mimic the analysis in Corsi and Reno 2012, with leverage components aggregated on the daily, weekly and monthly frequency, just set
HARRVCJ a test determines whether the contribution of the jumps is statistically significant.
jumptest should be set to the function name of the test to determine whether the jump variability is significant that day.
jumptest='ABDJumptest', hence using the test statistic in Equation (1) (or Equation (18) of Andersen et al. 2007).
alpha can then be used to set the confidence level used in testing for jumps, which is set to the traditional " 5%" by default.
Often, the object will be to model the volatility for the next day, but this needn’t be the case. Use the argument
h to set the time horizon over which you want to aggregate the dependent variable, e.g.
h=5 in case you are interested in modeling the weekly realized volatility.
To conclude, we note that in applications of the HAR model sometimes the dependent and explanatory variables are transformed by taking the logarithm or the square root. Set the argument
transform='sqrt' or any other function to transform all variables right before the linear model is estimated. By default, no transformation is done and
Fitting the HARRV model on the Dow Jones Industrial Average in 2008
As a first step, we load the daily realized volatility for the Dow Jones Industrial average and select only 2008. The sample dataset
highfrequency contains a very rich set of realized volatility measures
computed by Heber et al. 2009.
highfrequency only contains a subset of the full realized library, which can be found on their website: http://realized.oxford-man.ox.ac.uk.
> data(realized_library); #Get sample daily Realized Volatility data > DJI_RV = realized_library$Dow.Jones.Industrials.Realized.Variance; #Select DJI > DJI_RV = DJI_RV[!is.na(DJI_RV)]; #Remove NA's > DJI_RV = DJI_RV['2008'];
As a second step, we compute the traditional Heterogeneous AutoRegressive model (HAR).
The output of the model is an S3 object. Since the HAR-model is simply a special type of linear model,
it is also implemented that way:
the output of the
harModel function is a sublclass
lm, the standard class for linear models.
Figure 2 plots the output object of the
harModel function, which has time on the horizontal axis and
the observed realized volatility and the forecasted realized volatility on the vertical axis (of course this analysis is in-sample, but the estimated coefficients of the model can be used for real out-of-sample forecasts obviously). It is clear from a visual inspection of Figure 2 that one of the features
harModel is that it can adapt relatively fast to changes in the volatility level, which explains its popularity in recent years.
> x = harModel(data=DJI_RV , periods = c(1,5,22), RVest = c("rCov"), + type="HARRV",h=1,transform=NULL); > class(x);  "harModel" "lm" > x; Model: RV1 = beta0 + beta1 * RV1 + beta2 * RV5 + beta3 * RV22 Coefficients: beta0 beta1 beta2 beta3 4.432e-05 1.586e-01 6.213e-01 8.721e-02 r.squared adj.r.squared 0.4679 0.4608 > summary(x); Call: "RV1 = beta0 + beta1 * RV1 + beta2 * RV5 + beta3 * RV22" Residuals: Min 1Q Median 3Q Max -0.0017683 -0.0000626 -0.0000427 -0.0000087 0.0044331 Coefficients: Estimate Std. Error t value Pr(>|t|) beta0 4.432e-05 3.695e-05 1.200 0.2315 beta1 1.586e-01 8.089e-02 1.960 0.0512 . beta2 6.213e-01 1.362e-01 4.560 8.36e-06 *** beta3 8.721e-02 1.217e-01 0.716 0.4745 --- Signif. codes 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.0004344 on 227 degrees of freedom Multiple R-squared: 0.4679, Adjusted R-squared: 0.4608 F-statistic: 66.53 on 3 and 227 DF, p-value: < 2.2e-16 > plot(x);
Fitting the HARRVCJ model on small example dataset
highfrequency, it is also very easy to estimate more sophisticated versions of the harModel.
For example, the HARRVCJ model discussed in Andersen et al. 2007 can be easily estimated with a small example dataset as follows:
> data("sample_5minprices_jumps"); > data = sample_5minprices_jumps[,1]; > data = makeReturns(data); #Get the high-frequency return data > x = harModel(data, periods = c(1,5,10), periodsJ=c(1,5,10), + RVest = c("rCov","rBPCov"), type="HARRVCJ", + transform="sqrt"); > x Model: sqrt(RV1) = beta0 + beta1 * sqrt(C1) + beta2 * sqrt(C5) + beta3 * sqrt(C10) + beta4 * sqrt(J1) + beta5 * sqrt(J5) + beta6 * sqrt(J10) Coefficients: beta0 beta1 beta2 beta3 beta4 beta5 -0.8835 1.1957 -25.1922 38.9909 -0.4483 0.8084 beta6 -6.8305 r.squared adj.r.squared 0.9915 0.9661
This last example shows how easily a special type of HAR model can be estimated having solely the intraday returns as input. Note that this last example purely serves a didactic purpose: the high R-squared can be explained by the small number of observastions (i.e. days) in the dataset.
Shephard and Sheppard 2010 introduced the HEAVY model (High frEquency bAsed VolatilitY) to leverage highfrequency data for volatility modeling. Essentially, the model boils down to two equations: Equation (2) models the close-to-close conditional variance, while Equation (3) models the conditional expectation of the open-to-close variation. Denote by Ft−1HF the past high frequency data (i.e. all intraday returns) and by Ft−1LF the past low frequency data (i.e. daily returns). The HEAVY model is now based on Ft−1HF in contrast to traditional GARCH models that use only Ft−1LF.
Denote the daily returns:
Based on intraday data, one can calculate the daily realized measures:
|RM1, RM2, …, RMT,|
with T the total number of days in the sample. The heavy model is given by:
heavyModel in the
highfrequency package implements the above model in very general way.
It enables the user the obtain the estimates for the parameters (ω,ωR,α;αR,β,βR) using Quasi-maximum likelihood.4
heavyModel5 takes a (T x K) matrix containing the data as input, with T the number of days. For the traditional HEAVY model: K=2, the first column contains the squared demeaned daily returns, i.e. r12,…, rT2 and the second column contains the realized measures, i.e RM1,…, RMT.
As you can see in Equation (4), the matrix notation of the HEAVY model contains two matrices with parameters named α and named β respectively. This matrix structure allows to see that the traditional HEAVY model given in (4) is just a special case of a larger set of models that can be written this way. You can use the arguments
q from the
heavyModel function to specify for each of these two matrices: (i) which parameters in the matrix should be estimated by setting a non-zero integer in that matrix cell (ii) how many lags should be included in the model. More formally:
p is a (K × K) matrix containing the lag length for the model innovations. Position (i,j) in the matrix indicates the number of lags in equation i of the model for the innovations in data column j. For the traditional heavy model introduced above p is given by (cfr. the matrix with the αs in Equation (4)):
Similarly to the matrix
p, the matrix
q is a (K × K) matrix containing the lag length for the conditional variances. Position (i,j) in the matrix indicates the number of lags in equation i of the model for conditional variances corresponding to series j. For the traditional heavy model introduced above q is given by (cfr. the matrix with the βs in Equation (4)):
The starting values that will be used in the optimization can be set by the argument
startingvalues. The arguments
UB can be used to set the Upper and Lower Bound for the parameters. By default, the lower bound is set to zero and the upperbound to infinity. To specify how the estimation should be initialized you can use the
backcast argument, which will be set to the unconditional estimates by default. Finally, the
compconst argument is a boolean indicating how the ωs should be estimated. In case it is TRUE, ωs are estimated in the optimization, and in case it is FALSE volatility targeting is done and ω is just 1 minus the sum of all relevant αs and βs multiplied by the unconditional variance.
The output of the
heavyModel is a list. Most interestingly, the list-item
estparams contains a matrix with the parameter estimates and their names. The order in which the parameters are reported is as follows: first the estimates for ω, then the estimates for the non-zero αs with the most recent lags first in case max(p)>1, then the estimates for the non-zero βs with the most recent lag first in case max(q) > 1.
loglikelihood list-item reports the total log likelihood, we offer users a bit more insight with the list-item
likelihoods containing an xts-object with the daily log likelihood evaluated at the parameters.
convergence provides more information on the convergence of the optimization, see
optim for more information.
condvar is a (T × K) xts-object containing the conditional variances. For the traditional HEAVY model, Figure 3 plots for example the conditional close-to-close variance for the DJIA for 1996 up to 2009 (see next subsection for more information).
Fitting the HEAVY model on the Dow Jones Industrial Average from 1996 up to 2009
As a first step, we load the
realized_library (Heber et al. 2009) which contains information on the Dow Jones Industrial Average.
We then select from this library the daily returns and the daily Realized Kernel estimates (Barndorff-Nielsen et al. 2004).
The data matrix that serves as an input for the
heavyModel now has the returns as the first column and the Realized Kernel estimates in the second column. We further set the argument
backcast to the variance of the daily returns and the average realized kernel over the sample period.We now have all ingredients to estimate the HEAVY Model. Based on the output of the model, Figure 3 plots the conditional open-to-close variance, as estimated by the second equation in the model. Note the striking peak in 2008 at the start of the financial crisis.
> # Implementation of the heavy model on DJIA: > data("realized_library"); > returns = realized_library$Dow.Jones.Industrials.Returns; > rk = realized_library$Dow.Jones.Industrials.Realized.Kernel; > returns = returns[!is.na(rk)]; rk = rk[!is.na(rk)]; # Remove NA's > data = cbind( returns^2, rk ); > backcast = matrix( c(var(returns),mean(rk)) ,ncol=1); > > startvalues = c(0.004,0.02,0.44,0.41,0.74,0.56); # Initial values > output = heavyModel( data = as.matrix(data,ncol=2), compconst=FALSE, + startingvalues = startvalues, backcast=backcast); > output$estparams [,1] omega1 0.01750506 omega2 0.06182249 alpha1 0.45118753 alpha2 0.41204541 beta1 0.73834594 beta2 0.56367558
Trades and quotes are often supplied as separate data objects.
For many research and practical questions related to transaction data, one needs to merge trades and quotes.
Since trades and quotes can be subject to different reporting lags, this is not a straightforward operation (Lee and Ready 1991).
matchTradesQuotes can be used for matching trades and quotes.
One should supply the number of seconds quotes are registered faster than trades.
Based on the research of Vergote 2005, we set 2 seconds as the default.
Many trades and quotes databases do not indicate whether individual trades are market buy or market sell orders.
highfrequency implements with
getTradeDirection the Lee-Ready rule (Lee and Ready 1991)
to infer the trade direction based on the matched trades and quotes.
Numerous liquidity measures can be calculated based on matched trade and quote data, using the function
(see Bessembinder 2003, Boehmer 2005, Hasbrouck and Seppi 2001 and Venkataraman 2001 for more information on the implemented liquidity measures).
The main implemented liquidity measures are listen in Table 3, and can be used as arguments of the function
Argument(s) Liquidity measures es, rs Compute effective (realized) spread value_trade Compute trade value (price × size) signed_value_trade
Compute signed trade value
signed_trade_size Compute signed trade size di_diff, di_div Compute depth imbalance pes, prs Compute proportional effective and realized spread price_impact, prop_price_impact Compute price impact tspread, pts Compute half traded and proportional half-traded spread qs, logqs Compute quoted spread qsslope, logqslope Compute quoted slope
The example below illustrates how to: (i) match trades and quotes, (ii) get the trade direction and (iii) calculate liquidity measures.
> #Load data samples > data("sample_tdata"); > data("sample_qdata"); > #Match the trade and quote data > tqdata = matchTradesQuotes(sample_tdata,sample_qdata); > #Display information in tqdata > colnames(tqdata)[1:6];  "SYMBOL" "EX" "PRICE" "SIZE" "COND" "CORR" > colnames(tqdata)[7:12];  "G127" "BID" "BIDSIZ" "OFR" "OFRSIZ" "MODE" > #Get the inferred trade direction according to the Lee-Ready rule > x = getTradeDirection(tqdata); > #Calculate the proportional realized spread: > prs = tqLiquidity(tqdata,sample_tdata,sample_qdata,type="prs"); > #Calculate the effective spread: > es = tqLiquidity(tqdata,type="es");
We thank Google for the financial support through the Google summer of code 2012 program and Brian G. Peterson for his mentorship. Furthermore, we thank the participants of the 1st R/Rmetrics Summer School and 4th User/Developer Meeting on Computational Finance and Financial Engineering, Switzerland (2010) and the "R/Finance: Applied finance with R" conference, USA, Chicago (2010), for their comments and suggestions. In particular, we thank Christophe Croux, Brian Peterson and Eric Zivot for their insights and help in developing this package. Financial support from the Flemish IWT (Institute for Science and Innovation), the National Bank of Belgium and the Research Fund K.U.Leuven is gratefully acknowledged.