Predicting Abnormal Returns From News Using Text Classification
We show how text from news articles can be used to predict intraday price movements of financial assets using support vector machines. Multiple kernel learning is used to combine equity returns with text as predictive features to increase classificat…
Authors: ** - Ronny Luss (ORFE Department, Princeton University) – rluss@princeton.edu - Alex, re d’Aspremont (ORFE Department
Predicting Abnorm al Returns From Ne ws Using T ext Classifica tion Ronny Luss ∗ Alexandre d’Aspremont † May 28, 2018 Abstract W e show how text f rom ne ws articles can be used to p redict intraday price mov emen ts of fin ancial as- sets using supp ort vector machine s. Multiple kernel learning is used to comb ine equity returns with text as predictive features to increa se c lassification performa nce and w e develop an analy tic center cu tting plane method to solve the kernel learning problem ef ficiently . W e observe that while the direction of re- turns is not predictable using either text or returns, their size is, with te xt features producing significantly better perfor mance than historical retur ns alone. 1 Introd uction Asset pri cing models ofte n describe the arri v al of nov el informatio n by a jump process, but the characteri stics of the underlying jump proces s are only coarsely , if at all, relate d to the underly ing source of information. Similarly , time series models such as A RCH and GARCH hav e been dev eloped to forecast v olatilit y using asset return s data but these methods also ignor e one k ey source of m arke t volatil ity: financial news. Our object iv e here is to sho w that tex t clas sification techniqu es allow a much more refined analy sis of the impact of ne ws on asset prices. Empirical studies that e xamine sto ck return pred ictability can b e trace d back to Fama (1965) among others , who sho wed that there is no significa nt autocor relation in the daily returns of thirty stocks from the Do w-Jones Industrial A verag e. Similar studies were cond ucted by T aylor (1 986) and Ding et al . (19 93), who find sign ificant autocorrelat ion in squared and absolute returns (i.e. v olatility ). These eff ects are also observ ed on intraday volati lity patt erns as demonstrated by W ood et al. (1985) and by Andersen & B ollersl ev (1997 ) on absolute returns . These fi ndings tend to demonstrate that, giv en solely histor ical stoc k returns, future stoc k returns are not predic table while vo latility is. The impact of n ews arti cles has also been studi ed ext ensi vely . Ederingt on & Lee (19 93) for e xample st udied price fluctuat ions in int erest rate and foreign exc hange fu tures mark ets followin g macro economic an nouncement s and sho wed that prices mostly a djusted within one minute of m ajor announcement s. Mi tchell & Mulheri n (1994) aggrega ted daily announcemen ts by Dow Jon es & C ompany into a single vari able and found no correlatio n with market absolute returns and weak co rrelation with firm-specific absol ute returns. Ho wev er , Kale v et al. (2004) ag gregat ed intraday ne ws concer ning companies listed on the Australian Stoc k Exc hange in to an exog enous v ariable in a GARCH model and found significant predicti ve power . Thes e findings are attribu ted to the conditionin g of volatil ity on ne ws. Results were furthe r improve d by restrictin g the type of ne ws articles included. ∗ ORFE Department, Princeton Univ ersity , Princeton, NJ 08544 . rluss @princeton.ed u † ORFE Department, Princeton Univ ersity , Princeton, NJ 08544 . aspre mon@princeton .edu 1 The most common technique s for forec asting vola tility are often based on Autoregre ssiv e Conditiona l Heterosk edastic ity (ARCH) and Generalized ARCH (GA RCH) models mentioned above. For example, intrad ay vol atility in foreign e xchange an d equ ity markets is modeled with MA-GAR CH in A nderse n & Bollersle v (19 97) and ARCH in T ayl or & Xu (1997). S ee Bollersl ev et al. (1992 ) for a s urve y of ARCH and GARCH models and vario us other application s. Machine learning techniq ues such as neural netwo rks and suppo rt vector machines hav e also been used to foreca st volati lity . Neural networks are used in Malliari s & S alchen ber ger (1996) to forecast implie d v olatility of option s on the SP100 index, and support vec tor machines are used to fore cast volati lity of the SP500 index using daily returns in Gavrish chaka & Banerjee (2006 ). Here, w e sho w that information fr om p ress re leases ca n be us ed to p redict intraday abnormal retu rns with relati vely high accurac y . Con sistent with T a ylor (1986) and Ding et al. (1993), ho we ver , the direction of returns is not found to be predict able. W e form a text classificati on problem whe re pre ss release s are labele d posi tiv e if the ab solute return jumps at so me (fix ed) time after the ne ws is mad e public. Support vec tor machine s (SVM) are used to solv e this classification probl em using both equit y returns and word freque ncies from press re leases. Furthermor e, we use mult iple ker nel learnin g (MK L) to optimally combin e equity return s w ith text as pre dicti ve features and increase classificatio n performanc e. T ext classificatio n is a well-studied problem in machine learnin g, (Dumais et al. (1998 ) and Joachims (2002 ) among many others sho w that SVM significantly outper form classic methods such as nai ve bayes). Initial ly , nai ve bayes clas sifiers were used in W uthr ich et al . (19 98) to do three -class class ification of an inde x u sing daily re turns f or la bels. News is take n fro m s ev eral sources s uch as Reuters and T he W all Str eet J ourn al . Fiv e-class class ification with naiv e bayes classi fiers is us ed in Lavre nko et al. (20 00) to classif y intraday price trends when articles are published at the Y AHOO!F inanc e website. Support ve ctor machines were also used to clas sify intraday price trends in F ung et al. (2003) using R euter s articles and in M.-A.Mittermayer & Kno lmayer (2006 a ) to do four -clas s classification of stoc k return s using press releas es by PR Newswir e . T ext class ification has also been used to directly predic t vo latility (see M. -A.Mittermayer & Knolmayer (2006 b ) for a surve y of trading systems that use text). Recently , Roberts on et al. (2007) used SVM to pred ict if artic les from the Bloombe r g service are follo wed by abno rmally larg e vola tility; article s deemed important are then aggr egated into a v ariable and used in a GARCH mo del similar to Kale v et al. (2004 ). K ogan et al. (2009 ) u se Support V ector Regr ession (S VR) to forecast stock return v olatility based on text in SE C mandate d 10-K reports. They found that reports publ ished after the Sarbanes- Oxley Act of 2002 improv ed forecast s ov er baselin e methods that did not use tex t. G enerat ing tradi ng rules w ith geneti c progra mming (GP) is anot her way to incorpor ate text for fina ncial tra ding sy stems. T rading rules ar e created in Dempster & Jones (2001) using GP for foreign exch ange mark ets base d on tech nical indica tors and exten ded i n Austin et al . (2004) to c ombine technical indicators with non-pu blicly a v ailable informa tion. Ensemble methods were used in Thomas (200 3) on top of GP to create rules based on headlines posted on Y ahoo internet message boards. Our contrib ution here is twofol d. First, abnormal r eturns ar e pred icted using te xt classification tech- niques similar to M.-A.Mittermayer & Knolmayer (2006 a ). Giv en a press release, w e predict w hether or not an abnormal return will occur in the next 10 , 20 , .. ., 250 minutes using text and past absolu te ret urns. The algorit hm in M.-A.Mittermay er & Kno lmayer (2 006 a ) uses t ext to predict whe ther retu rns jump up 3%, do wn 3%, remain within th ese bounds, or are “un clear” within 15 minutes of a press release. They c onsider a nine months subset of the eight years of press relea ses used here. O ur experimen ts analyze predictabilit y of absolute re turns at many horizon s and demonstrate si gnificant initi al intrad ay pred ictability that decreases throug hout the trading day . Second, we optimally combine te xt informatio n with asse t price time series to significa ntly enhance clas sification perfor mance using multiple k ernel learn ing (MKL). W e use an analytic 2 center cutting plane m ethod (A CCPM) to solve the result ing MKL proble m. A CCPM is particula rly ef ficient on problems where the objecti ve function and gradient are hard to e v aluate but whose feasibl e set is simple enoug h so that analytic centers can be computed efficient ly . Further more, because it does not suf fer from condit ioning issues, A CCPM can achie ve higher precisi on tar gets than other first-order methods. The rest of the paper is organi zed as follo ws. Section 2 detai ls the tex t class ification problem we solv e here and provid es predictabili ty results using usin g either te xt or ab solute returns as fe atures. Section 3 descri bes the multiple kernel learning frame wor k and details the analytic center cutting plane al gorithm used to solv e the resulting optimizat ion proble m. F inally , we use MKL to enha nce the predictio n perfor mance. 2 Pr edictions with s upport vector machines Here, we descri be ho w supp ort ve ctor machines can be used to make binary predictions on equity returns . The ex perimental setup follo ws with results that use tex t and stock return dat a separately to make prediction s. 2.1 Support vector machines Support v ector machines (SVMs) form a li near classifier by maximizin g the distance , kno wn as the mar g in , between two parallel h yperplan es which separa te two gro ups of data (see Crist ianini & Shawe- T aylor (2 000) for a detailed reference on SVM). This is illustrate d in Figure 1 (righ t) where the linear classifier , defined by the hyperplane h w , x i + b = 0 , is m idway betwee n the separating hyperplan es. Giv en a linear classifier , the margin can be computed ex plicitly as 2 k w k so finding the maximum mar gin classifier can be formulated as the linear ly cons trained quadratic program minimize 1 2 k w k 2 + C l P i =1 ǫ i subjec t to y i ( h w, Φ( x i ) i + b ) ≥ 1 − ǫ i ǫ i ≥ 0 (1) in the v ariables w ∈ R d , b ∈ R , an d ǫ ∈ R l where x i ∈ R d is th e i th data point w ith d features, y i ∈ {− 1 , 1 } is its label, and there are l points. The fi rst const raint dictat es that poin ts with equi v alent labels are on the same side of the line . The slack var iable ǫ allo ws data to be misclass ified while being penalized at rate C in the objecti ve, so S VMs also handle nons eparable data . The optimal objecti ve valu e in (1 ) can be viewed as an upper boun d on the probabilit y of misclassi fication for the gi ve n task. These results can be readily extend ed to nonlinear classification . Gi ven a nonlinear classification task, the function Φ : x → Φ( x ) maps data from an input space (F igure 1 left) to a linearly separable feature space (Figure 1 right) where linear classificatio n is perfo rmed. Problem (1) becomes numerica lly dif ficult in high dimensio nal feat ure spaces but, cruci ally , the comple xity of solving its dual maximize α T e − 1 2 α T diag ( y ) K diag ( y ) α subjec t to α T y = 0 0 ≤ α ≤ C (2) in the va riables α ∈ R l , does not depend on the dimension of the feature space. The input to problem (2) is no w an l × l m atrix K where K ij = h Φ( x i ) , Φ( x j ) i . G i ven K , the mapp ing Φ need not b e specified, hence this l -dimensio nal linearly constrained quadra tic program does not suffe r from the high (pos sibly infinite) 3 5 10 15 11.5 12 12.5 13 13.5 14 14.5 15 15.5 5 6 7 8 9 10 11 12 13 14 15 16 18 20 22 24 26 28 30 32 34 36 2 / ||w|| + b = −1 + b = +1 Figure 1 : Input Space vs. Feature Space. For nonlinear classification, data is mapped from th e input space to the f eature space. Linear classification is performe d by s upp ort vector machines on mapped data in the feature space. dimensio nality of the mapping Φ . An exp licit classifier can be constructe d as functio n of K f ( x ) = sgn ( l X i =1 y i α ∗ i K ( x i , x ) + b ∗ ) (3) where x i is the i th trainin g samp le in input space, α ∗ solv es (2 ), and b ∗ is computed from the KKT conditio ns of prob lem (1 ). The data features are entirel y described by the matrix K , w hich is called a kern el and must satisfy K 0 , i.e. K is positi ve-semid efinite (this is called Mercer’ s condition in machine learning). If K 0 , then there e xists a mapping Φ such that K ij = h Φ( x i ) , Φ( x j ) i . Thus, S VMs only require as input a kernel functi on k : ( x i , x j ) → K ij such that K 0 . T able 1 lists sev eral classic kernel functio ns used in tex t classifica tion, each correspond ing to a diffe rent implicit mapping to feature space. Linear ker nel k ( x i , x j ) = h x i , x j i Gaussian ker nel k ( x i , x j ) = e −k x i − x j k 2 /σ Polynomial ker nel k ( x i , x j ) = ( h x i , x j i + 1) d Bag-of-w ords kernel k ( x i , x j ) = h x i ,x j i k x i kk x j k T able 1 : Several classic kernel function s. Many ef ficient algorithms ha ve been dev eloped for solving the quadratic program (2). A common tech- nique uses s equentia l m inimal opt imization (SMO), which i s coordina te descent where all b ut two v aria bles are fixed an d the remaining two-dimen sional prob lem is solved exp licitly . A ll e xperiments in this paper use the LIBSVM (Chang & Lin 2001) packag e implementing this method. 4 2.2 Data Data vec tors x i in the follo wing e xperiments are formed usi ng text features and equity return s features. T ex t feature s are extra cted from press rel eases as a bag-of- wor ds . A fixed set of import ant words ref erred to as the dictionary is prede termined; in this instan ce, 619 words such as incr ease , decr ease , acqui , lead , up , down , bankrup t , powerful , potential , and inte gr at are considered . Stems of words are used so that words such as a cquir ed and ac quisition are conside red identical. W e use t he follo wing Micr osoft press release and its bag-of-word s represen tation in Figur e 2 as a n e xample. Here, x ij is the nu mber of times that the j th word in the dictio nary occu rs in the i th press release. LONDON Dec. 12, 2007 Micr osoft Corp. h as ac quired Multimap, one of the United Kin gdoms top 10 0 technology compan ies and o ne o f the leading online m apping services in the world. The acquisition g ives Micr osoft a pow erful new loca tion an d ma pping technology to complem ent existing o fferings such as V irtual Earth, Live Sear ch, W indows Live services, MSN a nd the aQu antive ad vertising platform, wi th futur e integ ration p otential fo r a range of other Micr osoft pr oducts and platforms. T erms of the deal wer e no t disclosed. increas decreas acqui lead up do wn bankrup t powerful potential integrat 0 0 2 1 0 0 0 1 1 1 Figure 2 : Example of Micr osoft press release and the co rrespon ding bag- of-words rep resentation. Note that words in the dictionary are stems. These numbers are trans formed using term freque ncy-in verse document frequenc y weighting (tf-idf) defined by TF-IDF ( i, j ) = TF ( i, j ) · IDF ( i ) , IDF ( i ) = log N DF ( i ) (4) where TF ( i, j ) is the number of times that term i occurs in document j (normal ized by the number of words in document j ) and DF ( i ) is the number of docu ments in which term i appe ars. This weighting increa ses the importance of words that sho w up often within a document but also decreases the importance of terms that appea r in too many documen ts beca use they are not useful for discr imination. Other adva nced tex t represent ations include latent semantic analys is (Deerweste r et al. 1990), probabilist ic latent semantic analys is (Hofmann 2001), and latent dirichl et allocation (B lei et al. 2003). In regar ds to equity return feature s, x i corres ponds to a time series of 5 retur ns (tak en at 5 minut e interv als and calculated w ith 15 minute lags) bas ed on equity prices lead ing up to the time when the press relea se is publishe d. P ress releases publis hed before 10:10 am thus do not ha ve suf ficient stock pric e data to create the equity returns features used here and most exp eriments will only conside r ne ws publish ed after 10:10 am. Experiment s are based on press release s issued during the eight ye ar period 20 00-2007 by PRNewswir e . W e focus on ne ws related to publicly traded companies that issued at least 500 press releases through PRNewswir e in this time fr ame. Press releases tag ged w ith multipl e stoc k tick ers are discarded from ex- periment s. Intrad ay price data is taken from the N YSE T rade and Q uote D ataba se (T A Q) through Wharto n Resear c h Data Services . The eight year horizon is di vided into monthly data. In order to simulate a practic al en vironment, all decisi on mode ls are calibrated on one year of press release data and used to make predic tions on articles release d in the follo wing month; thus all tests are out-o f-sample. After making predictio ns on a particular month, the one year trainin g w indo w slides forward by one month as does the one month test windo w . Price data is us ed fo r ea ch pr ess re lease for a fixed period prior to the relea se and at each 10 minute interv al follo wing t he release of th e article up to 250 minutes. When, for example, ne ws is release d at 3 5 pm, price data exists only for 60 minutes follo wing the news (becau se the b usiness day ends at 4 pm), so this particular article is discar ded fr om exp eriments that make predictions with time horiz ons longer than 60 minutes. Overa ll, this means that trainin g and testing data sizes decrease w ith the forecasting horizo n. Figure 3 displays the ov erall amount of te sting data (le ft) and the ave rage amoun t o f tr aining and te sting data used in each time windo w (right). 0 50 100 150 200 250 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 Testing P S f r a g r e p l a c e m e n t s Aggregated T esting Data Number Press Releases Minutes 0 50 100 150 200 250 0 200 400 600 800 1000 1200 Avg. Training Avg. Testing P S f r a g r e p l a c e m e n t s A vg . T rainin g/T esti ng Data Per W indow Number Press Releases Minutes Figure 3: Ag gregate ( over all windows) amount o f test press releases (left) and a verage train- ing/testing set per window (right) . A verag e training and testing windows are one y ear and one month, respectiv ely . Aggregated test d ata over all win dows is u sed to calculate all pe rforma nce measures. 2.3 Pe rformance Measur es Most k ernel function s in T ab le 1 cont ain parameters requiring calibrati on. A set of reasonable value s for each paramete r is chose n, and for each combination of parameter va lues, w e perfor m n -fold cross-v alidatio n to optimize paramet er v alues. Trai ning data is separa ted in to n equal folds. Each fold is pulle d out succe ssiv ely , and a model is traine d on the remaining data and tested on the extracte d fold. A predefine d classification perfor mance meas ure is av eraged over the n test folds and the optimal set of parameters is determined as those th at gi ve the bes t perfor mance. Since the d istrib ution of words occurring in p ress release s may change ov er time, we perform chro nological one-fol d cross vali dation here. T raining data is ordered according to release dates, after which a model is train ed on a ll ne ws published before a fixed date and tested on the remainin g press releases (the single fold). Sev eral poten tial measures are defined in T ab le 2. Note that the SVM Problem (2) also has a paramet er C that must be calibrated using cross-v alidatio n. Beyon d standard accurac y and recall measures , we measure prediction performance with a more fi - nancia lly intuiti ve metric, the Sharpe ratio, defined here as the ratio of the exp ected retur n to the standard de viation of returns, for the follo wing (fictitio us) trading strate gy: e very time a ne ws article is release d, a bet is made on the stock return and we either win or lose $1 accordi ng to w hether or not the predicti on is correc t. Daily returns are computed as the return of playing this game on each press relea se publishe d on a giv en day . The Sharpe ratio is estimated using the mean and standar d de viation of these daily ret urns, then annu alized. Additio nal results are gi ven using the classic performance measure: accurac y , defined as 6 Annualiz ed shar pe ratio: √ T E [ r ] σ Accurac y: T P + T N T P + T N + F P + F N Recall: T P T P + F N T able 2: Perform ance measur es. T is th e num ber o f perio ds p er year (1 2 for month ly , 252 for daily). E [ r ] is the expected return per period of a giv en tr ading strategy , a nd σ is the standard deviation of r . For b inary classification , T P , T N , F P , and F N are, re spectiv ely , true positives, true negativ es, f alse po siti ves, and false negativ es. the percentage of correct predicti ons made, howe v er all results are based on cross-v alidatin g over Sharpe ratios. Accurac y is disp layed due to its intuiti ve meanin g in binary classificatio n, but it has no direct fi nan- cial inte rpretatio n. Another pot ential measure is reca ll, defined as the p ercentage of posit iv e data points that are predicte d positi v e. In general , a tradeof f bet ween acc uracy and recall would be us ed as a measure in cross- vali dation. Here inst ead, we tradeof f risk versus retu rns by op timizing the Sharpe ratio. 2.4 Pr edicting equity movements with text or r eturns Support vect or machine s are used here to make prediction s on stock retu rns when ne ws regardi ng the com- pan y is published . In this secti on, the input feature vec tor to SV M is either a bag-of-word s text vector or a time series of past equity returns, as SVM only inputs a single feature ve ctor . P redicti ons are considere d at e very 10 minute interv al follo wing the release of an article up to either a maximum of 250 minutes or the close of the b usiness d ay; i.e. if the article comes out at 10:30 a m, we make pred ictions on the equity return s at 10:40 am, 10 :50 am, .. . , until 2:40 pm. Only article s rele ased during the b usiness day are cons idered here. T wo dif feren t classi fication tasks are performed. In one experimen t, the direction of returns is predicted by labeling press rel eases acc ording to whether the fut ure re turn is positi v e or negat iv e. In the other e x- periment , we predict abnormal returns , defined as an absolute re turn gre ater than a predefined thre shold. Dif ferent threshold s co rrespond to dif ferent classificatio n tasks and we expe ct lar ger jumps to be easier to predic t than smaller ones because the latter may not corres pond to true abnormal returns. This will be ver ified in experimen ts belo w . Performanc e of pred icting the directi on of equity returns followin g p ress releases is displayed in Figure 4 and sho ws the w eake st performance, using eith er a time seri es of returns (lef t) or text (right) as features . No predic tability is found in the dire ction of equity returns (since the Sha rpe ratio is near ze ro and the accur acy remains close to 50%). This is consisten t with literat ure reg arding stock return predi ctability . All results displa yed h ere use linear kernels with a single featu re type. Instead of the fictitious tradin g strategy used for abno rmal return prediction s, directio nal results use a bu y and sell (or sell and b uy) strateg y based on the true equity return s. Similar performance using gaussian kern els was observ ed in independen t exper iments. While predictin g direction of returns is a dif ficult task, abnormal return s appear to be pred ictable using either a time se ries of absol ute retur ns or the text of press relea ses. F igure 4 shows that a time ser ies of absolu te returns cont ains useful info rmation for in traday pred ictions (left), while ev en bette r predictio ns can be made using te xt (right). The threshold fo r d efining abnor mal retu rns in each windo w is the 7 5 th percen tile of abso lute returns observed in the training data. As describ ed abov e, expe riments with returns featur es only use news publi shed after 10:10 am. Thus, 7 perfor mance using text kernels is gi ven for both the full data set with all press released during the busin ess day as well a s the reduced data set to compare against exp eriments with returns featu res. P erforman ce from the full data set is also broken down according to press relea sed before and after 10:10 am. The differe nce between the cu rves labeled ≥ 10:10 AM and ≥ 10:1 0 AM2 is that the former train s mo dels using the complete data set includin g articles releas ed at the open of the bus iness day while the latter does not use the first 40 minutes of news to train models. The dif ference in performan ce might be attrib uted to the importa nce of these article s. The Sharpe ratio using the reduced data set is greater than that for ne ws publi shed bef ore 10:10 am becaus e fe wer articles are publish ed in the first 40 minutes than are publis hed during the remainde r of the b usiness day . Note that these very high Sharpe ratio are m ost likely due to the simple strate gy that is traded here; this does not imply that such a high Sharpe ratio can be generated in prac tice b ut rather in dicates a poten tial statist ical arbitrage. The decreas ing trend observe d in all performance measures over the intraday time horizo n is intuit iv e: public information is ab sorbed into prices o ver time, hence articles slowly lose their predic tiv e po wer as the predict ion horizon increases. Figure 5 compares performance in predictin g abnormal returns when the threshold is taken at either the 50 th or 85 th percen tile o f absolu te ret urns within the trainin g set. Result s u sing linear ke rnels and annuali zed Sharpe ratios (using daily returns) are shown here. Decreasin g the threshol d to the 50 th percen tile slightly decrea ses performance when using absolu te returns. H o wev er , there is a huge decreas e in performa nce when usin g tex t. Increasing the thresh old to the 85 th percen tile improv es performan ce relati ve to the 75 th percen tile in all measures. T his demonstrate s the sensiti vity of performan ce with respect to this thresh old. The 50 th percen tile of absol ute returns from the data set is not lar ge enoug h to de fine a true a bnormal re turn, whereas the 75 th and 85 th percen tiles do define abnormal jumps. Abso lute returns are kno wn to ha ve predic tability for small movemen ts, but the question remains as to why text is a poor source of information for predicti ng small jumps. Figure 6 illustrate s the impact of this percen tile thresh old on performan ce. Predictio ns are made 2 0 minutes into the future . For 25-35% of press relea ses, news ha s a bigger impact on future returns than past mark et data. 2.5 T ime o f Day Effect Other public ly a va ilable information aside from returns and tex t should be consider ed when predicting mov ements of equit y returns. The time of day has a stron g impact on abso lute returns, as demonstrated by Andersen & Bollers lev (1997) for the S&P 500. Figure 7 sho ws the time of day ef fect follo wing the release of press from the PR Newswir e data set. It is clear that abso lute returns follo wing pres s released early (and late) in the day are on a verage much higher than during midday . W e use the time stamp of the press release as a feature for maki ng the same predict ions as abov e. A binary feat ure vector x ∈ R 3 is cre ated to label eac h press release as pu blished before 10:30 am, af ter 3 pm, or in between. Linea r kern els are create d from these features and used in SVM for the same experimen ts as abo ve w ith absolute returns and tex t features and results are displayed in Figure 8. Note that gaussian ker nels ha ve exactly the same performance when using these bina ry features . As was done for the analysis with text data, perfo rmance is shown when using all press released during the bus iness day as w ell as the reduce d data set with ne ws only publ ished after 10:10 am (labels are the same as were describ ed for text). T raining SV M with data from the be ginning of the day is clearly importan t since the curv e labeled ≥ 10:10 AM2 has the weak est performance . The improv ed performance of the curve labele d ≥ 10:10 AM ov er ≥ 10:10 AM 2 can be attrib uted to the pattern seen in F igure 7. T rainin g with the full data set allo ws the model to distinguish between abso lute return s early in the day versus midday . Similar experimen ts usi ng day of the week features sho wed very 8 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 AbsReturns Abn Returns Dir P S f r a g r e p l a c e m e n t s Accuracy using Returns Accuracy Minutes 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Text Abn (all) Text Abn (<10:10 AM) Text Abn (>=10:10 AM) Text Abn (>=10:10 AM2) Text Dir (all) P S f r a g r e p l a c e m e n t s Accuracy using T ext Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 AbsReturns Abn Returns Dir P S f r a g r e p l a c e m e n t s Sharpe Ratio using Returns Sharpe Ratio Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 Text Abn (all) Text Abn (<10:10 AM) Text Abn (>=10:10 AM) Text Abn (>=10:10 AM2) Text Dir (all) P S f r a g r e p l a c e m e n t s Sharpe Ratio using T ext Sharpe Ratio Minutes Figure 4 : Accur acy and ann ualized daily Sharpe ratio fo r predicting abnor mal r eturns (Abn) or direction of returns (Dir) using returns and text data with linear kernels. Per forman ce using text is giv en f or both the full d ata set as well as th e red uced d ata set that is u sed for experimen ts with r eturns features. The curves labeled with ≥ 10:10 AM trains models using the com plete d ata set including articles released at the open of the b usiness day wh ile the curved lab eled with ≥ 10:10 AM2 does not use the fi rst 40 minutes of news to train models. Each point z on the x-axis corresp onds to predicting an abnorm al retu rn z minutes after each press release is issued. The 75 th percentile of absolute returns observed in the training data is used as the threshold for defining an abnormal return. weak pe rformance and are thus no t displayed. While the time of day ef fect exhib its predictabi lity , note that the exp eriments w ith text and absolu te returns data do not use any time stamp features and hence perfor mance with text an d absolute returns sho uld not be attrib uted to any ti me of day ef fects . F urther more, exp eriments belo w for combining the dif ferent pieces of publicly a vai lable information will show tha t these time of day effects are less useful than the text and retur ns data. There are of course other related market microstru cture ef fects that could be useful for predictabili ty , such as the amount o f n ews released throughou t the day or the indus try of the resp ecti ve companies. 9 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 AbsReturns 50 Text 50 AbsReturns 85 Text 85 P S f r a g r e p l a c e m e n t s Accuracy using Returns/T ext Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 AbsReturns 50 Text 50 AbsReturns 85 Text 85 P S f r a g r e p l a c e m e n t s Sharpe Ratio using Returns/T e xt Sharpe Ratio Minutes Figure 5: Accuracy and annualized daily Sharpe ratio for predicting abnormal returns using returns and text data with linear kerne ls. Each point z on th e x -axis co rrespon ds to pred icting an abnormal return z minutes after each pr ess release is issued. The 50 th and 85 th percentile of absolute return s observed in the training data are used as thresholds for defining abnormal returns. 50 55 60 65 70 75 80 85 90 95 0.45 0.5 0.55 0.6 0.65 0.7 0.75 AbsReturns Text P S f r a g r e p l a c e m e n t s Accuracy using Returns/T ext Accuracy Threshold Percentile 50 55 60 65 70 75 80 85 90 95 0 2 4 6 8 10 12 14 16 AbsReturns Text P S f r a g r e p l a c e m e n t s Sharpe Ratio using Returns/T e xt Sharpe Ratio Threshold Percentile Figure 6: Accuracy and annualized sharpe ratio for predicting abnormal returns 20 minutes into the future as the percentile for th resholds is increased from 50% to 9 5%. Linear kernels with ab solute returns an d text are used. For 25-35% of press releases, news has a b igger im pact on future returns than past market data. 2.6 Pr edicting daily equity movements and trading cov ered call options While the main focus is intraday movement s, we next use te xt and a bsolute ret urns to make daily predictions on abno rmal returns and sho w how one can trad e on these predi ctions. These exper iments use the same tex t data as abo ve for a sub set of 101 c ompanies (daily options d ata w as not obtained fo r all companie s). Returns data is also an intraday time series as abo ve, but is here computed as the 5, 10, ..., 25 minute return prior 10 Mon Tue Wed Thu Fri 0 1 2 3 4 5 6 x 10 −3 P S f r a g r e p l a c e m e n t s A verage Absolute (10 minute) Returns following Press Releases A verage Absolute Return Figure 7: A verage absolute (1 0 m inute) returns following pr ess rele ased du ring the business day . Red lines are drawn between b usiness days. 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Hour (all) Hour (<10:10 AM) Hour (>=10:10 AM) Hour (>=10:10 AM2) P S f r a g r e p l a c e m e n t s Accuracy using T ime of Day Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 Hour (all) Hour (<10:10 AM) Hour (>=10:10 AM) Hour (>=10:10 AM2) P S f r a g r e p l a c e m e n t s Sharpe Ratio using T ime of Day Sharpe Ratio Minutes Figure 8: Accuracy and annualized daily sharpe ratio for predicting abnormal returns using time of day . Perf ormanc e using time of day is g iv en fo r b oth the full data set as well as the reduced d ata set th at is u sed f or expe riments with return s features. Th e cur ves labeled with ≥ 10:10 AM train s models using the com plete news data set includin g articles released at the open of the business day while the curved labeled with ≥ 10:10 AM2 does not use the first 4 0 m inutes o f news to train mo dels. Each point z on the x-axis correspond s to pred icting an abno rmal return z minutes after each pr ess release is issued. The 75 th percentile of absolute retur ns observed in the training data are used as thresholds for defining abno rmal returns. to press release s. Daily equity data is obtain ed from the Y AHOO!F inance website and the optio ns data is obtain ed usin g O ptionMe trics through Wharton Resear ch Data Servic es . Rather than the fictitious trading strate gy above, delta-he dged cov ered call options are used to bet on abnorma l re turns (intra day o ptions data was not a vai lable hence the use of a fictitious strategy abov e). In order to bet on the oc currence of an abnormal return, the strate gy takes a long position in a call optio n, and, since the bet is not on the directi on of the price mo vement , the p osition is k ept delta ne utral by takin g a 11 short posit ion in delta share s of stock (delta is defined as the change in call option price resulting from a $1 incre ase in stock price, here taken from the OptionMet rics data ). The positi on is exited the fo llowin g day by going short the call optio n and long delta shar es of stock. A bet against an abno rmal return takes the opposi te positions. Equity posit ions use the closing prices follo wing the release of press and the clos ing price the follo wing day . Option prices (b uy and sell) use an ave rage of the highest closing bid and lo west closin g ask price observ ed on the day of the press release. T o normali ze the size of positions, w e alway s tak e a position in delta times $100 worth of the respec tiv e stock and the proper amount of the call option. The profit and loss (P&L) of these strat egies is displayed in Figure 9 using the equity and options data . The left side sho ws the P &L of predi cting that an abnor mal return will occur and the right side sho ws the P&L of predictin g no price move ment. There is a potentiall y large upside to predicting abno rmal returns, ho wev er only a l imited u pside to pred icting no movemen t, while an incor rect pred iction of no mov ement has a potentia lly large do wnside. T ext featu res were used in the related experi ments, but figures using returns feature s do exhibit similar patterns . −30 −20 −10 0 10 20 30 −6 −4 −2 0 2 4 6 P S f r a g r e p l a c e m e n t s P&L of Predicting Abnorma l R eturn s Profit and Loss Change in Stock V alue −30 −20 −10 0 10 20 30 −6 −4 −2 0 2 4 6 P S f r a g r e p l a c e m e n t s P&L of Predicting No Abno rmal R eturn s Profit and Loss Change in Stock V alue Figure 9: Profit and Loss (P&L) of tr ading delta-hedged covered call o ptions. The left figure displays the P&L of trading on pr edictions that an abno rmal retur n follows the release of press while the right displays the P&L resulting from p rediction s th at no abnor mal r eturn occurs. There is a potentially large upside to predicting abno rmal retu rns, h owe ver only a limited upside to predicting no movement, while an incorrect prediction of n o movement has a poten tially large downside. T ext features were used in the related exper iments, but figur es using returns featur es do exh ibit similar patterns. T able 3 dis plays results for three strate gies. TRADE ALL mak es the appr opriate trade based on all predic tions, LON G ON L Y takes posit ions only when an abnormal return is predi cted, and SHOR T ON L Y tak es positions only when no p rice mov ement is p redicted. The 75 th percen tile of absolut e returns observ ed in the training dat a are used as thresholds f or defining abnormal re turns. The results imply that the do wnside of predicting no move ment great ly dec reases the performanc e. The LONG ONL Y strategy performs best due to the lar ge u pside an d o nly limited do wnside. In add ition, the n umber o f no mov ement predictions made using absolu te return s feat ures is much larg er tha n when using text. T his is lik ely the ca use o f the neg ativ e S harpe ratio for TRADE ALL with absolute returns. Result s using higher threshold s sho w similar perfor mance trends and the associated P&L figures hav e ev en clearer U-shaped patterns (not displayed ). These resul ts do not accou nt for transac tion costs. Separate e xperiments set the buy and se ll option 12 Features Strategy Acc uracy Sharp e R atio # T rades T ext TRA DE ALL .63 .75 3752 Abs Returns TRADE ALL .54 -1.01 3752 T ext LONG ONL Y .63 2.02 1953 Abs Returns LONG ONL Y .54 1.15 597 T ext SHOR T ONL Y .62 -1 .28 1670 Abs Returns SHOR T O NL Y .54 -1 .95 3155 T able 3: Performan ce of delta-hedged cov ered call op tion strate gies. TRADE ALL makes the approp riate tr ade b ased on all predictio ns, LONG ONL Y takes p ositions only when an ab norma l return is p redicted, and SHOR T ONL Y takes positions on ly wh en n o price movement is p redicted. The 75 th percentile of absolute returns observed in the training d ata are used as th resholds for defining abnorm al returns. prices to the high est bid and lo west ask closing prices respec tiv ely . Somewhat large spreads mean that the portfo lios performed poorly with uniformly nega tiv e Sharpe ratios. 3 Combining text and retur ns W e no w dis cuss multip le ke rnel le arning (MKL), w hich provid es a method for opt imally combi ning te xt with return data in order to make predictio ns. A cuttin g plane algorit hm amenable to large -scale kernels is descri bed and compared w ith another recent method for MKL. 3.1 Multiple ker nel learning framework Multiple ker nel learning (MKL ) seeks to minimize the upper bound on misclas sification probability in (1) by learning an optimal li near combination of k ernels (see B ousqu et & Herrmann (2003), Lanck riet e t al . (2004 a), Bach et al. ( 2004), Ong et al. (2005 ), Sonnenber g et al. (20 06), Rakotomamon jy et al. (20 08), Zien & Ong (2007), Micchelli & Pontil (2007)). The kernel learning problem as formulated in Lanckr iet et al. (2004 a) is written min K ∈K ω C ( K ) (5) where ω C ( K ) is the minimum of problem (1) and can be vie wed as an upper boun d on the prob ability of misclass ification. For general sets K , enforc ing Mercer’ s condition (i.e. K 0 ) on the kernel K ∈ K makes ker nel learning a computation ally challenging task. The MKL prob lem in Lanckriet et al. (2004a) is a partic ular instance of kerne l learning and solves proble m (5) w ith K = { K ∈ S n : K = P i d i K i , P i d i = 1 , d ≥ 0 } (6) where K i 0 are predefined kernels. Note that cross-v alidati on over k ernel parameters is no longer required becaus e a ne w ker nel is i ncluded fo r each set o f des ired paramete rs; howe ver , calibrati on of the C parameter to SVM is still necessary . The kern el learning problem in (5) can be written as a semide finite program when there are no nonneg ativ ity constr aints on the kern el weights d in (6) as sho wn in Lanckrie t et al. 13 (2004 a). There are currently no semide finite programming solvers th at can handle lar ge k ernel learning proble m instances efficien tly . The restriction d ≥ 0 enfo rces M ercer’ s condition and reduce s problem (5) to a quadr atically constrained optimization problem maximize α T e − λ subjec t to α T y = 0 0 ≤ α ≤ C λ ≥ 1 2 α T diag ( y ) K i diag ( y ) α ∀ i (7) This problem is still numericall y challe nging fo r lar ge-scale kernels and sev eral algorithmic approach es ha ve been tested since the initial formula tion in Lanckriet et al. (2004a) . The first meth od, described in Bach et al. (200 4) solves a smooth ref ormulation of the nondif fere ntiable dual probl em obta ined by switching the max and min in problem (5) minimize α T e − max i { 1 2 α T diag ( y ) K i diag ( y ) α } subjec t to α T y = 0 0 ≤ α ≤ C (8) in the v ariabl es α ∈ R n . A re gulariz ation term is added in the primal to problem (8) , w hich makes t he dual a di ffere ntiable problem with the same con straints as SVM. A sequ ential minimal optimizati on (SMO) algori thm that iterati vely optimizes over pair s of varia bles is used to solve probl em (8). Other approache s for solving lar ger scale problems are written as a w rapper around an SVM compu ta- tion. For example, an approach detailed in S onnenb erg et al. (2006) solves the semi-infinite linear program (SILP) formulat ion maximize λ subjec t to P i d i = 1 d ≥ 0 1 2 α T diag ( y )( P i d i K i ) diag ( y ) α − α T e ≥ λ for all α with α T y = 0 , 0 ≤ α ≤ C (9) in the v ariables λ ∈ R , d ∈ R K . T his problem can be deriv ed from (5) by mov ing the object iv e ω C ( K ) to the constr aints. The algorithm iterati vely adds cutting planes to approx imate the infinite linear constrain ts until the solution is found. Each cut is found by solvin g an S VM using the curre nt kerne l P i d i K i . This formulat ion is adapt ed to multiclas s MKL in Zien & O ng (2007) where a similar S ILP is solve d. The lates t formulat ion in Rak otomamonjy et al. (2008) is min J ( d ) s.t. X i d i = 1 , d i ≥ 0 (10) where J ( d ) = max { 0 ≤ α ≤ C,α T y = 0 } α T e − 1 2 α T diag ( y )( X i d i K i ) diag ( y ) α (11) is simply the initial formulation of problem (5 ) with the constraints in (6) plugged in. The autho rs consi der the objecti v e J ( d ) as a dif ferent iable func tion of d with gradie nt calcu lated as: ∂ J ∂ d i = − 1 2 α ∗ T diag ( y ) K i diag ( y ) α ∗ (12) 14 where α ∗ is the optimal solution to SVM using the kernel P i d i K i . This becomes a smooth minimization proble m subject to box cons traints and one linear equalit y constr aint which is solve d using a reduced gra- dient method with a line search. E ach computation of the object iv e and gradient requi res solving an S VM. Experiment s in Rakotomamonj y et al. (2008) show this m ethod to be m ore efficient compared to the semi- infinite linear progr am solved abov e. More SVMs are requir ed but warm-starting S VM m ake s this method some what faster . Still, the redu ced gradient method suffer s numerically on large kerne ls as it requi res com- puting many g radients, hence solving many numerically expen siv e SVM classification problems. 3.2 Multiple ker nel learning via an analytic center cutting plane method W e next deta il a more efficient algorith m for solvi ng prob lem (10) that requires far less SVM computation s than gradient descen t methods. T he analytic center cutting plane m ethod (A CCPM) iterat iv ely reduces the v olume of a loc alizing set L containing the optimum us ing cuts deriv ed fro m a fi rst order con vexit y property until the volu me of the reduced localizing set con v er ges to the targe t precision. At each iteratio n i , a ne w center is computed in a smaller localizing set L i and a cut through this point is added to split L i and create L i +1 . The method can be modified according to ho w the center is selec ted; in our case the cent er selected is the analy tic center of L i defined belo w . Note that this method does not requir e dif ferentia bility but still exh ibits linear con ver gen ce. W e set L 0 = { d ∈ R n | P i d i = 1 , d i ≥ 0 } which we can write as { d ∈ R n | A 0 d ≤ b 0 } (the single equali ty const raint can be remove d by a diffe rent par ameterizatio n of th e proble m) to be our first localizati on set for the optimal solut ion. Our m ethod is then described as Algorithm 1 belo w (see Bertsek as (1999) for a m ore complete ref erence on cutting plane methods). The comple xity of each iter ation breaks down as follo ws. • Step 1. This st ep computes the an alytic center of a p olyhedro n and can be s olved in O ( n 3 ) operation s using interio r point methods for example. • Step 2. This step updates the polyhedral descriptio n. Computation of ∇ J ( d ) requ ires a single SVM computa tion which can be speeded up by warm-starting with the SV M solut ion of the previ ous itera- tion. • Step 3. This step requires orde ring the constraints according to their relev anc e in the localizatio n set. One rele v ance measure for the j th constr aint at iteration i is a T j ∇ 2 f ( d i ) − 1 a j ( a t j d i − b j ) 2 (13) where f is the objec tiv e function of the analytic center problem. Computing the hessia n is easy: it requir es matrix multiplicat ion o f the form A T D A where A is m × n (matrix multiplicat ion is kept ine xpensi ve in this step by pruning redunda nt cons traints) and D is diagon al. • Step 4. An explicit duality gap can be calculated at no extr a cost at eac h iter ation bec ause we can obtain the dual MKL soluti on without furthe r computations. The du ality gap (as shown in Rakotoma- monjy et al. (2008)) is: max i ( α ∗ T diag ( y ) K i diag ( y ) α ∗ ) − α ∗ T diag ( y )( X i d i K i ) diag ( y ) α ∗ (14) where α ∗ is the optimal solu tion to SVM using the kerne l P i d i K i . 15 Algorithm 1 Analytic center cutting plane method 1: Compu te d i as the analy tic cent er of L i = { d ∈ R n | A i d ≤ b i } by solving: d i +1 = argmin y ∈ R n − m X i =1 log ( b i − a T i y ) where a T i repres ents the i th ro w of coef ficients from A i in L i , m is the number of ro ws in A i , and n is the dimen sion of d (the number of kern els). 2: Compu te ∇ J ( d ) from (12) at the center d i +1 and update the (polyh edral) loca lization set: L i +1 = L i ∩ { d ∈ R n |∇ J ( d i +1 )( d − d i +1 ) ≥ 0 } 3: If m ≥ 3 n , reduce the number of constrai nts to 3 n . 4: If g ap ≤ ǫ stop , otherwise go back to step 1. Complexity . A CCPM is pro vab ly con ve rgen t in O ( n (log 1 /ǫ ) 2 ) iterations when usin g a cut elimination scheme as in A tkinso n & V aidya (1995) which keeps the complexity of the localization set bounded. Othe r schemes are av ailable w ith sligh tly d iffer ent complexit ies: O ( n 2 /ǫ 2 ) is achie ved in Goffin & V ial (2002) using (cheap er) approximate cente rs for example. In practice, A CCPM usually con ve rge s linearly as seen in Figure 10 (left) which uses kernels of dimension 500 on text data. T o illustrate the affect of increasi ng the number of kernels on the analy tic center probl em, Figure 10 (right) sho ws CPU time increasin g as the number of kern els increases . 0 20 40 60 80 100 10 −6 10 −4 10 −2 10 0 10 2 10 4 10 6 P S f r a g r e p l a c e m e n t s Duality Gap Iteration 0 20 40 60 80 100 0 2 4 6 8 10 12 14 P S f r a g r e p l a c e m e n t s T ime (seconds) Number of Kernels Figure 1 0: The con vergence semilog plot fo r ACC PM (left) shows a verage gap versus iter ation number . W e plot CPU time fo r the first 10 iterations versus nu mber of kernels (right). Bo th plo ts giv e averages over 20 expe riments with dashed lines at plus and min us one stand ard deviation. In all these experiments, A CCPM con verges linearly to a high precision. Gradient m ethods such as the reduced gradien t m ethod used in simpleMKL con v erg e linear ly (see Lu- enber ger (2003 )), b ut requi re expen siv e line searche s. Therefore, while gradient method s may sometimes 16 con v er ge linearl y at a faster rate than ACCPM on certain problems, they are often much slo wer due to the need to solve many SVM problems per iteration. Empiricall y , gradient methods tend to require man y m ore gradie nt ev aluatio ns than the localiza tion techniques discusse d here. ACCPM computes the objecti ve and gradie nt exactl y once per itera tion and th e ana lytic center p roblem remain s relati ve ly cheap with respec t to the SVM comput ation because the dimensio n of the analyti c centering problem (i.e. the number of ker - nels) is small in our application . Thresholding small ker nel weights in MKL to zero can furth er reduce the dimensio n of the anal ytic center problem. 3.3 Computational Sa vings As de scribed abo ve, ACCPM computes one SVM computa tion per iteratio n and con v er ges linear ly . W e compare this m ethod, which we denote accpmMKL, with the simpleMKL algorithm which uses a reduced gradie nt m ethod an d also con ver ges fast b ut co mputes many mor e SVMs to perform lin e searches. The SVMs in the line search are spee ded up using warm-startin g as desc ribed in R ako tomamonjy et al. (2008) b ut in practice, we observe that sa ving s in MK L from warm-s tarting often do not suf fice to make this gradient method more ef ficient than A CCP M. Few kernels are usually required in MKL because most kerne ls can be eliminate d more efficien tly be- foreha nd using cross-v alida tion, hence we use se vera l families of ker nels (linear , gaussia n, and polynomial) b ut v ery few ker nels from each family . Each experi ment uses one linear kernel an d the same number of gaussi an and polynomia l kernels gi vin g a total of 3, 7, and 11 ker nels (eac h nor malized to unit tra ce) in each experiment . W e set the duality gap to .01 (a v ery loose gap) and C to 100 0 (after cro ss-v alidatio n for C rangi ng between 500 and 5000) for each expe riment in orde r to compa re the algor ithms on identical proble ms. For fa irness, w e compare simpleMKL with our implementati on of accpmMKL using the same SVM packag e in simpleMKL which allo ws warm-st arting (The SVM package in simpleMKL is based on the SVM-K M toolbox (Canu et al. 2005 ) and implemented in M atlab .). In the fi nal column, we also gi ve the running time for accpmMKL usin g the LIBSVM solv er witho ut wa rm-starting. The follo wing ta bles demonst rate comput ational efficienc y a nd do not sho w predicti ve performan ce; both algorithms solv e the same optimization probl em with the s ame sto pping crite rion. High pre cision for MK L does not significantl y increa se predic tion perf ormance. Results are av erages over 20 experiments done on Linux 64-bit servers with 2.6 GHz CPUs. T able 4 shows that A CCPM is more ef ficient for the multiple kernel learning problem in a text classi- fication example. Saving s from warm-s tarting SVM in simpleMKL do not ov ercome the benefit of fe wer SVM compu tations at each iteration in accp mMKL . Furthermo re, using a faster SVM solv er such as LIB - SVM prod uces better performance ev en without warm-sta rting. The number of ker nels used in accpmMKL is higher than w ith simpleMKL because of the very loose duality gap here. The reduced gradient method of simpleMKL often stops at a much highe r preci sion bec ause the gap is checke d after a line search that can achie ve h igh pr ecision in a single iter ation and it is this higher pre cision that reduces th e numbe r of ker nels. Howe ver , for a slight ly higher precisio n, simpleMKL will often stall or con ver ge very slowly ; the method is v ery sensiti v e to the tar get precision . The accpmMKL method st ops at the desire d duality (mean- ing m ore kernel s) becau se the gap is check ed at each iteration during the linear con ver gence; howe ve r , the con v er gence is much more stabl e and consis tent for all data sets. For accpmMKL, the number of SVMs is equi v alent to the number of iteration s. T able 5 sho ws an example where accp mMKL is outpe rformed by simpleMKL. This occurs when the classifica tion task is extremely easy and the optimal m ix of kernels is a singleton . In this case, simpleMKL con v er ges with fewer SVMs. N ote though that acc pmMKL with LIBS VM is still f aster h ere. Both e xamples illustr ate that simpleMKL t rains many more SVMs whene ver th e optimal mix of k ernels includ es more than 17 Max simpleMKL accpmMKL Dim # Kern # Kern # Iters # SVMs T ime # Kern # SVMs T ime Time (LIBSVM) 500 3 2.0 3.4 27.2 48.6 3.0 7.1 13. 7 0.6 7 2.6 3.4 39.5 47.9 7.0 12.0 15.5 1. 8 11 3.6 3.2 41.0 37.3 10.9 15.3 1 7.4 3.3 1000 3 2.0 2.0 29.3 164.5 3.0 6. 3 36.7 2.4 7 2.4 3.6 53.3 240.3 6.8 11.7 4 0.0 6.8 11 3.9 3.6 57.8 214.6 10.6 14.9 4 8.1 12.7 2000 3 2.0 1.0 24.0 265.8 3.0 5. 0 79.4 7.2 7 3.3 1.5 30.4 209.6 7.0 10.5 1 10.5 25 .2 11 6.0 2.3 40.5 253.2 11.0 14.4 1 41.4 46 .5 3000 3 2.0 1.0 24.0 435.5 3.0 6. 0 2 48.9 17 .9 7 4.0 2.0 38.0 591.4 7.0 6. 8 2 21.7 39 .0 11 6.0 2.0 39.8 648.9 11.0 8.0 244.8 6 6.8 T able 4: Nu merical performa nce of simp leMKL versus accpmMKL for classification on text clas- sification d ata. accpmMKL ou tperfor ms simpleM KL in ter ms of SVM iterations an d time. Using LIBSVM to solv e SVM prob lems further enhances performance. Results are averages over 2 0 runs. Experime nts are done using the SVM solver in the simpleM KL toolbox except for the final column which uses LIBSVM. T ime is in seconds. Dim is the number of training samples in each kernel. one input kern el. Overall, accp mMKL has the adva ntages of cons istent con ver gence rates for all data sets, fe wer SVM computatio ns for rele v ant data sets, and the ability to achie ve high precision tar gets. Max simpleMKL accpmMKL Dim # K ern # Kern # Ite rs # SVMs T ime # Kern # SVMs T ime Time (LIBSVM) 500 3 2.0 1.9 32.8 22.3 2.0 11.1 5.8 0.8 7 1.6 2.8 22.6 19.2 7.0 14.7 3.7 1.9 11 1.0 2.0 11.6 7.1 8.2 20.4 9.1 4.1 1000 3 2.0 2.0 32.6 70.6 3.0 5.0 8. 7 1.5 7 1.0 2.0 9.9 10.6 7.0 15.7 17.2 8.2 11 1.0 2.0 11.6 38.4 8.0 21.0 48.6 16.8 2000 3 1.0 1.0 4.0 36.5 3.0 6.0 41.8 7 .0 7 1.0 2.0 10.3 54.0 7.0 16.0 85.5 34.0 11 1.0 2.0 12.1 261.7 8.0 21.0 2 94.8 67 .5 3000 3 1.0 1.0 4.0 89.4 3.0 6.0 100.9 15 .1 7 1.0 2.0 10.5 158.3 7.0 16.0 2 35.4 79 .9 11 1.0 2.0 12.2 925.9 8.0 21.0 9 59.5 163.4 T able 5 : Nu merical per forman ce of simpleM KL versus accpmMKL for classification on UCI Mushroom Data . simp leMKL outp erforms accpmMKL when the classification task is very easy , demonstra ted b y optimality of a single kernel, b ut otherwise performs slo wer . Expe riments a re done using the SVM solver in the simpleMKL toolbo x except for the final column which uses LI BSVM. T ime is in second s. Dim is the numb er of training instances in each kernel. 18 3.4 Pr edicting abnormal retur ns with text and retur ns Multiple kernel learnin g is used here to combine text w ith returns data in order to predic t abnorma l equity return s. Kernels K 1 , ..., K i are c reated using on ly text features as don e in Section 2.4 and addi tional kern els K i +1 , ..., K d are created from a time series of absol ute returns. Experiments here use one linear and four Gaussian kernels, each normalized to hav e unit trace, for each feature type. The MKL proble m is solve d using K 1 , ...K d , two linea r kernels based on time of day and day of w eek, and an additi onal identity matrix in K descri bed by (6); h ence we o btain a si ngle optimal k ernel K ∗ = P i d ∗ i K i that is a con v ex combin ation of the input kern els. The same techniqu e (referred to as data fusion) was applied in Lanckriet et al. (2004b) to combin e protein sequences with gene express ion data in order to recognize diffe rent protein classes. Performanc e using th e 75 th percen tile of ab solute returns as a threshold for abnormality are displayed in Figure 11. Resul ts from S ection 2.4 that use SV M with a text and absolute returns linear kernels are super - imposed w ith the perfo rmance when combining text, absolu te returns, and time stamps. While predicti ons using o nly text o r return s exhibi t good performan ce, combining them sig nificantly improv es perfor mance in both accur acy and annualize d daily Sharpe ratio. 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Multiple Text AbsReturns P S f r a g r e p l a c e m e n t s Accuracy using Multiple K ernels Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 Multiple Text AbsReturns P S f r a g r e p l a c e m e n t s Sharpe Ratio using Multiple K ernels Sharpe Ratio Minutes Figure 11: Accuracy an d shar pe r atio using multiple kern els. MKL m ixes 13 possible kern els (1 linear text, 1 linear absolute returns, 4 g aussian text, 4 gaussian ab solute returns, 1 linear time of day , 1 linear day of week, 1 identity matrix). Each point z on the x-axis corresponds to predicting an abnorm al return z minutes after each press release is issued. The 75 th percentile of absolu te returns observed in the training data is used as the threshold for defining an abnormal return. W e next analyze th e impac t of the vario us ke rnels. Figure 12 d isplays the optimal kernel weights d i found f rom solvin g (10) at e ach time h orizon (weights a re a verag ed from result s ove r each wind ow). Kern el weights are repre sented as colored fractions of a single bar of length one. T he five kern els with the largest coef ficients are two ga ussian text k ernels, a linear text ke rnel, the identit y kernel, and one gaussi an absolute return s kernel s. Note that t he magnit udes of th e coef ficients are not perfectly indic ativ e of import ance of th e respec tiv e featu res. Hence, the optimal mix of kernels here support s the abov e ev idence that mixing ne ws with abso lute returns impro ves perfor mance. Anothe r important observ ati on is that kernel weights remain relati ve ly constant ov er time. Each bar of kern el weights corres ponds to an indep endent classification task (i.e. each predic ts abnormal returns at dif ferent times in the future) and the persiste nt kernel weights imply that comb ining important k ernels detects a mea ningful signal be yond that fo und by usin g only te xt or retu rn 19 feature s. 0 50 100 150 200 250 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lin AbsReturns Gauss AbsReturns 1 Gauss Text 1 Identity Lin Text Gauss Text 2 P S f r a g r e p l a c e m e n t s Coefficients with Multiple Kernels 75 th % Coefficients Minutes Figure 12: Optimal kernel coefficients when u sing wh en u sing 13 p ossible kernels (1 linear text, 1 linear absolute returns, 4 g aussian text, 4 gau ssian abso lute re turns, 1 lin ear time of day , 1 linear day of week, 1 identity matrix) with 75 th percentile thresho ld to de fine abno rmal returns. Only the top 5 kernels are labeled. Each point z on th e x-axis co rrespon ds to pred icting an abnorm al return z minutes after each press release is issued. Figure 13 shows the performance of using multip le k ernels for pr edicting ab normal retu rns when we chang e the threshold to the 50 th and 85 th percen tiles of absolute returns in the train ing data. In both cases, there is a slight impro vement in per formance from using si ngle kernels . Figure 14 displays the optima l ker nel weight s for these experimen ts, and, indeed, both experiment s use a mix of tex t and absolu te returns . Pre viously , te xt was sho wn to ha ve more pred ictability with a high er thresh old while absolu te return s per - formed better with a lo wer threshol d. K ernel weights here versu s those with the 75 th percen tile threshold reflect this observ atio n. 3.5 Sensitivity of MKL Successf ul performan ce using multiple kernel learni ng is highl y depende nt on a proper choice of input ker nels. Here, w e show that high accur acy o f the optimal m ix of k ernels is not c rucial for good perfo rmance, while incl uding the optimal ker nels in the mix is n ecessary . In add ition, we sho w that MKL is inse nsiti ve to the i nclusion of kern els with no i nformation (suc h as rando m kernels ). The follo wing four experiment s w ith dif ferent kerne ls set s e xempli fy the se observ ations. First, only linear kernels using tex t, abso lute retu rns, time of day , and day of week are inc luded. N ext, an equal weighting ( d i = 1 / 13 ) for thirt een ker nels (one linea r and four gauss ian each from text and absolu te return s, one linear for each time of day and day of week, and an iden tity kern el) is used. Another test perf orms MKL using the same thirteen k ernels in additi on to three rando m kernels and a final expe riment uses four bad gaussian kern els (two tex t and two absolu te retur ns). Figure 15 display s the accuracy and Sharpe ratios of these ex periments. Performanc e using only linea r ker nels is high since lin ear k ernels achiev ed eq uiv alen t performanc e to gauss ian k ernels usin g SVM. Adding three ran dom ker nels to the mix of thirteen ke rnels that ach iev e high p erformance does not significa ntly 20 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Multiple 50 Multiple 85 P S f r a g r e p l a c e m e n t s Accuracy using Multiple K ernels Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 Multiple 50 Multiple 85 P S f r a g r e p l a c e m e n t s Sharpe Ratio using Multiple K ernels Sharpe Ratio Minutes Figure 13: Accuracy and annualized daily sharpe ratio for predicting abnormal returns using multi- ple kernels. Each point z on the x-a xis correspond s to predicting an abn ormal return z minutes after each press release is issued. The 50 th and 85 th percentiles of absolute retur ns are used to define abnorm al returns. 0 50 100 150 200 250 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lin AbsReturns Gauss AbsReturns 2 Gauss Text 1 Identity Gauss AbsReturns 1 Gauss Text 2 P S f r a g r e p l a c e m e n t s Coefficients wit h Multiple Kernels 50 th % Coefficients Minutes 0 50 100 150 200 250 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lin AbsReturns Gauss Text 1 Gauss AbsReturns 1 Identity Gauss Text 2 Lin Text P S f r a g r e p l a c e m e n t s Coefficients with Multiple Kernels 85 th % Coefficients Minutes Figure 14: Optimal kernel coefficients when using 13 p ossible kernels (1 linear text, 1 linear abso- lute re turns, 4 gaussian text, 4 g aussian absolu te returns, 1 linear time of day , 1 linear day o f week, 1 identity matrix) with 50 th and 85 th percentiles as thr esholds. Only th e top 5 kernels are labeled. Each point z on the x-axis correspond s to pred icting an abno rmal return z minutes after each pr ess release is issued. impact the results either . The thr ee random kernels hav e neg ligible coef ficients across the horizon (not displa yed). A noticeabl e decreas e in performance is seen when using equally weighted kerne ls, while an e ven more sign ificant decrease is observ ed w hen using high ly subo ptimal kernels. A small data set (using only data after 11 pm) showed an ev en smaller decre ase in pe rformance with equal ly weighted k ernels. This demonst rates that MKL need not be solved to a high tolerance in order to achie ve good performance in this 21 0 50 100 150 200 250 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Linear Kerns Equal Coeffs With Rand Kerns Bad Kerns P S f r a g r e p l a c e m e n t s Accuracy for V arious T ests Accuracy Minutes 0 50 100 150 200 250 0 2 4 6 8 10 12 14 16 Linear Kerns Daily Equal Coeffs Daily With Rand Kerns Daily Bad Kerns Daily P S f r a g r e p l a c e m e n t s Sharpe Ratio for V ariou s T ests Sharpe Ratio Minutes Figure 1 5: Accur acy and Sharpe Ratio for MKL with different kernel sets. Linear K erns uses 4 linear kernels. E qual Coeffs uses 13 equ ally weighted kernels. W ith Ra nd K erns adds 3 random kernels to 1 3 kernels. Bad K erns uses 4 gaussian kernels with misspecified constants (2 text a nd 2 absolute returns). Th e 75 th percentile is used as threshold to define abnor mal returns. applic ation, while it is still, as expec ted, necessary to include good kernels in the mix. 4 Conclusion W e found significant performance when predict ing abnormal returns using text and absolute returns as fea- tures. In ad dition, multiple k ernel learn ing was introduce d to th is applic ation and grea tly improv ed pe r- formance . Final ly , a cutti ng plane algorithm for solving lar ge-sca le MK L problems was described and its ef ficiency relati ve to curre nt MKL solvers was demons trated. These e xperiment s could of course be furthe r refined by implemen ting a trad eable strate gy based on abnorma l return predicti ons such as done for daily predic tions in Section 2.6. Unfort unately , while equity option s are liquid asset s a nd woul d produce realis tic pe rformance metrics, intrad ay op tions prices are not public ly a v ailable. An important directi on for further research is feature selection, i.e. choosi ng the words in the dictio- nary . The abo ve experi ments use a simple handp icked set of words. T echniques such as recursi v e feature eliminati on (RFE-SVM ) w ere used to select words but performan ce was similar to results when using the handp icked dicti onary . Mor e adv anc ed methods such as la tent seman tic analysi s, probabilis tic latent se man- tic anal ysis, and latent diric hlet allocatio n shoul d be considered . Addi tionally , industry-s pecific dictio naries can be de vel oped and used with the associate d subse t of companies. Another natural extension of our work is regress ion analysis. Support vector regress ions (SVR) are the reg ression count erpart to S VM and extend to M KL. T e xt can be combined with returns in order to forecast both intrad ay v olatili ty and abno rmal returns using SVR and MKL. 22 Acknowledgeme nts The authors are grateful t o Jonathan Lange and Ke vin Fan for superb research assistance. W e wo uld also li ke to ackno wledg e sup port from NSF gran t DMS-062 5352, N SF C DI grant SES -08355 50, a NSF C AREER awa rd, a Peek junior faculty fello wship and a Howa rd B. W entz Jr . junior facult y award. Refer ences Andersen, T . G. & Bollerslev , T . (1997), ‘Intraday period icity and v olatility pe rsistence in financial markets’, Journal of Empirical F inance 4 , 115–15 8. Atkinson, D. S. & V aidya, P . M. (199 5), ‘ A cutting plane algorith m for conv ex program ming that uses analytic ce nters’, Mathematical Pr ogramming 69 , 1–43. Austin, M . P ., Bates, G., Dem pster , M. A. H., Leemans, V . & W illiams, S. N. (2004), ‘ Adap ti ve systems for foreign exchange trading’, Quantitative F inance 4 , C 37– C45. Bach, F . R., Lan ckriet, G. R. G. & Jord an, M . I. ( 2004 ), ‘Multiple kernel learning, conic duality , and the smo algo- rithm’, Pr oceedings of the 21st Internation al Confer ence on Machine Learning . Bertsekas, D. (1999 ), N onlin ear Pr o gramming, 2nd Edition , Athena Scientific. Blei, D. M., Ng, A. Y . & Jordan, M. I . (2 003), ‘Latent dirichle t allocation ’, Journal of Machine Learning Re sear ch 3 , 993–1 022. Bollerslev , T ., Cho u, R. Y . & Kro ner, K. F . ( 1992) , ‘ Arch modelin g in finan ce: A review of the theory and empirical evidence. ’, J ourna l of Econometrics 52 , 5–5 9. Bousquet, O. & Herrm ann, D. J. L. (2003) , ‘On the com plexity of learning the kernel m atrix’, Ad vances in Neural Information Pr ocessing Systems . Canu, S., Grandvalet, Y ., Guigue , V . & Rakotomamonjy , A. (20 05), ‘Svm and kernel methods matlab toolbo x’, Per - ception Systmes et Informa tion, INSA de Rouen, Rouen, France. Chang, C.-C. & Lin, C.-J. (2001) , ‘LIBSVM: a libr ary for suppor t vecto r mach ines’. Software av ailable at http://www .csie.ntu.edu.tw/ cjlin/libsvm. Cristianini, N. & Shawe-T aylor, J. (2000) , An Intr oduction to Suppo rt V ector Machines and other kernel-based learn- ing methods , Cambridge University Pr ess. Deerwester, S., Du mais, S. T ., Furnas, G. W ., La ndauer, T . K. & Harsh man, R. (199 0), ‘Indexing by latent seman tic analysis’, J ourna l of the American Society for Information Science 41 (6), 391–407 . Dempster, M. A. H. & Jones, C. M. (2001 ), ‘ A rea l-time adapativ e trad ing system using genetic programmin g’, Quantitative F inance 1 , 397–41 3. Ding, Z., Granger, C. W . J. & Engle, R . F . (1993 ), ‘ A long memory property of stock market returns and a n ew m odel’, Journal of Empirical F inance 1 , 83–10 6. Dumais, S., Platt, J., HHeckerman , D. & Sahami, M. (1 998), ‘Inductive learning algoirhtms and re presentation s f or text categorizations’, Pr oceedings of AC M0CIKM98 . Edering ton, L. H. & Lee, J. H. (1993 ), ‘Ho w markets process information : Ne ws releases and volatility’, The Journal of F inance XL VIII (4), 11 61–11 91. Fama, E. F . (1 965), ‘The behavior of stock-ma rket prices’, The Journal of Business 38 , 34–1 05. Fung, G. P . C., Y u, J. X. & Lam, W . (2 003), ‘Stock preiction: Integrating text mining approach using real-time ne ws’, Pr oceedings of IEEE Confer ence on Computationa l Intelligence for F inancial Engineering pp. 395–402. 23 Gavrishchaka, V . V . & Banerjee, S. (2 006), ‘Su pport v ector machine as an ef ficient framew ork for stock mar ket volatility forecasting’, Computational Management Science 3 , 147–16 0. Goffin, J.-L. & V ial, J.- P . (200 2), ‘Conve x n ondifferentiable op timization: A survey focused on the analy tic c enter cutting plane method’ , Optimization Methods and Softwar e 17 (5) , 805–867. Hofmann , T . (200 1), ‘Unsupervised learning by prob abilistic latent semantic a nalysis’, Machine Lea rning 42 , 17 7– 196. Joachims, T . (2 002), Learning to Classify T ext Using Supp ort V ector Machines: Methods, Theory and Algorithms , Kluwer Academic Publishers. Kalev , P . S., Liu, W . -M., Pham, P . K. & J arne cic, E. (2004 ), ‘Public i nfo rmation arri val and volatility of intraday s tock returns’, J ourna l of Banking and F inance 28 , 1441–1467 . K ogan, S., L evin, D., Routledge, B. R., Sagi, J. S. & Smith, N . A. (20 09), ‘Predicting risk fro m financial reports with regression’, Pr oceed ings of the North Ame rican Association for Compu tational Linguistics Huma n Lang uage T echno logy Confer ence, Boulder , CO . Lanckriet, G. R. G., Bie, T . D. , Cristianini, N., Jordan , M. I. & Nob le, W . S. (200 4b), ‘ A statistical framework for genomic data fusion’, Bioinformatics 20 , 2626– 2635. Lanckriet, G. R. G., Cristianini, N., Bartlett, P ., Ghao ui, L . E. & Jor dan, M. I. ( 2004a) , ‘Learn ing the kernel matr ix with semidefinite progr amming’ , Journal of Machine Learning Resear ch 5 , 27–72 . Lavrenko, V ., Schm ill, M., Lawrie, D., Ogilvie, P ., Jensen, D. & Allan, J. (2 000) , ‘Mining of concur rent te xt and time series’, Pr oceedings of 6th A CM SIGKDD Int. Confer ence on Knowledge Discovery and Data Mining . Luenberger, D. (2 003), Linear and Nonlinear Pr ogr amming, 2nd Edition , Kluwer Academic Publishers. M.-A.Mitterma yer & Knolmayer, G. (20 06 a ), ‘Ne wscats: A ne ws categorizatio n and trading system’, Pr oceedin gs of the Sixth Internation al Confer ence on Data Mining . M.-A.Mitterma yer & Kn olmayer, G. (200 6 b ), ‘T ext mining systems for pre dicting the market r esponse to news: A survey’, W orkin g P aper No. 184, Institute of Informa tion Systems, Univ . of Bern, Bern . Malliaris, M. & Salchenberger, L. (1996), ‘Using neural networks to forecast the s & p 100 implied v olatility’, Neur o- computing 10 , 183– 195. Micchelli, C . A. & Pontil, M. (2007 ), ‘ Feature space perspectives for learning the kernel’, Ma chine Learning 66 , 297– 319. Mitchell, M . L. & Mu lherin, J. H. (1994) , ‘The impact of pu blic information on the stock market’, The Journal of F inance XLIX (3), 923–95 0. Ong, C. S., Smola, A. J. & W illiamson, R. C. (2 005), ‘Learning the kernel with hyperkern els’, J ou rnal of Machine Learning Resear ch 6 , 1043–107 1. Rakotomamon jy , A., Bach , F ., C anu , S. & Grandvalet, Y . (2008), ‘Simplemkl’, J ourna l of Mac hine Learning Resear ch (9), 2491– 2521. Robertson, C. S. , Geva, S. & W olff, R. C. (20 07), ‘News aw are volatility forecasting: Is the content of n ews impo r- tant?’, Pr oc. of the 6th A ustralasian Data Mining Confer ence (AusDM’07), Gold Coast, Austr alia . Sonnenb erg, S., R ¨ atsch, G. , Sch ¨ afer, C. & Sch ¨ olkopf, B. (2006 ), ‘Large scale m ultiple kernel learning’ , J ourna l of Machine Learning Resear ch 7 , 1531– 1565. T aylor, S. (198 6), Modelling financia l time series , Ne w Y ork, John W iley & Sons. T aylor, S. J. & Xu, X. ( 1997) , ‘Th e incr emental volatility informatio n in one million foreign exchange quo tations’, Journal of Empirical F inance 4 , 317–3 40. 24 Thomas, J. D. (2003) , News and T rading Rules , Dissertation, Carnegie Mellon Univ ersity , Pittsb urgh, P A. W ood, R. A., McInish, T . H. & Ord, J. K. (198 5), ‘ An inves tigation of tra nsactions data for nyse stocks’, The Journal of F inance XL (3), 723–73 9. W uthrich, B ., Cho, V ., Leung , S., Perammun etilleke, D ., Sankaran, K., Zhang, J. & Lam, W . (199 8), ‘Daily prediction of major stock indices from textual web data’, Pr oceedings of 4th ACM SIGKDD Int. Confer ence on K nowled ge Discovery and Data Mining . Zien, A. & Ong, C. S. (200 7), ‘Multiclass multiple k ernel learning’, Pr oceedings of the 24th Internatio nal Confer ence on Machine Learning . 25
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment