Bayesian regression and Bitcoin

In this paper, we discuss the method of Bayesian regression and its efficacy for predicting price variation of Bitcoin, a recently popularized virtual, cryptographic currency. Bayesian regression refers to utilizing empirical data as proxy to perform…

Authors: Devavrat Shah, Kang Zhang

Bayesian regression and Bitcoin
1 Ba y esian regression and Bitcoin Dev a vrat Shah Kang Zhang Lab oratory for Information and Decision Systems Departmen t of EECS Massac h usetts Institute of T ec hnology devavrat@mit.edu, zhangkangj@gmail.com A bstr act —In this pap er, we discuss the metho d of Ba y esian regression and its efficacy for predicting price v ariation of Bitcoin, a recen tly p opularized virtual, cryptographic currency . Ba yesian regres- sion refers to utilizing empirical data as proxy to p erform Ba y esian inference. W e utilize Ba yesian regression for the so-called “laten t source mo del” . The Ba yesian regression for “laten t source mo del” w as introduced and discussed b y Chen, Nikolo v and Shah [1] and Bresler, Chen and Shah [2] for the purp ose of binary classification. They established theoretical as w ell as empirical efficacy of the metho d for the setting of binary classification. In this pap er, instead w e utilize it for predicting real-v alued quan tity , the price of Bitcoin. Based on this price prediction metho d, w e devise a simple strategy for trading Bitcoin. The strategy is able to nearly double the in vestmen t in less than 60 da y p erio d when run against real data trace. I. Ba yesian Regression The problem. W e consider the question of regres- sion: w e are give n n training lab eled data p oin ts ( x i , y i ) for 1 ≤ i ≤ n with x i ∈ R d , y i ∈ R for some fixed d ≥ 1. The goal is to use this training data to predict the unknown lab el y ∈ R for given x ∈ R d . The classical approach. A standard approach from non-parametric statistics (cf. see [3] for example) is to assume mo del of the follo wing type: the lab eled data is generated in accordance with relation y = f ( x ) +  where  is an independent random v ariable repre- sen ting noise, usually assumed to b e Gaussian with mean 0 and (normalized) v ariance 1. The regression metho d boils do wn to estimating f from n observ ation ( x 1 , y 1 ) , . . . , ( x n , y n ) and using it for future prediction. F or example, if f ( x ) = x T θ ∗ , i.e. f is assumed to b e linear function, then the classical least-squares estimate is used for estimating θ ∗ or f : ˆ θ LS ∈ arg min θ ∈ R d n X i =1 ( y i − x T i θ ) 2 (1) In the classical setting, d is assumed fixed and n  d whic h leads to justification of suc h an estimator b eing highly effectiv e. In v arious modern applications, n  d or ev en n  d is more realistic and thus leaving highly under-determined problem for estimating θ ∗ . Under reasonable assumption such as ‘sparsity’ of θ ∗ , i.e. k θ ∗ k 0  d , where k θ ∗ k 0 = |{ i : θ ∗ i 6 = 0 }| , the regularized least-square estimation (also kno wn as Lasso [4]) turns out to b e the right solution: for appropriate choice of λ > 0, ˆ θ LAS S O ∈ arg min θ ∈ R d n X i =1 ( y i − x T i θ ) 2 + λ k θ k 1 . (2) A t this stage, it is worth pointing out that the ab o v e framework, with differen t functional forms, has b een extremely successful in practice. And very excit- ing mathematical developmen t has accompanied this theoretical progress. The b o ok [3] pro vides a go o d o verview of this literature. Currently , it is a very active area of research. Our approach. The key to success for the ab ov e stated approach lies in the ability to choose a reason- able parametric function space ov er which one tries to estimate parameters using observ ations. In v arious mo dern applications (including the one considered in this pap er), making suc h a choice seems c hallenging. The primary reason b ehind this is the fact that the data is v ery high dimensional (e.g. time-series) making either parametric space too complicated or meaning- less. Now in many such scenarios, it seems that there are few prominen t wa ys in which underlying even t exhibits itself. F or example, a phrase or collection of w ords b ecome viral on T witter so cial media for few differen t reasons – a public ev ent, life changing even t for a celebrity , natural catastrophe, etc. Similarly , there are only few differen t t yp es of p eople in terms of their c hoices of movies – those of who like comedies and indie mo vies, those who lik e court-room dramas, etc. Suc h were the insights formalized in works [1] and [2] as the ‘latent source mo del’ which we describ e formally in the con text of ab o v e describ ed framework. 2 There are K distinct latent sources s 1 , . . . , s K ∈ R d ; a latent distribution o v er { 1 , . . . , K } with asso ciated probabilities { µ 1 , . . . , µ K } ; and K laten t distributions o ver R , denoted as P 1 , . . . , P K . Each lab eled data p oin t ( x, y ) is generated as follo ws. Sample index T ∈ { 1 , . . . , K } with P ( T = k ) = µ k for 1 ≤ k ≤ K ; x = s T +  , where  is d -dimensional indep endent random v ariable, represen ting noise, which we shall assume to b e Gaussian with mean vector 0 = (0 , ..., 0) ∈ R d and iden tity co v ariance matrix; y is sampled from R as p er distribution P T . Giv en this mo del, to predict lab el y given asso ciated observ ation x , we can utilize the conditional distribu- tion 1 of y given x giv en as follo ws: P  y   x  = T X k =1 P  y   x, T = k  P  T = k   x  ∝ T X k =1 P  y   x, T = k  P  x   T = k  P ( T = k ) = T X k =1 P k  y  P   = ( x − s k )  µ k = T X k =1 P k  y  exp  − 1 2 k x − s k k 2 2  µ k . (3) Th us, under the latent source mo del, the prob- lem of regression b ecomes a v ery simple Ba yesian inference problem. How ever, the problem is lack of kno wledge of the ‘latent’ parameters of the source mo del. Sp ecifically , lac k of knowledge of K , sources ( s 1 , . . . , s K ), probabilities ( µ 1 , . . . , µ K ) and probabil- it y distributions P 1 , . . . , P K . T o ov ercome this challenge, we prop ose the follo w- ing simple algorithm: utilize empirical data as pro xy for estimating conditional distribution of y given x giv en in (3). Sp ecifically , giv en n data p oin ts ( x i , y i ), 1 ≤ i ≤ n , the empirical conditional probabilit y is P emp  y   x  = P n i =1 1 ( y = y i ) exp  − 1 4 k x − x i k 2 2  P n i =1 exp  − 1 4 k x − x i k 2 2  . (4) The suggested empirical estimation in (4) has the follo wing implications: in the con text of binary clas- sification, y tak es v alues in { 0 , 1 } . Then (4) suggests 1 Here we are assuming that the random v ariables hav e w ell-defined densities ov er appropriate space. And when ap- propriate, conditional probabilities are effectively representing conditional probabilit y density . the following classification rule: compute ratio P emp  y = 1   x  P emp  y = 0   x  = P n i =1 1 ( y i = 1) exp  − 1 4 k x − x i k 2 2  P n i =1 1 ( y i = 0) exp  − 1 4 k x − x i k 2 2  . (5) If the ratio is > 1, declare y = 1, else declare y = 0. In general, to estimate the conditional exp ectation of y , given observ ation x , (4) suggests E emp [ y | x ] = P n i =1 y i exp  − 1 4 k x − x i k 2 2  P n i =1 exp  − 1 4 k x − x i k 2 2  . (6) Estimation in (6) can b e viewed equiv alently as a ‘linear’ estimator: let v ector X ( x ) ∈ R n b e suc h that X ( x ) i = exp  − 1 4 k x − x i k 2 2  / Z ( x ) with Z ( x ) = P n i =1 exp  − 1 4 k x − x i k 2 2  , and y ∈ R n with i th comp onen t b eing y i , then ˆ y ≡ E emp [ y | x ] is ˆ y = X ( x ) y . (7) In this pap er, we s hall utilize (7) for predicting future v ariation in the price of Bitcoin. This will further feed in to a trading strategy . The details are discussed in the Section I I. Related prior work. T o b egin with, Ba y esian in- ference is foundational and use of empirical data as a pro xy has b een a w ell known approach that is p oten tially discov ered and re-discov ered in v ariet y of con texts ov er decades, if not for cen turies. F or exam- ple, [5] pro vides a nice o verview of suc h a method for a sp ecific setting (including classification). The concrete form (4) that results due to the assumption of laten t source mo del is closely related to the p opular rule called the ‘weigh ted ma jority voting’ in the literature. It’s asymptotic effectiveness is discussed in literature as well, for example [6]. The utilization of laten t source mo del for the purp ose of identifying precise sample complexit y for Ba yesian regression w as first studied in [1]. In [1], authors sho w ed the efficacy of suc h an approach for predicting trends in so cial media T witter. F or the purp ose of the sp ecific application, authors had to utilize noise mo del that was different than Gaussian leading to minor change in the (4) – instead of using quadratic function, it w as quadratic function applied to logarithm (comp onent-wise) of the underlying vec- tors - see [1] for further details. In v arious mo dern application suc h as online recom- mendations, the observ ations ( x i in ab ov e formalism) 3 are only partially observ ed. This requires further mo d- ification of (4) to mak e it effectiv e. Suc h a mo difica- tion w as suggested in [2] and corresp onding theoretical guaran tees for sample complexity w ere provided. W e note that in b oth of the works [1], [2], the Ba yesian regression for latent source mo del w as used primarily for binary classification. Instead, in this w ork we shall utilize it for estimating real-v alued v ariable. I I. Trading Bitcoin What is Bitcoin. Bitcoin is a peer-to-p eer crypto- graphic digital currency that w as created in 2009 b y an unkno wn p erson using the alias Satoshi Nakamoto [7], [8]. Bitcoin is unregulated and hence comes with b enefits (and potentially a lot of issues) suc h as transactions can b e done in a frictionless manner - no fees - and anonymously . It can be purc hased through exc hanges or can b e ‘mined’ b y computing/solving complex mathematical/cryptographic puzzles. Cur- ren tly , 25 Bitcoins are rew arded every 10 minutes (eac h v alued at around US $400 on September 27, 2014). As of September 2014, its daily transaction v olume is in the range of US $30-$50 million and its market capitalization has exceeded US $7 billion. With suc h huge trading volume, it mak es sense to think of it as a prop er financial instrument as part of an y reasonable quan titative (or for that matter an y) trading strategy . In this pap er, our in terest is in understanding whether there is ‘information’ in the historical data related to Bitcoin that can help predict future price v ariation in the Bitcoin and thus help dev elop prof- itable quan titative strategy using Bitcoin. As men- tioned earlier, we shall utilize Bay esian regression inspired by latent source mo del for this purp ose. Relev ance of Latent Source Model. Quantitativ e trading strategies hav e been extensively studied and applied in the financial industry , although many of them are kept secretiv e. One common approach re- p orted in the literature is technical analysis, whic h assumes that price mov ements follow a set of patterns and one can use past price mo vemen ts to predict future returns to some exten t [9], [10]. Caginalp and Baleno vich [11] show ed that some patterns emerge from a mo del inv olving tw o distinct groups of traders with different assessments of v aluation. Studies found that some empirically dev elop ed geometric patterns, suc h as heads-and-shoulders, triangle and double- top-and-b ottom, can be used to predict future price c hanges [12], [13], [14]. The Laten t Source Mo del is precisely trying to mo del existence of such underlying patterns leading to price v ariation. T rying to develop patterns with the help of a h uman exp ert or trying to iden tify patterns explicitly in the data, can b e challenging and to some extent subjective. Instead, using Bay esian regression approach as outlined ab ov e allows us to utilize the existence of patterns for the purp ose of b etter prediction without explicitly finding them. Data. In this pap er, to p erform exp eriments, w e hav e used data related to price and order b o ok obtained from Okcoin.com – one of the largest exchanges op erating in China. The data concerns time p erio d b et w een F ebruary 2014 to July 2014. The total raw data points were ov er 200 million. The order bo ok data consists of 60 b est prices at whic h one is willing to buy or sell at a giv en p oin t of time. The data points w ere acquired at the interv al of every t w o seconds. F or the purp ose of computational ease, we constructed a new time series with time in terv al of length 10 seconds; eac h of the ra w data p oint was mapp ed to the closest (future) 10 second point. While this coarsening in tro duces slight ‘error’ in the accuracy , since our trading strategy operates at a larger time scale, this is insignificant. T rading Strategy . The trading strategy is very simple: at each time, w e either maintain position of +1 Bitcoin, 0 Bitcoin or − 1 Bitcoin. At eac h time instance, we predict the a verage price mov ement o ver the 10 seconds in terv al, sa y ∆ p , using Ba yesian regression (precise details explained b elo w) - if ∆ p > t , a threshold, then w e buy a bitcoin if curren t bitcoin p osition is ≤ 0; if ∆ p < − t , then w e sell a bitcoin if curren t p osition is ≥ 0; else do nothing. The choice of time steps when we mak e trading decisions as men tioned ab ov e are c hosen carefully b y looking at the recent trends. W e skip details as they do not hav e first order effect on the p erformance. Predicting Price Change. The core metho d for a verage price c hange ∆ p ov er the 10 second interv al is the Bay esian regression as in (7). Given time- series of price v ariation of Bitcoin ov er the interv al of few months, measured ev ery 10 second in terv al, we ha ve a very large time-series (or a v ector). W e use this historic time series and from it, generate three subsets of time-series data of three differen t lengths: S 1 of time-length 30 min utes, S 2 of time-length 60 min utes, and S 3 of time-length 120 minutes. No w at a giv en p oint of time, to predict the future c hange ∆ p , w e use the historical data of three length: previous 30 minutes, 60 minutes and 120 minutes - denoted 4 x 1 , x 2 and x 3 . W e use x j with historical samples S j for Ba y esian regression (as in (7)) to predict av erage price c hange ∆ p j for 1 ≤ j ≤ 3. W e also calculate r = ( v bid − v ask ) / ( v bid + v ask ) where v bid is total v olume p eople are willing to buy in the top 60 orders and v ask is the total volume p eople are willing to sell in the top 60 orders based on the curren t order b o ok data. The final estimation ∆ p is pro duced as ∆ p = w 0 + 3 X j =1 w j ∆ p j + w 4 r , (8) where w = ( w 0 , . . . , w 4 ) are learn t parameters. In what follows, w e explain how S j , 1 ≤ j ≤ 3 are collected; and how w is learnt. This will complete the description of the price c hange prediction algorithm as well as trading strategy . No w on finding S j , 1 ≤ j ≤ 3 and learning w . W e divide the en tire time duration into three, roughly equal sized, perio ds. W e utilize the first time perio d to find patterns S j , 1 ≤ j ≤ 3. The second perio d is used to learn parameters w and the last third p erio d is used to ev aluate the p erformance of the algorithm. The learning of w is done simply by finding the b est linear fit o ver all c hoices giv en the selection of S j , 1 ≤ j ≤ 3. No w selection of S j , 1 ≤ j ≤ 3. F or this, w e take all p ossible time series of appropriate length (effectiv ely v ectors of dimension 180, 360 and 720 resp ectiv ely for S 1 , S 2 and S 3 ). Each of these form x i (in the notation of formalism used to describ e (7)) and their corresp onding label y i is computed by lo oking at the a verage price change in the 10 second time interv al follo wing the end of time duration of x i . This data rep ository is extremely large. T o facilitate computa- tion on single machine with 128 G RAM with 32 cores, w e clustered patterns in 100 clusters using k − means algorithm. F rom these, w e chose 20 most effectiv e clusters and to ok represen tative patterns from these clusters. The one missing detail is computing ‘distance’ be- t ween pattern x and x i - as stated in (7), this is squared ` 2 -norm. Computing ` 2 -norm is computation- ally in tensiv e. F or faster computation, w e use negativ e of ‘similarity’, defined below, b et w een patterns as ’distance’ . Definition 1. (Similarity) The similarit y b et ween t wo vectors a , b ∈ R M is defined as s ( a , b ) = P M z =1 ( a z − mean ( a ))( b z − mean ( b )) M std ( a ) std ( b ) , (9) where mean ( a ) = ( P M z =1 a z ) / M (resp ectively for b ) and std ( a ) = ( P M z =1 ( a i − mean ( a )) 2 ) / M (respectively for b ). In (7), we use exp( c · s ( x, x i )) in place of exp( −k x − x i k 2 2 / 4) with choice of constant c optimized for b etter prediction using the fitting data (like for w ). W e mak e a note of the fact that this similarity can be computed v ery efficien tly by storing the pre- computed patterns (in S 1 , S 2 and S 3 ) in a normalized form (0 mean and std 1). In that case, effectively the computation b oils down to p erforming an inner- pro duct of v ectors, whic h can b e done very efficiently . F or example, using a straightforw ard Python imple- men tation, more than 10 million cross-correlations can b e computed in 1 second using 32 core mac hine with 128 G RAM. Fig. 1: The effect of different threshold on the num b er of trades, av erage holding time and profit Fig. 2: The in v erse relationship b etw een the a verage profit p er trade and the n umber of trades Results. W e sim ulate the trading strategy describ ed ab o v e on a third of total data in the duration of May 6, 2014 to June 24, 2014 in a causal manner to see ho w w ell our strategy does. The training data utilized is all historical (i.e. collected b efore May 6, 2014). W e use differen t threshold t and see how the performance of strategy c hanges. As shown in Figure 1 and 2 differen t threshold pro vide differen t p erformance. Concretely , as we increase the threshold, the num b er of trades 5 Fig. 3: The figure plots t wo time-series - the cumula- tiv e profit of the strategy starting May 6, 2014 and the price of Bitcoin. The one, that is low er (in blue), corresp onds to the price of Bitcoin, while the other corresp onds to cum ulative profit. The scale of Y -axis on left corresp onds to price, while the scale of Y -axis on the right corresp onds to cum ulative profit. decreases and the a verage holding time increases. A t the same time, the av erage profit p er trade increases. W e find that the total profit p eaked at 3362 yuan with a 2872 trades in total with a v erage inv estment of 3781 yuan. This is roughly 89% return in 50 days with a Sharp e ratio of 4 . 10. T o recall, sharp ratio of strategy , o ver a given time p erio d, is defined as follo ws: let L b e the n umber of trades made during the time in terv al; let p 1 , . . . , p L b e the profits (or losses if they are negativ e v alued) made in eac h of these trade; let C be the mo dulus of difference b etw een start and end price for this time in terv al, then Sharpe ratio [15] is P L ` =1 p ` − C Lσ p , (10) where σ p = 1 L  P L ` =1 ( p ` − ¯ p ) 2 ) with ¯ p = ( P L ` =1 p ` ) /L . Effectiv ely , Sharp e ratio of a strategy captures how w ell the strategy p erforms compared to the risk-free strategy as well as how consisten tly it performs. Figure 3 sho ws the performance of the b est strategy o ver time. Notably , the strategy p erforms b etter in the middle section when the market volatilit y is high. In addition, the strategy is still profitable even when the price is decreasing in the last part of the testing p erio d. I I I. Discussion Are There Interesting Patterns? The patterns uti- lized in prediction w ere clustered using the standard k − means algorithm. The clusters (with high price v ariation, and confidence) w ere carefully insp ected. The cluster centers (means as defined b y k − means algorithm) found are rep orted in Figure 4. As can b e seen, there are “triangle” pattern and “head-and- shoulder” pattern. Such patterns are observed and rep orted in the tec hnical analysis literature. This seem to suggest that there are indeed suc h patterns and pro- vides evidence of the existence of latent source mo del and explanation of success of our trading strategy . Fig. 4: Patterns identified resem ble the head-n- shoulder and triangle pattern as shown ab ov e. Scaling of Strategy . The strategy experimented in this pap er holds minimal p osition - at most 1 Bitcoin (+ or − ). This leads to nearly doubling of inv estment in 50 days. Natural question arises - do es this scale for large volume of inv estment? Clearly , the n umber of transactions at a giv en price at a giv en instance will decrease given that order b o ok is alwa ys finite, and hence linearity of scale is not exp ected. On the other hand, if we allow for flexibilit y in the p osition (i.e more than ± 1), then it is likely that more profit can b e earned. Therefore, to scale such a strategy further careful research is required. Scaling of Computation. T o b e computationally feasible, w e utilized ‘represen tative’ prior time-series. It is definitely b eliev able that using all p ossible time- series could hav e improv ed the prediction p ow er and hence the efficacy of the strategy . How ever, this re- quires computation at massive scale. Building a scal- able computation arc hitecture is feasible, in principle 6 as b y design (7) is trivially parallelizable (and map- reducable) computation. Understanding the role of computation in improving prediction qualit y remains imp ortan t direction for inv estigation. A cknowledgments This work was supp orted in part by NSF gran ts CMMI-1335155 and CNS-1161964, and b y Arm y Re- searc h Office MURI A ward W911NF-11-1-0036. References [1] G. H. Chen, S. Nik olo v, and D. Shah, “A latent source mo del for nonparametric time series classification,” in A dvanc es in Neur al Information Pr o c essing Systems , pp. 1088–1096, 2013. [2] G. Bresler, G. H. Chen, and D. Shah, “A latent source mo del for online collab orativ e filtering,” in A dvanc es in Neur al Information Pr oc essing Systems , 2014. [3] L. W asserman, Al l of nonp ar ametric statistics . Springer, 2006. [4] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pp. 267–288, 1996. [5] C. M. Bishop and M. E. Tipping, “Bay esian regression and classification,” Nato Scienc e Series sub Series III Computer A nd Systems Scienc es , vol. 190, pp. 267–288, 2003. [6] K. F ukunaga, Intr o duction to statistic al p attern r e c o gni- tion . A cademic press, 1990. [7] “Bitcoin. ” https://bitcoin.org/en/faq. [8] “What is bitcoin?. ” http://money .cnn.com/infographic/ tec hnology/what- is- bitcoin/. [9] A. W. Lo and A. C. MacKinlay , “Sto ck mark et prices do not follow random walks: Evidence from a simple sp ecification test,” R eview of Financial Studies , vol. 1, pp. 41–66, 1988. [10] A. W. Lo and A. C. MacKinla y , Sto ck market pric es do not fol low r andom walks: Evidenc e fr om a simple sp e cification test . Princeton, NJ: Princeton Universit y Press, 1999. [11] G. Caginalp and D. Balenovic h, “A theoretical founda- tion for technical analysis,” Journal of T e chnic al A nalysis , 2003. [12] A. W. Lo, H. Mamaysky , and J. W ang, “F oundations of tec hnical analysis: Computational algorithms, statistical inference, and empirical implementation,” Journal of Fi- nanc e , vol. 4, 2000. [13] G. Caginalp and H. Laurent, “The predictive p ow er of price patterns,” Applie d Mathematic al Financ e , vol. 5, pp. 181–206, 1988. [14] C.-H. Park and S. Irwin, “The profitability of tec hnical analysis: A review,” A gMAS Pr oje ct R ese ar ch R ep ort No. 2004-04 , 2004. [15] W. F. Sharp e, “The sharpe ratio,” Str e etwise–the Best of the Journal of Portfolio Management , pp. 169–185, 1998.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment