Prediction with a Short Memory

Prediction with a Short Memory V atsal Sharan Stanford Univ ersit y vsharan@stanford.edu Sham Kak ade Univ ersit y of W ashington sham@cs.washington.edu P ercy Liang Stanford Univ ersit y pliang@cs.stanford.edu Gregory V alian t Stanford Univ ersit y valiant@stanford.edu Abstract W e consider the problem of predicting the next observ ation giv en a sequence of past obser- v ations, and consider the exten t to which accurate prediction requires complex algorithms that explicitly leverage long-range dep endencies. Perhaps surprisingly , our p ositive results show that for a broad class of sequences, there is an algorithm that predicts well on av erage, and bases its predictions only on the most recent few observ ation together with a set of simple summary statistics of the past observ ations. Sp eciﬁcally , we show that for any distribution ov er obser- v ations, if the m utual information b etw een past observ ations and future observ ations is upp er b ounded by I , then a simple Marko v mo del ov er the most recent I / observ ations obtains ex- p ected KL error  —and hence ` 1 error √  —with respect to the optimal predictor that has access to the en tire past and kno ws the data generating distribution. F or a Hidden Mark o v Model with n hidden states, I is bounded by log n , a quantit y that do es not dep end on the mixing time, and we sho w that the trivial prediction algorithm based on the empirical frequencies of length O (log n/ ) windows of observ ations achiev es this error, provided the length of the sequence is d Ω(log n/ ) , where d is the size of the observ ation alphab et. W e also establish that this result cannot b e improv ed up on, even for the class of HMMs, in the following tw o senses: First, for HMMs with n hidden states, a window length of log n/ is information-theoretically necessary to achiev e exp ected KL error  , or ` 1 error √  . Second, the d Θ(log n/ ) samples required to accurately estimate the Marko v mo del when observ ations are drawn from an alphab et of size d is necessary for any computationally tractable learn- ing/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs. 1 Memory , Mo deling, and Prediction W e consider the problem of predicting the next observ ation x t giv en a sequence of past observ ations, x 1 , x 2 , . . . , x t − 1 , which could ha v e complex and long-range dep endencies. This se quential pr e diction problem is one of the most basic learning tasks and is encountered throughout natural language mo deling, sp eech syn thesis, ﬁnancial forecasting, and a n umber of other domains that hav e a sequen tial or c hronological element. The abstract problem has receiv ed muc h atten tion ov er the last half century from multiple comm unities including TCS, mac hine learning, and co ding theory . The fundamental question is: How do we c onsolidate and r efer enc e memories ab out the p ast in or der to eﬀe ctively pr e dict the futur e? Giv en the immense practical imp ortance of this prediction problem, there has b een an enormous eﬀort to explore diﬀerent algorithms for storing and referencing information ab out the sequence, whic h ha v e led to the developmen t of several p opular mo dels suc h as n -gram mo dels and Hidden Mark o v Mo dels (HMMs). Recently , there has b een signiﬁcan t interest in r e curr ent neur al networks (RNNs) [ 1 ]—which enco de the past as a real vector of ﬁxed length that is up dated after every observ ation—and sp eciﬁc classes of such netw orks, such as Long Short-T erm Memory (LSTM) net w orks [ 2 , 3 ]. Other recently p opular models that hav e explicit notions of memory include neural T uring machines [ 4 ], memory net w orks [ 5 ], diﬀerentiable neural computers [ 6 ], atten tion- based mo dels [ 7 , 8 ], etc. These mo dels hav e b een quite successful (see e.g. [ 9 , 10 ]); nevertheless, consisten tly learning long-range dep endencies, in settings s uc h as natural language, remains an extremely activ e area of researc h. In parallel to these eﬀorts to design systems that explicitly use memory , there has b een m uc h eﬀort from the neuroscience communit y to understand ho w humans and animals are able to mak e accurate predictions about their environmen t. Man y of these eﬀorts also attempt to understand the computational mec hanisms b ehind the formation of memories (memory “consolidation”) and retriev al [ 11 , 12 , 13 ]. Despite the long history of studying sequential prediction, many fundamen tal questions re main: • Ho w m uc h memory is necessary to accurately predict future observ ations, and what prop erties of the underlying sequence determine this requiremen t? • Must one remem b er signiﬁcant information ab out the distan t past or is a short-term memory suﬃcien t? • What is the computational complexit y of accurate prediction? • Ho w do answ ers to the ab o v e questions dep end on the metric that is used to ev aluate predic- tion accuracy? Aside from the intrinsic theoretical v alue of these questions, their answ ers could serv e to guide the construction of eﬀectiv e practical prediction systems, as well as informing the discussion of the computational mac hinery of cognition and prediction/learning in nature. In this work, w e provide insights into the ﬁrst three questions. W e b egin b y establishing the follo wing prop osition, which addresses the ﬁrst tw o questions with resp ect to the p erv asively used metric of aver age prediction error: Prop osition 1 . L et M b e any distribution over se quenc es with mutual information I ( M ) b e- twe en the p ast observations . . . , x t − 2 , x t − 1 and futur e observations x t , x t +1 , . . . . The b est ` -th or der Markov mo del, which makes pr e dictions b ase d only on the most r e c ent ` observations, pr e dicts the distribution of the next observation with aver age KL err or I ( M ) /` or aver age ` 1 err or p I ( M ) /`, with r esp e ct to the actual c onditional distribution of x t given al l p ast observations. 1 The “b est” ` -th order Marko v mo del is the mo del which predicts x t based on the previous ` observ ations, x t − ` , . . . , x t − 1 , according to the conditional distribution of x t giv en x t − ` , . . . , x t − 1 under the data generating distribution. If the output alphab et is of size d , then this conditional distribution can b e estimated with small error giv en O ( d ` +1 ) sequences dra wn from the distribution. Without any additional assumptions on the data generating distribution b eyond the b ound on the m utual information, it is necessary to observe multiple sequences to mak e go o d predictions. This is b ecause the distribution could b e highly non-stationary , and hav e diﬀerent b eha viors at diﬀerent times, while still ha ving small mutual information. In some settings, such as the case where the data generating distribution corresp onds to observ ations from an HMM, we will b e able to accurately learn this “b est” Mark o v mo del from a single sequence (see Theorem 1 ). The intuition b ehind the statemen t and proof of this general proposition is the following: at time t , w e either predict accurately and are unsurprised when x t is rev ealed to us; or, if w e predict po orly and are surprised b y the v alue of x t , then x t m ust con tain a signiﬁcan t amount of information about the history of the sequence, whic h can then b e lev eraged in our subsequent predictions of x t +1 , x t +2 , etc. In this sense, every timestep in whic h our prediction is ‘bad’, w e learn some information ab out the past. Because the m utual information b etw een the history of the sequence and the future is b ounded b y I ( M ), if w e were to make I ( M ) consecutive bad predictions, we hav e captured nearly this amount of information ab out the history , and hence going forward, as long as the windo w we are using spans these observ ations, we should exp ect to predict well. This general prop osition, framed in terms of the m utual information of the past and future, has immediate implications for a n um b er of well-studied mo dels of sequential data, suc h as Hidden Mark o v Mo dels (HMMs). F or an HMM with n hidden states, the mutual information of the generated sequence is trivially b ounded b y log n , whic h yields the following corollary to the ab o v e prop osition. W e state this proposition now, as it pro vides a helpful reference point in our discussion of the more general prop osition. Corollary 1. Supp ose observations ar e gener ate d by a Hidden Markov Mo del with at most n hidden states. The b est log n  -th or der Markov mo del, which makes pr e dictions b ase d only on the most r e c ent log n  observations, pr e dicts the distribution of the next observation w ith aver age KL err or ≤  or ` 1 err or ≤ √  , with r esp e ct to the optimal pr e dictor that knows the underlying HMM and has ac c ess to al l p ast observations. In the setting where the observ ations are generated according to an HMM with at most n hidden states, this “b est” ` -th order Mark o v mo del is easy to learn giv en a single suﬃciently long sequence dra wn from the HMM, and corresponds to the naive “empirical” ` -th order Mark o v mo del (i.e. ( ` + 1)-gram mo del) based on the previous observ ations. Sp eciﬁcally , this is the mo del that, giv en x t − ` , x t − ` +1 , . . . , x t − 1 , outputs the observed (empirical) distribution of the observ ation that has follo w ed this length ` sequence. (T o predict what comes next in the phrase “. . . defer the details to the ” we lo ok at the previous o ccurrences of this subsequence, and predict according to the empirical frequency of the subsequen t w ord.) The following theorem makes this claim precise. Theorem 1. Supp ose observations ar e gener ate d by a Hidden Markov Mo del with at most n hidden states, and output alphab et of size d . F or  > 1 / log 0 . 25 n ther e exists a window length ` = O ( log n  ) and absolute c onstant c such that for a ny T ≥ d c` , if t ∈ { 1 , 2 , . . . , T } is chosen uniformly at r andom, then the exp e cte d ` 1 distanc e b etwe en the true distribution of x t given the entir e history (and know le dge of the HMM), and the distribution pr e dicte d by the naive “empiric al” ` -th or der Markov mo del b ase d on x 0 , . . . , x t − 1 , is b ounde d by √  . 1 1 Theorem 1 does not hav e a guarantee on the av erage KL loss, such a guarantee is not p ossible as the KL loss as it can b e unbounded, for example if there are rare characters whic h hav e not been observ ed so far. 2 The ab ov e theorem states that the window length necessary to predict well is indep endent of the mixing time of the HMM in question, and holds ev en if the mo del do es not mix. While the amoun t of data required to make accurate predictions using length ` windows scales exp onentially in ` —corresp onding to the condition in the ab o v e theorem that t is c hosen uniformly b et w een 0 and T = d O ( ` ) —our low er b ounds, discussed in Section 1.3 , argue that this exp onen tial dep endency is una v oidable. 1.1 In terpretation of Mutual Information of P ast and F uture While the m utual information b etw een the past observ ations and the future observ ations is an in tuitiv e parameterization of the complexit y of a distribution o v er se quences, the fact that it is the right quantit y is a bit subtle. It is tempting to hop e that this mutual information is a bound on the amoun t of memory that w ould b e required to store all the information about past observ ations that is relev an t to the distribution of future observ ations. This is not the case. Consider the following setting: Given a join t distribution ov er random v ariables X past and X future , supp ose we wish to deﬁne a function f that maps X past to a binary “advice”/memory string f ( X past ), p ossibly of v ariable length, such that X future is indep endent of X past , given f ( X past ) . As is shown in Harsha et al. [ 14 ], there are join t distributions ov er ( X past , X future ) suc h that even on av erage, the minim um length of the advice/memory string necessary for the ab ov e task is exp onential in the m utual information I ( X past ; X future ). This setting can also b e interpreted as a tw o-pla y er communication game where one play er generates X past and the other generates X future giv en limited communication (i.e. the abilit y to comm unicate f ( X past )). 2 Giv en the fact that this mutual information is not even an upper bound on the amount of memory that an optimal algorithm (computationally unbounded, and with complete knowledge of the distribution) w ould require, Prop osition 1 migh t b e surprising. 1.2 Implications of Prop osition 1 and Corollary 1 These results sho w that a Mark o v mo del—a mo del that cannot capture long-range dep endencies or structure of the data—can predict accurately on any data-generating distribution (even those corresp onding to complex mo dels such as RNNs), provided the order of the Marko v mo del scales with the complexity of the distribution, as parameterized by the mutual information b et w een the past and future. Strikingly , this parameterization is indiﬀeren t to whether the dep endencies in the sequence are relatively short-range as in an HMM that mixes quic kly , or very long-range as in an HMM that mixes slo wly or do es not mix at all. Indep endent of the nature of these dep endencies, pro vided the mutual information is small, accurate prediction is p ossible based only on the most recen t few observ ation. (See Figure 1 for a concrete illustration of this result in the setting of an HMM that do es not mix and has long-range dep endencies.) A t a time when increasingly complex mo dels such as recurren t neural netw orks and neural T uring machines are in vogue, these results serve as a baseline theoretical result. They also help explain the practical success of simple Marko v mo dels such as Kneser-Ney smo othing [ 15 , 16 ] for machine translation and speech recognition systems in the past. Although recent recurrent neural netw orks hav e yielded empirical gains (see e.g. [ 9 , 10 ]), current mo dels still lack the abilit y 2 It is worth noting that if the advice/memory string s is sampled ﬁrst, and then X past and X future are deﬁned to b e random functions of s , then the length of s c an b e related to I ( X past ; X future ) (see [ 14 ]). This latter setting where s is generated ﬁrst corresp onds to allowing shared randomness in the tw o-play er communication game; how ev er, this is not relev ant to the sequential prediction problem. 3 Figure 1: A depiction of a HMM on n states, that rep eats a given length n binary sequence of outputs, and hence do es not mix. Corollary 1 and Theorem 1 imply that accurate prediction is p ossible based only on short sequences of O (log n ) observ ations. to consistently capture long-range dep endencies. 3 In some settings, suc h as natural language, capturing suc h long-range dep endencies seems crucial for achieving h uman-lev el results. Indeed, the main message of a narrativ e is not con v ey ed in any single short segment. More generally , higher-lev el intelligence seems to b e ab out the ability to judiciously decide what asp ects of the observ ation sequence are worth remem b ering and up dating a mo del of the w orld based on these asp ects. Th us, for suc h settings, Prop osition 1 , can actually b e in terpreted as a kind of negative result— that aver age error is not a go o d metric for training and ev aluating mo dels, since mo dels such as the Marko v mo del which are indiﬀerent to the time scale of the dep endencies can still p erform w ell under it as long as the num ber of dependencies is not to o large. It is imp ortant to note that av erage prediction error is the metric that ubiquitously used in practice, b oth in the natural language pro cessing domain and elsewhere. Our results suggest that a diﬀerent metric migh t b e essen tial to driving progress tow ards systems that attempt to capture long-range dep endencies and lev erage memory in meaningful w a ys. W e discuss this p ossibilit y of alternate prediction metrics more in Section 1.4 . F or many other settings, suc h as ﬁnancial prediction and low er level language prediction tasks suc h as those used in OCR, a v erage prediction error is actually a meaningful metric. F or these settings, the result of Prop osition 1 is extremely p ositiv e: no matter the nature of the dep endencies in the ﬁnancial markets, it is suﬃcien t to learn a Mark o v mo del. As one obtains more and more data, one can learn a higher and higher order Mark o v mo del, and av erage prediction accuracy should con tin ue to impro v e. F or these applications, the question no w b ecomes a computational question: the naiv e approac h to learning an ` -th order Mark o v mo del in a domain with an alphab et of size d might require Ω( d ` ) space to store, and data to learn. F r om a c omputational standp oint, is ther e a b etter algorithm? What pr op erties of the underlying se quenc e imply that such mo dels c an b e le arne d, or appr oximate d mor e eﬃciently or with less data? Our computational lo w er b ounds, describ ed b elo w, provide some p ersp ectiv e on these compu- tational considerations. 3 One amusing example is the recent sci-ﬁ short ﬁlm Sunspring whose script was automatically generated by an LSTM. Lo cally , eac h sentence of the dialogue (mostly) makes sense, though there is no cohesion o ver longer time frames, and no ov erarc hing plot tra jectory (despite the brilliant acting). 4 1.3 Lo w er b ounds Our positive results show that accurate prediction is p ossible via an algorithmically simple model— a Marko v mo del that only dep ends on the most recent observ ations—whic h can b e learned in an algorithmically straightforw ard fashion b y simply using the empirical statistics of short sequences of examples, compiled ov er a suﬃcient amount of data. Nev ertheless, the Marko v mo del has d ` parameters, and hence requires an amount of data that scales as Ω( d ` ) to learn, where d is a b ound on the size of the observ ation alphab et. This prompts the question of whether it is p ossible to learn a successful predictor based on signiﬁcan tly less data. W e sho w that, ev en for the sp ecial case where the data sequence is generated from an HMM ov er n hidden states, this is not possible in general, assuming a natural complexity-theoretic assumption. An HMM with n hidden states and an output alphab et of size d is deﬁned via only O ( n 2 + nd ) parameters and O  ( n 2 + nd ) samples ar e suﬃcient, from an information theoretic standp oint, to learn a mo del that will predict accurately . While learning an HMM is computationally hard (see e.g. [ 17 ]), this b egs the question of whether accurate (av erage) prediction can b e achiev ed via a computationally eﬃcien t algorithm and and an amoun t of data signiﬁcantly less than the d Θ(log n ) that the naiv e Mark o v mo del w ould require. Our main lo w er b ound shows that there exists a family of HMMs such that the d Ω(log n/ ) sam- ple complexity requirement is necessary for any computationally eﬃcient algorithm that predicts accurately on a v erage, assuming a natural complexity-theoretic assumption. Sp eciﬁcally , we show that this hardness holds, pro vided that the problem of strongly refuting a certain class of CSPs is hard, which w as conjectured in F eldman et al. [ 18 ] and studied in related w orks Allen et al. [ 19 ] and Kothari et al. [ 20 ]. See Section 5 for a description of this class and discussion of the conjectured hardness. Theorem 2 . Assuming the har dness of str ongly r efuting a c ertain class of CSPs, for al l suﬃciently lar ge n and any  ∈ (1 /n c , 0 . 1) for some ﬁxe d c onstant c , ther e exists a family of HMMs with n hidden states and an output alphab et of size d such that any algorithm that runs in time p olynomial in d , namely time f ( n,  ) · d g ( n, ) for any functions f , g , and achieves aver age KL or ` 1 err or  (with r esp e ct to the optimal pr e dictor) for a r andom HMM in the family must observe d Ω(log n/ ) observations fr om the HMM. As the mutual information of the generated sequence of an HMM with n hidden states is b ounded by log n , Theorem 2 directly implies that there are families of data-generating distributions M with mutual information I ( M ) and observ ations dra wn from an alphabet of size d suc h that an y computationally eﬃcient algorithm requires d Ω( I ( M ) / ) samples from M to achiev e av erage error  . The ab ov e b ound holds when d is large compared to log n or I ( M ), but a diﬀeren t but equally relev ant regime is where the alphab et size d is small compared to the scale of dep endencies in the sequence (for example, when predicting c haracters [ 21 ]). W e show low er b ounds in this regime of the same ﬂav or as those of Theorem 2 except based on the problem of learning a noisy parity function; the (very sligh tly) sub exp onential algorithm of Blum et al. [ 22 ] for this task means that w e lose at least a sup erconstan t factor in the exp onent in comparison to the p ositive results of Prop osition 1 . Prop osition 2. L et f ( k ) denote a lower b ound on the amount of time and samples r e quir e d to le arn p arity with noise on uniformly r andom k -bit inputs. F or al l suﬃciently lar ge n and  ∈ (1 /n c , 0 . 1) for some ﬁxe d c onstant c , ther e exists a family of HMMs with n hidden states such that any algorithm that achieves aver age pr e diction err or  (with r esp e ct to the optimal pr e dictor) for a r andom HMM in the family r e quir es at le ast f (Ω(log n/ )) time or samples. Finally , we also establish the information the or etic optimalit y of the results of Prop osition 1 , in 5 the sense that among (even computationally unbounded) prediction algorithms that predict based only on the most recen t ` observ ations, an av erage KL prediction error of Ω( I ( M ) /` ) and ` 1 error Ω( p I ( M ) /` ) with resp ect to the optimal predictor, is necessary . Prop osition 3. Ther e is an absolute c onstant c < 1 such that for al l 0 <  < 1 / 4 and suﬃciently lar ge n , ther e exists an HMM with n hidden states such that it is not information-the or etic al ly p ossible to obtain aver age KL pr e diction err or less than  or ` 1 err or less than √  (with r esp e ct to the optimal pr e dictor) while using only the most r e c ent c log n/ observations to make e ach pr e diction. 1.4 F uture Directions As men tioned ab ov e, for the settings in which capturing long-range dep endencies seems essential, it is w orth re-examining the choice of “a v erage prediction error” as the metric used to train and ev aluate mo dels. One p ossibilit y , that has a more worst-case ﬂav or, is to only ev aluate the algorithm at a chosen set of time steps instead of all time steps. Hence the naiv e Marko v mo del can no longer do well just by predicting w ell on the time steps when prediction is easy . In the context of natural language pro cessing, learning with resp ect to such a metric in tuitiv ely corresp onds to training a mo del to do well with resp ect to, say , a question answ ering task instead of a language mo deling task. A fertile middle ground b etw een a v erage error (which gives to o muc h rew ard for correctly guessing common w ords lik e “a” and “the”), and worst-case error might b e a re-weigh ted prediction error that provides more reward for correctly guessing less common observ ations. It seems p ossible, how ev er, that the techniques used to prov e Prop osition 1 can b e extended to yield analogous statemen ts for suc h error metrics. In cases where av erage error is appropriate, given the upp er b ounds of Proposition 1 , it is natural to consider what additional structure might be present that av oids the (conditional) computational lo w er b ounds of Theorem 2 . One p ossibility is a r obustness prop ert y—for example the prop ert y that a Marko v mo del would con tin ue to predict w ell ev en when each observ ation w ere obscured or corrupted with some small probabilit y . The low er b ound instance rely on parity based construc- tions and hence are very sensitiv e to noise and corruptions. F or learning o v er pr o duct distributions, there are well known connections b etw een noise stabilit y and approximation by low-degree p oly- nomials [ 23 , 24 ]. Additionally , lo w-degree p olynomials can b e learned agnostically o v er arbitr ary distributions via p olynomial regression [ 25 ]. It is tempting to hop e that this thread could b e made rigorous, by establishing a connection b et w een natural notions of noise stability o v er arbitrary dis- tributions, and accurate low-degree p olynomial appro ximations. Suc h a connection could lead to signiﬁcan tly better sample complexity requiremen ts for prediction on suc h “robust” distributions of sequences, p erhaps requiring only p oly( d, I ( M ) , 1 / ) data. Additionally , such sample-eﬃcient ap- proac hes to learning succinct representations of large Marko v mo dels ma y inform the man y pr actical prediction systems that curren tly rely on Mark o v mo dels. 1.5 Related W ork P arameter Estimation. It is in teresting to compare using a Marko v mo del for prediction with metho ds that attempt to pr op erly learn an underlying mo del. F or example, metho d of momen ts algorithms [ 26 , 27 ] allow one to estimate a certain class of Hidden Marko v mo del with p olynomial sample and computational complexit y . These ideas ha v e b een extended to learning neural net w orks [ 28 ] and input-output RNNs [ 29 ]. Using diﬀeren t metho ds, Arora et al. [ 30 ] sho w ed how to learn certain random deep neural netw orks. Learning the model directly can result in b etter sample eﬃciency , and also provide insights in to the structure of the data. The ma jor drawbac k of these 6 approac hes is that they usually require the true data-generating distribution to b e in (or extremely close to) the mo del family that we are learning. This is a very strong assumption that often do es not hold in practice. Univ ersal Prediction and Information Theory . On the other end of the sp ectrum is the class of no-regret online learning metho ds which assume that the data generating distribution can ev en b e adv ersarial [ 31 ]. Ho w ev er, the nature of these results are fundamentally diﬀerent from ours: whereas we are comparing to the p erfect mo del that can lo ok at the inﬁnite past, online learning metho ds typically compare to a ﬁxed set of exp erts, which is m uc h w eak er. W e note that information theoretic to ols hav e also b een emplo y ed in the online learning literature to show near- optimalit y of Thompson sampling with resp ect to a ﬁxed set of exp erts in the context of online learning with prior information [ 32 ], Prop osition 1 can b e thought of as an analogous statement ab out the strong p erformance of Marko v mo dels with resp ect to the optimal predictions in the con text of sequen tial prediction. There is m uc h work on sequen tial prediction based on KL-error from the information theory and statistics communities. The philosophy of these approac hes are often more adversarial, with p ersp ectiv es ranging from minimum description length [ 33 , 34 ] and individual sequence settings [ 35 ], where no mo del of the data distribution pro cess is assumed. Regarding worst case guarantees (where there is no data generation process), and r e gr et as the notion of optimalit y , there is a line of w ork on b oth minimax rates and the p erformance of Bay esian algorithms, the latter of whic h has fa v orable guaran tees in a sequen tial setting. Regarding minimax rates, [ 36 ] provides an exact characterization of the minimax strategy , though the applicability of this approac h is often limited to settings where the n um b er strategies a v ailable to the learner is relatively small (i.e., the normalizing constant in [ 36 ] must exist). More generally , there has been considerable w ork on the regret in information- theoretic and statistical settings, suc h as the w orks in [ 35 , 37 , 38 , 39 , 40 , 41 , 42 , 43 ]. Regarding log-loss more broadly , there is considerable w ork on information consistency (con- v ergence in distribution) and minimax rates with regards to statistical estimation in parametric and non-parametric families [ 44 , 45 , 46 , 47 , 48 , 49 ]. In some of these settings, e.g. minimax risk in parametric, i.i.d. settings, there are characterizations of the regret in terms of mutual informa- tion [ 45 ]. There is also w ork on univ ersal lossless data compression algorithm, such as the celebrated Lemp el-Ziv algorithm [ 50 ]. Here, the setting is rather diﬀerent as it is one of co ding the entire sequence (in a blo c k setting) rather than prediction loss. Sequen tial Prediction in Practice. Our w ork w as initiated by the desire to understand the role of memory in sequential prediction, and the b elief that mo deling long-range dep endencies is important for complex tasks suc h as understanding natural language. There hav e b een many prop osed mo dels with explicit notions of memory , including recurrent neural netw orks [ 51 ], Long Short-T erm Memory (LSTM) net w orks[ 2 , 3 ], atten tion-based mo dels [ 7 , 8 ], neural T uring mac hines [ 4 ], memory net works [ 5 ], diﬀerentiable neural computers [ 6 ], etc. While some of these mo dels often fail to capture long range dep endencies (for example, in the case of LSTMs, it is not diﬃcult to sho w that they forget the past exp onen tially quickly if they are “stable” [ 1 ]), the empirical p erformance in some settings is quite promising (see, e.g. [ 9 , 10 ]). 7 2 Pro of Sk etc h of Theorem 1 W e pro vide a sk etc h of the pro of of Theorem 1 , whic h gives stronger guarantees than Prop osition 1 but only applies to sequences generated from an HMM. The core of this pro of is the following lemma that guaran tees that the Marko v mo del that kno ws the true marginal probabilities of all short sequences, will end up predicting well. Additionally , the b ound on the exp ected prediction error will hold in expectation ov er only the randomness of the HMM during the short windo w, and with high probability ov er the randomness of when the window b egins (our more general results hold in exp ectation o v er the randomness of when the windo w b egins). F or settings such as ﬁnancial forecasting, this additional guarantee is particularly p ertinen t; you do not need to worry ab out the p ossibilit y of choosing an “unlucky” time to b egin your trading regime, as long as you plan to trade for a duration that spans an en tire short window. Beyond the extra strength of this result for HMMs, the pro of approac h is in tuitiv e and pleasing, in comparison to the more direct information- theoretic pro of of Prop osition 1 . W e ﬁrst state the lemma and sketc h its pro of, and then conclude the section b y describing ho w this yields Theorem 1 . Lemma 1. Consider an HMM with n hidden states, let the hidden state at time s = 0 b e chosen ac c or ding to an arbitr ary distribution π , and denote the observation at time s by x s . L et O P T s denote the c onditional distribution of x s given observations x 0 , . . . , x s − 1 , and know le dge of the hidden state at time s = 0 . L et M s denote the c onditional distribution of x s given only x 0 , . . . , x s − 1 , which c orr esp onds to the naive s -th or der Markov mo del that knows only the joint pr ob abilities of se quenc es of the ﬁrst s observations. Then with pr ob ability at le ast 1 − 1 /n c − 1 over the choic e of initial state, for ` = c log n/ 2 , c ≥ 1 and  ≥ 1 / log 0 . 25 n , E h ` − 1 X s =0 k O P T s − M s k 1 i ≤ 4 `, wher e the exp e ctation is with r esp e ct to the r andomness in the outputs x 0 , . . . , x ` − 1 . The pro of of the this lemma will hinge on establishing a connection b etw een O P T s —the Bay es optimal mo del that knows the HMM and the initial hidden state h 0 , and at time s predicts the true distribution of x s giv en h 0 , x 0 , . . . , x s − 1 —and the naive order s Marko v mo del M s that knows the joint probabilities of sequences of s observ ations (given that the initial state is dra wn according to π ), and predicts accordingly . This latter mo del is precisely the same as the mo del that knows the HMM and distribution π (but not h 0 ), and outputs the conditional distribution of x s giv en the observ ations. T o relate these tw o mo dels, we pro ceed via a martingale argument that leverages the intuition that, at eac h time step either O P T s ≈ M s , or, if they diﬀer signiﬁcantly , we exp ect the s th observ ation x s to contain a signiﬁcant amount of information ab out the hidden state at time zero, h 0 , which will then improv e M s +1 . Our submartingale will precisely capture the sense that for an y s where there is a signiﬁcan t deviation b etw een O P T s and M s , w e exp ect the probabilit y of the initial state b eing h 0 conditioned on x 0 , . . . , x s , to b e signiﬁcantly more than the probability of h 0 conditioned on x 0 , . . . , x s − 1 . More formally , let H s 0 denote the distribution of the hidden state at time 0 conditioned on x 0 , . . . , x s and let h 0 denote the true hidden state at time 0. Let H s 0 ( h 0 ) b e the probabilit y of h 0 under the distribution H s 0 . W e show that the following expression is a submartingale: log  H s 0 ( h 0 ) 1 − H s 0 ( h 0 )  − 1 2 s X i =0 k O P T i − M i k 2 1 . 8 The fact that this is a submartingale is not diﬃcult: Deﬁne R s as the conditional distribution of x s giv en observ ations x 0 , · · · , x s − 1 and initial state drawn according to π but not b eing at hidden state h 0 at time 0. Note that M s is a con v ex com bination of O P T s and R s , hence k O P T s − M s k 1 ≤ k O P T s − R s k 1 . T o verify the submartingale prop erty , note that b y Bay es Rule, the change in the LHS at an y time step s is the log of the ratio of the probability of observing the output x s according to the distribution O P T s and the probability of x s according to the distribution R s . The exp ectation of this is the KL-div ergence b etw een O P T s and R s , whic h can b e related to the ` 1 error using Pinsk er’s inequalit y . A t a high level, the pro of will then pro ceed via concentration b ounds (Azuma’s inequality), to sho w that, with high probabilit y , if the error from the ﬁrst ` = c log n/ 2 timesteps is large, then log  H ` − 1 0 ( h 0 ) 1 − H ` − 1 0 ( h 0 )  is also likely to b e large, in which case the p osterior distribution of the hidden state, H ` − 1 0 will b e sharply p eaked at the true hidden state, h 0 , unless h 0 had negligible mass (less than n − c ) in distribution π . There are several sligh t complications to this approac h, including the fact that the submartingale w e construct does not necessarily ha v e nicely concentrated or b ounded diﬀerences, as the ﬁrst term in the submartingale could change arbitrarily . W e address this by noting that the ﬁrst term should not decrease to o muc h except with tin y probability , as this corresp onds to the p osterior probabilit y of the true hidden state sharply dropping. F or the other direction, we can simply “clip” the deviations to prev en t them from exceeding log n in an y timestep, and then show that the submartingale prop ert y contin ues to hold despite this clipping b y proving the following mo diﬁed v ersion of Pinsk er’s inequalit y: Lemma 2. (Mo diﬁe d Pinsker’s ine quality) F or any two distributions µ ( x ) and ν ( x ) deﬁne d on x ∈ X , deﬁne the C -trunc ate d KL diver genc e as ˜ D C ( µ k ν ) = E µ h log  min n µ ( x ) ν ( x ) , C oi for some ﬁxe d C such that log C ≥ 8 . Then ˜ D C ( µ k ν ) ≥ 1 2 k µ − ν k 2 1 . Giv en Lemma 1 , the pro of of Theorem 1 follo ws relativ ely easily . Recall that Theorem 1 con- cerns the exp ected prediction error at a timestep t ← { 0 , 1 , . . . , d c` } , based on the mo del M emp corresp onding to the empirical distribution of length ` windo ws that ha v e o ccurred in x 0 , . . . , x t , . The connection b et w een the lemma and theorem is established by sho wing that, with high prob- abilit y , M emp is close to M ˆ π , where ˆ π denotes the empirical distribution of (unobserved) hidden states h 0 , . . . , h t , and M ˆ π is the distribution corresp onding to drawing the hidden state h 0 ← ˆ π and then generating x 0 , x 1 , . . . , x ` . W e provide the full pro of in App endix A . 3 Deﬁnitions and Notation Before pro ving our general Prop osition 1 , w e ﬁrst in tro duce the necessary notation. F or any random v ariable X , we denote its distribution as P r ( X ). The mutual information b etw een tw o random v ariables X and Y is deﬁned as I ( X ; Y ) = H ( Y ) − H ( Y | X ) where H ( Y ) is the en trop y of Y and H ( Y | X ) is the conditional entrop y of Y giv en X . The conditional mutual information I ( X ; Y | Z ) is deﬁned as: I ( X ; Y | Z ) = H ( X | Z ) − H ( X | Y , Z ) = E x,y ,z log P r ( X | Y , Z ) P r ( X | Z ) = E y ,z D K L ( P r ( X | Y , Z ) k P r ( X | Z )) , 9 where D K L ( p k q ) = P x p ( x ) log p ( x ) q ( x ) is the KL div ergence b etw een the distributions p and q . Note that w e are slightly abusing notation here as D K L ( P r ( X | Y , Z ) k P r ( X | Z )) should tech- nically b e D K L ( P r ( X | Y = y , Z = z ) k P r ( X | Z = z )). But we will ignore the assignmen t in the conditioning when it is clear from the context. Mutual information ob eys the follo wing chain rule: I ( X 1 , X 2 ; Y ) = I ( X 1 ; Y ) + I ( X 2 ; Y | X 1 ). Giv en a distribution ov er inﬁnite sequences, { x t } generated b y some mo del M where x t is a random v ariable denoting the output at time t , w e will use the shorthand x j i to denote the collection of random v ariables for the subsequence of outputs { x i , · · · , x j } . The distribution of { x t } is sta- tionary if the joint distribution of any subset of the sequence of random v ariables { x t } is inv arian t with resp ect to shifts in the time index. Hence P r ( x i 1 , x i 2 , · · · , x i n ) = P r ( x i 1 + l , x i 2 + l , · · · , x i n + l ) for an y l if the pro cess is stationary . W e are in terested in studying ho w w ell the output x t can b e predicted b y an algorithm which only lo oks at the past ` outputs. The predictor A ` maps a sequence of ` observ ations to a predicted distribution of the next observ ation. W e denote the predictiv e distribution of A ` at time t as Q A ` ( x t | x t − 1 t − ` ). W e refer to the Bay es optimal predictor using only windows of length ` as P ` , hence the prediction of P ` at time t is P r ( x t | x t − 1 t − ` ). Note that P ` is just the naiv e ` -th order Mark o v predictor provided with the true distribution of the data. W e denote the Bay es optimal predictor that has access to the entire history of the mo del as P ∞ , the prediction of P ∞ at time t is P r ( x t | x t − 1 −∞ ). W e will ev aluate av erage p erformance of the predictions of A ` and P ` with resp ect to P ∞ o v er a long time windo w [0 : T − 1]. The crucial prop erty of the distribution that is relev ant to our results is the mutual information b et w een past and future observ ations. F or a sto c hastic pro cess { x t } generated by some mo del M w e deﬁne the mutual information I ( M ) of the mo del M as the mutual information b etw een the past and future, a v eraged o v er the windo w [0 : T − 1], I ( M ) = lim T →∞ 1 T T − 1 X t =0 I ( x t − 1 −∞ ; x ∞ t ) . (3.1) If the pro cess { x t } is stationary , then I ( x t − 1 −∞ ; x ∞ t ) is the same for all time steps hence I ( M ) = I ( x − 1 −∞ ; x ∞ 0 ). If the a v erage do es not conv erge and hence the limit in ( 3.1 ) do es not exist, then we can deﬁne I ( M , [0 : T − 1]) as the mutual information for the windo w [0 : T − 1], and the results hold true with I ( M ) replaced by I ( M , [0 : T − 1]). W e no w deﬁne the metrics we consider to compare the predictions of P ` and A ` with resp ect to P ∞ . Let F ( P , Q ) b e some measure of distance b etw een t w o predictive distributions. In this w ork, w e consider the KL-div ergence, ` 1 distance and the relativ e zero-one loss b et w een the t w o distributions. The KL-div ergence and ` 1 distance b etw een t w o distributions are deﬁned in the standard wa y . W e deﬁne the relativ e zero-one loss as the diﬀerence b etw een the zero-one loss of the optimal predictor P ∞ and the algorithm A ` . W e deﬁne the exp ected loss of any predictor A ` with resp ect to the optimal predictor P ∞ and a loss function F as follows: δ ( t ) F ( A ` ) = E x t − 1 −∞ h F ( P r ( x t | x t − 1 −∞ ) , Q A ` ( x t | x t − 1 t − ` )) i , δ F ( A ` ) = lim T →∞ 1 T T − 1 X t =0 δ ( t ) F ( A ` ) . W e also deﬁne ˆ δ ( t ) F ( A ` ) and ˆ δ F ( A ` ) for the algorithm A ` in the same fashion as the error in e stimating 10 P ( x t | x t − 1 t − ` ), the true conditional distribution of the mo del M . ˆ δ ( t ) F ( A ` ) = E x t − 1 t − ` h F ( P r ( x t | x t − 1 t − ` ) , Q A ` ( x t | x t − 1 t − ` )) i , ˆ δ F ( A ` ) = lim T →∞ 1 T T − 1 X t =0 ˆ δ ( t ) F ( A ` ) . 4 Predicting W ell with Short Windows T o establish our general prop osition, whic h applies b ey ond the HMM setting, we provide an ele- men tary and purely information theoretic pro of. Prop osition 1. F or any data-gener ating distribution M with mutual information I ( M ) b etwe en p ast and futur e observations, the b est ` -th or der Markov mo del P ` obtains aver age KL-err or, δ K L ( P ` ) ≤ I ( M ) /` with r esp e ct to the optimal pr e dictor with ac c ess to the inﬁnite history. Also, any pr e dictor A ` with ˆ δ K L ( A ` ) aver age KL-err or in estimating the joint pr ob abilities over windows of length ` gets aver age err or δ K L ( A ` ) ≤ I ( M ) /` + ˆ δ K L ( A ` ) . Pr o of. W e b ound the exp ected error by splitting the time in terv al 0 to T − 1 into blo c ks of length ` . Consider any blo ck starting at time τ . W e ﬁnd the av erage error of the predictor from time τ to τ + ` − 1 and then av erage across all blo cks. T o begin, note that we can decompose the error as the sum of the error due to not kno wing the past history b eyond the most recent ` observ ations and the error in estimating the true join t distribution of the data o v er a ` length block. Consider an y time t . Recall the deﬁni tion of δ ( t ) K L ( A ` ), δ ( t ) K L ( A ` ) = E x t − 1 −∞ h D K L ( P r ( x t | x t − 1 −∞ ) k Q A ` ( x t | x t − 1 t − ` )) i = E x t − 1 −∞ h D K L ( P r ( x t | x t − 1 −∞ ) k P r ( x t | x t − 1 t − ` )) i + E x t − 1 −∞ h D K L ( P r ( x t | x t − 1 t − ` ) k Q A ` ( x t | x t − 1 t − ` )) i = δ ( t ) K L ( P ` ) + ˆ δ ( t ) K L ( A ` ) . Therefore, δ K L ( A ` ) = δ K L ( P ` ) + ˆ δ K L ( A ` ). It is easy to verify that δ ( t ) K L ( P ` ) = I ( x t − ` − 1 −∞ ; x t | x t − 1 t − ` ). This relation formalizes the in tuition that the current output ( x t ) has signiﬁcant extra information ab out the past ( x t − ` − 1 −∞ ) if we cannot predict it as well using the ` most recent observ ations ( x t − 1 t − ` ), as can b e done by using the entire past ( x t − 1 −∞ ). W e will no w upp er b ound the total error for the windo w [ τ , τ + ` − 1]. W e expand I ( x τ − 1 −∞ ; x ∞ τ ) using the c hain rule, I ( x τ − 1 −∞ ; x ∞ τ ) = ∞ X t = τ I ( x τ − 1 −∞ ; x t | x t − 1 τ ) ≥ τ + ` − 1 X t = τ I ( x τ − 1 −∞ ; x t | x t − 1 τ ) . Note that I ( x τ − 1 −∞ ; x t | x t − 1 τ ) ≥ I ( x t − ` − 1 −∞ ; x t | x t − 1 t − ` ) = δ ( t ) K L ( P ` ) as t − ` ≤ τ and I ( X , Y ; Z ) ≥ I ( X ; Z | Y ). The prop osition no w follo ws from a v eraging the error across the ` time steps and using Eq. 3.1 to a v erage o v er all blo c ks of length ` in the windo w [0 , T − 1], 1 ` τ + ` − 1 X t = τ δ ( t ) K L ( P ` ) ≤ 1 ` I ( x τ − 1 −∞ ; x ∞ τ ) = ⇒ δ K L ( P ` ) ≤ I ( M ) ` . 11 Note that Prop osition 1 also directly gives guaran tees for the scenario where the task is to predict the distribution of the next blo ck of outputs instead of just the next immediate output, b ecause KL-div ergence ob eys the c hain rule. The follo wing easy corollary , relating KL error to ` 1 error yields the following statement, which also trivially applies to zero/one loss with resp ect to that of the optimal predictor, as the exp ected relativ e zero/one loss at an y time step is at most the ` 1 loss at that time step. Corollary 2. F or any data-gener ating distribution M with mutual information I ( M ) b etwe en p ast and futur e observations, the b est ` -th or der Markov mo del P ` obtains aver age ` 1 -err or δ ` 1 ( P ` ) ≤ p I ( M ) / 2 ` with r esp e ct to the optimal pr e dictor that has ac c ess to the inﬁnite history. Also, any pr e dictor A ` with ˆ δ ` 1 ( A ` ) aver age ` 1 -err or in estimating the joint pr ob abilities gets aver age pr e diction err or δ ` 1 ( A ` ) ≤ p I ( M ) / 2 ` + ˆ δ ` 1 ( A ` ) . Pr o of. W e again decomp ose the error as the sum of the error in estimating ˆ P and the error due to not kno wing the past history using the triangle inequalit y . δ ( t ) ` 1 ( A ` ) = E x t − 1 −∞ h k P r ( x t | x t − 1 −∞ ) − Q A ` ( x t | x t − 1 t − ` ) k 1 i ≤ E x t − 1 −∞ h k P r ( x t | x t − 1 −∞ ) − P r ( x t | x t − 1 t − ` ) k 1 i + E x t − 1 −∞ h k P r ( x t | x t − 1 t − ` ) − Q A ` ( x t | x t − 1 t − ` ) k 1 i = δ ( t ) ` 1 ( P ` ) + ˆ δ ( t ) ` 1 ( A ` ) Therefore, δ ` 1 ( A ` ) ≤ δ ` 1 ( P ` ) + ˆ δ ` 1 ( A ` ). By Pinsk er’s inequality and Jensen’s inequalit y , δ ( t ) ` 1 ( A ` ) 2 ≤ δ ( t ) K L ( A ` ) / 2. Using Prop osition 1 , δ K L ( A ` ) = 1 T T − 1 X t =0 δ ( t ) K L ( A ` ) ≤ I ( M ) ` Therefore, using Jensen’s inequalit y again, δ ` 1 ( A ` ) ≤ p I ( M ) / 2 ` . 5 Lo w er Bound for Large Alphab ets Our lo w er b ounds for the sample complexit y in the large alphab et case lev erage a class of Constraint Satisfaction Problems (CSPs) with high c omplexity . A class of (Bo olean) k -CSPs is deﬁned via a predicate—a function P : { 0 , 1 } k → { 0 , 1 } . An instance of such a k -CSP on n v ariables { x 1 , · · · , x n } is a collection of sets (clauses) of size k whose k elements consist of k v ariables or their negations. Suc h an instance is satisﬁable if there exists an assignment to the v ariables x 1 , . . . , x n suc h that the predicate P ev aluates to 1 for every clause. More generally , the value of an instance is the maxim um, ov er all 2 n assignmen ts, of the ratio of n um b er of satisﬁed clauses to the total num ber of clauses. Our low er b ounds are based on the presumed hardness of distinguishing r andom instances of a certain class of CSP , v ersus instances of the CSP with high value . There has b een muc h w ork attempting to characterize the diﬃculty of CSPs—one notion which we will leverage is the c omplexity of a class of CSPs, ﬁrst deﬁned in F e ldman et al. [ 18 ] and studied in Allen et al. [ 19 ] and Kothari et al. [ 20 ]: 12 Deﬁnition 1. The c omplexity of a class of k -CSPs deﬁned by predicate P : { 0 , 1 } k → { 0 , 1 } is the largest r suc h that there exists a distribution supp orted on the supp ort of P that is ( r − 1)-wise indep enden t (i.e. “uniform”), and no such r -wise indep enden t distribution exists. Example 1 . Both k -XOR and k -SA T are w ell-studied classes of k -CSPs, corresponding, resp ectively , to the predicates P X OR that is the X OR of the k Bo olean inputs, and P S AT that is the OR of the inputs. These predicates b oth supp ort ( k − 1)-wise uniform distributions, but not k -wise uniform distributions, hence their complexity is k . In the case of k -X OR, the uniform distribution ov er { 0 , 1 } k restricted to the supp ort of P X OR is ( k − 1)-wise uniform. The same distribution is also supp orted b y k -SA T. A random instance of a CSP with predicate P is an instance such that all the clauses are c hosen uniformly at random (by selecting the k v ariables uniformly , and indep endently negating eac h v ariable with probability 1 / 2). A random instance will hav e v alue close to E [ P ], where E [ P ] is the exp ectation of P under the uniform distribution. In contrast, a planted instance is generated by ﬁrst ﬁxing a satisfying assignment σ and then sampling clauses that are satisﬁed, by uniformly choosing k v ariables, and pic king their negations according to a ( r − 1)-wise indep enden t distribution asso ciated with the predicate. Hence a plan ted instance alwa ys has v alue 1. A noisy plan ted instance with plan ted assignment σ and noise lev el η is generated by sampling consistent clauses (as ab ov e) with probabilit y 1 − η and random clauses with probability η , hence with high probabilit y it has v alue close to 1 − η + η E [ P ]. Our hardness results are based on distinguishing whether a CSP instance is random v ersus has a high v alue (v alue close to 1 − η + η E [ P ]). As one would exp ect, the diﬃculty of distinguishing random instances from noisy planted in- stances, decreases as the num b er of sampled clauses grows. The following conjecture of F eldman et al. [ 18 ] asserts a sharp b oundary on the n um b er of clauses, b elow which this problem b ecomes computationally in tractable, while remaining information theoretically easy . Conjectured CSP Hardness [Conjecture 1] [ 18 ]: L et Q b e any distribution over k -clauses and n variables of c omplexity r and 0 < η < 1 . Any p olynomial-time (r andomize d) algorithm that, given ac c ess to a distribution D that e quals either the uniform distribution over k -clauses U k or a (noisy) plante d distribution Q η σ = (1 − η ) Q σ + η U k for some σ ∈ { 0 , 1 } n and plante d distribution Q σ , de cides c orr e ctly whet her D = Q η σ or D = U k with pr ob ability at le ast 2/3 ne e ds ˜ Ω( n r/ 2 ) clauses. F eldman et al. [ 18 ] prov ed the conjecture for the class of statistic al algorithms . 4 Recen tly , Kothari et al. [ 20 ] show ed that the natural Sum-of-Squares (SOS) approach requires ˜ Ω( n r/ 2 ) clauses to refute random instances of a CSP with complexity r , hence proving Conjecture 1 for an y p olynomial-size semideﬁnite programming relaxation for refutation. Note that ˜ Ω( n r/ 2 ) is tigh t, as Allen et al. [ 19 ] give a SOS algorithm for refuting random CSPs b eyond this regime. Other recen t pap ers suc h as Daniely and Shalev-Shw artz [ 53 ] and Daniely [ 54 ] hav e also used presumed hardness of strongly refuting random k -SA T and random k -XOR instances with a small num ber of clauses to deriv e conditional hardness for v arious learning problems. A ﬁrst attempt to enco de a k -CSP as a sequential mo del is to construct a mo del which outputs k randomly chosen literals for the ﬁrst k time steps 0 to k − 1, and then their (noisy) predicate v alue for the ﬁnal time step k . Clauses from the CSP correspond to samples from the m odel, and the 4 Statistical algorithms are an extension of the statistical query mo del. These are algorithms that do not directly access samples from the distribution but instead hav e access to estimates of the expectation of any b ounded function of a sample, through a “statistical oracle”. F eldman et al. [ 52 ] point out that almost all algorithms that work on random data also work with this limited access to samples, refer to F eldman et al. [ 52 ] for more details and examples. 13 algorithm would need to solve the CSP to predict the ﬁnal time step k . How ever, as all the outputs up to the ﬁnal time step are random, the trivial prediction algorithm that guesses randomly and do es not try to predict the output at time k , w ould b e near optimal. T o get strong low er b ounds, w e will output m > 1 functions of the k literals after k time steps, while still ensuring that all the functions remain collectiv ely hard to in v ert without a large n um b er of samples. W e use elementary results from the theory of error correcting co des to achiev e this, and pro v e hardness due to a reduction from a sp eciﬁc family of CSPs to whic h Conjecture 1 applies. By c ho osing k and m carefully , we obtain the near-optimal dep endence on the mutual information and error  —matc hing the upp er b ounds implied by Proposition 1 . W e pro vide a short outline of the argumen t, follo w ed b y the detailed pro of in the app endix. 5.1 Sk etc h of Lo w er Bound Construction W e construct a sequential mo del M suc h that making go o d predictions on the mo del requires dis- tinguishing random instances of a k -CSP C on n v ariables from instances of C with a high v alue. The output alphab et of M is { a i } of size 2 n . W e choose a mapping from the 2 n c haracters { a i } to the n v ariables { x i } and their n negations { ¯ x i } . F or an y clause C and planted assignmen t σ to the CSP C , let σ ( C ) b e the k -bit string of v alues assigned b y σ to literals in C . The mo del M will output k characters from time 0 to k − 1 chosen uniformly at random, which corresp ond to literals in the CSP C ; hence the k outputs correspond to a clause C of the CSP . F or some m (to b e sp eciﬁed later) we will construct a binary matrix A ∈ { 0 , 1 } m × k , whic h will corresp ond to a go o d error-correcting co de. F or the time steps k to k + m − 1, with probabilit y 1 − η the mo del outputs y ∈ { 0 , 1 } m where y = Av mo d 2 and v = σ ( C ) with C b eing the clause asso ciated with the outputs of the ﬁrst k time steps. With the remaining probability , η , the mo del outputs m uniformly random bits. Note that the m utual information I ( M ) is at most m as only the outputs from time k to k + m − 1 can b e predicted. W e claim that M can b e sim ulated by an HMM with 2 m (2 k + m ) + m hidden states. This can b e done as follo ws. F or ev ery time step from 0 to k − 1 there will b e 2 m +1 hidden states, for a total of k 2 m +1 hidden states. Each of these hidden states has tw o lab els: the current v alue of the m bits of y , and an “output lab el” of 0 or 1 corresp onding to the output at that time step ha ving an assignmen t of 0 or 1 under the planted assignment σ . The output distribution for each of these hidden states is either of the following: if the state has an “output lab el” 0 then it is uniform ov er all the characters which ha v e an assignment of 0 under the plan ted assignmen t σ , similarly if the state has an “output lab el” 1 then it is uniform ov er all the characters which ha v e an assignment of 1 under the planted assignmen t σ . Note that the transition matrix for the ﬁrst k time steps simply connects a state h 1 at the ( i − 1)th time step to a state h 2 at the i th time step if the v alue of y corresp onding to h 1 should b e up dated to the v alue of y corresp onding to h 2 if the output at the i th time step corresp onds to the “output lab el” of h 2 . F or the time steps k through ( k + m − 1), there are 2 m hidden states for eac h time step, eac h corresp onding to a particular choice of y . The output of an hidden state corresp onding to the ( k + i )th time step with a particular label y is simply the i th bit of y . Finally , w e need an additional m hidden states to output m uniform random bits from time k to ( k + m − 1) with probability η . This accounts for a total of k 2 m +1 + m 2 m + m hidden states. After k + m time steps the HMM transitions back to one of the starting states at time 0 and rep eats. Note that the larger m is with resp ect to k , the higher the cost (in terms of a v erage prediction error) of failing to correctly predict the outputs from time k to ( k + m − 1). T uning k and m allo ws us to con trol the num ber of hidden states and a v erage error incurred b y a computationally constrained predictor. 14 W e deﬁne the CSP C in terms of a collection of predicates P ( y ) for each y ∈ { 0 , 1 } m . While Conjecture 1 do es not directly apply to C , as it is deﬁned by a collection of predicates instead of a single one, w e will later show a reduction from a related CSP C 0 deﬁned b y a single predi- cate for which Conjecture 1 holds. F or eac h y , the predicate P ( y ) of C is the set of v ∈ { 0 , 1 } k whic h satisfy y = Av mo d 2. Hence each clause has an additional lab el y which determines the satisfying assignments, and this lab el is just the output of our sequential mo del M from time k to k + m − 1. Hence for any plan ted assignment σ , the set of satisfying clauses C of the CSP C are all clauses such that Av = y mod 2 where y is the lab el of the clause and v = σ ( C ). W e deﬁne a (noisy) planted distribution o v er clauses Q η σ b y ﬁrst uniformly randomly sampling a lab el y , and then sampling a consistent clause with probability (1 − η ), otherwise with probability η w e sample a uniformly random clause. Let U k b e the uniform distribution ov er all k -clauses with uniformly chosen lab els y . W e will show that Conjecture 1 implies that distinguishing b etw een the distributions Q η σ and U k is hard without suﬃciently man y clauses. This giv es us the hardness results w e desire for our sequential mo del M : if an algorithm obtains lo w prediction error on the outputs from time k through ( k + m − 1), then it can b e used to distinguish b etw een instances of the CSP C with a high v alue and random instances, as no algorithm obtains lo w prediction error on random instances. Hence hardness of strongly refuting the CSP C implies hardness of making go o d predictions on M . W e no w sketc h the argument for why Conjecture 1 implies the hardness of strongly refuting the CSP C . W e deﬁne another CSP C 0 whic h we sho w reduces to C . The predicate P of the CSP C 0 is the set of all v ∈ { 0 , 1 } k suc h that Av = 0 mo d 2. Hence for any planted assignment σ , the set of satisfying clauses of the CSP C 0 are all clauses such that v = σ ( C ) is in the n ullspace of A . As b efore, the plan ted distribution ov er clauses is uniform on all satisfying clauses with probability (1 − η ), with probability η we add a uniformly random k -clause. F or some γ ≥ 1 / 10, if we can con- struct A suc h that the set of satisfying assignmen ts v (which are the vectors in the nullspace of A ) supp orts a ( γ k − 1)-wise uniform distribution, then by Conjecture 1 any polynomial time algorithm cannot distinguish b etw een the planted distribution and uniformly randomly c hosen clauses with less than ˜ Ω( n γ k / 2 ) clauses. W e show that choosing a matrix A whose null space is ( γ k − 1)-wise uniform corresp onds to ﬁnding a binary linear co de with rate at least 1/2 and relativ e distance γ , the existence of whic h is guaran teed b y the Gilb ert-V arshamov b ound. W e next sk etc h the reduction from C 0 to C . The k ey idea is that the CSPs C 0 and C are de- ﬁned b y linear equations. If a clause C = ( x 1 , x 2 , · · · , x k ) in C 0 is satisﬁed with some assignment t ∈ { 0 , 1 } k to the v ariables in the clause then A t = 0 mo d 2. Therefore, for some w ∈ { 0 , 1 } k suc h that Aw = y mo d 2, t + w mo d 2 satisﬁes A ( t + w ) = y mo d 2. A clause C 0 = ( x 0 1 , x 0 2 , · · · , x 0 k ) with assignment t + w mo d 2 to the v ariables can b e obtained from the clause C by switc hing the literal x 0 i = ¯ x i if w i = 1 and retaining x 0 i = x i if w i = 0. Hence for an y lab el y , we can eﬃciently con v ert a clause C in C 0 to a clause C 0 in C whic h has the desired lab el y and is only satisﬁed with a particular assignmen t to the v ariables if C in C 0 is satisﬁed with the same assignment to the v ariables. It is also not hard to ensure that we uniformly sample the consistent clause C 0 in C if the original clause C was a uniformly sampled consistent clause in C 0 . W e provide a small example to illustrate the sequential mo del constructed ab ov e. Let k = 3, m = 1 and n = 3. Let A ∈ { 0 , 1 } 1 × 3 . The output alphab et of the mo del M is { a i , 1 ≤ i ≤ 6 } . The letter a 1 maps to the v ariable x 1 , a 2 maps to ¯ x 1 , similarly a 3 → x 2 , a 4 → ¯ x 2 , a 5 → x 3 , a 6 → ¯ x 3 . Let σ b e some plan ted assignmen t to { x 1 , x 2 , x 3 } , whic h deﬁnes a particular mo del M . If the 15 output of the mo del M is a 1 , a 3 , a 6 for the ﬁrst three time steps, then this corresp onds to the clause with literals, ( x 1 , x 2 , ¯ x 3 ). F or the ﬁnal time step, with probability (1 − η ) the mo del outputs y = Av mo d 2, with v = σ ( C ) for the clause C = ( x 1 , x 2 , ¯ x 3 ) and planted assignmen t σ , and with probabilit y η it outputs a uniform random bit. F or an algorithm to make a go o d prediction at the ﬁnal time step, it needs to b e able to distinguish if the output at the ﬁnal time step is alwa ys a random bit or if it is dep endent on the clause, hence it needs to distinguish random instances of the CSP from plan ted instances. W e re-state Theorem 2 b elow in terms of the notation deﬁned in this section, deferring its full pro of to App endix B . Theorem 2. Assuming Conje ctur e 1, for al l suﬃciently lar ge T and 1 /T c <  ≤ 0 . 1 for some ﬁxe d c onstant c , ther e exists a family of HMMs with T hidden states and an output alphab et of size n such that, any pr e diction algorithm that achieves aver age KL-err or, ` 1 err or or r elative zer o-one err or less than  with pr ob ability gr e ater than 2/3 for a r andomly chosen HMM in the family, and runs in time f ( T ,  ) · n g ( T , ) for any functions f and g , r e quir es n Ω(log T / ) samples fr om the HMM. 6 Lo w er Bound for Small Alphab ets Our low er b ounds for the sample complexity in the binary alphab et case are based on the a v er- age case hardness of the decision version of the parity with noise problem, and the reduction is straigh tforw ard. In the parity with noise problem on n bit inputs we are giv en examples v ∈ { 0 , 1 } n dra wn uni- formly from { 0 , 1 } n along with their noisy labels h s , v i +  mo d 2 where s ∈ { 0 , 1 } n is the (unkno wn) supp ort of the parit y function, and  ∈ { 0 , 1 } is the classiﬁcation noise such that P r [  = 1] = η where η < 0 . 05 is the noise level. Let Q η s b e the distribution ov er examples of the parit y with noise instance with s as the supp ort of the parity function and η as the noise level. Let U n b e the distribution ov er examples and lab els where each lab el is chosen uniformly from { 0 , 1 } independent of the example. The strength of of our low er b ounds dep ends on the level of hardness of parity with noise. Currently , the fastest algorithm for the problem due to Blum et al. [ 22 ] runs in time and samples 2 n/ log n . W e deﬁne the function f ( n ) as follows– Deﬁnition 2. Deﬁne f ( n ) to be the function suc h that for a uniformly random supp ort s ∈ { 0 , 1 } n , with probability at least (1 − 1 /n 2 ) ov er the choice of s , an y (randomized) algorithm that can distinguish b et w een Q η s and U n with success probability greater than 2 / 3 ov er the randomness of the examples and the algorithm, requires f ( n ) time or samples. Our mo del will b e the natural sequential version of the parit y with noise problem, where eac h example is coupled with several parit y bits. W e denote the mo del as M ( A m × n ) for some A ∈ { 0 , 1 } m × n , m ≤ n/ 2. F rom time 0 through ( n − 1) the outputs of the mo del are i.i.d. and uniform on { 0 , 1 } . Let v ∈ { 0 , 1 } n b e the vector of outputs from time 0 to ( n − 1). The outputs for the next m time steps are given by y = Av +  mo d 2, where  ∈ { 0 , 1 } m is the random noise and eac h en try  i of  is an i.i.d random v ariable suc h that P r [  i = 1] = η , where η is the noise lev el. Note that if A is full ro w-rank, and v is c hosen uniformly at random from { 0 , 1 } n , the distribution of y is uniform on { 0 , 1 } m . Also I ( M ( A )) ≤ m as at most the binary bits from time n to n + m − 1 can b e predicted using the past inputs. As for the large alphab et case, M ( A m × n ) can b e simulated 16 b y an HMM with 2 m (2 n + m ) + m hidden states (see Section 5.1 ). W e deﬁne a set of A matrices, whic h sp eciﬁes a family of sequential mo dels. Let S b e the set of all ( m × n ) matrices A such that the A is full row rank. W e need this restriction as otherwise the bits of the output y will b e dep enden t. W e denote R as the family of mo dels M ( A ) for A ∈ S . Lemma 3 shows that with high probability ov er the choice of A , distinguishing outputs from the mo del M ( A ) from random examples U n requires f ( n ) time or examples. Lemma 3. L et A b e chosen uniformly at r andom fr om the set S . Then, with pr ob ability at le ast (1 − 1 /n ) over the choic e A ∈ S , any (r andomize d) algorithm that c an distinguish the outputs fr om the mo del M ( A ) fr om the distribution over r andom examples U n with suc c ess pr ob ability gr e ater than 2 / 3 over the r andomness of the examples and the algorithm ne e ds f ( n ) time or examples. The pro of of Prop osition 2 follows from Lemma 3 and is similar to the pro of for the large alphab et case. 7 Information Theoretic Lo w er Bounds W e sho w that information the or etic al ly , windo ws of length cI ( M ) / 2 are necessary to get exp ected relativ e zero-one loss less than  . As the exp ected relative zero-one loss is at most the ` 1 loss, which can b e b ounded by the square of the KL-div ergence, this automatically implies that our windo w length requirement is also tigh t for ` 1 loss and KL loss. In fact, it’s very easy to sho w the tightness for the KL loss: c ho ose the simple mo del which emits uniform random bits from time 0 to n − 1 and rep eats the bits from time 0 to m − 1 for time n through n + m − 1. One can then choose n, m to get the desired error  and mutual information I ( M ). T o get a lo w er b ound for the zero-one loss we use the probabilistic metho d to argue that there exists an HMM suc h that long windo ws are required to p erform optimally with resp ect to the zero-one loss for that HMM. W e now state the lo w er b ound and sk etc h the pro of idea. Prop osition 3. Ther e is an absolute c onstant c such that for al l 0 <  < 1 / 4 and suﬃciently lar ge n , ther e exists an HMM with n states such that it is not information the or etic al ly p ossible to get aver age r elative zer o-one loss or ` 1 loss less than  using windows of length smal ler than c log n/ 2 , and KL loss less than  using windows of length smal ler than c log n/ . W e illustrate the construction in Fig. 2 and pro vide the high-level pro of idea with resp ect to Fig. 2 b elow. W e wan t show that no predictor P using windows of length ` = 3 can make a go o d prediction. The transition matrix of the HMM is a p ermutation and the output alphab et is binary . Eac h state is assigned a label which determines its output distribution. The states lab eled 0 emit 0 with probabilit y 0 . 5 +  and the states lab eled 1 emit 1 with probability 0 . 5 +  . W e will randomly and uniformly choose the lab els for the hidden states. Ov er the randomness in choosing the labels for the permutation, we will sho w that the exp ected error of the predictor P is large, which means that there must exist some p ermutation such that the predictor P incurs a high error. The rough pro of idea is as follo ws. Sa y the Mark o v mo del is at hidden state h 2 at time 2, this is unknown to the predictor P . The outputs for the ﬁrst three time steps are ( x 0 , x 1 , x 2 ). The predictor P only lo oks at the outputs from time 0 to 2 for making the prediction for time 3. W e show that with high probability ov er the choice of lab els to the hidden states and the outputs ( x 0 , x 1 , x 2 ), the output ( x 0 , x 1 , x 2 ) from the hidden states ( h 0 , h 1 , h 2 ) is close in Hamming distance to the lab el of some other segment of hidden states, say ( h 4 , h 5 , h 6 ). Hence any predictor using only the 17 Figure 2: Low er b ound construction, n = 16. past 3 outputs cannot distinguish whether the string ( x 0 , x 1 , x 2 ) was emitted by ( h 0 , h 1 , h 2 ) or ( h 4 , h 5 , h 6 ), and hence cannot make a go o d prediction for time 3 (we actually need to show that there are many segments like ( h 4 , h 5 , h 6 ) whose lab el is close to ( x 0 , x 1 , x 2 )). The pro of pro ceeds via simple concen tration b ounds. A Pro of of Theorem 1 Theorem 1 . Supp ose observations ar e gener ate d by a Hidden Markov Mo del with at most n hidden states, and output alphab et of size d . F or  > 1 / log 0 . 25 n ther e exists a window length ` = O ( log n  ) and absolute c onstant c such that for a ny T ≥ d c` , if t ∈ { 1 , 2 , . . . , T } is chosen uniformly at r andom, then the exp e cte d ` 1 distanc e b etwe en the true distribution of x t given the entir e history (and know le dge of the HMM), and the distribution pr e dicte d by the naive “empiric al” ` -th or der Markov mo del b ase d on x 0 , . . . , x t − 1 , is b ounde d by √  . Pr o of. Let π t b e a distribution ov er hidden states such that the probabilit y of the i th hidden state under π t is the empirical frequency of the i th hidden state from time 1 to t − 1 normalized b y ( t − 1). F or 0 ≤ s ≤ ` − 1, consider the predictor P t whic h mak es a prediction for the distribution of observ ation x t + s giv en observ ations x t , . . . , x t + s − 1 based on the true distribution of x t under the HMM, conditioned on the observ ations x t , . . . , x t + s − 1 and the distribution of the hidden state at time t b eing π t . W e will sho w that in exp ectation ov er t , P t gets small error a v eraged across the time steps 0 ≤ s ≤ ` − 1, with resp ect to the optimal prediction of x t + s with kno wledge of the true hidden state h t at time t . In order to sho w this, we need to ﬁrst establish that the true hidden state h t at time t do es not hav e very small probabilit y under π t , with high probabilit y ov er the c hoice of t . Lemma 4. With pr ob ability 1 − 2 /n over the choic e of t ∈ { 1 , . . . , T } , the hidden state h t at time t has pr ob ability at le ast 1 /n 3 under π t . Pr o of. Consider the ordered set S i of time indices t where the hidden state h t = i , sorted in increasing order. W e ﬁrst argue that picking a time step t where the hidden state h t is a state j which o ccurs rarely in the sequence is not very lik ely . F or sets corresp onding to hidden states j which hav e probabilit y less than 1 /n 2 under π T , the cardinality |S j | ≤ T /n 2 . The sum of the cardinalit y of all suc h small sets is at most T /n , and hence the probability that a uniformly random t ∈ { 1 , . . . , T } lies in one of these sets is at most 1 /n . 18 No w consider the set of time indices S i corresp onding to some hidden state i which has prob- abilit y at least 1 /n 2 under π T . F or all t which are not among the ﬁrst T /n 3 time indices in this set, the hidden state i has probability at least 1 /n 3 under π t . W e will refer to the ﬁrst T /n 3 time indices in any set S i as the “bad” time steps for the hidden state i . Note that the fraction of the “bad” time steps corresp onding to any hidden state which has probability at least 1 /n 2 under π T is at most 1 /n , and hence the total fraction of these “bad” time steps across all hidden states is at most 1 /n . Therefore using a union b ound, with failure probabilit y 2 /n , the hidden state h t at time t has probabilit y at least 1 /n 3 under π t . Consider an y time index t , for simplicity assume t = 0, and let O P T s denote the conditional distribution of x s giv en observ ations x 0 , . . . , x s − 1 , and knowledge of the hidden state at time s = 0. Let M s denote the conditional distribution of x s giv en only x 0 , . . . , x s − 1 , given that the hidden state at time 0 has the distribution π 0 . Lemma 1 . F or  > 1 /n , if the true hidden state at time 0 has pr ob ability at le ast 1 /n c under π 0 , then for ` = c log n/ 2 , E h 1 ` ` − 1 X s =0 k O P T s − M s k 1 i ≤ 4 , wher e the exp e ctation is with r esp e ct to the r andomness in the outputs fr om time 0 to ` − 1 . By Lemma 4 , for a randomly chosen t ∈ { 1 , . . . , T } the probability that the hidden state i at time 0 has probability less than 1 /n 3 in the prior distribution π t is at most 2 /n . As the ` 1 error at an y time step can b e at most 2, using Lemma 1 , the exp ected av erage error of the predictor P t across all t is at most 4  + 4 /n ≤ 8  for ` = 3 log n/ 2 . No w consider the predictor ˆ P t whic h for 0 ≤ s ≤ ` − 1 predicts x t + s giv en x t , . . . , x t + s − 1 according to the empirical distribution of x t + s giv en x t , . . . , x t + s − 1 , based on the observ ations up to time t . W e will no w argue that the predictions of ˆ P t are close in exp ectation to the predictions of P t . Recall that prediction of P t at time t + s is the true distribution of x t under the HMM, conditioned on the observ ations x t , . . . , x t + s − 1 and the distribution of the hidden state at time t b eing dra wn from π t . F or any s < ` , let P 1 refer to the prediction of ˆ P t at time t + s and P 2 refer to the prediction of P t at time t + s . W e will show that k P 1 − P 2 k 1 is small in exp ectation o v er t . W e do this using a martingale concentration argument. Consider an y string r of length s . Let Q 1 ( r ) b e the empirical probability of the string r up to time t and Q 2 ( r ) b e the true probability of the string r giv en that the hidden state at time t is distributed as π t . Our aim is to sho w that | Q 1 ( r ) − Q 2 ( r ) | is small. Deﬁne the random v ariable Y τ = P r [[ x τ : x τ + s − 1 ] = r | h τ ] − I ([ x τ : x τ + s − 1 ] = r ) , where I denotes the indicator function and Y 0 is deﬁned to b e 0. W e claim that Z τ = P τ i =0 Y i is a martingale with resp ect to the ﬁltration { φ } , { h 1 } , { h 2 , x 1 } , { h 3 , x 2 } , . . . , { h t +1 , x t } . T o verify , note that, E [ Y τ |{ h 1 } , { h 2 , x 1 } , . . . , { h τ , x τ − 1 } ] = P r [[ x τ : x τ + s − 1 ] = r | h τ ] − E [ I ([ x τ : x τ + s − 1 ] = r ) |{ h 1 } , { h 2 , x 1 } , . . . , { x τ − 1 , h τ } ] = P r [[ x τ : x τ + s − 1 ] = r | h τ ] − E [ I ([ x τ : x τ + s − 1 ] = r ) | h τ ] = 0 . Therefore E [ Z τ |{ h 1 } , { h 2 , x 1 } , . . . , { h τ , x τ − 1 } ] = Z τ − 1 , and hence Z τ is a martingale. Also, note that | Z τ − Z τ − 1 | ≤ 1 as 0 ≤ P r [[ x τ : x τ + s − 1 ] = r | h τ ] ≤ 1 and 0 ≤ I ([ x τ : x τ + s − 1 ] = r ) ≤ 1. Hence 19 using Azuma’s inequalit y (Lemma 8 ), P r [ | Z t − s | ≥ K ] ≤ 2 e − K 2 / (2 t ) . Note that Z t − s / ( t − s ) = Q 2 ( r ) − Q 1 ( r ). By Azuma’s inequality and doing a union b ound ov er all d s ≤ d ` strings r of length s , for c ≥ 4 and t ≥ T /n 2 = d c` /n 2 ≥ d c`/ 2 , w e hav e k Q 1 − Q 2 k 1 ≤ 1 /d c`/ 20 with failure probability at most 2 d ` e − √ t/ 2 ≤ 1 /n 2 . Similarly , for all strings of length s + 1, the estimated probability of the string has error at most 1 /d c`/ 20 with failure probabilit y 1 /n 2 . As the conditional distribution of x t + s giv en observ ations x t , . . . , x t + s − 1 is the ratio of the join t distributions of { x t , . . . , x t + s − 1 , x t + s } and { x t , . . . , x t + s − 1 } , therefore as long as the empirical distributions of the length s and length s + 1 strings are estimated with error at most 1 /d c`/ 20 and the string { x t , . . . , x t + s − 1 } has probabilit y at least 1 /d c`/ 40 , the conditional distributions P 1 and P 2 satisfy k P 1 − P 2 k 1 ≤ 1 /n 2 . By a union b ound ov er all d s ≤ d ` strings and for c ≥ 100, the total probabilit y mass on strings whic h o ccur with probability less than 1 /d c`/ 40 is at most 1 /d c`/ 50 ≤ 1 /n 2 for c ≥ 100. Therefore k P 1 − P 2 k 1 ≤ 1 /n 2 with ov erall failure probabilit y 3 /n 2 , hence the exp ected ` 1 distance b et w een P 1 and P 2 is at most 1 /n . By using the triangle inequality and the fact that the exp ected av erage error of P t is at most 8  for ` = 3 log n/ 2 , it follows that the exp ected a v erage error of ˆ P t is at most 8  + 1 /n ≤ 7  . Note that the exp ected a v erage error of ˆ P t is the av erage of the exp ected errors of the empirical s -th order Mark o v mo dels for 0 ≤ s ≤ ` − 1. Hence for ` = 3 log n/ 2 there m ust exist at least some s < ` such that the s -th order Marko v mo del gets exp ected ` 1 error at most 9  . A.1 Pro of of Lemma 1 Let the prior for the distribution of the hidden states at time 0 be π 0 . Let the true hidden state h 0 at time 0 b e 1 without loss of generality . W e refer to the output at time t by x s . Let H s 0 ( i ) = P r [ h 0 = i | x s 0 ] b e the p osterior probabilit y of the i th hidden state at time 0 after seeing the observ ations x s 0 up to time t and having the prior π 0 on the distribution of the hidden states at time 0. Let u s = H s 0 (1) and v s = 1 − u s . Deﬁne P s i ( j ) = P r [ x s = j | x s − 1 0 , h 0 = i ] as the distribution of the output at time t conditioned on the hidden state at time 0 b eing i and observ ations x s − 1 0 . Note that O P T s = P s 1 . As b efore, deﬁne R s as the conditional distribution of x s giv en observ ations x 0 , · · · , x s − 1 and initial distribution π but not b eing at hidden state h 0 at time 0 i.e. R s = (1 /v s ) P n i =2 H s 0 ( i ) P s i . Note that M s is a con v ex combination of O P T s and R s , i.e. M s = u s O P T s + v s R s . Hence k O P T s − M s k 1 ≤ k O P T s − R s k 1 . Deﬁne δ s = k O P T s − M s k 1 . Our pro of relies on a martingale concentration argumen t, and in order to ensure that our martingale has b ounded diﬀerences we will ignore outputs which cause a signiﬁcant drop in the p osterior of the true hidden state at time 0. Let B b e the se t of all outputs j at some time t suc h that OP T s ( j ) R s ( j ) ≤  4 c log n . Note that, P j ∈ B O P T s ( j ) ≤  4 P j ∈ B R s ( j ) c log n ≤  4 c log n . Hence by a union b ound, with failure probability at most  2 an y output j such that OP T s ( j ) R s ( j ) ≤  4 c log n is not emitted in a windo w of length c log n/ 2 . Hence w e will only concern ourselves with sequences of outputs such that the output j emitted at eac h step satisﬁes OP T s ( j ) R s ( j ) ≤  4 c log n , let the set of all suc h outputs be S 1 , note that P r ( x s 0 / ∈ S 1 ) ≤  2 . Let E S 1 [ X ] b e the exp ectation of any random v ariable X conditioned on the output sequence b eing in the set S 1 . Consider the sequence of random v ariables X s = log u s − log v s for s ∈ [ − 1 , ` − 1]. Let X − 1 = log( π 1 ) − log(1 − π 1 ). Let ∆ s +1 = X s +1 − X s b e the change in X s on seeing the output x s +1 at time s + 1. Let the output at time s + 1 b e j . W e will ﬁrst ﬁnd an expression for ∆ s +1 . The p osterior probabilities after seeing the ( s + 1)th output get up dated according to Ba y es rule, H s +1 0 (1) = P r [ h 0 = 1 | x s 0 , x [ s + 1] = j ] 20 = P r [ h 0 = 1 | x s 0 ] P r [ x [ s + 1] = j | h 0 = 1 , x s 0 ] P r [ x [ s + 1] = j | x s 0 ] = ⇒ u s +1 = u s O P T s +1 ( j ) P r [ x [ s + 1] = j | x s 0 ] . Let P r [ x [ s + 1] = j | x s 0 ] = d j . Note that H s +1 0 ( i ) = H s 0 ( i ) P s +1 i ( j ) /d j if the output at time s + 1 is j . W e can write, R s +1 =  n X i =2 H s 0 ( i ) P s +1 i  /v s v s +1 = n X i =2 H s +1 0 ( i ) =  n X i =2 H s 0 ( i ) P s +1 i ( j )  /d j = v s R s +1 ( j ) /d j . Therefore w e can write ∆ s +1 and its exp ectation E [∆ s +1 ] as, ∆ s +1 = log O P T s +1 ( j ) R s +1 ( j ) = ⇒ E [∆ s +1 ] = X j O P T s +1 ( j ) log O P T s +1 ( j ) R s +1 ( j ) = D ( O P T s +1 k R s +1 ) . W e deﬁne ˜ ∆ s +1 as ˜ ∆ s +1 := min { ∆ s +1 , log log n } to k eep martingale diﬀerences b ounded. E [ ˜ ∆ s +1 ] then equals a truncated v ersion of the KL-div ergence whic h w e deﬁne as follo ws. Deﬁnition 3. F or any tw o distributions µ ( x ) and ν ( x ), deﬁne the truncated KL-divergence as ˜ D C ( µ k ν ) = E h log  min n µ ( x ) /ν ( x ) , C oi for some ﬁxed C . W e are now ready to deﬁne our martingale. Consider the sequence of random v ariables ˜ X s := ˜ X s − 1 + ˜ ∆ s for t ∈ [0 , ` − 1], with ˜ X − 1 := X − 1 . Deﬁne ˜ Z s := P n s =1  ˜ X s − ˜ X s − 1 − δ 2 s / 2  . Note that ∆ s ≥ ˜ ∆ s = ⇒ X s ≥ ˜ X s . Lemma 5. E S 1 [ ˜ X s − ˜ X s − 1 ] ≥ δ 2 s / 2 , wher e the exp e ctation is with r esp e ct to the output at time t . Henc e the se quenc e of r andom variables ˜ Z s := P s i =0  ˜ X s − ˜ X s − 1 − δ 2 s / 2  is a submartingale with r esp e ct to the outputs. Pr o of. By deﬁnition ˜ X s − ˜ X s − 1 = ˜ ∆ s and E [ ˜ ∆ s ] = ˜ D C ( O P T s k R s ) , C = log n . By taking an exp ectation with resp ect to only sequences S 1 instead of all p ossible sequences, w e are removing ev en ts whic h ha v e a negativ e con tribution to E [ ˜ ∆ s ], hence E S 1 [ ˜ ∆ s ] ≥ E [ ˜ ∆ s ] = ˜ D C ( O P T s k R s ) . W e can now apply Lemma 6 . Lemma 6. (Mo diﬁe d Pinsker’s ine quality) F or any two distributions µ ( x ) and ν ( x ) deﬁne d on x ∈ X , deﬁne the C -trunc ate d KL diver genc e as ˜ D C ( µ k ν ) = E µ h log  min n µ ( x ) ν ( x ) , C oi for some ﬁxe d C such that log C ≥ 8 . Then ˜ D C ( µ k ν ) ≥ 1 2 k µ − ν k 2 1 . Hence E S 1 [ ˜ ∆ s ] ≥ 1 2 k O P T s − R s k 2 1 . Hence E S 1 [ ˜ X s − ˜ X s − 1 ] ≥ δ 2 s / 2. 21 W e now claim that our submartingale has b ounded diﬀerences. Lemma 7. | ˜ Z s − ˜ Z s − 1 | ≤ √ 2 log( c log n/ 4 ) . Pr o of. Note that ( δ 2 s − δ 2 s − 1 ) / 2 can b e at most 2. Z s − Z s − 1 = ˜ ∆ s . By deﬁnition ˜ ∆ s ≤ log (log n ). Also, ˜ ∆ s ≥ − log( c log n/ 4 ) as we restrict ourselves to sequences in the set S 1 . Hence | ˜ Z s − ˜ Z s − 1 | ≤ log( c log n/ 4 ) + 2 ≤ √ 2 log( c log n/ 4 ). W e now apply Azuma-Ho eﬀding to get submartingale concentration b ounds. Lemma 8. (Azuma-Ho eﬀding ine quality) L et Z i b e a submartingale with | Z i − Z i − 1 | ≤ C . Then P r [ Z s − Z 0 ≤ − λ ] ≤ exp  − λ 2 2 tC 2  Applying Lemma 8 w e can sho w, P r [ ˜ Z ` − 1 − ˜ Z 0 ≤ − c log n ] ≤ exp  − c log n 4(1 / ) 2 log 2 ( c log n/ 4 )  ≤  2 , (A.1) for  ≥ 1 / log 0 . 25 n and c ≥ 1. W e no w b ound the av erage error in the window 0 to ` − 1. With failure probabilit y at most  2 o v er the randomness in the outputs, ˜ Z ` − 1 − ˜ Z 0 ≥ − c log n by Eq. A.1 . Let S 2 b e the set of all sequences in S 1 whic h satisfy ˜ Z ` − 1 − ˜ Z 0 ≥ − c log n . Note that X 0 = ˜ X 0 ≥ log(1 /π 1 ). Consider the last p oint after which v s decreases b elow  2 and remains b elo w that for ev ery subsequen t step in the windo w. Let this p oint be τ , if there is no such p oin t deﬁne τ to b e ` − 1. The total contribution of the error at every step after the τ th step to the av erage error is at most a  2 term as the error after this step is at most  2 . Note that X τ ≤ log(1 / ) 2 = ⇒ ˜ X τ ≤ log(1 / ) 2 as ˜ X s ≤ X s . Hence for all sequences in S 2 , ˜ X τ ≤ log(1 / ) 2 = ⇒ ˜ X τ − ˜ X − 1 ≤ log(1 / ) 2 + log(1 /π 1 ) ( a ) = ⇒ 0 . 5 τ X s =0 δ 2 s ≤ 2 log n + log(1 /π 1 ) + c log n ( b ) = ⇒ 0 . 5 τ X s =0 δ 2 s ≤ 2( c + 1) log n ≤ 4 c log n ( c ) = ⇒ P ` − 1 s =0 δ 2 s c log n/ 2 ≤ 8  2 ( c ) = ⇒ P ` − 1 s =0 δ s c log n/ 2 ≤ 3 , where (a) follo ws b y Eq. A.1 , and as  ≥ 1 /n ; (b) follo ws as log (1 /π 1 ) ≤ c log n , and c ≥ 1; (c) follo ws b ecause log(1 /π 1 ) ≤ c log n ); and (d) follows from Jensen’s inequality . As the total probabilit y of sequences outside S 2 is at most 2  2 , E [ P ` − 1 s =0 δ s ] ≤ 4  , whenever the hidden state i at time 0 has probabilit y at least 1 /n c in the prior distribution π 0 . A.2 Pro of of Mo diﬁed Pinsk er’s Inequality (Lemma 6 ) Lemma 6. (Mo diﬁe d Pinsker’s ine quality) F or any two distributions µ ( x ) and ν ( x ) deﬁne d on x ∈ X , deﬁne the C -trunc ate d KL diver genc e as ˜ D C ( µ k ν ) = E µ h log  min n µ ( x ) ν ( x ) , C oi for some ﬁxe d C such that log C ≥ 8 . Then ˜ D C ( µ k ν ) ≥ 1 2 k µ − ν k 2 1 . 22 Pr o of. W e rely on the following Lemma which b ounds the KL-div ergence for binary distributions- Lemma 9. F or every 0 ≤ q ≤ p ≤ 1 , we have 1. p log p q + (1 − p ) log 1 − p 1 − q ≥ 2( p − q ) 2 . 2. 3 p + (1 − p ) log 1 − p 1 − q ≥ 2( p − q ) 2 . Pr o of. F or the second result, ﬁrst observ e that log(1 / (1 − q )) ≥ 0 and ( p − q ) ≤ p as q ≤ p . Both the results then follo w from standard calculus. Let A := { x ∈ X : µ ( x ) ≥ ν ( x ) } and B := { x ∈ X : µ ( x ) ≥ C ν ( x ) } . Let µ ( A ) = p , µ ( B ) = δ, ν ( A ) = q and ν ( B ) =  . Note that k µ − ν k 1 = 2( µ ( A ) − ν ( A )). By the log-sum inequalit y– ˜ D C ( µ k ν ) = X x ∈ B µ ( x ) log µ ( x ) ν ( x ) + X x ∈ A − B µ ( x ) log µ ( x ) ν ( x ) + X x ∈ X − A µ ( x ) log µ ( x ) ν ( x ) = δ log C + ( p − δ ) log p − δ q −  + (1 − p ) log 1 − p 1 − q . 1. Case 1 : 0 . 5 ≤ δ p ≤ 1 ˜ D C ( µ k ν ) ≥ p 2 log C + (1 − p ) log 1 − p 1 − q ≥ 2( p − q ) 2 = 1 2 k µ − ν k 2 1 . 2. Case 2 : δ p < 0 . 5 ˜ D C ( µ k ν ) = δ log C + ( p − δ ) log p q −  + ( p − δ ) log  1 − δ p  + (1 − p ) log 1 − p 1 − q ≥ δ log C + ( p − δ ) log p q − ( p − δ ) 2 δ p + (1 − p ) log 1 − p 1 − q ≥ δ (log C − 2) + ( p − δ ) log p q + (1 − p ) log 1 − p 1 − q . (a) Sub-c ase 1 : log p q ≥ 6 ˜ D C ( µ k ν ) ≥ ( p − δ ) log p q + (1 − p ) log 1 − p 1 − q ≥ 3 p + (1 − p ) log 1 − p 1 − q ≥ 2( p − q ) 2 = 1 2 k µ − ν k 2 1 . (b) Sub-c ase 2 : log p q < 6 ˜ D C ( µ k ν ) ≥ δ (log C − 2 − log p q ) + p log p q + (1 − p ) log 1 − p 1 − q ≥ 2( p − q ) 2 = 1 2 k µ − ν k 2 1 . 23 B Pro of of Low er Bound for Large Alphab ets B.1 CSP form ulation W e ﬁrst go o v er some notation that w e will use for CSP problems, we follo w the same notation and setup as in F eldman et al. [ 18 ]. Consider the following mo del for generating a random CSP instance on n v ariables with a satisfying assignmen t σ . The k -CSP is deﬁned by the predicate P : { 0 , 1 } k → { 0 , 1 } . W e represen t a k -clause b y an ordered k -tuple of literals from { x 1 , · · · , x n , ¯ x 1 , · · · , ¯ x n } with no rep etition of v ariables and let X k b e the set of all suc h k -clauses. F or a k -clause C = ( l 1 , · · · , l k ) let σ ( C ) ∈ { 0 , 1 } k b e the k -bit string of v alues assigned by σ to literals in C , that is { σ ( l 1 ) , · · · , σ ( l k ) } where σ ( l i ) is the v alue of the literal l i in assignmen t σ . In the planted mo del we draw clauses with probabilities that dep end on the v alue of σ ( C ). Let Q : { 0 , 1 } k → R + , P t ∈{ 0 , 1 } k Q ( t ) = 1 b e some distribution ov er satisfying assignments to P . The distribution Q σ is then deﬁned as follo ws- Q σ ( C ) = Q ( σ ( C )) P C 0 ∈ X k Q ( σ ( C 0 )) (B.1) Recall that for any distribution Q ov er satisfying assignments we deﬁne its complexity r as the largest r suc h that the distribution Q is ( r − 1)-wise uniform (also referred to as ( r − 1)-wise inde- p enden t in the literature) but not r -wise uniform. Consider the CSP C deﬁned by a collection of predicates P ( y ) for each y ∈ { 0 , 1 } m for some m ≤ k / 2. Let A ∈ { 0 , 1 } m × k b e a matrix with full ro w rank ov er the binary ﬁeld. W e will later c ho ose A to ensure the CSP has high complexity . F or each y , the predicate P ( y ) is the set of solutions to the system y = Av mo d 2 where v = σ ( C ). F or all y w e deﬁne Q y to b e the uniform distribution ov er all consisten t assignments, i.e. all v ∈ { 0 , 1 } k satisfying y = Av mo d 2. The plan ted distribution Q σ , y is deﬁned based on Q y according to Eq. B.1 . Eac h clause in C is chosen b y ﬁrst picking a y uniformly at random and then a clause from the distribution Q σ , y . F or an y plan ted σ we deﬁne Q σ to b e the distribution o v er all consistent clauses along with their labels y . Let U k b e the uniform distribution o v er k -clauses, with each clause assigned a uniformly c hosen lab el y . Deﬁne Q η σ = (1 − η ) Q σ + η U k , for some ﬁxed noise level η > 0. W e consider η to b e a small constant less than 0.05. This corresp onds to adding noise to the problem by mixing the plan ted and the uniform clauses. The problem gets harder as η b ecomes larger, for η = 0 it can b e eﬃcien tly solv ed using Gaussian Elimination. W e will deﬁne another CSP C 0 whic h w e sho w reduces to C and for whic h w e can obtain hardness using Conjecture 1. The lab el y is ﬁxed to b e the all zero v ector in C 0 . Hence Q 0 , the distribution o v er satisfying assignmen ts for C 0 , is the uniform distribution ov er all vectors in the null space of A ov er the binary ﬁeld. W e refer to the planted distribution in this case as Q σ , 0 . Let U k, 0 b e the uniform distribution ov er k -clauses, with each clause now having the lab el 0. F or any plan ted assignmen t σ , w e denote the distribution of consistent clauses of C 0 b y Q σ , 0 . As b efore deﬁne Q η σ , 0 = (1 − η ) Q σ , 0 + η U k, 0 for the same η . Let L b e the problem of distinguishing b etw een U k and Q η σ for some randomly and uniformly c hosen σ ∈ { 0 , 1 } n with success probabilit y at least 2 / 3. Similarly , let L 0 b e the problem of distinguishing b etw een U k, 0 and Q η σ , 0 for some randomly and uniformly c hosen σ ∈ { 0 , 1 } n with success probability at least 2 / 3. L and L 0 can b e thought of as the problem of distinguishing random instances of the CSPs from instances with a high v alue. Note that L and L 0 are at least 24 as hard as the problem of refuting the random CSP instances U k and U k, 0 , as this corresp onds to the case where η = 0. W e claim that an algorithm for L implies an algorithm for L 0 . Lemma 10. If L c an b e solve d in time t ( n ) with s ( n ) clauses, then L 0 c an b e solve d in time O ( t ( n ) + s ( n )) and s ( n ) clauses. Let the complexit y of Q 0 b e γ k , with γ ≥ 1 / 10 (w e demonstrate ho w to ac hiev e this next). By Conjecture 1 distinguishing b etw een U k, 0 and Q η σ , 0 requires at least ˜ Ω( n γ k / 2 ) clauses. W e no w discuss ho w A can b e chosen to ensure that the complexity of Q 0 is γ k . B.2 Ensuring High Complexit y of the CSP Let N b e the null space of A . Note that the rank of N is ( k − m ). F or an y subspace D , let w ( D ) = ( w 1 , w 2 , · · · , w k ) b e a randomly c hosen vector from D . T o ensure that Q 0 has complexity γ k , it suﬃces to show that the random v ariables w ( N ) = ( w 1 , w 2 , · · · , w k ) are ( γ k − 1)-wise uni- form. W e use the theory of error correcting co des to ﬁnd such a matrix A . A binary linear co de B of length k and rank m is a linear subspace of F k 2 (our notation is diﬀeren t from the standard notation in the co ding theory literature to suit our setting). The rate of the co de is deﬁned to b e m/k . The generator matrix of the co de is the matrix G suc h that B = { Gv , v ∈ { 0 , 1 } m } . The parit y chec k matrix of the co de is the matrix H such that B = { c ∈ { 0 , 1 } k : Hc = 0 } . The distance d of a co de is the w eigh t of the minimum weigh t co dew ord and the relativ e distance δ is deﬁned to b e δ = d/k . F or any co deword B we deﬁne its dual co deword B T as the co dew ord with generator matrix H T and parity c hec k matrix G T . Note that the rank of the dual codeword of a co de with rank m is ( k − m ). W e use the follo wing standard result ab out linear co des– F act 1. If B T has distanc e l , then w ( B ) is ( l − 1) -wise uniform. Hence, our job of ﬁnding A reduces to ﬁnding a dual co de with distance γ k and rank m , where γ = 1 / 10 and m ≤ k / 2. W e use the Gilb ert-V arshamo v b ound to argue for the existence of such a co de. Let H ( p ) b e the binary entrop y of p . Lemma 11. (Gilb ert-V arshamov b ound) F or every 0 ≤ δ < 1 / 2 , and 0 <  ≤ 1 − H ( δ ) , ther e exists a c o de with r ank m and r elative distanc e δ if m/k = 1 − H ( δ ) −  . T aking δ = 1 / 10, H ( δ ) ≤ 0 . 5, hence there exists a co de B whenev er m/k ≤ 0 . 5, which is the setting we’re interested in. W e c ho ose A = G T , where G is the generator matrix of B . Hence the n ull space of A is ( k / 10 − 1)-wise uniform, hence the complexity of Q 0 is γ k with γ ≥ 1 / 10. Hence for all k and m ≤ k / 2 we can ﬁnd a A ∈ { 0 , 1 } m × k to ensure that the complexit y of Q 0 is γ k . B.3 Sequen tial Mo del of CSP and Sample Complexit y Lo w er Bound W e no w construct a sequen tial mo del which deriv es hardness from the hardness of L . Here w e sligh tly diﬀer from the outline presented in the b eginning of Section 5 as w e cannot base our sequen tial mo del directly on L as generating random k -tuples without repetition increases the m utual information, so we form ulate a slight v ariation L 0 of L which w e show is at least as hard as L . W e did not deﬁne our CSP instance allo wing rep etition as that is diﬀeren t from the setting examined in F eldman et al. [ 18 ], and hardness of the setting with rep etition do es not follow from hardness of the setting allo wing rep etition, though the con v erse is true. 25 B.3.1 Constructing sequential mo del Consider the following family of sequential mo dels R ( n, A m × k ) where A ∈ { 0 , 1 } m × k is chosen as deﬁned previously . The output alphabet of all mo dels in the family is X = { a i , 1 ≤ i ≤ 2 n } of size 2 n , with 2 n/k even. W e choose a subset S of X of size n , each choice of S corresp onds to a mo del M in the family . Each letter in the output alphab et is enco ded as a 1 or 0 whic h represents whether or not the letter is included in the set S , let u ∈ { 0 , 1 } 2 n b e the vector which stores this enco ding so u i = 1 whenev er the letter a i is in S . Let σ ∈ { 0 , 1 } n determine the subset S suc h that entry u 2 i − 1 is 1 and u 2 i is 0 when σ i is 1 and u 2 i − 1 is 0 and u 2 i is 1 when σ i is 0, for all i . W e c ho ose σ uniformly at random from { 0 , 1 } n and each c hoice of σ represents some subset S , and hence some mo del M . W e partition the output alphab et X in to k subsets of size 2 n/k eac h so the ﬁrst 2 n/k letters go to the ﬁrst subset, the next 2 n/k go to the next subset and so on. Let the i th subset b e X i . Let S i b e the set of elemen ts in X i whic h b elong to the set S . A t time 0, M chooses v ∈ { 0 , 1 } k uniformly at random from { 0 , 1 } k . A t time i, i ∈ { 0 , · · · , k − 1 } , if v i = 1, then the mo del chooses a letter uniformly at random from the set S i , otherwise if v i = 0 it chooses a letter uniformly at random from X i − S i . With probability (1 − η ) the outputs for the next m time steps from k to ( k + m − 1) are y = Av mo d 2, with probabilit y η they are m uniform random bits. The mo del resets at time ( k + m − 1) and rep eats the pro cess. Recall that I ( M ) is at most m and M can b e simulated by an HMM with 2 m (2 k + m ) + m hidden states (see Section 5.1 ). B.3.2 Reducing sequential mo del to CSP instance W e rev eal the matrix A to the algorithm (this corresp onds to rev ealing the transition matrix of the underlying HMM), but the encoding σ is k ept secret. The task of ﬁnding the encoding σ giv en samples from M can b e naturally seen as a CSP . Each sample is a clause with the literal corresp onding to the output letter a i b eing x ( i +1) / 2 whenev er i is o dd and ¯ x i/ 2 when i is even. W e refer the reader to the outline at the b eginning of the section for an example. W e denote C 0 as the CSP C with the mo diﬁcation that the i th literal of eac h clause is the literal corresp onding to a letter in X i for all 1 ≤ i ≤ k . Deﬁne Q 0 σ as the distribution of consistent clauses for the CSP C 0 . Deﬁne U 0 k as the uniform distribution ov er k -clauses with the additional constraint that the i th literal of eac h clause is the literal corresp onding to a letter in X i for all 1 ≤ i ≤ k . Deﬁne Q 0 η σ = (1 − η ) Q 0 σ + η U 0 k . Note that samples from the mo del M are equiv alen t to clauses from Q 0 η σ . W e show that hardness of L 0 follo ws from hardness of L – Lemma 12. If L 0 c an b e solve d in time t ( n ) with s ( n ) clauses, then L c an b e solve d in time t ( n ) with O ( s ( n )) clauses. Henc e if Conje ctur e 1 is true then L 0 c annot b e solve d in p olynomial time with less than ˜ Ω( n γ k / 2 ) clauses. W e can now prov e the Theorem 2 using Lemma 12 . Theorem 2. Assuming Conje ctur e 1, for al l suﬃciently lar ge T and 1 /T c <  ≤ 0 . 1 for some ﬁxe d c onstant c , ther e exists a family of HMMs with T hidden states and an output alphab et of size n such that, any pr e diction algorithm that achieves aver age KL-err or, ` 1 err or or r elative zer o-one err or less than  with pr ob ability gr e ater than 2/3 for a r andomly chosen HMM in the family, and runs in time f ( T ,  ) · n g ( T , ) for any functions f and g , r e quir es n Ω(log T / ) samples fr om the HMM. 26 Pr o of. W e describ e how to choose the family of sequential mo dels R ( n, A m × k ) for eac h v alue of  and T . Recall that the HMM has T = 2 m (2 k + m ) + m hidden states. Let T 0 = 2 m +2 ( k + m ). Note that T 0 ≥ T . Let t = log T 0 . W e choose m = t − log (1 / ) − log( t/ 5), and k to b e the solution of t = m + log( k + m ) + 2, hence k = t/ (5  ) − m − 2. Note that for  ≤ 0 . 1, k ≥ m . Let  0 = 2 9 m k + m . W e claim  ≤  0 . T o verify , note that k + m = t/ (5  ) − 2. Therefore,  0 = 2 m 9( k + m ) = 10  ( t − log(1 / ) − log ( t/ 5)) 9 t (1 − 10 /t ) ≥ , for suﬃcien tly large t and  ≥ 2 − ct for a ﬁxed constant c . Hence proving hardness for obtaining error  0 implies hardness for obtaining error  . W e choose the matrix A m × k as outlined earlier. F or eac h vector σ ∈ { 0 , 1 } n w e deﬁne the family of sequen tial mo dels R ( n, A ) as earlier. Let M b e a randomly c hosen mo del in the family . W e ﬁrst sho w the result for the relativ e zero-one loss. The idea is that an y algorithm whic h does a go o d job of predicting the outputs from time k through ( k + m − 1) can b e used to distinguish b et w een instances of the CSP with a high v alue and uniformly random clauses. This is b ecause it is not p ossible to mak e go o d predictions on uniformly random clauses. W e relate the zero-one error from time k through ( k + m − 1) with the relative zero-one error from time k through ( k + m − 1) and the a v erage zero-one error for all time steps to get the required lo w er b ounds. Let ρ 01 ( A ) b e the av erage zero-one loss of some p olynomial time algorithm A for the output time steps k through ( k + m − 1) and δ 0 01 ( A ) b e the av erage relativ e zero-one loss of A for the out- put time steps k through ( k + m − 1) with resp ect to the optimal predictions. F or the distribution U 0 k it is not p ossible to get ρ 01 ( A ) < 0 . 5 as the clauses and the lab el y are indep enden t and y is c hosen uniformly at random from { 0 , 1 } m . F or Q 0 η σ it is information theoretically p ossible to get ρ 01 ( A ) = η / 2. Hence an y algorithm which gets error ρ 01 ( A ) ≤ 2 / 5 can b e used to distinguish b e- t w een U 0 k and Q 0 η σ . Therefore b y Lemma 12 any p olynomial time algorithm whic h gets ρ 01 ( A ) ≤ 2 / 5 with probabilit y greater than 2 / 3 ov er the c hoice of M needs at least ˜ Ω( n γ k / 2 ) samples. Note that δ 0 01 ( A ) = ρ 01 ( A ) − η / 2. As the optimal predictor P ∞ gets ρ 01 ( P ∞ ) = η / 2 < 0 . 05, therefore δ 0 01 ( A ) ≤ 1 / 3 = ⇒ ρ 01 ( A ) ≤ 2 / 5. Note that δ 01 ( A ) ≥ δ 0 01 ( A ) m k + m . This is b ecause δ 01 ( A ) is the a v erage error for all ( k + m ) time steps, and the contribution to the error from time steps 0 to ( k − 1) is non-negative. Also, 1 3 m k + m >  0 , therefore, δ 01 ( A ) <  0 = ⇒ δ 0 01 ( A ) < 1 3 = ⇒ ρ 01 ( A ) ≤ 2 / 5. Hence any p olynomial time algorithm which gets a v erage relative zero-one loss less than  0 with probabilit y greater than 2 / 3 needs at least ˜ Ω( n γ k / 2 ) samples. The result for ` 1 loss follo ws directly from the result for relativ e zero-one loss, w e next consider the KL loss. Let δ 0 K L ( A ) be the av erage KL error of the algorithm A from time steps k through ( k + m − 1). By application of Jensen’s inequality and Pinsker’s inequality , δ 0 K L ( A ) ≤ 2 / 9 = ⇒ δ 0 01 ( A ) ≤ 1 / 3. Therefore, by our previous argument any algorithm which gets δ 0 K L ( A ) < 2 / 9 needs ˜ Ω( n γ k / 2 ) sam- ples. But as b efore, δ K L ( A ) ≤  0 = ⇒ δ 0 K L ( A ) ≤ 2 / 9. Hence any p olynomial time algorithm whic h succeeds with probability greater than 2 / 3 and gets av erage KL loss less than  0 needs at least ˜ Ω( n γ k / 2 ) samples. W e lo wer b ound k by a linear function of log T / to express the result directly in terms of log T / . W e claim that log T / is at most 10 k . This follows b ecause– log T / ≤ t/ = 5( k + m ) + 10 ≤ 15 k Hence an y polynomial time algorithm needs n Θ(log T / ) samples to get a verage relativ e zero-one loss, 27 ` 1 loss, or KL loss less than  on M . B.4 Pro of of Lemma 10 Lemma 10. If L c an b e solve d in time t ( n ) with s ( n ) clauses, then L 0 c an b e solve d in time O ( t ( n ) + s ( n )) and s ( n ) clauses. Pr o of. W e show that a random instance of C 0 can be transformed to a random instance of C in time s ( n ) O ( k ) by indep endently transforming every clause C in C 0 to a clause C 0 in C such that C is satisﬁed in the original CSP C 0 with some assignment t to x if and only if the corresp onding clause C 0 in C is satisﬁed with the same assignment t to x . F or ev ery y ∈ { 0 , 1 } m w e pre-compute and store a random solution of the system y = Av mo d 2, let the solution b e v ( y ). Giv en an y clause C = ( x 1 , x 2 , · · · , x k ) in C 0 , choose y ∈ { 0 , 1 } m uniformly at random. W e generate a clause C 0 = ( x 0 1 , x 0 2 , · · · , x 0 k ) in C from the clause C in C 0 b y choosing the literal x 0 i = ¯ x i if v i ( y ) = 1 and x 0 i = x i if v i ( y ) = 0. By the linearit y of the system, the clause C 0 is a consistent clause of C with some assignmen t x = t if and only if the clause C w as a consistent clause of C 0 with the same assignmen t x = t . W e next claim that C 0 is a randomly generated clause from the distribution U k if C was drawn from U k, 0 and is a randomly generated clause from the distribution Q σ if C was drawn from Q σ , 0 . By our construction, the lab el of the clause y is c hosen uniformly at random. Note that choosing a clause uniformly at random from U k, 0 is equiv alen t to ﬁrst uniformly choosing a k -tuple of un- negated literals and then c ho osing a negation pattern for the literals uniformly at random. It is clear that a clause is still uniformly random after adding another negation pattern if it was uniformly random b efore. Hence, if the original clause C was drawn to the uniform distribution U k, 0 , then C 0 is distributed according to U k . Similarly , c ho osing a clause uniformly at random from Q σ , y for some y is equiv alen t to ﬁrst uniformly choosing a k -tuple of unnegated literals and then choosing a negation pattern uniformly at random which makes the clause consisten t. As the original negation pattern corresp onds to a v randomly chosen from the null space of A , the ﬁnal negation pattern on adding v ( y ) corresp onds to the negation pattern for a uniformly random c hosen solution of y = Av mo d 2 for the chosen y . Therefore, the clause C 0 is a uniformly random c hosen clause from Q σ ,y if C is a uniformly random chosen clause from Q σ , 0 . Hence if it is p ossible to distinguish U k and Q η σ for some randomly chosen σ ∈ { 0 , 1 } n with success probabilit y at least 2 / 3 in time t ( n ) with s ( n ) clauses, then it is p ossible to distinguish b et w een U k, 0 and Q η σ , 0 for some randomly chosen σ ∈ { 0 , 1 } n with success probability at least 2 / 3 in time t ( n ) + s ( n ) O ( k ) with s ( n ) clauses. B.5 Pro of of Lemma 12 Lemma 12. If L 0 c an b e solve d in time t ( n ) with s ( n ) clauses, then L c an b e solve d in time t ( n ) with O ( s ( n )) clauses. Henc e if Conje ctur e 1 is true then L 0 c annot b e solve d in p olynomial time with less than ˜ Ω( n γ k / 2 ) clauses. Pr o of. Deﬁne E to b e the even t that a clause generated from the distribution Q σ of the CSP C has the prop erty that for all i the i th literal b elongs to the set X i , we also refer to this prop erty of the clause as E for notational ease. It’s easy to v erify that the probability of the even t E is 1 /k k . W e 28 claim that conditioned on the ev en t E , the CSP C and C 0 are equiv alent. This is veriﬁed as follows. Note that for all y , Q σ ,y and Q 0 σ ,y are uniform on all consistent clauses. Let U b e the set of all clauses with non-zero probability under Q σ ,y and U 0 b e the set of all clauses with non-zero probabilit y under Q 0 σ ,y . F urthermore, for an y v which satisﬁes the constraint that y = Av mo d 2, let U ( v ) b e the set of clauses C ∈ U such that σ (C) = v . Similarly , let U 0 ( v ) b e the set of clauses C ∈ U 0 suc h that σ (C) = v . Note that the subset of clauses in U ( v ) whic h satisfy E is the same as the set U 0 ( v ). As this holds for ev ery consistent v and the distributions Q 0 σ ,y and Q σ ,y are uniform on all consisten t clauses, the distribution of clauses from Q σ is identi- cal to the distribution of clauses Q 0 σ conditioned on the even t E . The equiv alence of U k and U 0 k conditioned on E also follows from the same argument. Note that as the k -tuples in C are chosen uniformly at random from satisfying k -tuples, with high probability there are s ( n ) tuples having prop ert y E if there are O ( k k s ( n )) clauses in C . As the problems L and L 0 are equiv alen t conditioned on ev ent E , if L 0 can b e solved in time t ( n ) with s ( n ) clauses, then L can b e solv ed in time t ( n ) with O ( k k s ( n )) clauses. F rom Lemma 10 and Conjecture 1, L cannot b e solv ed in p olynomial time with less than ˜ Ω( n γ k / 2 ) clauses. Hence L 0 cannot b e solved in p olynomial time with less than ˜ Ω( n γ k / 2 /k k ) clauses. As k is a constant with resp ect to n , L 0 cannot b e solv ed in p olynomial time with less than ˜ Ω( n γ k / 2 ) clauses. C Pro of of Low er Bound for Small Alphab ets C.1 Pro of of Lemma 3 Lemma 3. L et A b e chosen uniformly at r andom fr om the set S . Then, with pr ob ability at le ast (1 − 1 /n ) over the choic e A ∈ S , any (r andomize d) algorithm that c an distinguish the outputs fr om the mo del M ( A ) fr om the distribution over r andom examples U n with suc c ess pr ob ability gr e ater than 2 / 3 over the r andomness of the examples and the algorithm ne e ds f ( n ) time or examples. Pr o of. Supp ose A ∈ { 0 , 1 } m × n is c hosen at random with each en try b eing i.i.d. with its distribution uniform on { 0 , 1 } . Recall that S is the set of all ( m × n ) matrices A whic h are full row rank. W e claim that P ( A ∈ S ) ≥ 1 − m 2 − n/ 6 . T o verify , consider the addition of each row one by one to A 0 . The probability of the i th ro w b eing linearly dep endent on the previous ( i − 1) rows is 2 i − 1 − n . Hence b y a union b ound, A 0 is full ro w-rank with failure probabilit y at most m 2 m − n ≤ m 2 − n/ 2 . F rom Deﬁnition 2 and a union b ound ov er all the m ≤ n/ 2 parities, an y algorithm that can distinguish the outputs from the mo del M ( A ) for uniformly chosen A from the distribution o v er random examples U n with probability at least (1 − 1 / (2 n )) o v er the choice of A needs f ( n ) time or examples. As P ( A ∈ S ) ≥ 1 − m 2 − n/ 2 for a uniformly randomly c hosen A , with probability at least (1 − 1 / (2 n ) − m 2 − n/ 2 ) ≥ (1 − 1 /n ) ov er the choice A ∈ S any algorithm that can distinguish the outputs from the mo del M ( A ) from the distribution ov er random examples U n with success probabilit y greater than 2 / 3 o v er the randomness of the examples and the algorithm needs f ( n ) time or examples. C.2 Pro of of Prop osition 2 Prop osition 2. With f ( T ) as deﬁne d in Deﬁnition 2 , for al l suﬃciently lar ge T and 1 /T c <  ≤ 0 . 1 for some ﬁxe d c onstant c , ther e exists a family of HMMs with T hidden states such that any algorithm that achieves aver age r elative zer o-one loss, aver age ` 1 loss, or aver age KL loss less 29 than  with pr ob ability gr e ater than 2/3 for a r andomly chosen HMM in the family ne e ds, r e quir es f (Ω(log T / )) time or samples samples fr om the HMM. Pr o of. W e describ e ho w to choose the family of sequential mo dels A m × n for eac h v alue of  and T . Recall that the HMM has T = 2 m (2 n + m ) + m hidden states. Let T 0 = 2 m +2 ( n + m ). Note that T 0 ≥ T . Let t = log T 0 . W e c ho ose m = t − log(1 / ) − log( t/ 5), and n to b e the solution of t = m + log( n + m ) + 2, hence n = t/ (5  ) − m − 2. Note that for  ≤ 0 . 1, n ≥ m . Let  0 = 2 9 m n + m . W e claim  ≤  0 . T o verify , note that n + m = t/ (5  ) − 2. Therefore,  0 = 2 m 9( n + m ) = 10  ( t − log(1 / ) − log ( t/ 5)) 9 t (1 − 10 /t ) ≥ , for suﬃcien tly large t and  ≥ 2 − ct for a ﬁxed constant c . Hence proving hardness for obtaining error  0 implies hardness for obtaining error  . W e choose the matrix A m × n as outlined earlier. The family is deﬁned b y the model M ( A m × n ) deﬁned previously with the matrix A m × n c hosen uniformly at random from the set S . Let ρ 01 ( A ) be the a v erage zero-one loss of some algorithm A for the output time steps n through ( n + m − 1) and δ 0 01 ( A ) be the a verage relativ e zero-one loss of A for the output time steps n through ( n + m − 1) with resp ect to the optimal predictions. F or the distribution U n it is not p ossible to get ρ 01 ( A ) < 0 . 5 as the clauses and the lab el y are indep enden t and y is chosen uniformly at random from { 0 , 1 } m . F or Q η s it is information theoretically p ossible to get ρ 01 ( A ) = η / 2. Hence any algo- rithm whic h gets error ρ 01 ( A ) ≤ 2 / 5 can b e used to distinguish b etw een U n and Q η s . Therefore by Lemma 3 any algorithm which gets ρ 01 ( A ) ≤ 2 / 5 with probability greater than 2 / 3 ov er the choice of M ( A ) needs at least f ( n ) time or samples. Note that δ 0 01 ( A ) = ρ 01 ( A ) − η / 2. As the optimal predictor P ∞ gets ρ 01 ( P ∞ ) = η / 2 < 0 . 05, therefore δ 0 01 ( A ) ≤ 1 / 3 = ⇒ ρ 01 ( A ) ≤ 2 / 5. Note that δ 01 ( A ) ≥ δ 0 01 ( A ) m n + m . This is b ecause δ 01 ( A ) is the av erage error for all ( n + m ) time steps, and the con tribution to the error from time steps 0 to ( n − 1) is non-negative. Also, 1 3 m n + m >  0 , therefore, δ 01 ( A ) <  0 = ⇒ δ 0 01 ( A ) < 1 3 = ⇒ ρ 01 ( A ) ≤ 2 / 5. Hence any algorithm whic h gets av erage relativ e zero-one loss less than  0 with probability greater than 2 / 3 ov er the c hoice of M ( A ) needs f ( n ) time or samples. The result for ` 1 loss follo ws directly from the result for relative zero-one loss, w e next consider the KL loss. Let δ 0 K L ( A ) b e the a v erage KL error of the algorithm A from time steps n through ( n + m − 1). By application of Jensen’s inequality and Pinsker’s inequality , δ 0 K L ( A ) ≤ 2 / 9 = ⇒ δ 0 01 ( A ) ≤ 1 / 3. Therefore, by our previous argument any algorithm which gets δ 0 K L ( A ) < 2 / 9 needs f ( n ) samples. But as b efore, δ K L ( A ) ≤  0 = ⇒ δ 0 K L ( A ) ≤ 2 / 9. Hence any algorithm which gets av erage KL loss less than  0 needs f ( n ) time or samples. W e low er b ound n by a linear function of log T / to express the result directly in terms of log T / . W e claim that log T / is at most 10 n . This follo ws b ecause– log T / ≤ t/ = 5( n + m ) + 10 ≤ 15 n Hence any algorithm needs f (Ω(log T / )) samples and time to get a v erage relative zero-one loss, ` 1 loss, or KL loss less than  with probabilit y greater than 2 / 3 o v er the choice of M ( A ). 30 D Pro of of Information Theoretic Lo w er Bound Prop osition 3. Ther e is an absolute c onstant c such that for al l 0 <  < 0 . 5 and suﬃciently lar ge n , ther e exists an HMM with n states such that it is not information the or etic al ly p ossible to get aver age r elative zer o-one loss or ` 1 loss less than  using windows of length smal ler than c log n/ 2 , and KL loss less than  using windows of length smal ler than c log n/ . Pr o of. Consider a Hidden Mark o v Mo del with the Marko v chain b eing a p erm utation on n states. The output alphab et of eac h hidden state is binary . Each state i is marked with a lab el l i whic h is 0 or 1, let G ( i ) b e mapping from hidden state h i to its lab el l i . All the states lab eled 1 emit 1 with probabilit y (0 . 5 +  ) and 0 with probability (0 . 5 −  ). Similarly , all the states lab eled 0 emit 0 with probability (0 . 5 +  ) and 1 with probability (0 . 5 −  ). Fig. 3 illustrates the construction and pro vides the high-lev el pro of idea. Figure 3: Low er b ound construction, ` = 3 , n = 16. A note on notation used in the rest of the pro of with resp ect to this example: r (0) corresp onds to the lab el of h 0 , h 1 and h 2 and is (0 , 1 , 0) in this case. Similarly , r (1) = (1 , 1 , 0) in this case. The segmen ts b etw een the shaded no des comprise the set S 1 and are the p ossible sequences of states from which the last ` = 3 outputs could hav e come. The shaded no des corresp ond to the states in S 2 , and are the p ossible predictions for the next time step. In this example S 1 = { (0 , 1 , 0) , (1 , 1 , 0) , (0 , 1 , 0) , (1 , 1 , 1) } and S 2 = { 1 , 1 , 0 , 0 } . Assume n is a m ultiple of ( ` + 1), where ( ` + 1) = c log n/ 2 , for a constant c = 1 / 33. W e will regard  as a constant with resp ect to n . Let n/ ( ` + 1) = t . W e refer to the hidden states by h i , where0 ≤ i ≤ ( n − 1), and h j i refers to the sequence of hidden states i through j . W e will show that a mo del lo oking at only the past ` outputs cannot get av erage zero-one loss less than 0 . 5 − o (1). As the optimal prediction lo oking at all past outputs gets av erage zero-one loss 0 . 5 −  + o (1) (as the hidden state at eac h time step can b e determined to an arbitrarily high probability if we are allo w ed to lo ok at an arbitrarily long past), this pro v es that windows of length ` do not suﬃce to get a v erage zero-one error less than  − o (1) with resp ect to the optimal predictions. Note that the Ba y es optimal prediction at time ( ` + 1) to minimize the exp ected zero-one loss giv en outputs from time 1 to ` is to predict the mo de of the distribution P r ( x ` +1 | x ` 1 = s ` 1 ) where s ` 1 is the sequence of outputs from time 1 to ` . Also, note that P r ( x ` +1 | x ` 1 = s ` 1 ) = P i P r ( h i ` = i | x ` 1 = s ` 1 ) P r ( x ` +1 | h i ` = i ) where h i ` is the hidden state at time ` . Hence the predictor is a weigh ted av erage of the prediction of eac h hidden state with the w eigh t b eing the probabilit y of b eing at that hidden state. W e index each state h i of the p erm utation b y a tuple ( f ( i ) , g ( i )) = ( j, k ) where j = i mo d ( ` + 1) and k = b i ` +1 c hence 0 ≤ j ≤ ` , 0 ≤ k ≤ ( t − 1) and i = k ( ` + 1) + j . W e help the predictor to mak e the prediction at time ( ` + 1) by providing it with the index f ( i ` ) = i ` mo d ( ` + 1) of the 31 true hidden state h i ` at time ` . Hence this narrows do wn the set of p ossible hidden states at time ` (in Fig. 3 , the set of p ossible states given this side information are all the hidden states b efore the shaded states). The Bay es optimal prediction at time ( ` + 1) given outputs s ` 1 from time 1 to ` and index f ( h i ` ) = j is to predict the mo de of P r ( x ` +1 | x ` 1 = s ` 1 , f ( h i ` ) = j ). Note that by the deﬁnition of Bay es optimality , the av erage zero-one loss of the prediction using P r ( x ` +1 | x ` 1 = s ` 1 , f ( h i ` ) = j ) cannot b e w orse than the av erage zero-one loss of the prediction using P r ( x ` +1 | x ` 1 = s ` 1 ). Hence w e only need to sho w that the predictor with access to this side information is p o or. W e refer to this predictor using P r ( x ` +1 | x ` 1 = s ` 1 , f ( h i ` ) = j ) as P . W e will now sho w that there exists some p ermutation for which the av erage zero-one loss of the predictor P is 0 . 5 − o (1). W e argue this using the probabilistic metho d. W e choose a p ermutat ion uniformly at random from the set of all p erm utations. W e show that the exp ected a v erage zero-one loss of the predictor P ov er the randomness in choosing the p ermutation is 0 . 5 − o (1). This means that there m ust exist some p erm utation such that the av erage zero-one loss of the predictor P on that p ermutation is 0 . 5 − o (1). T o ﬁnd the exp ected av erage zero-one loss of the predictor P ov er the randomness in choosing the p ermutation, we will ﬁnd the exp ected a v erage zero-one loss of the predictor P given that we are in some state h i ` at time ` . Without loss of generality let f ( i ` ) = 0 and g ( i ` ) = ( ` − 1), hence w e were at the ( ` − 1)th hidden state at time ` . Fix an y sequence of lab els for the hidden states h ` − 1 0 . F or any string s ` − 1 0 emitted by the hidden states h ` − 1 0 from time 0 to ` − 1, let E [ δ ( s ` − 1 0 )] b e the exp ected av erage zero-one error of the predictor P o v er the randomness in the rest of the p erm utation. Also, let E [ δ ( h ` − 1 )] = P s ` − 1 0 E [ δ ( s ` − 1 0 )] P r [ s ` − 1 0 ] be the exp ected error a v eraged across all outputs. W e will argue that E [ δ ( h ` − 1 )] = 0 . 5 − o (1). The set of hidden states h i with g ( i ) = k deﬁnes a segmen t of the p ermutation, let r ( k ) b e the lab el G ( h k ( ` +1) − 2 ( k − 1)( ` +1) ) of the segment k , excluding its last bit which corresp onds to the predictions. Let S 1 = { r ( k ) , ∀ k 6 = 0 } be the set of all the lab els excluding the ﬁrst lab el r (0) and S 2 = { G ( h k ( ` +1)+ ` ) , ∀ k } b e the set of all the predicted bits (refer to Fig. 3 for an example). Consider any assignment of r (0). T o b egin, we show that with high probability ov er the output s ` − 1 0 , the Hamming distance D ( s ` − 1 0 , r (0)) of the output s ` − 1 0 of the set of hidden states h ` − 1 0 from r (0) is at least ` 2 − 2 ` . This follows directly from Ho eﬀding’s inequalit y 5 as all the outputs are indep enden t conditioned on the hidden state– P r [ D ( s ` − 1 0 , r (0)) ≤ `/ 2 − 2 ` ] ≤ e − 2 ` 2 ≤ n − 2 c (D.1) W e now sho w that for any k 6 = 0, with decent probability the lab el r ( k ) of the segmen t k is closer in Hamming distance to the output s ` − 1 0 than r (0). Then we argue that with high probability there are many suc h segments whic h are closer to s ` − 1 0 in Hamming distance than r (0). Hence these other segments are assigned as muc h weigh t in predicting the next output as r (0), which means that the output cannot b e predicted with a high accuracy as the output bits corresp onding to diﬀeren t segmen ts are indep enden t. W e ﬁrst ﬁnd the probability that the segment corresp onding to some k with lab el r ( k ) has a Hamming distance less than ` 2 − p ` log t/ 8 from an y ﬁxed binary string x of length ` . Let F ( l, m, p ) b e the probabilit y of getting at least l heads in m i.i.d. trails with each trial having probability p of giving a head. F ( l , m, p ) can b e b ounded b elow by the following standard inequality– F ( l , m, p ) ≥ 1 √ 2 m exp  − mD K L  l m    p  5 F or n indep endent random v ariables { X i } lying in the interv al [0 , 1] with ¯ X = 1 n P i X i , P r [ X ≤ E [ ¯ X ] − t ] ≤ e − 2 nt 2 . In our case t =  and n = ` . 32 where D K L ( q k p ) = q log q p + (1 − q ) log 1 − q 1 − p . W e can use this to low er b ound P r h D ( r ( k ) , x ) ≤ `/ 2 − p ` log t/ 8 i , P r h D ( r ( k ) , x ) ≤ `/ 2 − p ` log t/ 8 i = F ( `/ 2 + p ` log t/ 8 , `, 1 / 2) ≥ 1 √ 2 ` exp  − `D K L  1 2 + r log t 8 `    1 2  Note that D K L ( 1 2 + v k 1 2 ) ≤ 4 v 2 b y using the inequalit y log (1 + v ) ≤ v . W e can simplify the KL-div ergence using this and write– P r h D ( r ( k ) , x ) ≤ `/ 2 − p ` log t/ 8 i ≥ 1 / √ 2 `t (D.2) Let D b e the set of all k 6 = 0 such that D ( r ( k ) , x ) ≤ ` 2 − p ` log t/ 8 for some ﬁxed x . W e argue that with high probabilit y ov er the randomness of the p ermutation |D | is large. This follo ws from Eq. D.2 and the Chernoﬀ b ound 6 as the lab els for all segmen ts r ( k ) are c hosen indep enden tly– P r h |D | ≤ p t/ (8 ` ) i ≤ e − 1 8 √ t/ (2 ` ) Note that p t/ (8 ` ) ≥ n 0 . 25 . Therefore for any ﬁxed x , with probability 1 − exp( − 1 8 q t 2 ` ) ≥ 1 − n − 0 . 25 there are q t 8 ` ≥ n 0 . 25 segmen ts in a randomly chosen p ermutation which hav e Hamming dis- tance less than `/ 2 − p ` log t/ 8 from x . Note that b y our construction 2 ` ≤ p ` log t/ 8 b ecause log( ` + 1) ≤ (1 − 32 c ) log n . Hence the segments in D are closer in Hamming distance to the output s ` − 1 0 if D ( s ` − 1 0 , r (0)) > `/ 2 − 2 ` . Therefore if D ( s ` − 1 0 , r (0)) > `/ 2 − 2 ` , then with high probability o v er randomly choosing the segments S 1 there is a subset D of segments in S 1 with |D | ≥ n 0 . 25 suc h that all of the segmen ts in D hav e Hamming distance less than D ( s ` − 1 0 , r (0)) from s ` − 1 0 . Pic k any s ` − 1 0 suc h that D ( s ` − 1 0 , r (0)) > `/ 2 − 2 ` . Consider any set of segments S 1 whic h has suc h a subset D with resp ect to the string s ` − 1 0 . F or all such p ermutations, the predictor P places at least as muc h weigh t on the hidden states h i with g ( i ) = k , with k such that r ( k ) ∈ D as the true hidden state h ` − 1 . The prediction for an y hidden state h i is the corresp onding bit in S 2 . Notice that the bits in S 2 are indep enden t and uniform as w e’v e not used them in any argumen t so far. The av erage correlation of an equally w eigh ted a v erage of m indep endent and uniform random bits with any one of the random bits is at most 1 / √ m . Hence ov er the randomness of S 2 , the exp ected zero-one loss of the predictor is at least 0 . 5 − n − 0 . 1 . Hence we can write- E [ δ ( s ` − 1 0 )] ≥ (0 . 5 − n − 0 . 1 ) P r [ |D | ≥ p t/ (8 ` )] ≥ (0 . 5 − n − 0 . 1 )(1 − e − n 0 . 25 ) ≥ 0 . 5 − 2 n − 0 . 1 By using Equation D.1 , for an y assignmen t r (0) to h ` − 1 0 E [ δ ( h ` − 1 )] ≥ P r h D ( s ` − 1 0 , r (0)) > `/ 2 − 2 ` i E h δ ( s ` − 1 0 )    D ( s ` − 1 0 , r (0)) > `/ 2 − 2 ` i 6 F or indep endent random v ariables { X i } lying in the interv al [0 , 1] with X = P i X i and µ = E [ X ], P r [ X ≤ (1 −  ) µ ] ≤ exp( −  2 µ/ 2). In our case  = 1 / 2 and µ = p t/ (2 ` ). 33 ≥ (1 − n − 2 c )(0 . 5 − 2 n − 0 . 1 ) = 0 . 5 − o (1) As this is true for all assignmen ts r (0) to h ` − 1 0 and for all c hoices of hidden states at time ` , using linearit y of exp ectations and a v eraging ov er all hidden states, the exp ected av erage zero-one loss of the predictor P o v er the randomness in c ho osing the p ermutation is 0 . 5 − o (1). This means that there must exist some p erm utation such that the av erage zero-one loss of the predictor P on that p erm utation is 0 . 5 − o (1). Hence there exists an HMM on n states such that is not information theoretically p ossible to get a v erage zero-one error with resp ect to the optimal predictions less than  − o (1) using windo ws of length smaller than c log n/ 2 for a ﬁxed constan t c . Therefore, for all 0 <  < 0 . 5 and suﬃcien tly large n , there exits an HMM with n states suc h that it is not information theoretically p ossible to get a v erage relativ e zero-one loss less than / 2 <  − o (1) using windows of length smaller than c − 2 log n . The result for relative zero-one loss follo ws on replacing / 2 by  0 and setting c 0 = c/ 4. The result follows immediately from this as the exp ected relativ e zero-one loss is less than the exp ected ` 1 loss. F or KL-loss we use Pinsk er’s inequalit y and Jensen’s inequalit y . Ac kno wledgemen ts Sham Kak ade ackno wledges funding from the W ashington Research F oundation for Innov ation in Data-in tensiv e Disco v ery , and the NSF Aw ard CCF-1637360. Gregory V aliant and Sham Kak ade ac kno wledge funding form NSF Award CCF-1703574. Gregory w as also supp orted b y NSF CA- REER Aw ard CCF-1351108 and a Sloan Researc h F ellowship. References [1] Y oshua Bengio, P atrice Simard, and Paolo F rasconi. Learning long-term dep endencies with gradien t descen t is diﬃcult. IEEE tr ansactions on neur al networks , 5(2):157–166, 1994. [2] S. Ho c hreiter and J. Schmidh uber. Long short-term memory . Neur al Computation , 9(8): 1735–1780, 1997. [3] F elix A Gers, J ¨ urgen Sc hmidh ub er, and F red Cummins. Learning to forget: Contin ual predic- tion with LSTM. Neur al c omputation , 12(10):2451–2471, 2000. [4] Alex Gra v es, Greg W a yne, and Iv o Danihelk a. Neural turing machines. arXiv pr eprint arXiv:1410.5401 , 2014. [5] J. W eston, S. Chopra, and A. Bordes. Memory netw orks. In International Confer enc e on L e arning R epr esentations (ICLR) , 2015. [6] Alex Gra v es, Greg W a yne, Malcolm Reynolds, Tim Harley , Ivo Danihelk a, Agnieszk a Grabsk a- Barwi ´ nsk a, Sergio G´ omez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural net work with dynamic external memory . Natur e , 538 (7626):471–476, 2016. [7] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by join tly learning to align and translate. arXiv pr eprint arXiv:1409.0473 , 2014. 34 [8] Ashish V asw ani, Noam Shazeer, Niki Parmar, Jak ob Uszkoreit, Llion Jones, Aidan N Gomez, Luk asz Kaiser, and Illia Polosukhin. A tten tion is all y ou need. In A dvanc es in Neur al Infor- mation Pr o c essing Systems , pages 6000–6010, 2017. [9] M. Luong, H. Pham, and C. D. Manning. Eﬀectiv e approac hes to attention-based neural mac hine translation. In Empiric al Metho ds in Natur al L anguage Pr o c essing (EMNLP) , pages 1412–1421, 2015. [10] Y. W u, M. Sc h uster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey , M. Krikun, Y. Cao, Q. Gao, K. Macherey , et al. Go ogle’s neural machine translation system: Bridging the gap b etw een h uman and mac hine translation. arXiv pr eprint arXiv:1609.08144 , 2016. [11] Zhe Chen and Matthew A Wilson. Deciphering neural co des of memory during sleep. T r ends in Neur oscienc es , 2017. [12] Zhe Chen, Andres D Grosmark, Hector Penagos, and Matthew A Wilson. Unco vering repre- sen tations of sleep-asso ciated hipp o campal ensem ble spike activity . Scientiﬁc r ep orts , 6:32193, 2016. [13] Matthew A Wilson, Bruce L McNaugh ton, et al. Reactiv ation of hippo campal ensemble mem- ories during sleep. Scienc e , 265(5172):676–679, 1994. [14] Prahladh Harsha, Rahul Jain, David McAllester, and Jaikumar Radhakrishnan. The comm u- nication complexity of correlation. In Twenty-Se c ond A nnual IEEE Confer enc e on Computa- tional Complexity (CCC’07) , pages 10–23. IEEE, 2007. [15] R. Kneser and H. Ney . Improv ed backing-oﬀ for m-gram language mo deling. In International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing (ICASSP) , volume 1, pages 181–184, 1995. [16] S. F. Chen and J. Go o dman. An empirical study of smo othing techniques for language mo d- eling. In Asso ciation for Computational Linguistics (ACL) , 1996. [17] E. Mossel and S. Ro ch. Learning nonsingular phylogenies and hidden Marko v models. In The ory of c omputing , pages 366–375, 2005. [18] Vitaly F eldman, Will P erkins, and Santosh V empala. On the complexity of random satisﬁa- bilit y problems with planted solutions. In Pr o c e e dings of the F orty-Seventh Annual ACM on Symp osium on The ory of Computing , pages 77–86. ACM, 2015. [19] Sarah R Allen, Ryan O’Donnell, and David Witmer. How to refute a random CSP. In F oundations of Computer Scienc e (F OCS), 2015 IEEE 56th A nnual Symp osium on , pages 689–708. IEEE, 2015. [20] Pra v esh K Kothari, Ryuhei Mori, Ryan O’Donnell, and Da vid Witmer. Sum of squares low er b ounds for refuting an y CSP. arXiv pr eprint arXiv:1701.04521 , 2017. [21] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-a w are neural language mo dels. arXiv pr eprint arXiv:1508.06615 , 2015. [22] Avrim Blum, Adam Kalai, and Hal W asserman. Noise-toleran t learning, the parity problem, and the statistical query mo del. Journal of the A CM (JACM) , 50(4):506–519, 2003. 35 [23] Ry an O’Donnell. Analysis of b o ole an functions . Cambridge Universit y Press, 2014. [24] Eric Blais, Ryan ODonnell, and Karl Wimmer. Polynomial regression under arbitrary pro duct distributions. Machine le arning , 80(2-3):273–294, 2010. [25] Adam T auman Kalai, Adam R Kliv ans, Yisha y Mansour, and Ro cco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing , 37(6):1777–1805, 2008. [26] D. Hsu, S. M. Kak ade, and T. Zhang. A sp ectral algorithm for learning hidden Marko v mo dels. In Confer enc e on L e arning The ory (COL T) , 2009. [27] A. Anandkumar, D. Hsu, and S. M. Kak ade. A metho d of moments for mixture mo dels and hidden Mark o v mo dels. In Confer enc e on L e arning The ory (COL T) , 2012. [28] H. Sedghi and A. Anandkumar. T raining input-output recurrent neural net w orks through sp ectral metho ds. arXiv pr eprint arXiv:1603.00954 , 2016. [29] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the p erils of non-conv exit y: Guaran teed training of neural net w orks using tensor metho ds. arXiv pr eprint arXiv:1506.08473 , 2015. [30] S. Arora, A. Bhask ara, R. Ge, and T. Ma. Prov able b ounds for learning some deep represen- tations. In International Confer enc e on Machine L e arning (ICML) , pages 584–592, 2014. [31] N. Cesa-Bianchi and G. Lugosi. Pr e diction, le arning, and games . Cambridge Universit y Press, 2006. [32] Daniel Russo and Benjamin V an Roy . An information-theoretic analysis of thompson sampling. The Journal of Machine L e arning R ese ar ch , 17(1):2442–2471, 2016. [33] A. Barron, J. Rissanen, and B. Y u. The minimum description length principle in co ding and mo deling. IEEE T r ans. Information The ory , 44, 1998. [34] P .D. Grun w ald. A tutorial introduction to the minim um description length principle. A dvanc es in MDL: The ory and Applic ations , 2005. [35] A. Da wid. Statistical theory: The prequential approach. J. R oyal Statistic al So ciety , 1984. [36] Y. Shtark ov. Universal sequen tial co ding of single messages. Pr oblems of Information T r ans- mission , 23, 1987. [37] K. S. Azoury and M. W armuth. Relative loss b ounds for on-line density estimation with the exp onen tial family of distributions. Machine L e arning , 43(3), 2001. [38] D. P . F oster. Prediction in the w orst case. Annals of Statistics , 19, 1991. [39] M. Opp er and D. Haussler. W orst case prediction ov er sequences under log loss. The Mathe- matics of Information Co ding, Extr action and Distribution , 1998. [40] Nicolo Cesa-Bianc hi and Gab or Lugosi. W orst-case b ounds for the logarithmic loss of predic- tors. Machine L e arning , 43, 2001. [41] V. V ovk. Comp etitive on-line statistics. International Statistic al R eview , 69, 2001. [42] S. M. Kak ade and A. Y. Ng. Online b ounds for bay esian algorithms. Pr o c e e dings of Neur al Information Pr o c essing Systems , 2004. 36 [43] M. W. Seeger, S. M. Kak ade, and D. P . F oster. W orst-case b ounds for some non-parametric ba y esian metho ds, 2005. [44] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of Bay es metho ds. IEEE T r ansactions on Information The ory , 36(3):453–471, 1990. [45] Da vid Haussler and Manfred Opp er. Mutual information, metric entrop y and cum ulativ e relativ e en trop y risk. Annals Of Statistics , 25(6):2451–2492, 1997. [46] A. Barron. Information-theoretic c haracterization of Ba y es performance and the choice of priors in parametric and nonparametric problems. In Bernardo, Berger, Da wid, and Smith, editors, Bayesian Statistics 6 , pages 27–52, 1998. [47] A. Barron, M. Sc hervish, and L. W asserman. The consistency of p osterior distributions in nonparametric problems. Annals of Statistics , 2(27):536–561, 1999. [48] P . Diaconis and D. F reedman. On the consistency of Ba y es estimates. Annals of Statistics , 14: 1–26, 1986. [49] T. Zhang. Learning b ounds for a generalized family of Bay esian p osterior distributions. Pr o- c e e dings of Neur al Information Pr o c essing Systems , 2006. [50] J. Ziv and A. Lemp el. Compression of individual sequences via v ariable-rate co ding. IEEE T r ansactions on Information The ory , 1978. [51] D. Rumelhart, G. Hinton, and R. Williams. Learning representations b y back-propagating errors. Natur e , 323(6088):533–538, 1986. [52] Vitaly F eldman, Elena Grigorescu, Lev Reyzin, Santosh V empala, and Ying Xiao. Statistical algorithms and a lo w er b ound for detecting planted cliques. In Pr o c e e dings of the forty-ﬁfth annual A CM symp osium on The ory of c omputing , pages 655–664. ACM, 2013. [53] Amit Daniely and Shai Shalev-Shw artz. Complexity theoretic limitations on learning DNF’s. In 29th A nnual Confer enc e on L e arning The ory , pages 815–830, 2016. [54] Amit Daniely . Complexit y theoretic limitations on learning halfspaces. In Pr o c e e dings of the 48th Annual ACM SIGA CT Symp osium on The ory of Computing , pages 105–117. A CM, 2016. 37

Prediction with a Short Memory

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment