Kernels for sequentially ordered data

K ernels for sequentially ordered data Franz J. Király ∗ 1 and Harald Oberhauser † 2 1 Department of Statistical S cience, University College London, Gower Street, London WC1E 6BT , United Kingdom 2 Mathematical Institute, University of Oxfo rd, Andrew Wiles Building, Oxfor d O X2 6GG, United Kingdom Abstract W e present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or string s. Our approach is based on signature features which can be seen as an ordered variant of sample (cross-)moments; it allows to o btain a “sequentialized” version of any static ke rnel. The sequential kernels are efﬁciently computable for discrete sequences and are shown to approximate a continuous moment form in a sampling sense. A number of known ker nels for sequences arise as “sequentializations” o f suitable static kernels: string kernels may be obtained as a special case, and alignment kernels are closely related up to a modiﬁcation that resolves their open non-deﬁniteness issue. Our experiments indicate that our sign ature- based s equential kernel framew ork may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual p re-processing. ∗ f.kiraly@ucl.ac.uk † oberhauser@maths.ox.ac.uk 1 Contents 1 Introduction 3 1.1 Si gnature features, and their universality f or sequences . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The sequential kernel and sequentialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Efﬁcient computation and discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Pr ior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Notation for orde r ed dat a 8 3 Signature features: ord ered mome nt s for seque ntial data 9 3.1 P aths of bounded variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Si gnature integrals in Hilbert s p aces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 The signature as a canoncial feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 K ernelized signatures and s e quentialized ker ne ls 14 4.1 W ell-deﬁnedn ess of the signature kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 K ernel trick number one: kernelizing the signature . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 K ernel trick number two: sequentialization of kernels . . . . . . . . . . . . . . . . . . . . . . . . 16 5 Discrete signatures and kernels for s equences 17 5.1 Discretiz ing the signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 K ernel trick number one discretized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.3 K ernel trick number two discretized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 The string, alignment and ANOV A kernel seen via seq uentializations 22 6.1 The string kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.2 The global alignment kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.3 The relation-convol ution kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 7 Higher order sequentialization and noisy observations 26 7.1 Brow nian noise: order 2 approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 7.2 Higher order approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 8 Efﬁcient computation of sequent ialized kernels 29 8.1 Dy namic programming for t ensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 8.2 Compu ting the sequential kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 8.3 Compu ting the higher order sequential kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 8.4 La rge scale strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 9 Experimental validation 37 9.1 V alidation and prediction set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 9.2 Experi ment: Classifying hand movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9.3 Experi ment: P endigits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A Second kernelization of the signatur e kernel 40 B Inte g ral approximation bounds and proof of Th e orem 5 42 2 1. Introduction Sequentially orde red d ata are ubiquitous in modern science, oc curring as time series, location series, or , more generally , sequentially observed samples of numbers, vectors, and structured objects. They occur frequently in st ructured machine learning tasks, in supervised classiﬁcation and regression as well as in forecasting, as well as in unsupervised learning. Three stylized facts make learning with sequential data an ongo ing challenge: (A) S equential data is usually very diverse, with wildly dif f erent features being useful. In th e st ate-of - the-art, t his is usually addressed by manual ext r action of hand-crafte d fe at ures , the combination of which is often very speciﬁc to the application at hand and does not transfer easily . (B) Sequ ential data often occurs as se quences of s tructur ed objects , such as letters in t ext, images in video, graphs in network evolution, or heterogenous combination of all mentioned say in database or internet applications. This is usually s o lved by ad-hoc approaches adapted t o th e speciﬁc structure. (C) Seq uential data is often large, with sequences easily obtaining the length of hundreds, t housands, millions. Especially when there is one or more sequences per data point, the data set s quickly become ve r y h uge , and with them computational time. In this paper , we present a novel approach to learning with sequential data based on a joining of the theory of signatures / r ough paths , and kernels / Gaus sian pr ocesses , addressing the points above: (A) The signature of a p ath is a (large) collection of canonical featur es that can be intuitively described an ordered version of sample moments. T hey completely desc ribe a sequence of vectors (provably), and make sequences of different size and length comparable. The use of signature f eatures is there- fore a straightforward way of avoiding manual feature extraction (Section 3). (B) Combin ing signatures with the kernel t rick , by c onsidering the signature map as a f eature map yields a kernel for sequences. It also allows learning with seque nce s of s t ructure d objects f o r which non-sequential kernels exist — consequently we c all t he process of obt aining a sequence kernel from a kernel f or structured objects “kernel sequentializa tion” (Section 4). (C) The sequentialized ker nel can b e computed efﬁciently via dynamic programming ideas similar to those known for st ring kernels (Section 5 ). The kernel formalism makes the computations further amenable to low-rank type speed-ups in kernel and Gaussian process learning such as Nyström-type and Cholesky approximations or inducing point methods (Section 8). T o sum up, we provide a canonical construction to transform any kernel k : X × X : → R into a version for sequences k + : X + × X + → R , where we have denoted by X + the set of arbitrary length sequences in X . W e call k + the seq uentialization of k. This sequential ization is canonical in the sense that it converges to an inner product of ordered moments, the signature, when sequences in X + converge to functions [[ 0, 1 ] → X ] in a meaningful way . W e will see that existing kernels fo r sequences such as string or alignment kernels are closely related t o th is construction. W e explore the practical use of the sequential kernel in experiments which show that sequentializ ation of non-linear kernels may be beneﬁcial, and that the sequential kernel we propose can beat the state- of-the- art in sequence classiﬁcation while avoiding extensive pre-processing. Below we give an informal overview of the main ideas, and a summary of related prior art. 1.1. Signature features , and t heir universality for seque nces Signatures are universal features for sequences, characterizing sequential structure by quantifying depen- dencies in their c hange, similar to sample moments. W e showcase how t o obtain such signature features for th e simple example of a two-dimensional, smooth series x : [ 0, 1 ] → R 2 , t 7→ ( a ( t ) , b ( t )) ⊤ , 3 whose argument we interpret as “time”. As with sample moments or sample cumulants of different degree, there are signature f eatures of different degree, ﬁrst degree, second degree, and so on. The ﬁrst degree part of the signature is the average change in the series, that is, S 1 ( x ) : = E t [ ˙ x ( t )] = E t  ˙ a ( t ) ˙ b ( t )  =  ´ 1 0 d a ( t ) ´ 1 0 d b ( t )  =  a ( 1 ) − a ( 0 ) b ( 1 ) − b ( 0 )  , where we have written d a ( t ) : = ˙ a ( t ) d t ,d b ( t ) : = ˙ b ( t ) d t . The second degree part of the signature is the (non-centered) c ovariance of changes at two subsequent time points, that is, the expectation S 2 ( x ) : = 1 2 E t 1 < t 2 [ ˙ x ( t 1 ) · ˙ x ( t 2 ) ⊤ ] = 1 2 E t 1 < t 2  ˙ a ( t 1 ) ˙ a ( t 2 ) ˙ a ( t 1 ) ˙ b ( t 2 ) ˙ b ( t 1 ) ˙ a ( t 2 ) ˙ b ( t 1 ) ˙ b ( t 2 )  =  ´ 1 0 ´ t 1 0 d a ( t 2 ) d a ( t 1 ) ´ 1 0 ´ t 1 0 d a ( t 2 ) d b ( t 1 ) ´ 1 0 ´ t 1 0 d b ( t 2 ) d a ( t 1 ) ´ 1 0 ´ t 1 0 d b ( t 2 ) d b ( t 1 )  , where the expectations in the ﬁrst line are uniformly over time points in chronological o rder t 1 ≤ t 2 (that is, t 1 , t 2 is the order statist ic of t wo points sampled from the uniform distribution on [ 0, 1 ] ). This is equival ent t o integration over t he so-called 2-order-simplex { t 1 , t 2 : 0 ≤ t 1 ≤ t 2 ≤ 1 } in the second line, up to a factor of 1 / 2 c orresponding t o the uniform density (we put it in front of the expectation and not its inverse in f ront of the integral t o obt ain an exponential g enerating function later on). Note t hat the second order signature is different f rom the second order moment matrix of t he inﬁnites- imal changes by t he chronological order imposed in t he expectation. Similar ly , one deﬁnes the degree M part of the signature as the M -th order moment tensor of t he inﬁnitesimal changes, where expectation is taken over chronologically ordered time points (which is a tensor of degree M ). A basis-free deﬁnition over an arbitra ry RKHS is given in Section 3.2. Note that the signature tensors are not symmetric, similarly to the second order matrix in which the number arising from the ˙ b ( t 1 ) ˙ a ( t 2 ) term is in general different from the number obtained from t he ˙ a ( t 1 ) ˙ b ( t 2 ) t erm. The signature f eatures are in c lose mathematical analogy to moments and thus polynomials on the domain of multi-dimensional series. Namely , one can show: • A (sufﬁciently regular) series is (almost) uniquely determined by t heir signature - this is not true for higher order moments or cumulants without the order st ruct ure (recall that these almost uniquely determine the distribution of values, without order structure). • Any (sufﬁciently regular real-val ued) function f on series can be arbitrari ly approxim ated by a function linear in signature f eatures, t hat is for a non-linear funct ionals f of our t wo-dimension al path x = ( a , b ) ⊤ , f ( x ) ≈ α + X β i 1 ,..., i M ˆ d x i 1 · · · d x i M , where the sum runs over M and ( i 1 , . . . , i M ) ∈ { 1, 2 } M and we denote x 1 : = a , x 2 : = b . Note that x is the indetermina te here, that ´ d x i 1 · · · d x i M is the degree M part of t he signature and approximation is over a compact set of different paths x . The exact statements are given in Section 3. From a methodological viewpoint, t hese assertions mean that not only are signature features rich enough to capture all relevant features of the sequential data, but also that any practically relevant feature can be expressed linearly in th e signature, addressing point (A) in the sense of a universal met h odologic al approach. Unfortunately , native signature features, in the form above are only practic al in low dimension and low deg ree M : alrea dy in the example above of a two-dimensional path x , there are 2 M (scalar) signa- ture features of degree M , in general computation of a larger number of signature features is infeasible, point (C) Further , all data are discrete sequences not continuous, and possibly of objects which are not necessarily real vectors; point (B). 4 1.2. The sequential kernel and s e quentialization The two issues mentioned can be addressed by the kernel trick — more precisely , by the kernel trick applied twice: once, to cope with the combinatorial explosion of signature f eatures, akin to t he polynomial kernel which prevents computation of an exponential number of polynomial features; a second time, to allow t reatment of sequences of arbitrar y object s. T his double kernelization addresses point (B), and also the combinatorial explosion of the feature space. An additional discretization-approximation, which we discuss in the next paragraph below , makes t he so-deﬁned kernel amenable to efﬁcient computation. W e describe the two kernelization steps. The ﬁ rst kernelization st ep addresses t he combinatorial ex- plosion. It simply consists of taking the sc alar product of signature features as kernel, and then observing that t his scalar product of integrals is an integral of sc alar products. More precisely , t his kernel, c alled signature kernel can be deﬁned as follows, continuing t he example with two two-dimensional s equences t 7→ x ( t ) = ( a ( t ) , b ( t )) ⊤ , t 7→ ¯ x ( t ) = ( ¯ a ( t ) , ¯ b ( t )) ⊤ as above: K ⊕ ( x , ¯ x ) : = 〈 S ( x ) , S ( ¯ x ) 〉 = 1 + 〈 S 1 ( x ) , S 1 ( ¯ x ) 〉 + 〈 S 2 ( x ) , S 2 ( ¯ x ) 〉 + · · · . The s c alar product of S 1 -s (vectors in R 2 ) is the Euclidean scalar product in R 2 , the s calar product of S 2 -s (matrices in R 2 × 2 ) is th e trace product in R 2 , and so on (with higher degree tensor trace products). The “1” is an “S 0 ”-contribution (for mathematical reasons becoming apparent in paragraph 1.3 below). F or t he ﬁrst degree contribution t o t he signature kernel, one now not es that 〈 S 1 ( x ) , S 1 ( x ) 〉 = E s [ ˙ a ( s ) ] · E t  ˙ a ( t )  + E s  ˙ b ( s )  · E t h ˙ b ( t ) i = E s , t  ˙ a ( s ) · ˙ a ( t ) + ˙ b ( s ) · ˙ a ( t )  = E s , t  〈 ˙ x ( s ) , ˙ x ( t ) 〉  . In analogy , one computes that t he second degree contribution to the signature kernel evaluates to 〈 S 2 ( x ) , S 2 ( x ) 〉 = 1 2! 2 E s 1 < s 2 , t 1 < t 2  〈 ˙ x ( s 1 ) , ˙ x ( t 1 ) 〉 · 〈 ˙ x ( s 2 ) , ˙ x ( t 2 ) 〉  . Similar ly , for a higher degree M , one obt ains a product of M sc alar products in the expectation. The presentation is not only reminiscent of the polynomial kernel in how it copes with the combi- natorial explosion, it also directly suggests the second kernelization to cope with sequences of arbitrary objects: since the sequential kernel is now entirely expressed in scalar products in R 2 , the scalar products in the expect ation may be replaced by any kernel k : X × X → R , of arbitrary objects, yielding a sequential kernel, now for sequences in X , g iven as k ⊕ ( x , ¯ x ) : = 1 + 1 1! 2 E s , t  k ( ˙ x ( s ) , ˙ x ( t ))  + 1 2! 2 E s 1 < s 2 , t 1 < t 2  k ( ˙ x ( s 1 ) , ˙ x ( t 1 )) · k ( ˙ x ( s 2 ) , ˙ x ( t 2 ))  + 1 3! 2 E . . . (1.1) (for expository convenience we assume here that diff erentials in X are deﬁned which in general is untrue, see Section 4.3 for the general statement). N ote that (1.1) can be seen as a process that t akes any kernel k on X , and makes it into a kernel k ⊕ on X -sequences, therefore we term it “sequentializ ation” of the kernel k. This addresses point (B), and can be found in more det ail in Section 4. 1.3. Efﬁcient computation and discr etization An ef ﬁcient way of evaluating the sequential kernel is suggested by a second observation, closely related to (and g enerali zing) Horner’s method of evaluating polynomia ls. Note that t he sequential kernel can be written as an iterated conditional expectation k ⊕ ( x , ¯ x ) =  1 + E s 1 , t 1  k ( ˙ x ( s 1 ) , ˙ x ( t 1 )) ·  1 + 1 2 2 E s 1 < s 2 , t 1 < t 2  k ( ˙ x ( s 2 ) , ˙ x ( t 2 )) ·  1 + 1 3 2 E s 2 < s 3 , t 2 < t 3 [ . . . ]  . 5 The it erated expectation directly suggests a discretization by replacing expectations by s ums, such as 1 + 1 n 2 X s 1 , t 1   k ( ˙ x ( s 1 ) , ˙ x ( t 1 )) · 1 + 1 ( 2 n ) 2 X s 1 < s 2 , t 1 < t 2   k ( ˙ x ( s 2 ) , ˙ x ( t 2 )) · 1 + 1 ( 3 n ) 2 X s 2 < s 3 , t 2 < t 3 [ . . . ] !   !   ! , where the sums range over a discrete set of p o ints s i , t i ∈ [ 0, 1 ] , i = 1, . . . L . A reader familiar with string kernels will immediately notic e the similarity: the sequential kernel can in fact be seen as inﬁnitesimal limit of a string kernel, and the (vanilla) string kernel can be obtained as a special case (see Section 6). As a ﬁnal s ubtlety , we note that the derivatives of x , x will not be known in observations, therefore one needs to replace k ( ˙ x ( s i ) , ˙ x ( t j )) by a discrete difference approximation k ( x ( s i + 1 ) , x ( t j + 1 )) + k ( x ( s i ) , x ( t j )) − k ( x ( s i ) , x ( t j + 1 )) − k ( x ( s i + 1 ) , x ( t j )) where s i , s i + 1 resp. t j , t j + 1 denote adjacent support values of the discretization. Our theoretical results in Section 5 show that the discretization, a s described above, c onverges to the c ontinuous kernel, with a c onvergence order linear in t he sampling density . Moreover , similarly to the Horner scheme for polynomials (or fast string kernel techniques), the iterated sum-product can be efﬁciently evaluated by dynamical programming techniques on arr ays of dimension three, as outlined in Section 8. The computational c omplexity is quadratic in the length of the sequences and linear in the degree o f approximation, and can b e fur ther reduced to linear complexity in b ot h with low-rank techniques. This addresses the remaining point (C) and therefore yields an efﬁcently c omputable, c anonical and universa l kernel for sequences of arbitrary objects. 1.4. Prior art Prior art relevant to learning with sequential data may be found in three areas: 1. dynamic programming algorithms for sequence comparison in the engineering community , 2. kernel learning and Gaussian processes in the machine learning community 3. rough paths in the stochastic analy sis community . The dynamic programmi ng literature (1) from the 70’s and 80’s h as inspired some of the progress in kernels (2) f or sequential data over the last dec ade, but to our knowledge so far no c onnections have b een made between these two, and (3), even though (3) pre-dates kernel literature for sequences by more than a decade. Beyond the above, we are not aware of literature in s t atistics of time series that deals with sequence-val ued data points in a way other than ﬁrst identifying one-dimensional sequences with real vectors of same size, or even forgetting the sequence structure entirely and replacing the sequences with (order-a gnostic) agg regates such as c umulants, quantiles or principal component scores (this is equally true f or forecasting methods). Though, s imple as such reduction to more classic situations may be, it constitutes an important baseline, since only in comparison one c an infer t hat the ordering was informative or not. Dynamic progr amming for seq uence c omparison. The earliest occurrence in which the g enuine order information in sequences is used for learning can probably be found in th e work of Sakoe et al [ 29 , 30 ] which introduces the idea of using editing or distortion distances to compare sequences of different length, and to efﬁciently determine such distances via dynamic programming strategies. These distances are then employed f or classiﬁcation by maximum similar ity / minimum distance principles. Through theoretical appeal and efﬁcient computability , sequence comparison methods, later synonymously called dynamic time warping methods, have become one of the standard methods in comparing sequential data [ 9, 16 ] . Though it may need t o be said that sequence comparison methods in their pure form — namely an efﬁciently computable distance bet ween sequences — have remained somewhat restricted in that they 6 can only be directly adapted only t o relatively heuristic distance-based learning algorithms, by deﬁnition. This may be one of th e reasons why sequence comparison / dyn amic time warping methods have not given rise to a closed learning theory , and why in their original practical application, speech recognition and speech classiﬁcation, they have later been superseded by Hidden Markov Models [ 26 ] and more recently by neural network / deep learning methodologies [ 15 ] as gold standard. A possible solution of t he above-mentioned shortcomings has been demonstrated in kernel learning literature. K ernels for seque nces. Ker nel learnin g is a relatively new ﬁeld, providing a general framework to make non-line ar data of arbitrary kind amenable to classical and scalable linear algorithms such as regression or the support vector machine in a uniﬁed way , by using a non-lin ear scalar product: this strategy is called “kerneli zation”; see [ 32 ] or [ 33 ] . Mathematically , t h ere are c lose relations to Gaussian process theory [ 27 ] which is often considered as a complimentary viewpoint to kernel s, and aspects of spatial g eostatistics [ 4 ] , particularl y Kriging, an interpolation / prediction method from the 60’s [ 23 ] which has been re-discovered 30 years later in the form of Gaussian process regression [ 35 ] . In all three incarnations, coping with a speciﬁc kind of data practically reduces to ﬁnding a suitable kernel ( = c ovarian ce function), or a family of kernels for the t ype of data at hand — after which one can apply a ready arsenal of learnin g theory and non-line ar methodology to such data. In this sense, providing suitable kernels for sequences has proved to be one of the main st rategies in removing the shortcomings of the sequence comparison approach. Ker nels for strings, t hat is, sequences of symbols, were among the ﬁ rst to be considered [ 14 ] . F ast dynamic programming algorithms to c ompute string kernels were obtained a few years later [ 17, 19 ] . Almost in parallel and somewhat separately , kernels based o n the above-mentioned dynamic t ime warp- ing approach were developed, for sequences of arbitrary objec t s [ 1, 24 ] . A re-formulation / modiﬁcation led to the so-called g lobal alignment kernels [ 6 ] , for which later fast dynamic programming algorithms were found [ 5 ] as well. An interesting subtle issue c ommon t o both strains was t hat initially , the dy- namic programming algorithms f ound were quadratic in the length of the sequences, and o nly later linear complexity algorithms were devised: f or s tring kernels, the transition was made in [ 17 ] , while for the sequence matc h ing strain, this became only p o s s ible after passing to t h e global alignment kernels [ 5 ] . Looking f rom a g eneral perspect ive: while in hindsight all of the mentioned kernels can be viewed f rom Haussler’s original, visionary relation-convolu tion kernel f ramewor k, and all above-mentioned kernels for sequences, in some form, admit fast dynamic programming algorithms, existing literature provides no unifying view on kernels for sequences: the exact relation between string kernels a nd dynamic time warping / globa l alignment kernels, or to the classical theory of time s eries has remained unclear; further , the only known kernels for sequences of arbitrary ob jec t s, the dynamic time warping / global alignment kernels, s uf f er from the fact that t hey are not proper kernels, failing to be positive deﬁnite. In this paper , we attempt to resolve these issues. More precisely , t he string kernel will arise as a special case, and the global alignment kernel as a deﬁ c ient version of our new signature kernel, built on the theory of signatures and rough paths from stoc hastic analysis. Iterated integr als, signatures and rough paths. Series of iterated integrals are a classic mathematical object that plays a fundamental role in many areas like c ontrol theory , combinatorics, homotopy theory , F eynman–Dyson–Schwin ger theory in physics and mor e recently probabil ity theory . W e refer to [ 20, Section “Historic papers”, p97 ] for a bibliography of inﬂuential articles. This series, or certain aspec t s of it, are treated under various names in the literature like “Magnus expansion”, “time ordered exponential”, or the one we chose which c omes from Lyons’ rough path theory: “the signature of a path”. The reason we work in the rough path setting is that it provides a concise mathematical framework that clearly separates analytic and algebrai c aspects, applies in inﬁnite dimensional spaces like our RKHS and is robust under noise; we refer to [ 8, 20, 22 ] as introductions. The role of the signature as a “non-commutative exponential” plays a guiding principle fo r many recent developments in stoc hastic analysis, though it might be less known outside this c ommunity . 7 The major application of rough path theory was and st ill is to provide a robust understanding of differ- ential equations that are perturbed by noise, going beyond classic It o-calculus; Hairer’s work on regulari ty structures [ 12 ] which was recently awared a Fields medal can be seen as vast generalization of such ideas. The interpretation of t he signature in a statistical context is more recent: work of P apavasil iou and Ladroue [ 25 ] applies it t o SDE p arameter est imation; work of Gyurko, Hao, Lyons, Oberhauser [ 11, 18, 21 ] ap- plies it to forecasting and classifcation of ﬁnancial time series using linear and logistic regression; work of Diehl [ 7 ] and Graham [ 10 ] uses signature features for handwritten digit recognition, see also [ 36 ] f o r more recent state of the art results. The interpretation of the signature as an expectation alrea dy oc curs as a technical Lemma 3.9 in [ 13 ] . A sc alar p roduct formula for the norm, somewhat reminiscent of that of the s equential kernel, can be f ound in t he same Lemma. Similar ly we would like to mention the Computational R ough P aths p ac kage [ 3 ] , that cont ains C ++ code t o compute signature features directly for R d -value d paths. However , it does not contain spec ialized code to calculate inner products of signature features directly . The Horner-type algorithm we desc ribe in Section 1.3 gives already signiﬁcant speed improvements when it is applied to paths in ﬁnite dimensional, linear spaces (that is, the sequentialization of the Euclidean inner product k ( · , · ) = 〈· , ·〉 R d ; see R emark 8.1). R emark 1.1. The above overview contains a substantial number of examples in which independent or parallel development of ideas related to sequential data has occurred, possibly due to researchers being unaware of similar ideas in communities socially far from their own. W e are a ware that this could the r efore also be the case for this very paper , unintended . W e are thus especially tha nkful for any pointers from read e rs of this pre-print ver sion t hat help us give credit wher e credit is due. Acknowledgement HO is grateful for support b y t he Oxford-Man Institute of Quantitative ﬁnance. 2. Notation for ordered data W e introduce recurring notation. Deﬁnition 2.1. The set of nat ur al numbers, including 0 , wi ll be d e noted by N . Deﬁnition 2.2. Let A be a set a nd L ∈ N . W e denote 1. the set of integers between 1 and L by [ L ] : = { 1, 2, . . . , L } , 2. the set of order ed L -tuples in A as usual by A L : = { a = ( a 1 , . . . , a L ) : a 1 , . . . , a L ∈ A } , 3. the set of such tuples, of arbitrary but ﬁnite length, by A + : = S L ∈ N A L where by convention A 0 = ; . Moreover , we use t he following index notation for a = ( a 1 , . . . , a L ) ∈ A + , 1. ℓ ( a ) : = L , 2. # a the count of t h e most frequent item in a , 3. a [ i ] : = a i for i = 1, . . . , L , 4. a [ 1 : N ] : = ( a [ 1 ] , . . . , a [ N ]) for N ∈ [ L ] , 5. a [ i ] : = ( a [ i 1 ] , . . . , a [ i N ]) ∈ A N for i =  i 1 , . . . , i N  ∈ [ L ] N , 6. a ! : = n 1 ! · · · n k ! if a cons ists of k = |{ a 1 , . . . , a L }| different elements in A and n 1 , . . . , n k denote the number of time s they occur in a , 8 7. f ( a ) =  f ( a 1 ) , . . . , f ( a L )  ∈ B L for f ∈ [ A → B ] . In the case that A ⊂ R , we c an deﬁne subsets of A + that consist of increasing tuples. These tuples play an important role for calculations with the signature features. Deﬁnition 2.3. Let A ⊂ R and M ∈ N . W e denote t he set of monotonously ordered M -tuples in S by ∆ M ( A ) : = { u ∈ A M : u [ 1 ] ≤ u [ 2 ] ≤ · · · ≤ u [ M ] } . W e de not e the union of such tuples by ∆ ( A ) : = S M ∈ N ∆ M ( A ) where again by convention ∆ 0 ( A ) = ; . W e call ∆ M ( A ) the order M simplex on A , and ∆( A ) the order simplex on A . A monotonously ordere d tuple u ∈ ∆ M ( A ) is called strict if u [ i ]  u [ i + 1 ] for all i ∈ [ M − 1 ] . The index notat ion of Deﬁnition 2.2 applies also to u ∈ ∆( A ) , under sta nding that ∆( A ) ⊆ A + if A ⊆ R . R emark 2.4. Above deﬁnition of order simplex is slightly different from the usual one, in which one tak e s A = [ 0, 1 ] , t he counting starts at t 0 , and one has t 0 = 0 and t M = 1 . The f ollowing notation is less standard, but becomes very useful for t he algorithms and recursions that we discuss. Deﬁnition 2.5. Let a , b ∈ A + , D ∈ N . W e use t he following notation: • a ⊑ b if there i s a i ∈ ∆( N ) such that a = b [ i ] , • a ⊏ b if there i s stri ctly or d ered tuple i ∈ ∆( N ) such that a = b [ i ] , • a ⊑ D b if there is a i ∈ ∆( N ) such that a = b [ i ] and # a ≤ D . F or L ∈ N we also the nota t ion 1. a ⊑ [ L ] if a ∈ ∆([ L ] ) , 2. a ⊏ [ L ] if a ∈ ∆([ L ] ) and a is strict (that is a [ i ]  a [ i + 1 ] for a ll i ∈ [ ℓ ( a ) − 1 ] ). 3. Signature features: ordered momen ts for sequentia l data In this section, the signature features are introduced in a mathematically p recise way , and the properties which make them canonical features for s equences are derived. W e refer t he interested reader t o [ 20 ] for further properties of signatures and its use in stochastic analysi s. As outlined in introductory Section 1.1, the signature can be understood as an ordered and hence non-commutative variant of sample moments. If we model ordered data by a f unct ion x ∈ [[ 0, 1 ] → R n ] , the signature features are obtained as iterated integrals of x and the M -times iterated integral c an be interpreted as a M -th ordered moment; examples o f such features for M = 1, 2 are the (non-commutative) integral moments S 1 : [[ 0, 1 ] → R n ] → R n , x 7→ ˆ 1 0 d x ( t ) , or S 2 : [[ 0, 1 ] → R n ] → R n × n , x 7→ ˆ 1 0 ˆ t 2 0 d x ( t 1 ) d x ( t 2 ) ⊤ (where the domain o f x will be restricted so the integrals are well-deﬁned). The more general idea of signature features, made mathematically precise, is t o c o nsider the integral moments S M ( x ) =  ˆ ∆ M d x i 1 ( t 1 ) · · · d x i M ( t M )  i 1 ,..., i M ∈{ 1,..., n } ∈ R n ×···× n (3.1) where the integration is over the M -simplex ∆ M , i.e., all ordered sequences 0 ≤ t 1 ≤ t 2 ≤ · · · ≤ t M ≤ 1, and the c hoice of index ( i 1 , . . . , i M ) parametrises t he features. The features S 1 ( x ) , S 2 ( x ) , . . . are all element 9 of a (graded) linear spac e and a kernel for sequential data may then be obtained from taking the scalar product over the signature features. The section is devoted to a desc ription and charactarization of t hese signature features and the kernel obtained from it. T his is done in a basis-free form which allows treatment of ordered data x : [ 0, 1 ] → H , where H is a Hilbert space (over R ) not necessarily identical to R n . This will allow considering ordered versions of data, the non-ordered varian t of which may be already encoded via its reproducing kernel Hilbert sp ac e H . F or illustration, the reader is invited t o keep the prototypical case H = R n in mind. 3.1. Paths of bounded variation The main object modelling a (generative) datum will be a continuous sequence, a path x : [ 0, 1 ] → H where H is a Hilbert space which is ﬁxed in the following. F or reasons of regularity (integrals need to converge), we will restrict ourselves t o paths [[ 0, 1 ] → H ] which are continuous and of bounded length 1 (called “varia tion”, as deﬁned below). The variation of a path is deﬁned as the supremum of variations of discrete s ub-sequences t aken from the path : Deﬁnition 3.1. F or a n ordered tuple x ∈ H L + 1 , we deﬁne: (i) ∇ x ∈ H L as ∇ x [ i ] : = x [ i + 1 ] − x [ i ] , i ∈ [ L ] , (ii) mesh ( x ) : = max i k ( ∇ x )[ i ] k , (iii) V ( x ) : = P L i = 1 k∇ x k H . W e call ∇ x the ﬁrst differe nce seque nce of x , we call mesh ( x ) the mesh of x , and V ( x ) the variation of x . Deﬁnition 3.2. W e deﬁne the variat ion of x ∈ [[ 0, 1 ] → H ] as V ( x ) : = sup t ∈ ∆([ 0,1 ]) V ( x ( t )) . The ma pping x is said t o be of bounded variation if V ( x ) < ∞ . A bounded variation path, as t he name says, is a path with bounded variation. Deﬁnition 3.3. Let [ a , b ] ⊆ [ 0, 1 ] . W e denote the set of H -valued paths of bounded variation on [ a , b ] by B V ([ a , b ] , H ) : = { x ∈ C ([ a , b ] , H ) : V ( x ) < ∞} . When [ a , b ] = [ 0, 1 ] we often omit the qualiﬁer [ 0, 1 ] a nd writ e B V ( H ) . Deﬁnition 3.4. L et E be a linear space and denote with L ( H , E ) t he set of continuous li near m a ps from H t o E . Given x ∈ BV ([ 0, 1 ] , H ) and y ∈ B V ([ 0, 1 ] , L ( H )) , the Ri e mann–Stieltjes integral of y over [ a , b ] ⊆ [ 0, 1 ] is deﬁned as the element i n E given as ˆ b a y d x : = lim t ∈ ∆([ a , b ] ) mesh ( t ) → 0 ℓ ( t ) − 1 X i = 1 y ( t [ i ]) ( ∇ x ( t [ i ])) . W e al so use the shorter notation ´ y d x when the integration domain i s clear from t he context. As in the ﬁnite-dimension al c ase, above is indeed well- deﬁned, that is th e Riemann sums converge and the limit is independent of the sequence t ; see [ 20, Theorem 1.16 ] for a proof of a more general result. Note that t he integral itself is in g eneral not a scalar , but an element of t he Hilbert space H . 1 Considering bounded variation paths is s lightly restrictive as it excludes samples from stochastic process models such as the Wiener process / Brownian motion. The theory may be extended to such s equences at the cost of an i ncrease in technicality . F or reading conven ience, this will be done only at a later s tage. 10 3.2. Signature integrals in Hilbert spaces Wi th the Riemann–Stieltjes integral, we c an deﬁne t he signature integrals in a basis-free way . F or this, note t hat the Riemann–Stieltjes integral of a bounded variation path is again a bounded variation path. Deﬁnition 3.5. Let [ a , b ] ⊆ [ 0, 1 ] and x ∈ BV ([ a , b ] , H ) . W e deﬁne: (i) ´ ∆ 1 ([ a , b ]) d x ⊗ 1 : = ´ b a d x = x ( b ) − x ( a ) , (ii) ´ ∆ M ([ a , b ]) d x ⊗ M : = ´ b a  ´ ∆ M − 1 ([ a , s ]) d x ⊗ ( M − 1 )  ⊗ d x ( s ) for inte ger s M ≥ 2 . W e call ´ ∆ M ([ a , b ]) d x ⊗ M the M -th it erated integral of x on [ a , b ] . When [ a , b ] = [ 0, 1 ] we ofte n omit the qualiﬁer [ 0, 1 ] and writ e ∆ M for ∆ M ([ 0, 1 ] ) . In the prototypical case H = R n , the ﬁrst iterated integral is a vector in R n , the second it erated integral is a matrix in R n × n , th e t h ird iterated integral is a tensor in R n × n × n and so on. F or the case of arbitrary Hilbert spaces, we need t o introduce some additional notation to write down t h e space where the iterated integrals live: Deﬁnition 3.6. W e denote by H ⊗ H the tensor ( = out er product) product of H wi t h itself . Instead of H ⊗ H ⊗ · · · ⊗ H ( M times), we also writ e H ⊗ M . By convention, H ⊗ 0 = R . The M -th iterated integral of x is an element of H ⊗ M , a tensor of degree M . The natural space of iterated integrals of all degrees t ogether is t he tensor power algebra th e Hilbert space H : Deﬁnition 3.7. The tensor power algebra over H is deﬁned a s L ∞ M = 0 H ⊗ M , addition + , multiplication ∗ a nd an inner prod uct 〈· , ·〉 T ( H ) are deﬁned by canonical extension from H , that i s: ( g 0 , g 1 , . . . , g M , . . . ) + ( h 0 , h 1 , . . . , h M , . . . ) : =  g 0 + h 0 g 1 + h 1 , . . . g M + h M , . . .  . ( g 0 , g 1 , . . . , g M , . . . ) ∗ ( h 0 , h 1 , . . . , h M , . . . ) : = g 0 h 0 , g 0 ⊗ h 1 + g 1 ⊗ h 0 , . . . , M X i = 0 g i ⊗ h M − i , . . . ! ,  ( g 0 , g 1 , . . . , g M , . . . ) , ( h 0 , h 1 , . . . , h M , . . . )  T ( H ) : = 〈 g 0 , h 0 〉 R + 〈 g 1 h 1 〉 H + · · · + 〈 g M , h M 〉 H ⊗ M + . . . . The elements in the tensor power algebra with w ith ﬁnite norm   g   T ( H ) = p 〈 g , g 〉 T ( H ) < ∞ are d enoted by T ( H ) 2 . Furthe r , we canonically identify H ⊗ M with the sub-Hilbert space of T ( H ) that contains exa ct l y ele- ments of the type g = ( 0, . . . , 0, g M , 0, . . . ) , which we call homogenous (of degree k ). Under this identiﬁcat ion, for g ∈ T ( H ) , w e will also write, for read ing convenience, g 0 + g 1 + g 2 + . . . i nstead of ( g 0 , g 1 , g 2 , . . . ) , where we may opt to omit zeros in the sum. W e adopt an analogue use of sum and product signs P , Q . Example 3.8. W e work out te nsor algebra multiplication and scalar product in the case of the protot ypical example H = R n . Consider homogenous elements of degree 2: it holds that H ⊗ 2 = R n ⊗ R n ∼ = R n × n , and the tensor product of two vectors is v ⊗ w = v w ⊤ . Th e t race product on R n × n is induced by the Euclidean scalar product, since 〈 v 1 w ⊤ 1 , v 2 w ⊤ 2 〉 = 〈 v 1 , v 2 〉〈 w 1 , w 2 〉 . Homogenous elements of de gree 3 a re similar: it holds that H ⊗ 3 ∼ = R n × n × n , and the tensor product of three vectors u ⊗ v ⊗ w is a tensor T of degre e 3 , where T i j k = u i v j w k for all i , j , k ∈ [ n ] . The scalar product on R n × n × n is the tensor t race product, 〈 T , T ′ 〉 H ⊗ 3 = P n i , j , k = 1 T i j k T ′ i j k . An e lement g ∈ T ( H ) is of the form g = c + v + M + T + . . . , w here c ∈ R , v ∈ R n , M ∈ R n × n , T ∈ R n × n × n , et c . As an example of tensor algebra mult iplication, it holds that g ∗ g = c 2 + 2 c v + c M + v v ⊤ + c T + v ⊗ M + M ⊗ v + . . . . Note t he d ifference betwe e n v ⊗ M and M ⊗ v : it hold s that ( v ⊗ M ) i j k = v i M j k , while ( M ⊗ v ) i j k = v k M i j . 2 This is slightly non-standar d notation: usually T ( H ) equals L ∞ M = 0 H ⊗ M . 11 One checks that T ( H ) is indeed an algebra: Proposition 3.9. ( T ( H ) , + , ∗ ) is a an associative R -algebra w ith ( 1, 0, . . . , 0, . . . ) resp. ( 0, . . . , 0, . . . ) as multiplicative neutral element resp. additive neutral element . In general, t h e te nsor a lgebra multiplication ∗ is not commutative. Proof. V erifying the axioms of an associative R -al gebra is a series of non-trivia l but elementary calcu- lations. T o see that ∗ is not commutative, consider the counter-example where g , h ∈ H are linearly independent. Then, g ⊗ h 6 = h ⊗ g (in the case of H = R n , this is g h ⊤ 6 = h g ⊤ ). R emark 3.10. W e further emphasize the following points, also t o a rea der who may already be familiar with the te nsor prod uct / outer product: (i) The Cartesian product H M is d i ffere nt from the tensor prod uct H ⊗ M in t he same way as ( R n ) 2 is different from R n × n = ( R n ) ⊗ 2 and ( R n ) 3 from R n × n × n = ( R n ) ⊗ 3 . (ii) In general, ∗ is different from the formal t ensor product / outer product of elements, since for g , h ∈ T ( H ) , the formal tensor prod uct g ⊗ h is an element of T ( H ) ⊗ T ( H ) , while t he tensor power algebra product g ∗ h is an element of T ( H ) . (iii) Under the identiﬁcation introduced above, t here is one case where ∗ coincides with the t ensor product - namely , whe n g and h are homogenous. Identifying g ∈ H ⊗ m and h ∈ H ⊗ n , it hol d s t hat g ⊗ h ∈ H ⊗ ( m + n ) may be identi ﬁ ed with g ∗ h which is also homogenous. No e q uivalence of thi s kind holds when g and h ar e not homogenous. 3.3. The signature as a canoncial featu re set W e are now ready to deﬁne the signature features. Deﬁnition 3.11. W e call the mapping S : B V ([ 0, 1 ] , H ) → T ( H ) , x 7→ X M ≥ 0 ˆ ∆ M ([ 0,1 ]) ( d x ) ⊗ M the signature map of H -valued paths and we refer to S ( x ) , as the signature features of x ∈ BV . Simila r ly , we deﬁne (level- M -)truncated signature mapping as S ≤ M : B V ([ 0, 1 ] , H ) → T ( H ) , x 7→ M X m = 0 ˆ ∆ m ([ 0,1 ]) d x ⊗ M . Above is well-deﬁned since t he signature c an be shown t o have ﬁnite norm: Lemma 3.12. Let x ∈ B V ([ 0, 1 ] , H ) . Then: (i)    ´ ∆ M [ 0,1 ] d x ⊗ M    H ⊗ M ≤ 1 M ! V ( x ) M < ∞ , (ii) k S ( x ) k T ( H ) ≤ exp ( V ( x )) < ∞ . Proof. (i) is classical in the literature on bounded variation paths, it is also proven in Lemma B.4 of t he appendix. (ii) f ollows from (i) by observing that k S ( x ) k T ( H ) ≤ P ∞ M = 0    ´ ∆ M [ 0,1 ] d x ⊗ M    H ⊗ M due to the triangle equali ty , t hen substituting (i) and the T aylor expansion of exp. 12 There are several reasons why the signature features are (practically and theoretically) attractive, which we summarize before we state t he results exactly: 1. the signature features are a mathematically faithful represent at ion of the under lying seque nt ial data x : the map x 7→ S ( x ) is essentially one-on-one. 2. the s ignature f eatures are sequ e ntially ordered analogues to polynomial features and moments . The tensor algebra has the natural grading with M designating the “polynomial degree”. It is further canonically compatible with natural operations on BV ([ 0, 1 ] , H ) . 3. linear combinations of signature features approximate continuous functions of seque ntial data arbitrarily well . This is in analogy with classic polynomial features and implies that signature features are as rich a c lass for learning purposes as one can hope. Theorem 1 (Signature uniqueness) . Le t x , y ∈ BV ([ 0, 1 ] , H ) . Then S ( x ) = S ( y ) if and only x and y ar e equal up to tree -like equivalence 3 . Proof. This is [ 20 , Theorem 2.29 (ii) ] . Remark 1 . Being not tree-like equivalent is a very weak requireme nt, e.g. if x and y h ave a strictly increasing coordinate they are not tree-like equivalent. All t he data we will encounter in t he experiments is not tree-like. Even if presented with a t ree-lik e path, simply adding t ime as extra coordinate (that is, working with t 7→ ( t , x ( t )) instead of t 7→ x ( t ) ) guarantees the assumptions of above Theorem are met. Remark 2 . Above Theorem ext ends to unbounded variation paths, cf. [ 2 ] Secondly , the signature features are analogous to polynomial f eatures: the tensor algebra has a natural grading with M designating the “polynomial degree”. Theorem 2 (Chen’s Theorem) . Let x ∈ B V ([ 0, 1 ] , H ) , t hen ˆ ∆ M d x ⊗ M ⊗ ˆ ∆ N d x ⊗ N = X σ σ  ˆ ∆ M + N d x ⊗ ( M + N )  . Here t h e sum is taken over al l orde r ed shufﬂes σ ∈ OS M , N =  σ : σ pe r mutation of { 1, . . . , M + N } , σ ( 1 ) < · · · < σ ( M ) , σ ( M + 1 ) < · · · < σ ( M + N )  . and σ ∈ OS M , N acts on H ⊗ ( M + N ) as σ  e i 1 ⊗ · · · ⊗ e i M + N  = e σ ( i 1 ) ⊗ · · · ⊗ e σ ( i M + N ) . Finally , a direct consequence of t he above and again in analogy with classic polynomial features, linear combinations of signature features approximate continuous functions of s equential(!) data arbitrar y well. Theorem 3 (Linear approximations) . L et P be a compact subset of BV ([ 0, 1 ] , H ) of paths that a r e not tree- like eq uival e nt. Let f : P → R be continuous in varia t ion norm. Then for any ε > 0 , t h e re exists a w ∈ T ( H ) such t hat sup x ∈ P   f ( x ) − 〈 w , S ( x ) 〉 T ( H )   < ε . Proof. The statement f ollows from the Stone–W eierstraß t heorem if the set F ⊂ C ( P , R ) F : = span n P ∋ x 7→ ¬ e i 1 ⊗ · · · ⊗ e i M , S ( x ) ¶ T ( H ) , M ∈ N o (3.2) forms a point-separating algebra. However , t his is a direct consequence of the above: by Chen’s theorem, Theorem 2, F is an algebra, and by the signature uniquen ess, Theorem 1, F separates points. Above shows more than st ated: f o r a ﬁxed ONB  e i  of T ( H ) , there exists a ﬁnite subset of this ONB and w can be f ound in the linear span of this ﬁnite set. 3 W e call x , y tree-like equivalent if x ⊔ y − 1 is tree-like. A path z ∈ B V ([ a , b ]) is called tree-like if there exists a continuous map h : [ a , b ] → [ 0, ∞ ) with h ( 0 ) = h ( T ) = 0 and | z ( t ) − ( s ) | ≤ h ( s ) + h ( t ) − 2 inf u ∈ [ s , t ] h ( u ) for all s ≤ t ∈ [ a , b ] . 13 4. K ernelized signatures and sequentialized kernels Our goal is to construct a kernel for sequential data of arbitrar y type, to enable learning with such data. W e proceed in two steps and ﬁrst discuss the case when the sequential data are sequences in the Hilbert space H (for example H = R n ). In this sc enario, the properties o f the signature, presented in Section 3, suggest as kernel the scalar p roduct of the signature features. This yields th e following kernels, Deﬁnition 4.1. Fix M ∈ N . W e deﬁne the kernels K ⊕ : B V ( H ) × B V ( H ) → R , ( x , y ) 7→ 〈 S ( x ) , S ( y ) 〉 T ( H ) , K ⊕ ≤ M : B V ( H ) × B V ( H ) → R , ( x , y ) 7→ 〈 S ≤ M ( x ) , S ≤ M ( y ) 〉 T ( H ) . W e r efer to K ⊕ as the signature kernel and t o K ⊕ ≤ M as the signature kernel truncated at level M . T o make t hese kernels practically meaningful, we need to verify a number of points: (a) That they are well-deﬁned, positive (semi-)deﬁnite kernels. Note that checking ﬁniteness of t h e scalar product is not immediate (but follows from well-know n estimates about t he signature fea- tures). (b) That they are efﬁciently computable. A naive evaluation is infeasible, due to combinatorial explosion of the number of signature f eatures. However , we show that K ⊕ and K ⊕ ≤ M can be expressed ent i rely in integrals of inner products 〈 d x ( s ) , d y ( t ) 〉 H . T h e formula c an be written as an efﬁcient re c u rsion , similar to the Horner scheme f or efﬁc ient evaluation of polynomial s. (c) That they are robust under discretization: t he issue is that paths are never directly observed since all real observations are discrete sequences. The subsequent Section 5 introduces d is cr etizations for signatures , and the two kerneli zation steps above, which are canonical and consistent to the above continuous steps, in a sampling sense of discrete sequences H + converging to bounded variation path B V ([ 0, 1 ] , H ) . (d) That t hey are robust under noise: in most situations our measuremen ts of the underlying path are perturbed by random perturbations and noise. W e discuss the common situation of additive white noise / Browni an motion in Section 7. W e refer to the above procedure as “ kernel trick one ” and discuss it below in Sect ion 4.1 and Section 4.2. In a second step, to which we refer as “ kernel trick two ”, we show that the above is also meaningful for sequential data in an arbitrar y set X . This second step yields, for any primary kernel k on (st atic objects) X , a sequential kernel k ⊕ on (a sufﬁciently regular subset of ) paths P ( X ) ⊂ [[ 0, 1 ] → X ] . W e t hus call this procedure that t ransforms a static kernel k on X into a kernel k ⊕ on sequences in X , the s o-called sequentialization (of the kernel k). Deﬁnition 4.2. Fix M ∈ N and k : X × X → R , k ( · , · ) = 〈 φ , φ 〉 H . W e deﬁne 4 k ⊕ : P ( X ) × P ( X ) → R , ( σ , τ ) 7→ 〈 S  φ ( σ )  , S  φ ( τ )  〉 T ( H ) k ⊕ ≤ M : P ( X ) × P ( X ) → R , ( σ , τ ) 7→ 〈 S ≤ M  φ ( σ )  , S ≤ M  φ ( τ )  〉 T ( H ) . W e refer to k ⊕ as the sequentializa tion of the ke rnel k and to k ⊕ ≤ M as the sequentializa tion of the ke rnel k truncated at level M . As we have done for K ⊕ and K ⊕ ≤ M , we again need to verify points (a), (b), (c), (d) for k ⊕ and k ⊕ ≤ M . P oint (a) follows under appropriate regular ity assumptions on P ( X ) immediately f rom t he corresponding statement f or K ⊕ and K ⊕ ≤ M . F or point (b) note, that although t he data enters K ⊕ and K ⊕ ≤ M in the recursion 4 F or φ : X → H and σ ∈ [[ 0, 1 ] → X ] we denote with φ ( σ ) ∈ [[ 0, 1 ] → H ] the path t 7→ φ ( σ ( t )) . 14 formula o nly in the form of sc alar products of differentials in H , it is mathematically somewhat subtle to replace these by evaluations of k over an arbitrary set X , due to t he differential operators involved. However , we will do so by identifying the kernel k with a signed measure on [ 0, 1 ] 2 . P oints (a) and (b) will be discussed in this section, point (c) will be the main topic of Section 5, and point (d) is discussed in Section 7. 4.1. W ell-deﬁnedness of the signature kernel F or (a) well-deﬁnedness, note: K ⊕ and K ⊕ ≤ M are positive deﬁnite kernels, since explicitly deﬁned as a scalar product of f eatures. Also, these scalar products are always ﬁnite (thus well-deﬁned) for paths of bounded variation, as the following Lemma 4.3 shows. Lemma 4.3. Let x , y ∈ BV ( H ) . Then i t hold s that (i) k K ⊕ ≤ M ( x , y ) k ≤ 1 M ! 2 V ( x ) M V ( y ) M < ∞ (ii) k K ⊕ ( x , y ) k ≤ exp ( V ( x ) + V ( y )) < ∞ Proof. This follows f rom Lemma 3.12 and the Cauchy–Schwarz-ineq uality . Hence, K ⊕ , K ⊕ ≤ M are well-deﬁned (positive deﬁnite) kernels. 4.2. Kernel t rick number one: ker nelizing the signature The kernel trick consists of deﬁning a kernel which is (b) efﬁciently computable. F or this, we show t hat K ⊕ , K ⊕ ≤ M can be entirely expressed in t erms of H -sc alar products: Proposition 4.4. Let x , y ∈ BV ( H ) . Then: (i) K ⊕ ( x , y ) = P ∞ m = 0 ´ ( s , t ) ∈ ∆ m × ∆ m Q m i = 1  d x ( s [ i ] ) , d y ( t [ i ])  H , (ii) K ⊕ ≤ M ( x , y ) = P M m = 0 ´ ( s , t ) ∈ ∆ m × ∆ m Q m i = 1  d x ( s [ i ] ) , d y ( t [ i ])  H . Proof. The ﬁrst formula f ollows from substituting deﬁnitions for K ⊕ ≤ M , K ⊕ and using the linearity of the integrals (recall that we use the convention Q 0 i = 1 · · · = 1). Furthermore, there are the following Horner-scheme-type recursions: Proposition 4.5. Let x , y ∈ BV ( H ) . Then (i) K ⊕ ( x , y ) = 1 + ´ ( s 1 , t 1 ) ∈ ( 0,1 ) × ( 0,1 )  1 + ´ ( s 2 , t 2 ) ∈ ( 0, s 1 ) × ( 0, t 1 ) ( 1 + · · · ) · · · 〈 d x  s 2  , d y  t 2  〉 H  〈 d x  s 1  , d y  t 1  〉 H , (ii) K ⊕ ≤ M ( x , y ) = 1 + ´ ( s 1 , t 1 ) ∈ ( 0,1 ) × ( 0,1 )  1 + . . . ´ ( s M , t M ) ∈ ( 0, s M − 1 ) × ( 0, t M − 1 ) 〈 d x  s M  , d y  t M  〉 H . . .  〈 d x  s 1  , d y  t 1  〉 H . Proof. This follows f rom Proposition 4.4 and an elementary computation. The recursion is the mathematically precise form of the iterated expectation from Section 1.3. In Section 5 we show that this recursion is preserved under disc retization. 15 4.3. Kernel t rick number two: sequentialization of kernels As shown in the previous Section 4.2, the signature kernel on bounded variation paths over H can be entirely expressed in scalar product s over H . However , in general, the data will not be directly observed as sequences in a (potentially inﬁnite dimensional) Hilbert space H , but in some arbitrary X . Sequences in X c an be treated with a second kernelization step: W e c hoose a primary feature map φ : X → H f or (non-sequen tial) data in X . By Proposition 4.4, t he sequential kernel can be expressed in scalar products in H , therefore, after the second kerneliza tion step, in the primary kernel. W e make this point p recise. Assumption 4.6. F or the general situati on of sequences in an arbitrar y set X , w e may assume: (i) The potential observations, d enoted P ( X ) , are sequential data of type [[ 0, 1 ] → X ] . (ii) The potential observations P ( X ) ar e absolutely continuous paths when corresponding sequences of pri- mary feat ures are considered. That is, for every potential observation σ ∈ P ( X ) , the concatenation φ ◦ σ is an e lement of BV ( H ) that i s absolutely continuous 5 . (iii) Scalar prod ucts of primary features can be efﬁciently computed via a kernel function k : X × X → R , k ( x , y ) = 〈 φ ( x ) , φ ( y ) 〉 H . More precisely , H is t he RK H S associated to the kernel k . Under Assumption 4.6, the formulas given in Proposition 4.4 and Proposition 4.5 suggest to substitute x = φ ( σ ) , y = φ ( τ ) and to replace  d x ( s [ i ] ) , d y ( t [ i ])  H with an evalua tion of k. However , the differ- ential is between the s calar product and the φ and hence a naive replacement of t he sc alar product by the primary kernel k is not possible. Below we show that a slightly more tec hnical f orm of t he iterated expectation formula in the introductory Section 1.2 holds and agrees with this formula f or diff erentiable paths: Proposition 4.7. Let Assumptions 4.6 be satisﬁed. Then k ⊕ and k ⊕ ≤ M are positive deﬁnite ker nels on P ( X ) and for σ , τ ∈ P ( X ) : (i) k ⊕ ( σ , τ ) = P ∞ m = 0 ´ ( s , t ) ∈ ∆ m × ∆ m d κ σ , τ ( s [ 1 ] , t [ 1 ]) · . . . · d κ σ , τ ( s [ m ] , t [ m ]) , (ii) k ⊕ ≤ M ( σ , τ ) = P M m = 0 ´ ( s , t ) ∈ ∆ m × ∆ m d κ σ , τ ( s [ 1 ] , t [ 1 ]) · . . . · d κ σ , τ ( s [ m ] , t [ m ]) . for a suitably chosen signed measure κ σ , τ on [ 0, 1 ] 2 . If X is an R -vector space and σ , τ are differentiable, then d κ σ , τ ( s , t ) = k ( ˙ σ ( s ) , ˙ τ ( t )) d s d t . Proof. The proof is carried out in Appendix A. As bef ore, above can be written as Horner-type recursions Proposition 4.8. Let Assumptions 4.6 be satisﬁed and σ , τ ∈ P ( X ) . Then: (i) k ⊕ ( σ , τ ) = 1 + ´ ( s 1 , t 1 ) ∈ ( 0,1 ) × ( 0,1 )  1 + ´ ( s 2 , t 2 ) ∈ ( 0, s 1 ) × ( 0, t 1 ) ( 1 + · · · ) · · · d κ σ , τ ( s 2 , t 2 )  d κ σ , τ ( s 1 , t 1 ) (ii) k ⊕ ≤ M ( σ , τ ) = 1 + ´ ( s 1 , t 1 ) ∈ ( 0,1 ) × ( 0,1 )  1 + . . . ´ ( s M , t M ) ∈ ( 0, s M − 1 ) × ( 0, t M − 1 ) d κ σ , τ  s 1 , t 2  . . .  d κ σ , τ  s 1 , t 2  Proof. This follows f rom Proposition 4.7 and an elementary calculation. 5 Absolutely continuou s paths are dense in variation norm in BV. W e exclude non-absolutely continuou s paths (like Cantor function s) to avoid technicalities. In Section 7 we s how that above can generalized to m uch “rougher paths” than bounded variation. 16 5. Discrete signatures and kernels for sequences As it was mentioned under point (c) in Section 4, t here is one major practical issue with the kernels introduced in Sect ion 4: continuous sequences are in reality never fully observed, since observations in practice are alwa ys ﬁnite, for example given as t ime points at which a time series is sampled. T o be more precise, instead of having full knowledge of σ ∈ [[ 0, 1 ] → X ] we can use at most a ﬁ nite number of samples, σ ( s ) =  σ ( s 1 ) , . . . , σ ( s n )  ∈ H + for s ∈ ∆ . W e address this by providing a discretization of all prior concepts: a discrete version S : H + → T ( H ) of the signature S, a discrete version K + : H + × H + → R of the signature kernel K ⊕ , and a discrete version k + : X + × X + of the sequnentializ ation k ⊕ of k. The discretization is based on an elementary integral-vs-sum approximation argument, in which the iterated integral ˆ ∆ m ([ 0,1 ]) d x ⊗ m is approximated by a sum X i ⊏ [ ℓ ( ∇ x )] ℓ ( i )= m ∇ x [ i 1 ] ⊗ · · · ⊗ ∇ x [ i m ] , where x = x ( t ) is a s uitable discretization of the bounded variation path x ∈ B V ([ 0, 1 ] , H ) with t ∈ ∆([ 0, 1 ]) . Both the integral and the sum live in the tensor power algebra, an we will show that t he sum approximates the integral in a sampling sense. Similar ly , one can deﬁne a signature for discrete se- quences that approximates the continuous signature. The beginning of this section showcases qualitative statements of this approximative kind. In analogy to Section 4 we proceed in two steps: in th e ﬁrst step, “ kernel tr ick one ”, we use the discretizted signature S to deﬁne a kernel K + for sequences in H + (of possibly different length). It can be expressed as a polynomial in scalar products in H , as follows 6 : K + ( x , y ) = X i ⊏ [ ℓ ( ∇ x )] j ⊏ [ ℓ ( ∇ y )] ℓ ( i )= ℓ ( i ) ℓ ( i ) Y r = 1 〈∇ x [ i r ] , ∇ y [ j r ] 〉 H (5.1) and also a Horner-type formula is derived (see Proposition 5.5 below). In a sec ond step, “ kernel trick two ”, we replace the inner product of ﬁnite diff erences in (5.1) by a linear c ombination of evalua tions of the kernel k. This gives a kernel k + deﬁned on sequence in X . Primary kernels k on arbitrary sets X can thus be “sequentialized” to kernels on sequences in such sets. Besides t he Horner type formula, the main theoretical result of this s ec tion is that both kernels are robust in the sense that K + ( x ( s ) , y ( t )) and k + ( σ ( s ) , τ ( t )) converge to K ⊕ ( x , y ) and k ⊕ ( σ , τ ) as the mesh of s and t vanishes. In Section 6 we will also see that the well-know n string kernels can be derived as such a special case of sequentiali zation, for sequences of symbols; in section 7 we show that above kernels are also robust under noise, that is when the bounded variation assumption does not hold; in Sect ion 8, we will further show that despite an exponential number of t erms in the sum-product expansion above, t he sequential kernels can be co mputed efﬁciently . 5.1. Discre t izing the signature T o obt ain a kernel K + for discrete sequences in H + that mimicks the signature kernel K ⊕ , we pass from the Riemann –Stieltjes integral t o a discretized, ﬁnite-sum approximation. T he central mathematical ob- servation is that ℓ ( ∇ x ) O j = 0 ( 1 + ∇ x [ j ]) = X i ⊏ [ ℓ ( ∇ x )] ∇ x [ i 1 ] ⊗ · · · ⊗ ∇ x [ i ℓ ( i ) ] ≈ ∞ X m = 0 ˆ ∆ m ([ 0,1 ]) d x ⊗ m , 6 R ecall our convention Q 0 r = 1 f ( r ) = 1 and that i ⊏ [ L ] includes the empty tuple. 17 that is, the sum approximating th e signature can be written as an outer p roduct reminiscent of an expo- nential approximation. The mathematical statement underlying our approximation generalizes Euler’s famous theorem on t he exponential f unction. Euler’s original theorem and its proof are below , in a form that is slightly more quantitative t han how it is usually presented, while being closer to our later approximation result: Theorem 4. Let x ∈ R , n ∈ N . The n,  1 + x n  n − exp ( x ) = g ( x , n ) , where k g ( x , n ) k ≤ exp ( x ) n  1 + x n ( n − 2 ) !  . In part icular , it holds that lim n →∞  1 + x n  n = exp ( x ) , where convergence is uniform of order O ( n − 1 ) on a ny compact subset of R . Proof. All statements follow from the ﬁrst, which we proceed t o prove. By the binomial theorem, it holds that  1 + x n  n = n X k = 0  n k  · x k n k . From the deﬁnition of the binomial coef ﬁcient and an elementary computation, one obtains  n k  · x k n k = x k k ! + g ( x , n , k ) , where k g ( x , n , k ) k ≤ x k k ! n , for k ≤ n . F or k ≥ n , one has x k k ! ≤ x n n ! · x k − n ( k − n ) ! . Putting together all inequalities and using the T aylor expansion of exp yields the claim. Approximation bounds for a discretized version of the signature approximating the continuous one will be given by a generalization of Euler’s Theorem 4. W e introduce a discretized variant of t he signature for sequences, as follows: Deﬁnition 5.1. W e call t he map S : H + → T ( H ) deﬁned as S ( x ) = Q ℓ ( ∇ x ) i = 1 ( 1 + ∇ x [ i ]) the (approximate) signature map, for discrete sequences. W e are ready to prove the main t h eorem for discrete approximation of the signature. The gist of Theorem 5 is that the signature of sub-sequences of x approximates the actual signature of x , in a limit similar to Euler’s Theorem 4. Theorem 5. Let x ∈ B V ( H ) and t ∈ ∆ M + 1 ([ 0, 1 ] ) . The n, k S ( x ( t ) ) − S ( x ) k T ( H ) ≤ exp ( V ( x )) − exp 1 ( V t ( x )) , where exp ( r ) : = Q ℓ ( r ) i = 1 ( 1 + r [ j ]) for r ∈ R + , and V t ( x ) : =  V ( x [ t 1 , t 2 ]) , . . . , V ( x [ t M , t M + 1 ])  ∈ R M (as usual, x [ a , b ] d e notes the restriction of x to B V ([ a , b ] , H ) .). Fur t her , it holds that: (i) The right h a nd sides are always positive and can be bounded as follows: 0 ≤ exp ( V ( x )) − exp 1 ( V t ( x )) ≤ V ( x ) exp ( V ( x )) · k V t ( x ) k 0 , where as usual we denote k V t ( x ) k 0 = max i V  x [ t i , t i + 1 ]  . 18 (ii) One has convergence on the r ight side lim V ( x ( t )) → V ( x ) exp ( V t ( x )) = exp ( V ( x )) (and similar for exp M ), uniform of order O  k V t ( x ) k 0  on any compact subset of B V ( H ) . (iii) In par ticular , one has convergence of the di scretized signature to the continuous signature lim V ( x ( t )) → V ( x ) S ( x ( t ) ) = S ( x ) , (hence also for S ≤ M ), uniform of order O  k V t ( x ) k 0  on any compact subset of BV ( U ) . (iv) If t is chosen such that V ( x [ t i , t i + 1 ]) = V ( x ) M for all i , then one has the asymptotically ti ght bound k S ( x ( t ) ) − S ( x ) k T ( H ) ≤ exp ( V ( x )) M  1 + ( V x ) M ( M − 2 ) !  . Proof. The proof is c arried out in Appendix B: The main statement follows from T heorem 6 (ii); points (i) and (ii) follow from applying Proposition B.6 t o V ( x ) = P M i = 1 V ( x [ t i , t i + 1 ]) ; point (iii) f o llows from (ii); (iv) follows from Theorem 6 (iii). Euler’s original T heorem 4 is recovered from for substituting x : I → R , t 7→ x ′ · t in T heorem 5 (iii), where x ′ is the x of Theorem 4. Similar statements for the t runcated signature may be derived, where the bounds are truncated versions of t he ones given; for an exact mathematical formulation, see Theorem 6 (i). R emark 5.2 (About geometric approximations) . In general, there is no bounded vari ation path x ′ ∈ B V ( H ) such that S ( x ′ ) = S ( x ( t ) ) , even though both S ( x ( t )) and S ( x ) a r e e lements of the tensor algebra T ( H ) . Thus the approximation characterized in Theor em 5 is an algebraic, non-geometric a pproxi mation on t he level of signatures ( = algebraic objects); since, in general, it d oes not arise from an approximation / discretizat ion on the level of paths ( = geometric objects). The lat ter is the common t ype of approximat ion for (rough) paths, see [ 8 ] . The read er familiar with such approximations on the level of path s i s invited to follow the exposition in parallel with any path-level approximation i n mind. The the oretical considerations can be carrie d out in analogy for a w h i le, but to our knowledge do not lead to a kernel which is as efﬁciently computable (as opposed to the e fﬁcient Algorithm 3 discussed in Section 8); see also Remark 8.1 about approximations of inner products of signature features of piecewise linear paths. 5.2. Kernel t rick number one discr etized In the discrete sett ing, we have access only to the discrete signature S ( x ( t )) of a path x , at a ﬁnite sequence of points t ∈ ∆ . Theorem 5 guarantees that S ( x ( t )) is a good approximation of t he cont inuous signature S ( x ) when t is densely sampled (as V ( x ( t )) approaches V ( x ) ). Both s ignatures live in t he tensor algebra, it is therefore natural to obtain discretized variant of signa- ture kernels as the scalar product of discretized signatures; again, we also deﬁne a truncated version. Deﬁnition 5.3. The discretized signature kernel K + , for sequences in H , and it s t runcation K + M , are deﬁned as K + : H + × H + → R , ( x , y ) 7→ 〈 S ( x ) , S ( y ) 〉 T ( H ) K + M : H + × H + → R , ( x , y ) 7→ 〈 S M ( x ) , S M ( y ) 〉 T ( H ) . 19 Both K + and K + M are a positive (semi-)deﬁnite kernels, since explicitly deﬁned as a sc alar product of features. Through the approximation T heorem 5, t he sequential kernel naturally applies to a ground truth of bounded variation path s x , y ∈ B V ([ 0, 1 ] , H ) which is sampled in a ﬁnite number of consecutive points s , t ∈ ∆([ 0, 1 ]) , by considering K + ( x ( s ) , y ( t )) (note that s and t being of different lengths will create no issue in-principle). W e ob t ain a c orollary of Theorem 5 for approximating the kernel function in this way: Corollary 5.4. Let x , y ∈ B V ([ 0, 1 ] , H ) , let s , t ∈ ∆([ 0, 1 ]) . Then,   K + ( x ( s ) , y ( t )) − K ⊕ ( x , y )   ≤ 4 exp ( V ( x ) + V ( y )) − 2 exp ( V ( y ) ) exp ( V s ( x )) − 2 exp ( V ( x )) exp ( V t ( y )) . In part icular , it holds that lim V ( x ( s )) → V ( x ) V ( y ( t )) → V ( y ) K + ( x ( s ) , y ( t )) = K ⊕ ( x , y ) , where convergence is uniform of or der O ( k V s ( x ) k 0 + k V t ( y ) k 0 ) on any compact subset of BV ([ 0, 1 ] , H ) × B V ([ 0, 1 ] , H ) . Proof. It holds t hat K + ( x ( s ) , y ( t )) − K ⊕ ( x , y ) = 〈 S ( x ( s ) ) , S ( y ( t )) 〉 T ( H ) − 〈 S ( x ) , S ( y ) 〉 T ( H ) = 〈 S ( x ( s ) ) , S ( y ( t )) − S ( y ) 〉 T ( H ) + 〈 S ( x ( s ) ) − S ( x ) , S ( y ) 〉 T ( H ) . The Cauchy-Schwarz-ineq uality implies t hat k〈 S ( x ( s )) , S ( y ( t )) − S ( y ) 〉 T ( H ) k T ( H ) ≤ k S ( x ( s )) k · k S ( y ( t )) − S ( y ) k Theorem 5, togeth er with Lemma 3.12 (i), implies that k S ( x ( s )) k · k S ( y ( t )) − S ( y ) k ≤ 2 exp ( V ( x ) ) ·  exp ( V ( y )) − exp ( V t ( y ))  Similar ly , one o bt ains k〈 S ( y ( t )) , S ( x ( s )) − S ( x ) 〉 T ( H ) k T ( H ) ≤ 2 exp ( V ( y )) ·  exp ( V ( x )) − exp ( V s ( x ))  . Putting all (in-)equalities together yields t he main claim, t he c onvergence statement follows from Theo- rem 5 (ii). Note that the sampling points s , t are crucial for t he convergence st atement, even though the deﬁnition of K ⊕ itself depends only on x , y . A similar s t atement may be obtained f or K + M in analogy . W e proceed by g iving explicit formulae for K + that will become crucial for its efﬁcient computation. Proposition 5.5. Let x , y ∈ H + . The following identit ies hold for t he discretized signature kernels: (i) K + ( x , y ) = P x ′ ⊏ ∇ x , y ′ ⊏ ∇ y ℓ ( x ′ )= ℓ ( y ′ ) Q ℓ ( x ′ ) i = 1 〈 x ′ [ i ] , y ′ [ i ] 〉 H , (ii) K + M ( x , y ) = P x ′ ⊏ ∇ x , y ′ ⊏ ∇ y ℓ ( x ′ )= ℓ ( y ′ ) ≤ M Q ℓ ( x ′ ) i = 1 〈 x ′ [ i ] , y ′ [ i ] 〉 H (iii) K + ( x , y ) = 1 + P i 1 ≥ 1 j 1 ≥ 1 〈∇ x [ i 1 ] , ∇ y [ j 1 ] 〉  1 + P i 2  i 1 j 2  j 1 〈∇ x [ i 2 ] , ∇ y [ j 2 ] 〉  1 + P i 3  i 2 j 3  j 2 〈∇ x [ i 3 ] , ∇ y [ j 3 ] 〉  1 + P ... . . .   (iv) K + M ( x , y ) = 1 + P i 1 ≥ 1 j 1 ≥ 1 〈∇ x [ i 1 ] , ∇ y [ i 1 ] 〉  1 + P i 2  i 1 j 2  j 1 〈∇ x [ i 2 ] , ∇ y [ i 2 ] 〉  1 + · · · P i M  i M − 1 j M  j M − 1 〈∇ x [ i M ] , ∇ y [ i M ] 〉  20 where t h e usual convention that a sum running over an empty index set evaluates to ze ro applie s. Proof. This follows directly from an explicit computation where both sides are expanded and compared. R emark 5.6. All summat ion in Proposition 5.5 is ﬁnite, even for K + : there are only a ﬁnit e number of sub-sequences x ′ , y ′ in (i); and in (ii) , summation ends at M = min ( ℓ ( x ) , ℓ ( y )) − 1 . An import ant feature of th e sum-formula given by Proposition 5.5 (iv) is tha t it is more efﬁcient t o evaluate than th e more naive presentation in Proposition 5.5 (ii), d ue t o a much smaller amount of summation. 5.3. Kernel t rick number two discr etized Both equalities in Proposition 5.5 show K + to be expressible entirely in terms o f sc alar p roducts in H . Thus, if we are again in the situation of Assumptions 4.6 , that is t here is a primary kernel function k : X × X → R and a f eature map φ : X → H such that k ( σ , τ ) =  φ ( σ ) , φ ( τ )  H for σ , τ ∈ X , the sequential kernel can be again be second-kernelized Deﬁnition 5.7. Let k : X × X → R . The (discrete) sequentialization k + of k for sequences in X , and its truncation k + M are deﬁned as k + : X + × X + → R , ( σ , τ ) 7→ 〈 S ( φ ( σ )) , S ( φ ( τ )) 〉 T ( H ) k + M : X + × X + → R , ( σ , τ ) 7→ 〈 S M ( φ ( σ ) ) , S M ( φ ( τ )) 〉 T ( H ) . F ollowing the presentation in the previous section 5.1 and making the substitution x = φ ( σ ) , y = φ ( τ ) yields again an explict sum formula and a recursive formula for k + and k + M , in direct analogy to Proposition 5.5. This disc rete recursion is at the foundation of efﬁ c ient computation in a practical s et ting, as demonstrated in Section 6 and Section 8. Proposition 5.8. Let σ , τ ∈ X + and M ∈ N . The following identi t ies hold for the sequentializat ion k + of k : (i) k + ( σ , τ ) = P i ⊏ [ ℓ ( ∇ σ )] j ⊏ [ ∇ ℓ ( τ )] ℓ ( i )= ℓ ( j ) Q ℓ ( i ) r = 1 ∇ k ( σ , τ )[ i [ r ] , j [ r ]] (ii) k + ( σ , τ ) = 1 + P i 1 ≥ 1 j 1 ≥ 1 ∇ k ( σ , τ )[ i 1 , j 1 ] ·  1 + P i 2  i 1 j 2  j 1 ∇ k ( σ , τ )[ i 2 , j 2 ] ·  1 + P ... . . .   , (iii) k + M ( σ , τ ) = P i ⊏ [ ℓ ( ∇ σ )] j ⊏ [ ℓ ( ∇ τ ) ] ℓ ( i )= ℓ ( j ) ≤ M Q ℓ ( i ) r = 1 ∇ k ( σ , τ )[ i [ r ] , j [ r ]] , (iv) k + M ( σ , τ ) = 1 + P i 1 ≥ 1 j 1 ≥ 1 ∇ k ( σ , τ )[ i 1 , j 1 ] ·  1 + P i 2  i 1 j 2  j 1 ∇ k ( σ , τ )[ i 2 , j 2 ] ·  1 + · · · P i M  i M − 1 j M  j M − 1 ∇ k ( σ , τ )[ i M , j M ]  . where we use the notation ∇ k ( σ , τ )[ i , j ] : = k ( σ [ i + 1 ] , τ [ j + 1 ]) + k ( σ [ i ] , τ [ j ] ) − k ( σ [ i ] , τ [ j + 1 ]) − k ( σ [ i + 1 ] , τ [ j ]) . Proof. Substituting x = φ ( σ ) , y = φ ( τ ) we have f o r i = ( i 1 , . . . , i r ) , j = ( j 1 , . . . , j r ) 〈 ( ∇ x )[ i r ] , ( ∇ y )[ j r ] 〉 H = 〈 ( ∇ φ ( σ ) )[ i r ] , ( ∇ φ ( b τ ))[ j r ] 〉 H for th e scalar products. One obtains, via an elementary computation, that 〈 ( ∇ x )[ i r ] , ( ∇ y )[ j r ] 〉 H = ∇ i r , j r k ( σ , τ ) . The s t atement then follows by Proposition 5.5 21 R emark 5.9. Again a direct consequence of the approximation Theorem 5 is the convergence of k + to k ⊕ , k + ( σ ( s ) , τ ( t )) → k ⊕ ( σ , τ ) as mesh ( s ) , mesh ( t ) vanishes (under Assumptions 4.6). Howe ver , alt hough the r a te of convergence is explicitly given by The or em 5, it depends on V ( φ ( σ )) and V ( φ ( τ ) ) . R emark 5.10 (V ariations on a theme) . W e further point out several, more complex varia nt s of the second kernelization: (i) In the formula for the sequential kernel, one could also simply omi t t he ﬁ rst d ifferences. This corresponds to cumulative summation of the sequences in feature space, or replacing signatures by exponentials: k + : X + × X + → R ; ( σ , τ ) 7→ X σ ′ ⊏ σ , τ ′ ⊏ τ ℓ ( σ ′ )= ℓ ( τ ′ ) ℓ ( σ ′ ) Y r = 1 k ( σ ′ [ r ] , τ ′ [ r ]) . In practice, we have a necdot ally found t hat this t o lead to large sums and numeri cal instabilities — though one could argue that the variant without ﬁnite differences is the simpler one. (ii) More generally , one may consider a family of primary kernels, of form k : X m × X m → R , not necessar- ily induced as the product of kernels of t ype k : X × X → R , sequentiali za t ion is possible via Proposi- tion 5.5 (i), namely as k + : X + × X + → R ; ( σ , τ ) 7→ X i ⊏ [ ℓ ( σ )] , j ⊏ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) = m k m ( σ [ i ] , τ [ j ]) . This corresponds to choosing different kernels on different levels of t he tensor algebra, e.g., re-normalization. (iii) One may add i t ionally opt to have the primary kernel remember the position of elements in the sequence. This may be done by mapping sequences in X + to a position-remembering sequence in ( X × N ) + ﬁrst, and t hen sequentializing a primary ke rnel wh i ch is a pr od uct ker nel of type k m (( σ , i ) , ( τ , j )) = κ ( i , j ) · k m ( σ , τ ) , wher e κ : N + × N + → R is positi ve deﬁnit e. This yields a sequential kernel of form k + : X + × X + → R ; ( σ , τ ) 7→ X i ⊏ [ ℓ ( σ )] , j ⊏ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) κ ( i , j ) · k m ( σ [ i ] , τ [ j ]) . From the viewpoint of the sequential kernel, the above choices, while minor , seem somewhat a rbitrary . W e will see in the coming Section 6 that t hey have their justiﬁcation in explaining prior art . Nevertheless we would with Occam’s razor that they should not be made unless t hey empirically improve the goodness of the method at hand, above th e mathematically more simple version of t he sequent i al kernel, or sequentiali zation in Remark ? ? . 6. The string, alignment and ANOV A kernel seen via sequentialization s In this section we show how the sequential kernel is closely related t o the existing kernels for sequential data: (a) String kernels [ 17, 19 ] may be seen as a special case of the sequential kernel. (b) The g lobal alignment kernel [ 5, 6 ] can be obtained from a special case of the sequential kernel by deleting terms that destroy positive deﬁniteness. 22 (c) The AN OV A kernel arises by considering only t he symmetric part of tensors in T ( H ) . (d) The sequential kernel may be understood in the general framework of the relation-convolution kernel [ 14 ] . The link will be established to one of the more general varian ts of sequentializ ation presented in R emark 5.10, of form k + : X + × X + → R ; ( σ , τ ) 7→ X i ⊏ [ ℓ ( σ )] , j ⊏ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) κ ( i , j ) · ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) , where k : X × X → R and κ : N + × N + → R are suitably chosen. 6.1. The string kernel The string kernel is a kernel for strings in a ﬁxed alphabet Σ , say Σ = { A, B } . A number of variants exist; we will consider the original deﬁnition of the string kernel (Deﬁnition 1, page 423 of [ 19 ] ). Modiﬁcations may be t reated similarly to the below . Deﬁnition 6.1. Fix a ﬁnite alphabet Σ . The string kernel on Σ i s K Σ : Σ + × Σ + → R ( σ , τ ) 7→ X i ⊏ [ ℓ ( σ )] j ⊏ [ ℓ ( τ )] λ d ( i )+ d ( j ) · 1 ( σ [ i ] = τ [ j ]) = X u ∈ Σ + X i : u = σ [ i ] X j : u = τ [ j ] λ d ( i )+ d ( j ) , where λ ∈ R + is a parameter , d ( i ) : = i [ ℓ ( i )] − i [ 1 ] + 1 is the distance of the last and ﬁrst symbol in t he sub-string given by i , a nd 1 ( σ [ i ] = τ [ j ] ) is the indicator function of the event whet her σ [ i ] and τ [ j ] agree , i.e., one if σ [ i ] = τ [ j ] and zero otherwise. F or λ = 1, one has K Σ ( σ , τ ) = X i ⊏ [ ℓ ( σ )] j ⊏ [ ℓ ( τ )] 1 ( σ [ i ] = τ [ j ]) = X σ ′ ⊏ σ τ ′ ⊏ τ ℓ ( σ ′ )= ℓ ( τ ′ ) ℓ ( σ ′ ) Y r = 1 k ( σ ′ [ r ] , τ ′ [ r ]) for the choice of primary kernel k ( a , b ) = 1 ( a = b ) for symbols a , b ∈ Σ . Comparing with Proposition 5.5, this canonically identiﬁes the string kernel with parameter λ = 1 with the sequentiali zation of the Eu- clidean scalar product k ( · , · ) = 〈· , ·〉 on X = R | Σ | , via identifying a string a ∈ Σ L with the sequence ( 0, e a [ 1 ] , e a [ 1 ] + e a [ 2 ] , . . . , L X r = 1 e a [ r ] ) ∈ X L + 1 . W e st ate the result for arbitrary λ , where one can also express the string kernel as a sequentialization: Proposition 6.2. Consider the string ker nel K Σ on an alphabet Σ . Then, K Σ ( σ , τ ) = X i ⊏ [ ℓ ( σ )] j ⊏ [ ℓ ( τ )] λ d ( i )+ d ( j ) · 1 ( σ [ i ] = τ [ j ]) = : X i ⊏ [ ℓ ( σ )] j ⊏ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) κ ( i , j ) · ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ] with the positive deﬁnite kernels k ( a , b ) = 1 ( a = b ) and κ ( i , j ) = λ d ( i )+ d ( j ) . 23 Proof. The equality follows from substituting deﬁnitions. It remains t o show that k , κ are positive deﬁnite. F or κ , note that we have a scalar product representation κ ( i , j ) = 〈 λ d ( i ) , λ d ( j ) 〉 R (over the real numbers), therefore κ is positive deﬁnite. P ositive deﬁniteness of k follows similar ly from the scalar product repre- sentation k ( a , b ) = P c ∈ Σ + 1 ( a = c ) · 1 ( b = c ) . R emark 6.3. The above shows that string kernel arises a sequentiali za t ion as intr oduced in Section 5. How - ever , it i s inte resting to note t hat it is not in exact agreement w ith t he continuous signature ker nel 〈 S ( x ) , S ( y ) 〉 when we associate with a string a a path x ∈ BV ([ 0, L ] , R | Σ | ) x ( t ) = { t } · e a [ ⌊ t ⌋ ] + ⌊ t ⌋ X r = 1 e a [ r ] , where as usual ⌊ t ⌋ is t he ﬂoor function of t , and { t } is t h e fra cti onal part of t (not t he set containing t as one element). In fact, one can show that the string kernel cannot be expressed a s a signature kernel evaluated at a suitable continuous pa t h. This difference is due to the fact that our sequentialization kernel / discretization is on the level of the tensor algebra and not on the level of paths, see Remark 5.2. While κ cannot be pulled into the product, t he string kernel nevertheless admits a sum-formula pre- sentation similar to Proposition 5.5 (iii), namely K Σ ( σ , τ ) = X i 1 ≥ 1 j 1 ≥ 1 λ 2 − i 1 − j 1 k ( σ [ i 1 ] , τ [ j 1 ]) X i 2  i 1 j 2  j 1 k ( σ [ i 2 ] , τ [ j 2 ]) · · · X i L  i L − 1 j L  j L − 1 λ i L + j L k ( σ [ i L ] , τ [ j L ]) , where L = min ( ℓ ( σ ) , ℓ ( τ )) . Note t hat each sum runs over a double index, and λ -s occur only next to t he ﬁrst and last sum. The usual convention that a sum running over an empty index set evaluates to zero applies. T he sum-formula is well-k nown and a main t ool in efﬁc iently evaluating string kernels, see for example the sec tion “efﬁcient computation of SSK”, page 425 of [ 19 ] . Gappy and other st ring kernel variants such as in [ 17 ] may be obtained from the truncated sequential kernel and other , suitable choices of κ . R emark 6.4. The interpret ation as a sequential kernels d irectly highlights an alt ernative interpret ation, and a nat ur al generalization of the string kernel: suppose each character in t he string was not an exact character , but a wei ght ed sum α A + β B ; w h e re for example the characte r 1 A + 0 B i s the same as A, 0 A + 1 B i s t he same as B, a nd 1 2 A + 1 2 B is the same as half -A-half -B. Such a scenario could practical sense if the exact meaning of a character is not ent i rely known, for example when the string was obtained through character recognition, for example where the le tter l (lower-c ase-L) and the number 1 ar e often confounded. It can be observed tha t this situation can be coped with by ma pping say 1 2 A + 1 2 B to the vector 1 2 e A + 1 2 e B , with the sequenti a l kernel left unchanged. 6.2. The global alignment kernel The global alignment kernel one of the most used kernels for sequences. W e recapitulate its deﬁnition in modern terminology (Section 2.2 of [ 5 ] ). Deﬁnition 6.5. A 2-dimensional inte ger sequence s ∈  N 2  + is called an alignment if ∇ s [ i ] ∈ { ( 0, 1 ) , ( 1, 0 ) , ( 1, 1 ) } for all i ∈ [ ℓ ( s ) − 1 ] . The set of such alignments will be denot e d by A ; that is, we writ e s ∈ A if s is an align- ment. Deﬁnition 6.6. Fix an arbitra ry set X , and a primar y kernel k : X × X → R . The global alignment kernel is deﬁned as K GA : X + × X + → R ( σ , τ ) 7→ X i ⊑ [ ℓ ( σ )] j ⊑ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) ≤ ℓ ( σ ) + ℓ ( τ ) ( i , j ) ∈ A ℓ ( I ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) . 24 In its native f orm, t he global alignment kernel cannot be written as a sequential kernel. A simple proof for this is that th e sequential kernels are all positive deﬁnite, while t he global alignment kernel need not be (see [ 5 ] ). While one can write K GA : X + × X + → R ( σ , τ ) 7→ X i ⊑ [ ℓ ( σ )] j ⊑ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) κ ( i , j ) · ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) with κ ( i , j ) = 1  ( i , j ) ∈ A  , this is not a sequential kernel since κ is not positive deﬁnite. However , a simple modiﬁcation turns the global alignment kernel into a sequential (and thus positive deﬁnite) kernel: Deﬁnition 6.7. A 1-dimensional integer sequence σ ∈  N 1  + is called a half-alignment if ∇ σ [ i ] ∈ { 0, 1 } for all i ∈ [ ℓ ( σ ) − 1 ] . The set of such half -alignments will be denot e d by 1 2 A ; that is, we write σ ∈ 1 2 A if σ is a half- alignment. Wi th this, we c an formulate a slightly modiﬁed global alignment kernel: K G 1 2 A : X + × X + → R ( σ , τ ) 7→ X i ⊑ [ ℓ ( σ )] j ⊑ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) ≤ ℓ ( σ )+ ℓ ( τ ) i , j ∈ 1 2 A ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) Wi th a simple re-formul ation, one obtains K G 1 2 A = X i ⊑ [ ℓ ( σ )] j ⊑ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) ≤ ℓ ( σ ) + ℓ ( τ ) κ ′ ( i , j ) · ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) , where κ ′ ( i , j ) = 1  i ∈ 1 2 A  · 1  j ∈ 1 2 A  . Note that κ ′ is p o s it ive deﬁnite, since it is explicitly given as a scalar product of features in R , thus K G 1 2 A is positive deﬁnite as a sequential kernel. The t erms missing in K GA , when compared to K G 1 2 A are exactly those arising from sequence pairs ( i , j ) ∈  N 2  + in which there is an increment ( 0, 0 ) . In view of the discussion in Section 3.2 of [ 5 ] , these missing terms are exactly the locus of non-transitivity in the s ense of [ 34 ] . One can now follow the authors [ 5 ] and heuristically continue studying sufﬁcient c onditions under which the original global alignment kernel is positive deﬁnite, keeping the missing terms out. However , we would argue, especially in t he view of the violated transitivity condition, t hat it may be more natural to add the missing terms back, unless th ere is a clear empirical reason in favour of leaving them out. P articularly , we would further argue that there is no ﬁrst-principles reason t o leave t he terms out, since the deﬁnition of an alignment has been heuristic and somewhat arbitrar y to start with. 6.3. The relation-convolution kernel Finally , we would like to point out that t he sequential kernels are closely related to the relation-convolution kernels in t he sense of Haussler [ 14 ] , Section 2.2. W e c ite it in a slightly less general form than originall y deﬁned: Deﬁnition 6.8. Fix arbitrary sets Y , Z . F urther , ﬁx a relation R ⊆ Y × Z and a kernel k Y : Y × Y → R . A relation-convolution ker nel is a kernel of t he form K RC : Y × Y → R ( x , y ) 7→ X r ∈ R − 1 ( x ) s ∈ R − 1 ( y ) k Y ( r , s ) , 25 where we have writt en R − 1 ( x ) : = { y : ( x , y ) ∈ R } . In the presented form, the signature kernel k + : X + × X + → R ; ( σ , τ ) 7→ X i ⊏ [ ℓ ( σ )] j ⊏ [ ℓ ( τ )] ℓ ( i )= ℓ ( j ) κ ( i , j ) · ℓ ( i ) Y r = 1 k ( σ [ i r ] , τ [ j r ]) , is a spec ial case of Haussler’s relation-convolution kernel, with the relation being ordered sub-sequence relation. Note though the main discrepancy which is that t he native sequential kernel is evaluated at ﬁrst differences, so a relation-conv olution kernel is only obtained after t h e summation s tep described in R emark 5.10 (i). It is also interesting to note that the sub-sequence relation is th e same from which the ANOV A kernel is obtained, see Sect ion 2.4 of [ 14 ] . However , for the ANOV A kernel, the primary kernel is restricted to be zero f or unequal sequence elements. 7. Higher order sequentialization and noisy observation s W e have seen in Section 5 that the sequentialisation k + of a kernel k converges to the inner product of signature features, that is k + ( σ ( s ) , τ ( t )) → k ⊕ ( σ , τ ) = 〈 S  φ ( σ )  , S  φ ( τ )  〉 T ( H ) as mesh ( s ) ∨ mesh ( t ) → 0. (7.1) This requires (see Assumption 4.6) that φ ( σ ) , φ ( τ ) ∈ B V ( H ) so that t he signature features S  φ ( σ )  ,S  φ ( τ )  are well deﬁned via Riemann–Stieljtes integration. However , a common situation is that observations are perturbed by noise in which case t he bounded variation assumption is t ypically not fulﬁlled. An insight of rough path theory and st oc hastic analysis is that despite the breakdown of c lassic integration, it is fo r large classes of paths poss ible t o deﬁne a map that replaces the signature σ 7→ S  φ ( σ )  or in our learning context: becomes a feature map for σ . The price is that we need to replace S by a higher ord er approximation S D respectively 〈 S ◦ φ , S ◦ φ 〉 by 〈 S D ◦ φ , S D ◦ φ 〉 , here D ≥ 1 will denote t h e order of approximation and has to be choosen higher t he more irregular the underlying paths are. A thorough discussion for g eneral noise (truly “rough” p aths ) require s some background in stochastic analysis and rough path theory and is beyond the sc ope of this article and we refer to [ 8, 20 ] . The point we want to make in t his section, is that with a f ew modiﬁcations, the method ology of the previous chapter extends t o sequences t h a t ar e sampled from unbounded variation paths . 7.1. Brow nian noise: ord er 2 approximations Below we demonstrate the needed adapations for multivariate white noise / Brownian motion; moreover , we focus on sequentialization of th e trivial kernel k = 〈· , ·〉 R d , that is the primary f eature map φ ( x ) = x . In this c ase, we deal with (semi)martin gale paths and an approximation of degree D = 2 is needed. Proposition 7.1. Let µ be the Wiener measure on C ([ 0, 1 ] ) , R d ) , that is x ∼ µ is a d -dime nsional Brow - nian moti on. Let t n ∈ ∆ be dyadics, t n [ i ] : = i 2 − n , and deﬁne ( x n ) ∈ B V ([ 0, 1 ] , R d ) as the piecewise linear interpolation on t n , x n ( t ) : = x ( t n )[ i ] + ( t − t n [ i ] ) ∇ x ( t n )[ i ] for t ∈  t n [ i ] , t n [ i + 1 ]  . (7.2) F or any p > 2 and any mult index ( i 1 , i 2 ) ∈ { 1, . . . d } 2 it holds for µ -a.e. x t h a t 1. V ( x ) = ∞ , 26 2. lim n →∞ S ( i 1 , i 2 ) ( x n ) e xists in T ( H ) and eq ua l s the stochastic I t o–Stratonovich int e gral ˆ ∆ 2 ([ 0,1 ]) ◦ d x i 1 ◦ d x i 2 , (7.3) 3. S 1 ( x ( t n )) ( i 1 , i 2 ) does not converge to (7.3) as n → ∞ if i 1 = i 2 . However , if we denote S ( 2 ) : H + → T ( H ) , S ( 2 ) ( x ) = ℓ ( ∇ x ) Y i = 1  1 + ∇ x [ i ] + ( ∇ x [ i ]) ⊗ 2 2!  (7.4) then the ( i 1 , i 2 ) coord inate of S ( 2 )  x ( t n )  converges t o (7.3) as n → ∞ for µ -a.e. x . Proof. P oint (1) can be found in any textbook on stochastic analysis, e.g. [ 28 ] ; similarly , point (2 ) is classic, e.g. [ 8 ] [ Chapter 13 ] . Also point (3) and the convergence of S x ( x n ) f ollows by a simple calculation: S 2 ( x n ) = ℓ ( t n ) − 1 Y i = 1  1 + ∇ x ( t n )[ i ] + ( ∇ x ( t n )[ i ]) ⊗ 2 2  = 1 + X i ∈ [ ℓ ( t n ) − 1 ] ∇ x ( t n )[ i ] + X i , j ∈ [ ℓ ( t n ) − 1 ] , i < j ∇ x ( t n )[ i ] ⊗ ∇ x ( t n )[ j ] + 1 2 X i ∈ [ ℓ ( t n ) − 1 ] ∇ x ( t n )[ i ] ⊗ 2 . Now the ﬁrst sum equals, x ( t n [ ℓ ( t n )]) = ´ ∆ 1 ( 0,1 ) d x , the second equals X j ∈ [ ℓ ( t n ) − 1 ] X i ∈ [ j − 1 ] ∇ x [ i ] ⊗ ∇ x [ j ] = X j ∈ [ ℓ ( t n ) − 1 ] x ( t n [ i ] ) ⊗ ∇ x ( t n [ i ] ) (7.5) and a c lassic convergence results in stoc hastic calculus, [ 28 ] , shows that X j ∈ [ ℓ ( t n ) − 1 ] x ( t n [ i ] ) ⊗ ∇ x ( t n [ i ] ) + 1 2 X i ∈ [ ℓ ( t n ) − 1 ] ∇ x ( t n [ i ] ) ⊗ 2 (7.6) converges t o the Stratonovich integral ´ ◦ d x ◦ d x . Now point (3) follows by recall ing t hat Brownian motion has non-vanishing quadratic variation, t hat is P i ∈ [ ℓ ( t n ) − 1 ] ∇ x ( t n [ i ] )) ⊗ 2 does not converge to 0 as n → ∞ . Corollary 7.2. Let µ be the Wiener measure on C ([ 0, 1 ]) , R d ) . Denote t he S ( 2 ) ,2 the projection of S ( 2 ) to L 2 m = 0 ( R d ) ⊗ m . If x , y ∼ µ independently , the n with probability one 〈 S ( 2 ) ,2 ( x ( s m )) , S ( 2 ) ,2 ( y ( t n )) 〉 T ( H ) → 〈 X , Y 〉 T ( H ) as m , n → ∞ (7.7) where X : = 1 + ´ ∆ 1 ◦ d x + ´ ∆ 2 ( ◦ d x ) ⊗ 2 and Y : = 1 + ´ ∆ 1 ◦ d y + ´ ∆ 2 ( ◦ d y ) ⊗ 2 are given by I t o–Stratonovich integrals and s m , t n are dyadic partitions, s [ i ] = i 2 − m , t [ i ] = i 2 − n . R emark 7.3. Essentially t he same result holds for other part itions than dyadics that vanish quickly enough and for continuous semimarti ngales, that is bounded variation paths with addit ive noise of the same path regularity as B r ownian motion (see [ 8 ] ). Similarly , one can show t hat S ( x ( t n )) converges for higher iterat ed integrals though the calculation gets a bit cumbersome so w e do not address this. Useful tools to study such convergence questions are the notions of multiplicative funct ionals and extension theorem ; we r efer to [ 22 ] . 27 7.2. Highe r order approximations In the Brownian (and semimartingale) discussed above, the Riemann–Stieljtes integral was replaced by stochastic I to–Stratonovich integral, thus providing a map from (semimartingale) paths to T ( H ) . A general strategy t o construct a function that maps a path x to T ( H ) and behaves like iterated integrals is t o ﬁnd a sequence of bounded variation paths ( x n ) ⊂ B V ([ 0, 1 ] , H ) such that t he T ( H ) -valued paths t 7→ P m ´ ∆([ 0, t ]) ( d x n ) converges in an apropiate sense to a T ( H ) -valued path denoted t 7→ X ( t ) . This is t he notion of a geometric rough path and allows to study classes of much “rougher” paths; loosely sp eaking: if we deal w i th a geomet ric p -rough path, t h e n we have to consider an approximation S ( D ) of or der a t least D = ⌊ p ⌋ ; thus the rougher the path, the higher the order of signature approximation. Example 7.4. As pointed out above, the details of how to map a tra jector y to an element in T ( H ) var y , exploit pr obabilistic structure and we refer [ 8, 20 ] for details. Here we just mention that such constructions are well-known for • Brownian motion (leading to p-rough pat hs for any p > 2 ) • more genera l l y , continuous Semimartingale (leading to p -rough paths for any p > 2 ), • fractional Brownian motion of Hurst parameter H > 1 4 , • more generally , Gaussian processes (leading t o p -rough pat hs w here p d epends on the regulary of the covariation process), • Markov processes in continuous time (leading t o p -rough pat hs with p de pending on the generator of the Markov process). Again, a rigorous treatment requires knowledge of geometric p -rough paths and is beyond the aim of this paper . However , we still give t he general deﬁnition of a an order D approximation and the associated signature and sequentialized kernel that are needed to treat the paths in Example 7.4. Deﬁnition 7.5. Let D ∈ N . Deﬁne S ( D ) : H + → T ( H ) , S ( D ) ( x ) = ℓ ( ∇ x ) Y i = 1 D X m = 0 ( ∇ x [ i ]) ⊗ m m ! and denote with S ( D ) , M the projection of S ( D ) to L M m = 0 H ⊗ m . Le t k : X × X → R . Deﬁne the sequentialization of k of order D and truncated at M as k + ( D ) , M ( σ , τ ) : X + × X + → R , k + ( D ) , M ( σ , τ ) = 〈 S ( D ) , M  φ ( σ )  , S ( D ) , M  φ ( τ )  〉 T ( H ) . R emark 7.6. The sequnentialization k + M of a ke r nel k from Section 5.1 arises as special case of above, general deﬁnition: k + M = k + ( D ) , M for D = 1 , M ∈ N . In analogy to the order D = 1 approximations discussed in Section 5, the central mathematical identity is now ℓ ( ∇ x ) Y i = 1 D X m = 0 ( ∇ x [ i ]) ⊗ m m ! = X i ⊑ D [ ℓ ( ∇ x )] 1 i ! ℓ ( i ) Y r = 1 ∇ x [ i r ] ≈ ∞ X m = 0 ˆ ∆ m ([ 0,1 ]) d x ⊗ m where x = x ( t ) is a suitable discretization of the (unbounded varia tion!) path x ∈ C ([ 0, 1 ] , H ) with t ∈ ∆([ 0, 1 ]) . N ote the appearance of the i ! term in the ﬁrst identity together with the restriction i ⊑ D [ ℓ ( ∇ x ) ] which makes a recursion formula for t he inner product more complex. However , a variation of the recursive Horner type formula still holds for k + ( D ) , M and we give an efﬁc ient algorithm for k + ( D ) , M for 28 general 7 D , M ∈ N in Section 8 below . Since it is a multi-w ay recursion which is better formulated in t erms of a dynamic programming algorithm, we defer its formulation to the following Section 8. It relies on the formula below for the discretized higher o rder signature kernel: Proposition 7.7. F or x , y ∈ H + , D , M ∈ N , (a) S ( D ) ( x ) = P i ⊑ D [ ℓ ( ∇ x )] 1 i ! Q ℓ ( i ) r = 1 ∇ x [ i r ] , (b) 〈 S ( D ) , M ( x ) , S ( D ) , M ( y ) 〉 T ( H ) = P i , j ∈ ∆( N ) ℓ ( i )= ℓ ( j ) ≤ M # i ,# j ≤ D 1 i ! j ! Q ℓ ( i ) r = 1 〈∇ x [ i r ] , ∇ y [ j r ] 〉 H . Proof. W riting out the deﬁnition of S ( D ) ( x ) yields S ( D ) ( x ) = ℓ ( ∇ x ) Y r = 1 D X d = 0 1 d ! ( ∇ x [ r ] ) ⊗ d . Application of t he (non-commutative) associative law yields ℓ ( ∇ x ) Y r = 1 D X d = 0 1 d ! ( ∇ x [ r ] ) ⊗ d = X i ⊑ D [ ℓ ( ∇ x )] 1 i ! ℓ ( i ) Y r = 1 ∇ x [ i r ] . The analogous expression for S ( D ) ( y ) and taking the scalar product while noting 〈 ℓ ( i ) Y r = 1 ∇ x [ i r ] , ℓ ( j ) Y r ′ = 1 ∇ y [ j r ′ ] 〉 T ( H ) = δ ℓ ( i ) , ℓ ( j ) · ℓ ( i ) Y r = 1 〈∇ x [ i r ] , ∇ y [ j r ] 〉 H . T runcating at tensor degree M yields t he claim. 8. Efﬁcient computation of sequentialized kernels Naive evaluation of the signature kernels K + M and sequentialized kernels k + M (or more generally k + ( D ) , M ) incurs a cos t exponential in the length o f the input or the degree. Inspired by dynamic programming, we present in t his section a number of algorithms for signature and s equential kernels whose time and memory requir ements are polynomial in the length of the sequence. T able 1 presents an overview on the different algorithms presented: given N sequences, σ 1 , . . . , σ N ∈ X + , each of length less t h an L , that is ℓ ( σ i ) ≤ L for i = 1, . . . , N , T able 1 sho ws the computational com- plexity and storage requir ements for calculating the Gram-matrix k + M  σ i , σ j  N i , j = 1 of t he sequentialization k + M of a kernel k on X . The simplest variant involv ing dynamic programming, Algorithm 3, allows t o eval- uate the sequential kernel in a number of elementary computations that is linear in the length of either sequence (thus quadratic for t wo sequences of equal length). Further combining t he st rategy with low- rank ideas allows to compute a full sequential kernel matrix t hat is bot h linear in the maximum length of sequence and the number of s equences, c ulminating in Algorithm 6. The higher order sequential kernel k + ( D ) , M will not be discussed to the same degree of detail. I t is demonstrated in Algorithm 4 how the simple dynamic programming Algorithm 3 changes when the ap- proximation is carried out to a higher order D . All dynamic programming algorithms listed in T able 1 may be modiﬁed in the same way while incurr ing an additional factor D in computation and storage require ments. 7 Without loss of generality , it is sufﬁcient to consider D ≤ M since for D > M they only difference occurs at the level of more than M times iterated integrals and we already cut off at degree M ). 29 method algorithm complexity storage naive evaluation Proposition 5.5 O ( N 2 · L 2 M ) O ( 1 ) dynamic programming Algorithm 3 O ( N 2 · L 2 · M ) O ( L 2 ) DP & LR, p er-element Algorithm 5 O ( N 2 · L · ρ · M ) O ( L · ρ ) DP & LR, p er-sequence Algorithm 3 O (( N + ρ ) · L 2 · ρ 2 · M ) O ( N · L 2 ) DP & LR, simultaneous Algorithm 6 O ( N · L · ρ · M ) O ( N · L · ρ ) T able 1: Overview over presented algorithms to compute the sequential kernel k + M , their computational cost and storage cost when evaluating an ( N × N ) kernel matrix between sequences of length at most L . Methods: DP = dynamic programming, LR = low-rank. In the low-rank methods, ρ is a meta-parameter which also c ontrols acc uracy of approximation and prediction, not necessarily in the s ame way for t he different algorithms. R emark 8.1. Let σ , τ ∈ B V ( R d ) , s , t ∈ ∆ with ℓ ( s ) , ℓ ( t ) ≤ L . Denote with σ s , τ t ∈ B V ( R d ) the bounded variation paths that are t h e piecewise linear inter polation of points σ ( s ) , τ ( t ) . Note that S ≤ M ( σ s ) , S ≤ M ( τ t ) are composed of O ( d M ) real numbers which makes a naive evaluation of 〈 S ≤ M ( σ s ) , S ≤ M ( τ t ) 〉 T ( H ) infeasible for moder ately high d or M . On the ot her hand, T able 1 applied with X = R d , k = 〈· , ·〉 R d and D = M , provides efﬁcient me thods for calculating this inner pr oduct since k + ( D ) , M ( σ ( s ) , τ ( t )) = 〈 S ≤ M ( σ s ) , S ≤ M ( τ t ) 〉 T ( H ) . 8.1. Dynamic programming for tensor s Before giving algorithms for c o mputing the sequential kernels, we introduce a number of notations and fast algorithms for dynamic programming subroutines which will allow to st ate the latter algorithms con- cisely and which are at the basis of fast computations. Notation 8.2. W e will denote the ( i 1 , . . . , i k ) -th element of a k -fold array ( = degree k tensor) A by A [ i 1 , . . . , i k ] . Occasionally , for ease of reading, we will use “ | ” instead of “ , ” as a separator , for example A [ i 1 , i 2 | i 3 , . . . , i k ] of the i ndi ces on t h e left side of “ | ” ar e semantically distinct from those on t he right side, in the example to separate t h e group of indices i 1 , i 2 from t he group i 3 , . . . , i k . The arrays in the remai nde r will all contain elements in R , and the indices w i ll alw a ys be positive integers, excluding zero. Notation 8.3. F or a function f : R → R , and an array A , we will d enote by f ( A ) the arr a y where f is applied element-wise. I.e., f ( A )[ i 1 , . . . , i k ] = f ( A [ i 1 , . . . , i k ]) . Similar l y , for f : R m → R and ar rays A 1 , . . . , A m , w e denote f ( A 1 , . . . , A m )[ i 1 , . . . , i k ] = f ( A 1 [ i 1 , . . . , i k ] , . . . , A m [ i 1 , . . . , i k ]) . F or example, 1 2 · A 2 is the array A having all elements squared, then divided by two. The array A + B contains, element-wise, sums of element s of A and B . Notation 8.4. W e introduce notation for sub-s etti ng, shifting, and cumulative sum. Let A be a k -fold arr ay of size ( n 1 × · · · × n k ) . (i) F or an index i j (at j -th position), we will write A [ :, . . . , :, i j , :, . . . , : ] for ( k − 1 ) -fold arra y of size ( n 1 × · · · × n j − 1 × n j + 1 × . . . n k ) such tha t A [ :, . . . , :, i j , :, . . . , : ][ i 1 , . . . , i j − 1 , i j + 1 , . . . , i k ] = A [ i 1 , . . . , i k ] . W e deﬁne in analogy , iter atively , A [ :, . . . , :, i j , :, . . . , :, i j ′ , :, . . . , : ] , and so on. Arrays of this ty pe are called slices (of A ). (ii) F or an int e ger m, we w i ll write A [ :, . . . , :, + m , :, . . . , : ] for the k -fold ar ray of size ( n 1 × · · · × n j − 1 × ( n j + k ) × n j + 1 × . . . n k ) such tha t A [ :, . . . , :, i j , :, . . . , : ][ i 1 , . . . , i j − 1 , i j + k , i j + 1 , . . . , i k ] = A [ i 1 , . . . , i k ] , a nd wher e non-existing indices of A are trea t ed a s zero. Arrays of this ty pe a r e called shifted (versions of A ). F or negative m, the shifts wi l l be denoted with a “minus”-sign instead of a “plus”-sig n. (iii) W e will wr ite A [ :, . . . , :, ⊞ , :, . . . , : ] , where ⊞ is at t h e j -th positi on, for t h e k-fold array of size ( n 1 × · · · × . . . n k ) such tha t A [ :, . . . , :, ⊞ , :, . . . , : ][ i 1 , . . . , i k ] = P i j κ = 1 A [ i 1 , . . . , i j − 1 , κ , i j + 1 , . . . , i k ] . Arrays of this type are called slice-wise cumulative sums (of A ). 30 (iv) W e will write A [ :, . . . , :, Σ , :, . . . , : ] , whe re Σ is at the j -th position, for the ( k − 1 ) -fold array of size ( n 1 × · · · × n j − 1 × n j + 1 × . . . n k ) such that A [ :, . . . , :, Σ , :, . . . , : ][ i 1 , . . . , i k − 1 ] = P n j κ = 1 A [ i 1 , . . . , i j − 1 , κ , i j , . . . , i k − 1 ] . Arrays of this type are called slice-wise sums (of A ). W e will furt her use iterat ions and mixtures of t he above notati on, noting that the inde x-wise sub-s ett ing, shifting, and cumulation commute with each other . Ther efore expressions such as A [ + 1, : | Σ , − 3 ] or A [ j | : , + 3, ⊞ ] a re well-deﬁned, for exa m ple . W e will also use th e notat ion A [ ⊞ − m , . . . ] and A [ ⊞ + m , . . . ] to indicate the shifted variant of the cumulative sum array . Before continuing, we would like to note that cumulative sums can be computed ef ﬁciently , in the order of t he size of an array , as oppo s ed to squared complexity of more naive approaches. The algorithm is classical, we present it f or the c onvenience of the reader in Algorithms 1 and 2. Algorithm 1 Computing the c umulative sum of a vect or . Input: A 1-fold array A of size ( n ) Output: The cumulative sum array A [ ⊞ ] 1: Let Q ← A . 2: for κ = 2 to n do 3: Q [ κ ] ← Q [ κ − 1 ] + A [ κ ] 4: end for 5: R eturn Q Algorithm 2 Computing the c umulative sum of an array . Input: A k -fold array A of size ( n 1 , × , . . . , × n k ) Output: The cumulative sum array A [ ⊞ , . . . , ⊞ , :, . . . , : ] (up to the m -th index) 1: Let Q ← A 2: for κ = 2 to m do 3: Let Q ← Q [ :, . . . , :, ⊞ , :, . . . , : ] (at the κ -th index), where the right side is computed via applying algorithm 1 slice-wise. 4: end for 5: R eturn Q 8.2. Computing the sequent ial kernel W e g ive a fast algorithm to compute the sequential kernel k + M , by using t he recursive presentation from Proposition 5.5. Algorithm 3 includes a cut-off degree M which f or exact computation c an be set to M = min ( L , L ′ ) but can be set lower to reduce c omputational cost when an approximation is good enough. Note that f ollowing our convention on arrays, all multiplications in Algorithm 3 are entry-wise, not matrix multiplications, even though some arrays have the format of c ompatible matrices. Correctness of Algorithm 3 is proved in Proposition 5.5. At the end of the algorithm, the array A contains as elements A [ m | i , j ] the contributions from sub-sequences i ⊑ [ i ] , j ⊑ [ j ] , beginning at i and j , and of to t al length at most m . Disregardin g th e cost of c omputing K which can vary depending on the exact form of k (but which is, usually , O ( n ℓ ( σ ) ℓ ( τ ))) time and O ( ℓ ( σ ) ℓ ( τ )) time, with constants that may depend on k), the com- putational cost of Algorithm 3 is O ( M ℓ ( σ ) ℓ ( τ )) elementary arithmetic operations ( = t he number of loop elements) and O ( M ℓ ( σ ) ℓ ( τ ) units of elementary storage. The storage requirem ent can be reduced t o O ( ℓ ( σ ) ℓ ( τ )) by discarding A [ m − 1 | :, : ] from memory aft er step 7 each time. 31 Algorithm 3 Evaluation of the sequential kernel k + M Input: Ordered sequences σ , τ ∈ X + . A kernel k : X + × X + → R to sequentialize. A cut-off degree M . Output: k + M ( σ , τ ) , as the sequentializ ation of k 1: Compute an ( L × L ′ ) array K such that K [ i , j ] = ∇ k ( σ , τ )[ i , j ] . 2: (or , alternative ly , obtain it as additional input to t he algorithm) 3: Initialize an ( M × L × L ′ ) -array A . 4: Set A [ 1 | :, : ] ← K . 5: for m = 2 to M do 6: Compute Q ← A [ m − 1 | ⊞ , ⊞ ] . 7: Set A [ m | :, : ] ← K · ( 1 + Q [+ 1, + 1 ]) 8: end for 9: Compute R ← 1 + A [ M | Σ , Σ] 10: R eturn R Note t hat in each loop over the index L , a matrix Q is pre-computed, to avoid a ﬁve-fold loop that would be necessary with the more naive version of line 7 , A [ m | i , j ] ← A [ m | i , j ] ·    1 + X i ′  i X j ′  j A [ m − 1 | i ′ , j ′ ]    , that leads to a blown up c omputational co s t of O ( M ℓ ( σ ) 2 ℓ ( τ ) 2 ) at the asymptotically insigniﬁcant gain of storing one ( ℓ ( σ ) × ℓ ( τ )) matrix less (the matrix Q ). Note that the whole code can be directly translated to the vector of matrix operations c ommonly availa ble in programmin g languages such as R, MA TLAB, or Python. 8.3. Computing the higher ord er seq u ential kernel Algorithm 4 Evaluation of the higher order sequential kernel k + ( D ) , M Input: Ordered sequences σ , τ ∈ X + . A kernel k : X + × X + → R to sequentializ e. A cut-off degree M , an approximation order D , D ≤ M . Output: k + ( D ) , M ( σ , τ ) , as the sequentiali zation of k 1: Compute an ( L × L ′ ) array K such that K [ i , j ] = ∇ k ( σ , τ )[ i , j ] . 2: (or , alternative ly , obtain it as additional input to t he algorithm) 3: Initialize an ( M × D × D × L × L ′ ) -array A , all entries zero. 4: for m = 2 to M do 5: D ′ ← min ( D , m − 1 ) 6: A [ m | 1, 1 | :, : ] ← K · ( 1 + A [ m − 1 | Σ , Σ | ⊞ + 1, ⊞ + 1 ]) 7: for d = 2 to D ′ do 8: A [ m | d , 1 | :, : ] ← A [ m | d , 1 | :, : ] + 1 d · K · A [ m − 1 | d − 1, Σ | :, ⊞ + 1 ] . 9: A [ m | 1, d | :, : ] ← A [ m | 1, d | :, : ] + 1 d · K · A [ m − 1 | Σ , d − 1 | ⊞ + 1, : ] . 10: for d ′ = 2 to D ′ do 11: A [ m | d , d ′ | :, : ] ← A [ m | d , d ′ | :, : ] + 1 d d ′ · K · A [ m − 1 | d − 1, d ′ − 1 | :, : ] . 12: end for 13: end for 14: end for 15: Compute R ← 1 + A [ M | Σ , Σ | Σ , Σ] 16: R eturn R 32 All multiplications in Algorithm 4 are entry-wise, not matrix multiplications. At the end of Algorithm 4, the array A c ontains as elements A [ m | d , d ′ | i , j ] the contributions from sub-sequences i ⊆ [ i ] , j ⊆ [ j ] , be- ginning at i and j , with end-sequences i i i . . . of length d and j j j . . . of length d ′ , and of t otal length at most M . W e prove the recursion used in Algorithm 4 and thus the correctness of this statement aft er introducing some necessary notation: Notation 8.5. Let t = ( t 1 , . . . , t M ) ∈ U M be a sequence, for any set U , and some M ∈ N . W e will denote by d ( t ) : = max { m : t 1 = t 2 = · · · = t m } the number of repet itions of the initial symbol. Proposition 8.6. Keep t he notations of Algorithm 3, let φ : X → H the feature map associated wit h th e kernel k . Let x , y ∈ H + such t hat x = φ ( σ ) , y = φ ( τ ) . Denote by A [ m | d , d ′ | i , j ] : = X x ′ ⊑ D ∇ x y ′ ⊑ D ∇ y 1 x ′ ! y ′ ! 〈 x ′ , y ′ 〉 H + = X i ⊑ D [ ℓ ( x )] j ⊑ D [ ℓ ( y )] ℓ ( i )= ℓ ( j ) ≤ m 1 i ! j ! ℓ ( i ) Y κ = 1 〈 ( ∇ x )[ i κ ] , ( ∇ y )[ j κ ] 〉 H where t h e sums are additi onally r e stricted in the following way: i = i 1 = i 2 = · · · = i d 6 = i d + 1 and j = j 1 = j 2 = · · · = j d ′ 6 = j d ′ + 1 . Or , equivalently , x [ i ] = x ′ [ 1 ] = x ′ [ 2 ] = · · · = x ′ [ d ] 6 = x ′ [ d + 1 ] and y [ j ] = y ′ [ 1 ] = y ′ [ 2 ] = · · · = y ′ [ d ] 6 = y ′ [ d + 1 ] Then, the following recursion equalities hold: A [ m | 1, 1 | i , j ] = 〈 x [ i ] , y [ j ] 〉 H ·    1 + X i ′  i X j ′  j A [ m − 1 | d − 1, d ′ − 1 | i ′ , j ′ ]    , A [ m | d , 1 | i , j ] = 1 d · 〈 x [ i ] , y [ j ] 〉 H · X j ′  j D X κ = 1 A [ m − 1 | d − 1, κ | i , j ′ ] for d ≥ 2, A [ m | 1, d ′ | i , j ] = 1 d ′ · 〈 x [ i ] , y [ j ] 〉 H · X i ′  i D X κ = 1 A [ − 1 | κ , d ′ − 1 | i ′ , j ] for d ′ ≥ 2, A [ m | d , d ′ | i , j ] = 1 d d ′ · 〈 x [ i ] , y [ j ] 〉 H · A [ m − 1 | d − 1, d ′ − 1 | i , j ] for d , d ′ ≥ 2. Proof. Note that the sums on the right hand side that deﬁne A [ m | d , d ′ | i , j ] as a (weighted) sum over elements parameterised by paired index sequences I , J of t he same length. Note t hat by deﬁnition, the sum goes over all index sequences such that ℓ ( i ) = ℓ ( j ) ≤ M , d ( i ) = d , and d ( j ) = d ′ . Wi th this, the st atement follows from comparing t he summation in the loop between Line 5 and 13 of Algorithm 4 with the explicit formula in Proposition 7.7. The computational c ost of Algorithm 4 is O ( D 2 M ℓ ( σ ) ℓ ( τ )) elementary arithmetic operations ( = the number of loop elements) and O ( D 2 ℓ ( σ ) ℓ ( τ )) units of elementary st o rage (when f reeing up space f or array entries directly after the last time they are read out). 8.4. L arge scale s trategies Even though a computational cost of O ( M ℓ ( τ ) ℓ ( σ )) for t h e sequential kernel via Algorithm 3 (for ease of reading, we will not discuss t he higher order kernel, most considerations ho ld in analogy) can be considered efﬁcient in a polynomial time, one has to note that this is t he cost of evalua ting k + M ( σ , τ ) for 33 a single pair of s equences σ , τ ∈ X + . Thus, computation of a symmetric kernel matrix of N sequences, of length at most M , would cost O ( N 2 · L 2 · M ) elementary arithmetic operations when done by iterating over entries. While t h is is for moderate sizes o f N and M still achievable on contemporary desktop computers, it may bec o me quickly prohibitive when c ombined with parameter tuning or cross-valida tion schemes (as later in our experiments). Furthermore, there exist regimes (low signal dimension n , large length L ) in which an explicit computation of features plus subsequent inner product, with a c omplexity of O ( N 2 · L · n M ) , may be faster due t o the linear dependence on L at the cost of an exponential dependence on M . W e present below a number of approaches by which t he above-mentioned issues may be addressed. These are somewhat independent and address different parts of the tot al complexity in different ways, but can be in-principle combined. 8.4.1. L ow-r ank method s for t he seq u ence-vs-sequence ker nel matrix The sequential kernel is a ker- nel on ordered data, t h erefore learning algorithms which use the kernel matrix as an input, such as support vect or machines, kernel ridge regression, o r kernel principal components, are in-pr inciple di- rectly amenable to large-scale variants of low-rank t ype. Strategies of t his kind include the incomplete Cholesky dec ompos it ion, Nyström approximation, or the inducing point formalism in a Gaussian process framework. F or N sequences, all strategies mentioned above require evaluation of at most an ( N × r ) and an ( r × r ) matrix (where r is a meta-parameter), which c osts O (( r + N ) · r · L 2 · M ) elementary operations and O (( r + N ) · r · L 2 ) storage, followed by a slightly modiﬁed varian t of the learn ing algorithm itself which usually c osts O ( ( r + N ) · r 2 ) elementary operations and storage (at most) instead of the unmodiﬁed varian t which usually cos t s O ( N 3 ) (at most). This allev iates the dependency of the sequential kernel evaluation on N , but not on L ; which is not unexpected, since t h e strategy is completely independent of ho w t he sequentia l kernel was evaluated. F or the same reason, any improvements on the co s t of single evaluations will c ombine with the above improvement un t h e number of evalua tions. 8.4.2. L ow-r ank method s for t he element-v s -element kernel matr ix A s ec ond kernel matrix is crucial to obtaining a single evalua tion of type k + M ( σ , τ ) , namely the cross-kernel matrix between the elements σ [ i ] , τ [ j ] of both sequences which is the object underlying the computations, see Line 2 of Algorithm 3, and Line 2 of Algorithm 4. Summations and multiplications are performed on t his cross-kernel matrix until th e ﬁnal result, k + M ( σ , τ ) , is obtained and returned. The low-rank methods of the previous paragraph cannot be applied naively t o the cross-kernel ma- trix. A minor issue is the fact that the cross-kernel matrix is in general non-symmetric, but the above- mentioned st rategies (incomplete Cholesky , Nyström, inducing points) translate verbatim to the context of non-symmetric cross-kernel matrices, b y replacing the respective symmetric decomposition with the analogue non-symmetric one. The major issue consists in the s ummation- and multiplication-type opera- tions which are performed on the kernel matrix. In their naive form, t hese operations require access the full cross-kernel matrix, and without modiﬁcation will therefore give rise to t he same c omputational com- plexity irrespectively of whether a low-rank decomposition of the initial cross-kernel matrix is c onsidered, or not. W e show how t his can be circumvented by working exclusively on low-ran k factorizations. Deﬁnition 8.7. Let A be an ( a × b ) -array . F or U an ( a × r ) -array , and V a ( b × r ) -array , we say that ( U , V ) is a low -rank presentation of A , of rank r , if A [ i , j ] =  U [ i , : ] · V [ j , : ]  [Σ] . In mat rix nota tion, this is equivalent to saying that A = U V ⊤ . Note that in matrix terms, t he fact th at A has a low-rank presentation of rank r does imply that A is of rank r or less (by equivalence of matrix rank with dec omposition rank), but it does not imply that A is of rank exactly r . 34 W e state a number of straightforward but computationally useful Lemmas: Lemma 8.8. Let A be an ( a × b ) -array with low -rank presentat ion ( U , V ) . Then: (i) F or m ∈ N , a low -rank presentation of A [+ m , : ] is ( U [+ m , : ] , V ) . A low -rank presentati on of A [ :, + m ] is ( U , V [+ m , : ]) . (ii) A low -rank presentat ion of A [ ⊞ , : ] is ( U [ ⊞ , : ] , V ) . A low -rank presentation of A [ :, ⊞ ] is ( U , V [ ⊞ , : ]) . (iii) A low-rank presentation of A [Σ , : ] is ( U [Σ , : ] , V ) . A low -rank presentation of A [ :, Σ] is ( U , V [Σ , : ]) . Proof. The statements f ollow directly from writing out t he decompositions. Lemma 8.9. Let A 1 , A 2 be ( a × b ) -arrays, let U i be arrays of size ( a × r i ) , let V i be arrays of size ( b × r i ) , for i = 1, 2 , for some r i ∈ N . Let A : = A 1 + A 2 , write U for the ( a × ( r 1 + r 2 )) -array obta ined by concatenating U 1 , U 2 , and V or the ( b × ( r 1 + r 2 )) -array obtai ned by concatenating V 1 , V 2 (such that t he order of indices matches). Then, the following are equivalent: (i) ( U , V ) is a l ow -rank presentation of A , of rank r 1 + r 2 . (ii) ( U 1 , V 1 ) is a low -rank presentation of A 1 , of rank r 1 , and ( U 2 , V 2 ) is a low -rank presentation of A 2 , of rank r 2 . Proof. The statement f ollows directly from writing out the decompositions of A and A 1 + A 2 in terms of U 1 , U 2 , V 1 , V 2 and obs erving that they are f ormally equal. Notation 8.10. W e introduce not a tion for index repetit ion. Let A be a k -fold array of size ( n 1 × · · · × n k ) . (i) F or an integer m, we will writ e A [ :, . . . , :, m · :, :, . . . , : ] for the k -fold array of size ( n 1 × · · · × n j − 1 × ( m · n j ) × n j + 1 × . . . n k ) such t h a t A [ :, . . . , :, m · :, :, . . . , : ][ i 1 , . . . , i j − 1 , i j , i j + 1 , . . . , i k ] = A [ i 1 , . . . , i j − 1 , r , i j + 1 , . . . , i k ] , where r is t he remainder in integer division of i j by m. This is intuiti vely equivale nt to concatenat- ing m copies of A along t h e j -th direction. (ii) F or an integer m, we will write A [ :, . . . , :, : · m , :, . . . , : ] for the k -fold array of size ( n 1 × · · · × n j − 1 × ( m · n j ) × n j + 1 × . . . n k ) such that A [ :, . . . , :, : · m , :, . . . , : ][ i 1 , . . . , i j − 1 , i j , i j + 1 , . . . , i k ] = A [ i 1 , . . . , i j − 1 , q , i j + 1 , . . . , i k ] , where q is the quotient in integer division of i j by m. This is intuit ively equivale nt t o repeating each slice of A along the j -th direction m times. Lemma 8.11. Let A 1 , A 2 be ( a × b ) -arrays, let U i be arrays of size ( a × r i ) , l e t V i be arrays of size ( b × r i ) , for i = 1, 2 , for some r i ∈ N . L et A : = A 1 · A 2 (i.e., by our convention, component-wise multiplication), wri te U for t he ( a × ( r 1 · r 2 )) -array U 1 [ :, : · r 2 ] · U 2 [ :, r 1 · : ] , and V or the ( b × ( r 1 + r 2 )) -array V 1 [ :, : · r 2 ] · V 2 [ :, r 1 · : ] (multiplication i s per convention component-wise). Then, the following are equivalent: (i) ( U , V ) is a l ow -rank presentation of A , of rank r 1 · r 2 . (ii) ( U 1 , V 1 ) is a low -rank presentation of A 1 , of rank r 1 , and ( U 2 , V 2 ) is a low -rank presentation of A 2 , of rank r 2 . Proof. The statement follows directly from writing out the decompositions of A and A 1 · A 2 in t erms of U 1 , U 2 , V 1 , V 2 and obs erving that they are f ormally equal. Lemmas 8.8, 8.9 and 8.11 cover all op erations performed on t he kernel matrix in Algorithms 3 and 4, namely , shift ing, summation and cumulative summation, component-wise addition and multipli cation. Therefore t he respective manipulations on low-rank f actors may be used to replace all operations within the algorithm. F o r convenience of the reader , we explicitly present Algorithm 5 which is a low-ra nk version of Algorithm 3. Algorithm 5 is obtained from Algorithm 3 as follows: Line 6 is via Lemma 8.8 (i) and (ii). Line 8, is via 8.9 and observing th at a low-rank presentation of the all-ones matrix is a pair of all-ones vectors. 35 Algorithm 5 Evaluation of the sequential kernel k + M , with low-rank speed-up Input: Ordered sequences σ , τ ∈ X + . A kernel k : X + × X + → R to sequentialize. A cut-off degree M . Output: k + ( D ) , M ( σ , τ ) , as the sequentiali zation of k 1: Let L ← ℓ ( σ ) , L ′ ← ℓ ( τ ) . 2: Compute arrays U , V such that ( U , V ) is a low-rank presentation of rank υ , approximating the kernel matrix K with K [ i , j ] = k ( ∇ σ [ i ] , ∇ τ [ j ]) 3: (or , alternative ly , obtain it as additional input to t he algorithm). 4: Initialize an ( M × L × ∗ ) -ar ray B and an ( M × L ′ × ∗ ) -array C (where * means that the s ize may c hange dynamical ly). 5: Set B [ 1 | :, : ] ← U and C [ 1 | :, : ] ← V . 6: for m = 2 to M do 7: Compute P ← B [ m − 1 | ⊞ + 1, : ] and Q ← C [ m − 1 | ⊞ + 1, : ] . 8: Ap p end an ( M × 1 ) -arra y of ones to P , append an ( L ′ × 1 ) -array of ones to Q . 9: Set ρ such that ( M × ρ ) is the size of P , and ( L ′ × ρ ) is the size of Q . 10: Set B [ m | :, : ] ← U [ :, : · ρ ] · P [ :, υ · : ] 11: Set C [ m | :, : ] ← V [ :, ρ · : ] · Q [ :, : · υ ] 12: optional: “simplify” the low-rank presentation ( B , C ) , reducing its rank 13: end for 14: Compute R ← B [ M | Σ , : ] 15: Compute S ← C [ M | Σ , : ] 16: R eturn 1 + ( R · S )[Σ] Lines 9 and 11 are via 8.11. Lines 12 to 15 are via Lemma 8.8 (iii). Line 13 is evaluation, f ollowing the def- inition of low-ran k presentation. Line 10 is optional, aiming at keeping size of the low-ra nk presentations low (and thus the computational c ost); it can be achieved f or example via singular value decomposition type tec hniques, sub-sampling techniques, or random projection type techniques. The higher order Algorithm 4 can be treated in a similar way , by applying the low-ra nk representation to the matrices / 2D-arrays A [ m | i , j | :, : ] . All assignments and manipula tions can be re-phrased in those matrices, t herefore t he same strategy applies. The computational cost of one run of Algorithm 5 is of the same order as t he maximum size of B and C . T hat is, if ρ is the smallest integer such that at any time B require s ℓ ( σ ) · M · ρ space, and C requires ℓ ( τ ) · M · ρ space, then the computational complexity of Algorithm 5 is O (( L + L ′ ) · ρ · M ) . Noting t h at the one c an always choose a low-ra nk representation of B [ m | :, : ] and C [ m | :, : ] of rank min ( L , L ′ ) o r less, the computational complexity of the low-rank algorithm is always b ounded by O ( L · L ′ · M ) , which is the complexity of Algorithm 3. F or the linear kernel, one can infer that the rank will be bounded by ρ ≤ n M , by using that K admits a low-rank presentation of rank n , then keeping track of matrix sizes: each of the ( M − 1 ) repetitions of Lines 9 resp. 11 enlarges the size of B , C by a multiplicative factor of n . 8.4.3. Simultaneous low-r ank met hods Algorithm 5 yields an efﬁcient low-rank sp eed-up for comput- ing a single element of the kernel matrix. When employing this speed-up for each entry o f the ﬁnal kernel matrix of size N , the co mputational c ost is O ( N 2 · L · ρ · k ) . As stated before, a c ost quadra tic in N may be prohibitive on large scale data. This can be addressed by combining both low-rank strategies mentioned bef ore, on sequence-vs- sequence and element-vs-elem ent basis. F or this, note t h at the computation of R depends only on U , and the computation of S depends only on V . If U is chosen to depend only on r , and V only on s , one notes that Algorithm 5 c an be split into computation of R an S for each r and s , thus allowing a further reduction of c omputational cost from O ( N 2 · ( L + L ′ ) · ρ · M ) to O ( N · ( L + L ′ ) · ρ · M ) . P seudo-code is given in Algorithm 6. It is obtained f rom Algorithm 5 by starting with a joint low-rank decomposition of element-vs-element kernel matrices, t hen separating t he otherwise redundant computa- 36 tions of the factors R , S . Algorithm 6 Computation of the sequential kernel matrix k + M , with (double) low-rank speed-up Input: Ordered sequences σ 1 , . . . , σ N ∈ X + . A kernel k : X + × X + → R to sequentialize. A cut-off degree M . Output: A matrix U such that ( U , U ) is a low-ran k presentation of the kernel matrix K ∈ R N × N where K i j = k + M ( σ i , σ j ) . 1: Compute arrays U ( i ) of such that each ( U ( i ) , U ( j ) ) forms a (joint) low-rank presentation of rank υ , approximating the element-vs-element kernel matrices K ( i j ) with K ( i j ) [ a , b ] = k ( ∇ σ i [ a ] , ∇ σ j [ b ]) 2: (or , alternative ly , obtain it as additional input to t he algorithm). 3: Initialize an ( N × M × ∗ × ∗ ) -array B (where * means that the sizes may change dynamicall y). 4: Set B [ i | 1 | :, : ] ← U ( i ) for all i ∈ [ N ] . 5: for m = 2 to M do 6: Compute P ← B [ : | m − 1 | ⊞ + 1, : ] . 7: Set κ , ρ such that ( N × κ × ρ ) is the s ize of P . 8: Ap p end an ( N × κ × 1 ) -ar ray of ones to P . 9: Set B [ : | m | :, : ] ← B [ : | 1 | :, : · ρ ] · P [ :, :, υ · : ] 10: optional: “simplify” the low-rank presentation encoded in B , reducing its rank 11: end for 12: Compute U ← B [ : | M | Σ , : ] 13: R eturn U In line 2, the algorithm starts with a joint low-rank presentation of the element-wi se kernel matrices - that is, the matrices U ( i ) , when row-concatenated, s h ould have low rank. F or example, if k is the Euclidean scalar product, U ( i ) can be t aken as th e raw data matrix for s i , where rows are diff erent time points and columns are features. More generally , jointly low-rank U ( i ) can be obtained by running a suitable joint diagonaliz ation or singular value decomposition scheme on the element-vs-elemen t kernel matrices K ( i j ) . Note that such a joint low-ra nk decomposit ion may require choice of a higher rank ρ for some kernels k than when only a single entry of the kernel matrix is c omputed as in Algorithm 6. 8.4.4. F ast sequential kernel met hods F ollowin g the analogy of sequentia l kernels to st ring kernels established in Section 6 , fast string kernel methods such as the gappy , substitut ion, or mismatch kernels presented in [ 17 ] may b e transferred to general sequential kernels. In general, this amounts to small modiﬁcation of Algorithm 3; for example, t o obtain a gappy variant of the sequential kernel, summation in line 6 of Algorithm 3 over the whole matrix, of quadratic size, is replaced by summation over a linear part of it. The scaling factor λ also may or may not be added in. It should be noted that not all fast st ring kernel ideas combine straightforwardly with the low-rank methods introduced above, though they can be adapted. F or example, f or the gappy kernel, one may con- sider a joint low-ran k decomposition of element-vs-element cross-kernel matrices where suitable entries have been set to zero. 9. Experimental validation W e perform two experiments t o validate the practical usefulness of t he sequential kernel: (1) On a real world dataset of hand movement classiﬁcation (eponymous UCI dataset [ 31 ] ), we show t he sequential kernel outperforms the best previousl y reported predictive performance [ 31 ] , as well as non-sequentia l kernel and aggregate baselines. (2) On a real world dataset on hand written digit recognition (pendigits), we show that the s equen- tialization of the E uc lidean kernel ( = linear use of signature f eatures) achieves only s ub-baselin e 37 performance, similarl y to previously reported results [ 7 ] . Using the sequentialized Gaussian kernel improves prediction accuracy to the baseline region. W e would like to stress that our experiments do not constitute a systematic benchmark comparison to prior work, only validation that the sequential kernel is a practically meaningful concept: result (1) validates t he ﬁrst kernelization step in the sense that the order information c aptured by the sequential kernel can be useful, when c ompared to alternatives which ignore it. Resu lt (2) vali dates the second kerneliz ation step in t he sense that using a non-linear kernel in sequentializ ation may outperform the linear kernel. A benchmark c omparison is likely to require a larger amount of work, since it would have to include a number of p revious methods (multiple variants of the string and general alignment kernels, dynamic time warping, naive use of signatures), for most of which there is no freely availa ble code with interface to a machine learning toolbox, and benchmark methods (order-agnostic baselines such as summary ag- gregation and chunking; distributional regression; naive baselines) which have not been compared to in literature previously . F or the beneﬁt of the scientiﬁc community , we have decided to share our more theoretical results early and provide the opportunity to others t o work with a toolbox-compatible implementation of the sequential kernels (code link will be provided here shortly), acknowledging that further experimentation is desirable. W e will supply benchmark c omparisons at a later time point. 9.1. V alidation and prediction set-up 9.1.1. P rediction tasks In all datasets, samples are multi-var iate (time) series. All learning tasks are supervised classiﬁcation tasks of predicting class labels attached to series of equal length. 9.1.2. P rediction methods F or prediction, we use eps-support vector classiﬁcation (as available in t he python / scikit-lear n package) on t he kernel matrices obtained from t he following kernels: (1.a) the Euc lidean kernel k ( x , y ) = 〈 x , y 〉 . This kernel has no parameters. (1.b) the Gaussian kernel k ( x , y ) = exp  1 2 γ 2 k x − y k 2  . This kernel has one parameter , a scaling const ant γ . (2.a) the (truncated) sequentializ ation k + ≤ M of t he linear / Euclidean kernel k ( x , y ) = γ 〈 x , y 〉 . This sequen- tial kernel has two parameters, a scaling constant γ , and the t runcation level M . (2.b) the (truncated) sequentialization k + ≤ M of the Gaussian kernel k ( x , y ) = θ exp  1 2 γ 2 k x − y k 2  . T his sequential kernel has three parameters: scaling c onstants γ and θ , and truncation level M . (1.a) and (1.b) are considered standard kernels, (2.a) and (2.b) are sequential kernels. Note that t he non- sequential kernels (1.a) and (1.b) can only be applied to s equential data samples of equal length which is the case for t he datasets co nsidered. Note that even though (1.a), (1.b) may be applied to sequences of same length, they do not use any information about their ordering: both the Euclidean and the Gaussian kernel are invariant under (joint) permutation of t he order of the indexing in the arguments. W e would also like to note a further subtlety: the sequential kernels (2.a), (2.b) do use information about t he ordering of the sequences, but only for a truncation M ≥ 2. F or M = 1, t he kernel corresponds to c hoosing the increment / mean aggregate feature (Euclidean) or a type of distributional classiﬁcation (Gaussian) . W e will therefore explicitly compare truncation levels 1 versus 2 and higher , to enable us to make a st atement about whether using the order information was beneﬁcial (or not). There will be no further baseline, naive, or state-of-art predictors in t he set-up, comparison will be conducted between t h e kernel classiﬁers and performances reported in literature . W e would note that this is a limitation in our set-up which we will rectify in future (more time c o n- suming) instances of a larger benchmarking experiment. T he current experiments merely aim to validate whether the sequential kernel is a practically meaningful concept, in particular whether each of t he two kerneliz ation steps is practically useful, and whether making use of order information is beneﬁcial. 38 9.1.3. T uning and error estimation In all experiments, we use nested (double) cross-validation for parameter tuning (inner loop) and estimation of error metrics (outer loop). In both instances of c ross- validation , we perform uniform 5-fold cross-validation. Unless stated ot herwise, parameters are tuned on the tuning grid g iven in T able 2 (when applicable). Ker nel parameters are t he same as in the above s ec tion “prediction mehods”. The best parame ter is selected by 5-fold cross-validation, as the parameter yielding the minimum test-f1-score, averaged over the ﬁ ve folds. parameter range kernel p aram. γ 0.01, 0.1, 1 kernel p aram. θ 0.01, 0.1, 1 truncation level M 1,2,3 SVC regularizer 0.1, 1, 10, 100, 1000 T able 2: T uning grid 9.1.4. E rror metr ics T he out-of-sampl e classiﬁcation error is reported as precision, recall, and f 1-score of out-of- sample prediction on the test f old. Errors measures are aggregated with equal weights on classes and f olds. These agg regates are reported in the result tables. 9.2. Exper iment: Classifying h and movement s W e performed classiﬁcation with the eps-support vector machine (SVC) on the hand movements dataset from UCI [ 31 ] . T he ﬁrst database in the dataset which we c onsidered for this experiment contains, for each of ﬁve subjects (two male, three female) 180 samples of hand movement sE MG recordings. Each sample is a time series in two variables (channels) at 3.000 time points. T he time series fall into six classes of distinct types of hand movement (spherical, tip, palmar , lateral, cylindrical, hook). F or each subject, 30 samples of each class were recorded. Hence, for each subject, there is a total of 180 sequences in X 3000 with X = R 2 . F or each of the ﬁve subjects, we conducted the classiﬁcation experiment as described in Section 9.1, comparing prediction via SVC using one of the following three kernels: (1.a) the Euclidean kernel, (1.b) the Gaussian kernel , (2.a) the sequentialized Euclidean kernel. F or t he non-sequential kernels (1.a), (1.b), prediction was performed with and without prior standardization of the data. F or the sequential kernel, the tuning grid was c onsidered in two parts: a cut-off level of M = 1, corresponding to mean aggregation, and c ut-off levels of M = 2, 3, c orresponding to the case where genuine sequence information is used. The results are reported in T ables 3 to 7. Jackknife standard errors (pooling the ﬁve folds) are all 0.04 or smaller . Baseline performance of an uninformed estimator is 1 / 6 ≈ 0.17. method precision recall f1-score (1.a) linear 0.37 0 .38 0. 36 (1.a) linear , standardized 0.33 0 .32 0. 29 (1.b) Gaussian 0.57 0 .59 0. 56 (1.b) Gaussian, standardized 0.54 0 .50 0. 50 (2.a) mean aggregation 0.19 0 .20 0. 18 (2.a) sequential, level ≥ 2 0.87 0.86 0.86 T able 3: female1.ma t method precision recall f1-score (1.a) linear 0.47 0 .39 0. 37 (1.a) linear , standardized 0.31 0 .28 0. 27 (1.b) Gaussian 0.71 0 .71 0. 70 (1.b) Gaussian, standardized 0.59 0 .58 0. 56 (2.a) mean aggregation 0.18 0 .20 0. 18 (2.a) sequential, level ≥ 2 0.94 0.97 0.95 T able 4: female2.mat One c an observe that for all ﬁve subjects, SVC with sequential kernel of level 2 or higher outperforms SVC using any of the other kernels not using any sequence information. The sequence kernel appears to outperform the reported methods f rom the original paper [ 31 ] as well (Figures 11 and 12), though this probably may not be entirely clariﬁed due to three issues: (i) The authors p rovide no code; 39 method precision recall f1-score (1.a) linear 0.48 0 .46 0. 46 (1.a) linear , standardized 0.47 0 .42 0. 43 (1.b) Gaussian 0.66 0 .64 0. 63 (1.b) Gaussian, standardized 0.54 0 .51 0. 50 (2.a) mean aggregation 0.26 0 .23 0. 20 (2.a) sequential, level ≥ 2 0.96 0.96 0.96 T able 5: female3.ma t method precision recall f1-score (1.a) linear 0.37 0 .33 0. 33 (1.a) linear , standardized 0.38 0 .36 0. 36 (1.b) Gaussian 0.59 0 .57 0. 57 (1.b) Gaussian, standardized 0.53 0 .54 0. 53 (2.a) mean aggregation 0.20 0 .18 0. 17 (2.a) sequential, level ≥ 2 0.96 0.96 0.96 T able 6: male1.ma t method precision recall f1-score (1.a) linear 0.36 0 .33 0. 32 (1.a) linear , standardized 0.37 0 .29 0. 27 (1.b) Gaussian 0.72 0 .71 0. 70 (1.b) Gaussian, standardized 0.34 0 .39 0. 35 (2.a) mean aggregation 0.22 0 .23 0. 20 (2.a) sequential, level ≥ 2 0.93 0.93 0.93 T able 7: male2.ma t (ii) it is not described how the “subject index” in Figures 11 and 12 relates to the s ubject ﬁle names; (iii) Figures 11 and 12, reporting the results and supposedly pertaining to two different classiﬁcation methods, are exactly identical, t hus likely one of the two is and erroneous copy of the other . 9.3. Exper iment: Pendigits W e use performed classiﬁcation on the pendgits dataset from the UCI repository 8 . It contains 10992 samples of digits between 0 and 9 written by 44 different writers with a digital pen on a tablet. One sample consists of a pair of horizontal and vertical coordinates of sampled at 8 different time points, hence we deal with a sequence in X 8 with X = R 2 . The data set comes with a pre-speciﬁed training fold of 7494 samples, and a test f old of 3498 samples. Estimation of the prediction error is performed in this validation split, while t uning is done as described via nested 5-fold cross-validation, inside the pre-speciﬁed training fold. W e c ompared prediction via SVC using one of the following three kernels: (2.a) the sequentialized Euclidean kernel, and (2.b) t he sequentialized Gaussian kernel. F or bot h, the truncation level was set to M = 4. The results are reported in T able 8. Jackknife standard errors (pooling the ﬁve folds) are all 0.01 or smaller . Baseline performance of an uninformed est imator is 1 / 10 ≈ 0.10. method \ method precision recall f1-score sequential, linear 0.91 0.90 0.89 sequential, Gaussian 0.97 0.97 0.97 T able 8: P endigits The quality of SVC prediction with the s equentialized linear kernel roughly c orrespond to those of Diehl [ 7 ] . It is outperformed by SVC prediction with t he sequential ized Gaussian kernel which is similar to the baseline performance of k -nearest neighbors reported in the documentation of the pendigits dataset. A. Second kernelizatio n of the signature kernel W e need give meaning to k ( d σ , d τ ) when σ takes values in an arbitrary set X and h ence the diff erentials do not make sense. T o motivate t he deﬁnition below , consider ﬁrst the case of two paths σ , τ such 8 https://archive.ics.uci.edu /ml/datasets/Pen- Based+Recognition+of+H andwritten+Digits 40 that x : = φ ( σ ) and y : = φ ( τ ) are piecewise linear between time points s = ( s [ 1 ] , . . . , s [ m ]) resp. t = ( t [ 1 ] , . . . , t [ n ]) . In t h is case, ˆ ( s , t ) ∈ ( s [ 1 ] , s [ m ]) × ( t [ 1 ] , t [ n ]) 〈 d x ( s ) , d y ( t ) 〉 H = X i ∈ [ m − 1 ] j ∈ [ n − 1 ] ˆ ( s , t ) ∈ ( s [ i ] , s [ i + 1 ]) × ( t [ j ] , t [ j + 1 ]) 〈 d x ( s ) , d y ( t ) 〉 H = X i ∈ [ m − 1 ] j ∈ [ n − 1 ] 〈∇ x ( s )[ i ] , ∇ y ( t )[ j ] 〉 H = X i ∈ [ m − 1 ] j ∈ [ n − 1 ] k  σ ( s [ i ]) σ ( s [ i + 1 ]) τ  t [ j ]  τ  t [ j + 1 ]   where we use the notation k  a b c d  : = k ( b , d ) + k ( a , c ) − k ( b , c ) − k ( a , d ) . If s [ 1 ] = t [ 1 ] = 0 and s [ m ] = t [ n ] = 1, then Proposition 4.5 reads as K ⊕ ≤ M ( φ ( σ ) , φ ( τ )) = 1 + X i 1 ∈ [ m − 1 ] j 1 ∈ [ n − 1 ]      1 + . . . X i M ∈ [ i M − 1 − 1 ] j M ∈ [ j M − 1 − 1 ] k  σ  s [ i M ]  σ  s [ i M + 1 ]  τ  t [ j M ]  τ  t [ j M + 1 ]   . . .      k  σ  s [ i 1 ]  σ  s [ i 1 + 1 ]  τ  t [ j 1 ]  τ  t [ j 1 + 1 ]   . (A.1) Now deﬁne a signed measure on [ 0, 1 ] 2 via the rule κ σ , τ ([ r , s ] × [ u , v ]) : = k  σ ( r ) σ ( s ) τ ( u ) τ ( v )  and not e that (A.1) reads as 1 + ˆ [ 0,1 ] × [ 0,1 ]  1 + . . . ˆ [ 0, s M − 1 ] × [ 0, t M − 1 ] d κ σ , τ ( s M , t M ) . . .  d κ σ , τ  s 1 , t 1  . The content of the deﬁnition and t heorem below is that this formula makes sense f or arbitrary paths σ , τ such t hat φ ( σ ) , φ ( τ ) ∈ B V ( H ) . Deﬁnition 3. Let k : X × X → R and deﬁne the signed-Borel-measure valued map κ : P aths ( X ) × P aths ( X ) → M ([ 0, 1 ] × [ 0, 1 ]) , ( σ , τ ) 7→ κ σ , τ via κ σ , τ ([ a , b ] × [ c , d ] ) : = k ( σ ( b ) , τ ( d )) + k ( σ ( a ) , τ ( c )) − k ( σ ( b ) , τ ( d )) − k ( σ ( a ) , τ ( d )) . Theorem 4. Under the Assumptions 4.6, it holds that k ⊕ ≤ M ( σ , τ ) = 1 + M X m = 1 ˆ ( s , t ) ∈ ∆ m × ∆ m d κ σ , τ ( s [ 1 ] , t [ 1 ]) · · · d κ σ , τ ( s [ m ] , t [ m ]) k ⊕ ≤ M ( σ , τ ) = 1 + ˆ ( s 1 , t 1 ) ∈ ( 0,1 ) × ( 0,1 )  1 + . . . ˆ ( s M , t M ) ∈ ( 0, s M − 1 ) × ( 0, t M − 1 ) d κ σ , τ  s 1 , t 2  . . .  d κ σ , τ  s 1 , t 2  If X is an R -vector space and σ , τ are di fferentiable, t hen d κ σ , τ ( s , t ) = k ( ˙ σ ( s ) , ˙ τ ( t ) ) d s d t . 41 Proof. Let x : = φ ( σ ) , y : = φ ( τ ) and note that k ⊕ ≤ M ( σ , τ ) = K ⊕ ≤ M ( φ ( x ) , φ ( y )) . Fix a s equence  t n  ⊂ ∆ ([ 0, 1 ]) with mesh  t n  → 0 as n → ∞ denote with x n , y n the paths given by p iece- wise linear interpolation of p o ints x ( t n ) =  x  t n [ 1 ]  , . . . , x  t n [ n ]  , y ( t n ) =  y  t n [ 1 ]  , . . . , y  t n [ n ]  . By the above discussion, the st atment holds for x n , y n if we replace the measure κ σ , τ by a measure κ σ , τ , n on [ 0, 1 ] 2 as κ σ , τ , n ([ a , b ] × [ c , d ] ) : =  x n ( b ) , y n ( d )  H +  x n ( a ) , y n ( a )  H −  x n ( b ) , y n ( c )  H −  x n ( a ) , y n ( d )  H . As mesh  t n  → 0, the right hand side converges t o k  σ ( a ) σ ( b ) τ ( c ) τ ( d )  . Hence t he measure κ σ , τ , n con- verges weakly to the measure κ σ , τ . On the other hand,  S ( x n ) , S  y n  T ( H ) converges to  S ( x ) , S  y  T ( H ) which ﬁnishes the proof by sending n → ∞ in (A.1). B. Integral approxima tion bounds and proof of Theorem 5 Deﬁnition B.1. Let [ a , b ] ⊂ [ 0, 1 ] a nd x ∈ BV ([ a , b ] , H ) . F or V = [ a ′ , b ′ ] ⊆ U , we write x [ a ′ , b ′ ] for the element of BV ( V , H ) which is obtained by restriction of x to V , i.e., x [ a ′ , b ′ ] : = [ x : V → H ] ⊆ B V ( V , H ) . The following sum identity becomes very useful. Proposition B.2. Let x ∈ H + . Then S ( x ) = P x ′ ⊏ ∇ x Q ℓ ( x ′ ) i = 1 x ′ [ i ] . Proof. This follows f rom an explicit algebraic c omputation. Lemma B.3. Let x ∈ BV ( U ) for U = [ a , b ] ⊆ R . Let m ∈ N , let V i : = [ a i , b i ] ⊆ U for 1 ≤ i ≤ m, and let V = V 1 × · · · × V m . It holds t hat ˆ V d x ⊗ m = m Y i = 1  x ( b i ) − x ( a i )  . Proof. Observe t hat t he integral on the left hand side can be sp lit as ˆ V d x ⊗ m = ˆ V 1 . . . ˆ V m d x 1 ⊗ · · · ⊗ d x m . Separating differential operators, one obtains ˆ V 1 . . . ˆ V m d x 1 ⊗ · · · ⊗ d x m = ˆ V 1 d x 1 ⊗ · · · ⊗ ˆ V m d x m . The c laim follows from observing that inf V i d x i = x ( b i ) − x ( a i ) . Lemma B.4. Let x ∈ B V ( U , H ) for U = [ a , b ] ⊆ [ 0, 1 ] . It holds that     ˆ ∆ m ( U ) d x ⊗ m     H ≤ 1 m ! V ( x ) m . 42 Proof. By the (continuous / integral ) triangle inequality , it holds t hat     ˆ ∆ m ( U ) d x ⊗ m     H ≤ ˆ ∆ m ( U ) k d x k m H . Further obs erve that ˆ ∆ m ( U ) k d x k m = 1 m ! ˆ U m k d x k m H . The integral on the right hand side c an be split, i.e., 1 m ! ˆ U m k d x k m = 1 m !  ˆ U k d x k H  m . Observing that ´ U k d x k H = V ( x ) yields the claim. Lemma B.5. Fix m , M ∈ N . Let x ∈ B V ([ a , b ] , H ) and t = ( t 1 , . . . , t M + 1 ) ∈ ∆ M + 1 ([ a , b ]) . Write C : = ˆ ∆ m ( U ) d x ⊗ m and D : = X i ⊏ [ M ] ℓ ( i )= m ( ∇ x ( t )) [ i 1 ] ⊗ · · · ⊗ ( ∇ x ( t )) [ i m ] . Further write, U : = [ a , b ] and U i : = [ t i , t i + 1 ] for 1 ≤ i ≤ M , and for i ⊑ [ M ] , writ e U i : = × i ∈ i [ t i , t i + 1 ] . Then, the following equalit ies hold: (i) C − D = P i ⊑ [ M ] ℓ ( i )= m # i  1 ´ ∆ m ( U ) ∩ U i d x ⊗ m , (ii)    ´ ∆ m ( U ) ∩ U i d x ⊗ m    H ≤ 1 i ! Q i ∈ i V ( x [ t i , t i + 1 ]) . Proof. (i) By s plitting the integral, we can write C = X i ⊑ [ M ] ℓ ( i )= m ˆ ∆ m ( U ) ∩ U i d x ⊗ m . Note that this is not the s ame summing over I as in D . I n C , indices in t he index sequence I may repeat, and f or D ; they may not as they have to increase strictly monotonously . W e will thus write C 1 : = X I ⊏ [ M ] ℓ ( i )= m ˆ ∆ m ( U ) ∩ U i d x ⊗ m and C 2 : = X I ⊑ [ M ] ℓ ( i )= m i !  1 ˆ ∆ m ( U ) ∩ U i d x ⊗ m , that is, C 1 collects index sequences without repeated indices, and C 2 collects t hose with repeat. Now note that for a non-repeating index sequence I we have ∆ m ( U ) ∩ U i = U i , therefore C 1 = X i ⊏ [ M ] ℓ ( i )= m ˆ U i d x ⊗ m 43 Subtraction and collecting terms with t he same index yields C 1 − D = X i ⊏ [ M ] ℓ ( i )= m  ˆ U i d x ⊗ m − ( ∇ x ( t ))[ i 1 ] ⊗ · · · ⊗ ( ∇ x ( t ))[ i m ]  . This is zero by Lemma B.3, t herefore C − D = C 1 − C 2 − D = C 2 which was the claimed statement. (ii) Fix an index sequence i . Let i 1 , . . . , i k the distinct occurring indices in i , and n 1 , . . . , n k the t o t al counts o f th eir respective occurrences. Note that t herefore ∆ m ( U ) ∩ U i : = k × j = 1 ∆ n j ([ t [ i k ] , t [ i k + 1 ]] ) . W rite S j : = ∆ n j ([ t [ i k ] , t [ i k + 1 ]] ) . Then, ˆ ∆ m ( U ) ∩ U i d x ⊗ m = k O j = 1 ˆ S j d x ⊗ n j . Therefore, we obt ain as a norm bound       k O j = 1 ˆ S j d x ⊗ n j       H = k Y j = 1      ˆ S j d x ⊗ n j      H ≤ 1 i ! Y i ∈ i V ( x [ t i , t i + 1 ]) , where the rightmost inequal ity f ollows from applying Lemma B.4 to every multiplicand in the product. Theorem 6. Let x ∈ B V ( U , H ) for U = [ a , b ] ⊆ R , and le t t ∈ ∆ M + 1 ( U ) . Write C : = S ( x ) and D : = X i ⊏ [ M ] ℓ ( i ) Y r = 1 ( ∇ x ( t ))[ i r ] . Write C m , D m for t h e respective homogenous parts of C , D . Furthe r deﬁne G ( z ) : = exp ( z · V ( x )) − M Y i = 1  1 + z · V ( x [ t i , t i + 1 ])  = : ∞ X m = 1 g m · z m . That is, the ﬁrst equa l i t y d eﬁnes G as a function of z , the second deﬁ nes g m as t he T aylor coefﬁcients of G of its expansion in z around 0 . Then, G is t he genera t ing function for an upper bound of approximat i ng C with D ; namely , more precisely: (i) k C m − D m k H ⊗ m ≤ g m . (ii) k C − D k T ( H ) ≤ G ( 1 ) = exp ( V ( x )) − Q M i = 1  1 + V ( x [ t i , t i + 1 ])  . (iii) If t is chosen such t h a t V ( x [ t i , t i + 1 ]) = 1 M V ( x ) for all i , then k C − D k T ( H ) ≤ exp ( V ( x )) M  1 + ( V ( x ) M ( M − 2 ) !  . 44 Proof. (i) The bounds given by G are those from Lemma B.5 (i) and (ii) where C m here is C in Lemma B.5 (i), and D m here is D in th e lemma. T he statement follows from explicitly writing out the coefﬁc ient g m and Lemma B.5 (i) and (ii). (ii) Applying the triangle inequali ty and then part (i), one obtains k C − D k T ( H ) ≤ ∞ X m = 1 k C m − D m k H ⊗ m ≤ ∞ X m = 1 g m = G ( 1 ) and t herefore t he st atement when writing out G ( 1 ) . (iii) This follows from observing t hat f or such a choice of t , it holds that G ( 1 ) = exp ( V ( x )) −  1 + 1 M V ( x )  , to which Euler’s Theorem 4 may be applied. Proposition B.6. Let x ∈ R , n ∈ N , x ≥ 0 . Let x 1 , . . . , x M ∈ R , x i ≥ 0 for i = 1, . . . , M such that P M i = 1 x i = x . Then, exp ( x ) = m Y i = 1  1 + x i  + g ( x , x 1 , . . . , x M ) , where 0 ≤ g ( x , x 1 , . . . , x M ) ≤ x exp ( x ) · max i x i . In part icular , it holds that lim max i x i → 0 m Y i = 1  1 + x i  = exp ( x ) , where convergence is uniform of order O ( max i x i ) on any compact subset of [ 0, ∞ ) . Proof. All statements f ollow f rom t he ﬁrst, which we proceed to prove. W riting out t he product, we obtain m Y i = 1  1 + x i  = X i ⊏ [ M ] x i , where abbrevia tingly we have written x i : = Q i ∈ i x i . T h e T aylor expansion of the exponential on the other hand yields exp ( x ) = exp M X i = 1 x i ! = X i ⊑ [ M ] 1 i ! x i . Note the major diff erent between bot h sums above being the repeating indices which may occur in the expansions of exp ( x ) . More precisely , we obtain exp ( x ) − m Y i = 1  1 + x i  = X i ⊑ [ M ] i !  1 1 i ! x i . W e further split up t h e sum by length of i : exp ( x ) − m Y i = 1  1 + x i  = ∞ X m = 2 X i ⊑ [ M ] ℓ ( i )= m i !  1 1 i ! x i . 45 P ositivity of g follows from this equation and pos it ivity of x . Now consider the map φ which removes the ﬁrst duplicated index in an ordered index sequence i yielding a sequence of length ℓ ( i ) − 1. On sequences of length m , the map φ is at most m -to-one, and surjective o nt o sequences of length m − 1. Therefore, X i ⊑ [ M ] ℓ ( i )= m i !  1 1 i ! x i ≤ X · ∞ X m = 2 m 2 X i ⊑ [ M ] ℓ ( i )= m − 1 1 i ! x i , where X = max i x i . Thus, ∞ X m = 2 X i ⊑ [ M ] ℓ ( i )= m i !  1 1 i ! x i ≤ X · ∞ X m = 1 m · X i ⊑ [ M ] ℓ ( i )= m 1 i ! x i . Comparing to t he expansion of exp ( x ) above, one observes that the right hand side is equal to X · ∞ X m = 1 m · x m m ! = X · x ∞ X m = 0 x m m ! = X · x · exp ( x ) . Note that the bounds from Proposition B.6 are worse t han those from Euler’s Theorem 4 in the case of equal x i , by a f actor of x . This is due to t he fact that the bound also needs to be valid for heavily imbalanced partitions of x into x i . R eferences [ 1 ] Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition with sup- port vector machines-a kernel approach. In Frontiers in handwrit ing recognition, 2002. pr oceed ings. eighth internat ional workshop on , pages 49–54. IEEE, 2002. [ 2 ] H. Boedihardjo, X. Geng, T . Lyons, and D. Y ang. The Signature of a R ough P ath: Uniqueness. ArXiv e-prints , June 2014. [ 3 ] Djèlil Chafaï, T erry Lyons, Christoph Ladroue, and Anastasia P apavasiliou. Computational Rough P aths Project on SourceF orge , 2006 (accessed January 8, 2016). [ 4 ] Noel Cressie. Statistics for spatial data . John Wiley & Sons, 2015. [ 5 ] Marco Cuturi. F ast global alignment kernels. In Proceedings of the 28th I nt e rnational Conference on Machine Le arning (ICML -11) , pages 929–936, 2011. [ 6 ] Marco Cuturi, J-P V ert, Oystein Birkenes, and T akashi Matsui. A kernel f or t ime series based on global alignments. In Acoustics, Speech and Signal Processing, 2007. ICA S SP 2007. IEE E Inte rnational Conference on , volume 2, pages II–413. IEE E , 2007. [ 7 ] Joscha Diehl. R otation invariants of two dimensional curves b ased on iterated integrals. CoRR , abs / 1305.6 883, 2013. [ 8 ] P eter K Friz and Nicolas B Victoir . Multidimensional stochastic processes as rough pat h s: theor y and applications , volume 120. Cambridge Univer sity Press, 2010. [ 9 ] T oni Giorgino. Computing and visualizi ng dynamic time warping alignments in r: the dtw package. Journal of statistical Software , 31(7):1–24 , 2009. [ 10 ] Benjamin Graham. Sparse arrays of signatures for online c haracter recognition. abs / 1308.03 71, 2013. [ 11 ] Lajos Gergely Gyurkó, T erry L yons, Mark Kontkow ski, and Jonathan Field. Extracting information from the signature of a ﬁnancial data stream. arXiv pre print a r Xiv:1307. 7244 , 2013. [ 12 ] Martin Hairer . A theory of regulari ty structures. Inventiones mathemati cae , 198(2):26 9–504, 2014. [ 13 ] Ben Hambly and T erry Ly ons. Uniqueness for the signature of a path of bounded variation and the reduced p ath group. Annals of Mat h ematics , 171(1):109–16 7, 2010. [ 14 ] David Haussler . Convolution kernels on discrete st ructures. T echnical report, Citeseer , 1999. [ 15 ] Geoffrey Hinton, Li Deng, Dong Y u, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly , An- drew Senior , Vincent V anhoucke, P atrick Nguyen, T ara N Sainath, et al. Deep neural networks for acoustic modeling in speec h recognition: The shared views of four research groups. Signal Processing Magazine, IE EE , 29(6):82–97, 2012. [ 16 ] Joseph B Kruskal. An overview of sequence co mparison: Time warps, string edits, and macro- molecules. SIAM review , 25(2):201 –237, 1983. [ 17 ] Christina Leslie and Rui Kua ng. F ast string kernels using inexact matching for protein sequences. Journal of Machine Learning R esearch , 2004. [ 18 ] Daniel Levin, T erry Ly ons, Hao Ni, et al. Learni ng from the past, predicting t he statistics for the future, learning an evolving s ystem. ArXiv e-prints (Sept. 2013) , 2013. [ 19 ] Huma Lohdi, Craig Saunders, John Shawe- T aylor , N ello Cristianini, and Chris W atkins. T ext classiﬁ- cation using string kernels. J ournal of Machine Learning Research , 2002. [ 20 ] T erry Lyons. Differential equati ons drive n by rough paths . Springer Berlin Heidelberg New Y ork, 2004. [ 21 ] T erry Lyons, Hao Ni, and Harald Oberhauser . A feature set for streams and an application to high- frequency ﬁnancial tic k data. In Proceedings of the 2014 International Conference on Big Data Science and Computi ng , page 5. ACM, 2014. [ 22 ] T erry J Lyons. Differential equations driven by rough signals. R evista Matemát ica Iberoamericana , 14(2):2 15–310, 1998. [ 23 ] Georges Matheron. Principles of geostatistics. E conomic geology , 58(8):1246–12 66, 1963. [ 24 ] Hiroshi Shimodaira K en-ichi Noma. Dynamic time-alignment kernel in support vector machine. Advances in neural information processing systems , 14:921, 2002. [ 25 ] Anastasia P apavasili ou, Christophe Ladroue, et al. P arameter estimation for rough differential equa- tions. The Annals of Statistics , 39(4):2047–2073 , 2011. [ 26 ] Lawrence R Rabiner . A tut orial on hidden markov models and selected applications in speech recog- nition. Proceedings of the IE E E , 77(2):257–286 , 1989. [ 27 ] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006. [ 28 ] Daniel Rev uz and Marc Y or . Continuous ma rtingales and Brownian mot ion , volume 293. Springer Science & Business Media, 1999. [ 29 ] Hiroaki Sakoe. T wo-level dp-matching–a dynamic programming-based pattern matching algorithm for connected word recognition. Acoustics, Speech and Signal Processing, IEEE T ransactions on , 27(6):5 88–595, 1979. [ 30 ] Hiroaki Sakoe and Seibi Chiba. A similari ty evaluation of speech patterns by dynamic programming. In Nat. Meeting of I nstit ute of Electr onic Communications Engineers of Japan , page 136, 1970. [ 31 ] Christos Sapsanis, George Georgoulas, and Anthony T zes. Emg based classiﬁcation of basic hand movements based on time-frequency features. In Control & Automat ion (MED), 2013 21st Medit e r- ranean Conference on , pages 716–722. IEEE, 2013. [ 32 ] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: S upport vector machines, regu- larization, optimi za t ion, and beyond . MIT press, 2002. [ 33 ] John Shawe- T aylor and Nello Cristiani ni. K ernel methods for patt ern a naly sis . Cambridge university press, 2004. [ 34 ] Kilho Shin and T etsuji Kuboy ama. A generalization o f haussler’s convolution kernel: mapping kernel. In Proceedi ngs of the 25th inte rnational conference on Machine learning , pages 944–951. ACM, 2008. [ 35 ] Christopher KI Wi lliams and Carl E dward Rasmussen. Gaussian processes f or regression. 1996. [ 36 ] W eixin Y ang, Lianwen Jin, and Manfei Liu. Character-level chinese writer identiﬁcation using path signature feature, dropstroke and deep CNN. abs / 1505.049 22, 2015.

Kernels for sequentially ordered data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment