Introduction to Coresets: Accurate Coresets

In tro duction to Coresets: Accurate Coresets Ibrahim Jubran, Alaa Maalouf, and Dan F eldman { ibrahim.jub, Alaamalouf12, dannyf.post } @gmail.com The Rob otics and Big Data Lab Departmen t of Computer Science Univ ersity of Haifa, Israel Abstract A coreset (or core-set) of an input set is its small summation, such that solving a problem on the coreset as its input, prov ably yields the same result as solving the same problem on the original (full) set, for a giv en family of problems (mo dels, classi- ﬁers, loss functions). Ov er the past decade, coreset construction algorithms ha v e been suggested for many fundamental problems in e.g. mac hine/deep learning, computer vision, graphics, databases, and theoretical computer science. This introductory paper w as written follo wing requests from (usually non-exp ert, but also colleagues) regarding the many inconsistent coreset deﬁnitions, lack of av ailable source co de, the required deep theoretical bac kground from diﬀerent ﬁelds, and the dense pap ers that make it hard for b eginners to apply coresets and develop new ones. The pap er pro vides folklore, classic and simple results including step-b y-step pro ofs and ﬁgures, for the simplest (accurate) coresets of v ery basic problems, such as: sum of vectors, minim um enclosing ball, SVD/ PCA and linear regression. Nev ertheless, w e did not ﬁnd most of their constructions in the literature. Moreo ver, w e expect that putting them together in a retrosp ective con text w ould help the reader to grasp mo dern results that usually extend and generalize these fundamental observ ations. Exp erts migh t appreciate the uniﬁed notation and comparison table that links betw een existing results. Op en source co de with example scripts are provided for all the presen ted algorithms, to demonstrate their practical usage, and to supp ort the readers who are more familiar with programming than math. 1 In tro duction Coreset (or core-set) is a mo dern data summarization that appro ximates the original data in some prov able sense with resp ect to a (usually inﬁnite) set of questions, queries or mo dels and an ob jectiv e loss/cost function. The goal is usually to compute the model that minimizes this 1 ob jective function on the small coreset instead of the original (p ossibly big) data, without compromising the accuracy by more than a small multiplicativ e factor. Moreo ver, it has man y other applications such as handling constraints, streaming , distributed data, parallel computation, model compression, parameter tuning, model selection and man y more. The simplest coreset is a (p ossibly w eigh ted) subset of the input data. The adv an tages of suc h subset coresets are: (i) preserved sparsity of the input, (ii) in terpretabilit y , (iii) coreset ma y be used (heuristically) for other problems, (iv) less n umerical issues that o ccur when non-exact linear com bination of p oin ts are used (Maalouf, Jubran, & F eldman, 2019). Unfortunately , not all problems admit suc h a subset coreset, as we sho w throughout the pap er. Although coreset constructions are usually practical and not hard to implemen t, the theory behind them ma y be complicated and based on go o d understanding of linear al- gebra, statistics, probabilit y , computational geometry and mac hine learning. Similarly to appro ximation algorithms in computer science, there are some generic tec hniques for coreset constructions, but many of their constructions are heavily tailored and related to the prob- lem at hand and its existing solvers. F urthermore, there are many inconsistent deﬁnitions of coresets in the papers. Nev ertheless, it seems that after understanding the in tuition and math b ehind simple coreset constructions, it is muc h easier to read mo dern academic pap ers and construct coresets for new problems. T o this end, this paper focuses only on what seems to b e the simplest t yp e of core- sets, namely “accurate coresets”, whic h do not in tro duce an y appro ximation error when compressing the original data, but giv e accurate solutions. Most of the coresets in this paper are easy to construct and may b e considered as “folk- lore” results. Ho wev er, w e did not ﬁnd them in the literature, and w e realized that man y exp erts in the ﬁeld are not familiar with them. F urthermore, since most of these results are easy to construct and explain, we found them to b e suitable for tutorials, as the case in this pap er. These results may also b e of great interest to p eople from v arious ﬁelds of study , who ma y not be familiar even with the simple techniques presented in this pap er. W e assume no previous kno wledge except from basic linear algebra, and therefore w e target b oth exp erts and b eginners in the ﬁeld, as w ell as data scien tists and analysts. T o b etter understand the results presented in this pap er and to encourage p eople to use them, w e pro vide full op en-source code for these results (Jubran, Maalouf, & F eldman, 2019). Another motiv ation of this introductory surv ey is to sho w the man y possible diﬀeren t deﬁnitions of coresets and the resulting diﬀerent constructions, as w ell as summarizing them in a single place. T able 1 summarises the diﬀerent accurate coresets that we present in this pap er. 2 Preliminalies In this section, w e give basic notations and deﬁnitions that will b e used throughout this pap er. The set of all real num b ers is denoted by R . W e denote [ n ] = { 1 , · · · , n } for every in teger n ≥ 1 b y k p k = k p k 2 = p p 2 1 + . . . + p 2 d the ` 2 norm of a p oin t p = ( p 1 , · · · , p d ) ∈ R d , by k p k q = 1 /q q P d i =1 | p i | q the ` q norm of p for ev ery q > 0, by k p k ∞ = max i | p i | the ` ∞ norm, 2 Name Input W eighted Set ( P, w ) of size | P | = n Query Set X cost function f : P × X loss for f ( p, x ) ov er p ∈ P Coreset C Coreset W eights Const. time Query time Section 1-Center P ⊆ ` ⊆ R d w ≡ 1 X = R d f ( p, x ) = k p − x k k·k ∞ C ⊆ P | C | = 2 u ≡ 1 O ( n ) O ( d ) 3.1 Monotonic function P ⊆ R w ≡ 1 X = { g | g is monotonic decreasing/increasing or increasing and then decreasing function } f ( p, g ) = g ( p ) k·k ∞ C ⊆ P | C | = 2 u ≡ 1 O ( n ) O (1) 3.2 V ectors sum (1) P ⊆ R d w : P → R X = R d f ( p, x ) = p − x Σ C ⊆ R d | C | = 1 u ≡ P p ∈ P w ( p ) O ( n ) O ( d ) 3.3 V ectors sum (2) P ⊆ R d w : P → R X = R d f ( p, x ) = p − x Σ C ⊆ P | C | ≤ d + 1 u : C → R X p ∈ C u ( p ) = X p ∈ P w ( p ) O ( nd 2 ) O ( d 2 ) 3.3.1 V ectors sum (3) P ⊆ R d w : P → [0 , ∞ ) X = R d f ( p, x ) = p − x Σ C ⊆ P | C | ≤ d + 2 u : C →   0 , X p ∈ P w ( p )   X p ∈ C u ( p ) = X p ∈ P w ( p ) O  min { n 2 d 2 , nd + d 4 log n }  O ( d 2 ) 3.3.2 1-Mean (1) P ⊆ R d w : P → R X = R d f ( p, x ) = w ( p ) k p − x k 2 k·k 1 C ⊆ R d × Z × R | C | = 3 Diﬀerent loss unweig hted O ( nd ) O ( d ) 3.4 1-Mean (2) P ⊆ R d w : P → R X = R d f ( p, x ) = w ( p ) k p − x k 2 k·k 1 C ⊆ P | C | ≤ d + 2 u : C → R X p ∈ C u ( p ) = X p ∈ P w ( p ) X p ∈ C u ( p ) k p k 2 = X p ∈ P w ( p ) k p k 2 O ( nd 2 ) O ( d 2 ) 3.4.1 1-Mean (3) P ⊆ R d w : P → [0 , ∞ ) X = R d f ( p, x ) = w ( p ) k p − x k 2 k·k 1 C ⊆ P | C | ≤ d + 3 u : C →   0 , X p ∈ P w ( p )   X p ∈ C u ( p ) = X p ∈ P w ( p ) X p ∈ C u ( p ) k p k 2 = X p ∈ P w ( p ) k p k 2 O  min { n 2 d 2 , nd + d 4 log n }  O ( d 2 ) 3.4.2 1-Segment P = { ( t i | p i ) } n i =1 ⊆ R d +1 w : P → [0 , ∞ ) X =  g | g : R → R d  f (( t, p ) , g ) = k p − g ( t ) k 2 k·k 1 C ⊆ R d +1 | C | = d + 2 u ≡ 1 O ( nd 2 ) O ( d 2 ) 3.5 Matrix 2-norm (1) P ⊆ R d w : P → [0 , ∞ ) X = R d f ( p, x ) = ( p T x ) 2 k·k 1 C ⊆ R d | C | = d u ≡ 1 O ( nd 2 ) O ( d 2 ) 3.6 Matrix 2-norm (2) P ⊆ R d w : P → [0 , ∞ ) X = R d f ( p, x ) = ( p T x ) 2 k·k 1 C ⊆ P | C | ≤ d 2 + 1 u : C →   0 , X p ∈ P w ( p )   X p ∈ C u ( p ) = X p ∈ P w ( p ) O  min { n 2 d 4 , nd 2 + d 8 log n }  O ( d 3 ) 3.6.1 Least Mean Squares P =  ( a T i | b i )  n i =1 ⊆ R d +1 w : P → [0 , ∞ ) X = R d f (( a T | b ) , x ) = ( a T x − b ) 2 k·k 1 C ⊆ P | C | ≤ ( d + 1) 2 + 1 u : C →   0 , X p ∈ P w ( p )   X p ∈ C u ( p ) = X p ∈ P w ( p ) O  min { n 2 d 4 , nd 2 + d 8 log n }  O ( d 3 ) 3.7 T able 1: Cor esets that ar e pr esente d in this p ap er. The input set, the query set, and the r oles of the functions f and loss ar e as deﬁne d in Deﬁnition 1 . The ﬁrst and se c ond ar guments of the function f ar e elements fr om the input set and the query set r esp e ctively. We assume that the input set is of size | P | = n , and we wish to c ompute the loss over the n ﬁtting err ors that ar e deﬁne d by f , e ach input p oint and a given query. and by k A k F = q P m i =1 P n j =1 a 2 ij the F rob enius norm of a matrix A ∈ R m × n , where a ij is the j th entry at the i th ro w of A . The d -dimensional iden tity matrix is denoted b y I d ∈ R d × d . F or a function f we denote f 2 ( · , · ) = ( f ( · , · )) 2 . F or a set Z of elements, w e denote b y P ( Z ) the p ow er set of Z , i.e., the set of all subsets of Z . A weighte d set is a pair P 0 = ( P , w ) where P is a set of items called p oints , and w : P → R is a function that maps every p ∈ P to w ( p ) ∈ R , called the weight of p . A weighte d p oint is a w eigh ted set of size | P | = 1. A weigh ted set ( P, 1 ) where 1 is the w eight function w : P → { 1 } that assigns w ( p ) = 1 for ev ery p ∈ P ma y b e denoted by P for short. In order to ha ve a uniﬁed framework, we make the follo wing deﬁnition of a query space, whic h will simplify and unify the deﬁnitions of ev ery example coreset presented in this pap er. A query space basically includes all the ingredients needed to deﬁne a coreset for a new problem that we wish to tackle. Deﬁnition 1 (query space) L et X b e a (p ossibly inﬁnite) set c al le d query set , P 0 = ( P , w ) b e a weighte d set c al le d the input set , f : P × X → [0 , ∞ ) b e c al le d a cost function , and loss b e a function that assigns a non-ne gative r e al numb er for every r e al ve ctor. The tuple 3 ( P , w , X , f , loss) is c al le d a query space . F or every weighte d set C 0 = ( C , u ) such that C = { c 1 , · · · , c m } , and every x ∈ X we deﬁne the over al l ﬁtting err or of C 0 to x by f loss ( C 0 , x ) := loss(( w ( c ) f ( c, x )) c ∈ C ) = loss( w ( c 1 ) f ( c 1 , x ) , · · · , w ( c m ) f ( c m , x )) . An accurate coreset that appro ximates a set of mo dels (queries) for a sp eciﬁc problem is deﬁned as follows: Deﬁnition 2 (accurate coreset) L et P 0 = ( P , w ) b e a weighte d set and ( P , w, X , f , loss) b e a query sp ac e. The weighte d set C 0 is c al le d an accurate coreset for ( P , w , X , f , loss) if for every x ∈ X we have f loss ( P 0 , x ) = f loss ( C 0 , x ) . 3 Accurate Coresets In what follo ws, each subsection presents an accurate coreset construction for a given query space. Eac h section is mark ed with one of the following diﬃcult y indicators:  (easy),  (in termediate) or    (adv anced). 3.1 1 -Cen ter  Supp ose that we w ant to op en a shop on our street that will b e close to all the residents in the street, and supp ose that the street is represented by a linear segment, say the x -axis, while the residen ts are represented b y p oints on this segmen t. Since we wan t to b e close to all the p otential n ≥ 1 residen ts in the street, if we decide to op en the shop at some lo cation, our loss will b e measured as the distance to the farthest resident. Supp ose that tomorrow w e will be giv en few locations to c ho ose from to p osition our store. Can w e pre-process the giv en p ositions of the residen ts so that computing the cost (farthest residen t) from the suggested store lo cation will take only O (1) time for each suggested lo cation? That is, constant time that is indep endent of the num b er of residents n ? F ormally , in the query-space notation from Deﬁnition 1 , we hav e that P = { p 1 , · · · , p n } ⊆ R , w ≡ 1 , X = R , f ( p, x ) = | p − x | , loss( · ) = k·k ∞ . Our goal is to compute a data structure (accurate coreset) C suc h that for every num b er x ∈ X w e can compute f loss (( P , 1 ) , x ) =      | p 1 − x | , · · · , | p n − x |      ∞ = max p ∈ P | p − x | in O (1) time using only C . This can b e easily done by observing that for every x ∈ X , the farthest p oin t from x is either the smallest p oint p min or the largest p oint p max in P ; See Fig. 1a . That is, simply choosing C = { p min , p max } ⊆ P yields f loss (( P , 1 ) , x ) = max p ∈ P | p − x | = max {| p min − x | , | p max − x |} = f loss (( C, 1 ) , x ) , 4 i.e., the distance to the farthest input p oin t from x is either the distance to the leftmost or the righ tmost p oint of P . Ev en for the case that X = R d and where P is contained on a line ` in R d , w e would still ha ve f loss (( P , 1 ) , x ) = f loss (( C, 1 ) , x ) where C contains the tw o edge p oin ts on ` ; See Fig. 1b . Indeed, denote by x 0 the pro jection (closest p oin t) of x onto ` to obtain b y the Pythagorean Theorem that f 2 loss (( P , 1 ) , x ) = max p ∈ P k p − x k 2 = max p ∈ P k p − x 0 k 2 + k x 0 − x k 2 = max p ∈ C k p − x 0 k 2 + k x 0 − x k 2 = max p ∈ C k p − x k 2 = f 2 loss (( C, 1 ) , x ) . See implemen tation of function one center in (Jubran et al., 2019). (a) An input set P ⊆ R (in blue) on the x -axis and a query p oint x ∈ R (in gr e en). (b) A line ` in R d , an input set P ⊆ ` (in blue) and a query p oint x ∈ R d (in gr e en). Figur e 1: A n ac cur ate c or eset for the 1 -c enter query sp ac e (r e d p oints). Both in Fig. 1a and Fig. 1b , the farthest p oint p ∈ P fr om the query x is either the ﬁrst p oint ( p min ) or the last p oint ( p max ) on the line, i.e., f loss (( P , 1 ) , x ) = max {k p min − x k , k p max − x k} . The solution in this section does not generalize to an arbitrary set of p oin ts in R d . In fact, the follo wing set of points on the plane does not hav e an y subset whic h is an accurate coreset. Let P ⊆ R 2 denote n p oints on the unit circle in the plane, and let C ⊂ P . F or p ∈ P \ C and x = − p we hav e that f loss (( P , 1 ) , x ) = f loss ( { p } , x ) 6 = f loss (( C, 1 ) , x ). Hence, there is no subset C whic h is a coreset for P in this sense. Nevertheless approximated coresets can b e found in (Paul, F eldman, Rus, & Newman, 2014) and references therein. F urthermore, the ab o ve solution do es not generalize for the case where the input is w eighted, even for d = 1, i.e., there is a w eighted set of p oints ( P , w ) where P ⊆ R suc h that for every p ∈ P there is x ∈ R that satisﬁes f loss (( P , w ) , x ) = w ( p ) | p − x | . In other w ords, assume p j ∈ P w as not chosen for the coreset ( C, u ), then there is some query x ∈ R such that f loss (( P , w ) , x ) = w ( p j ) | p j − x | 6 = f loss (( C, u ) , x ). W e now construct such an example. Let ( P, w ) b e a weigh ted set of | P | = n p oin ts where w ( p 1 ) = 2 , w ( p i ) = w p i − 1 +  1 4  i − 1 , and p i = 2 5 − i w ( p i ) , 5 Figur e 2: No ac cur ate c or eset exists for the weighte d 1-c enter pr oblem. A weighte d set ( P , w ) , P ⊆ R , wher e w ( p 1 ) = 2 , w ( p i ) = w p i − 1 +  1 4  i − 1 , and p i = 2 5 − i w ( p i ) . F or every p i ∈ P , we c an c ompute a query x i ∈ R wher e x i = − 2 4+ i + 1 such that f loss (( P , w ) , x i ) = max p ∈ P w ( p ) | p − x i | = w ( p i ) | p i − x i | . as illustrated in Fig. 2 . F or every p i ∈ P , by deﬁning x i = − 2 4+ i + 1, it is easy to verify that f loss (( P , w ) , x i ) = max p ∈ P w ( p ) | p − x i | = w ( p i ) | p i − x i | . Therefore, any coreset for this problem must include all the input p oin ts in ( P , w ). Thus there is no accurate coreset for the w eigh ted 1-cen ter problem, ev en in 1-dimensional space. 3.2 Monotonic functions  What if the function f ( p, x ) = k p − x k from Section 3.1 is not an Euclidean distance, but a function of this distance g ( k p − x k ), where g ( y ) = y 2 so f ( p, x ) = k p − x k 2 is the squared Euclidean distance from x , or g ( y ) = min { y , 1 } so f ( p, x ) = min  1 , k p − x k 2  ? The latter one is called an M-estimator and is robust to p oints that are very far from x (outliers). It turns out that the coreset from the previous section holds for the following cases. Consider the query space ( P , w , R d , f , loss) where P = { p 1 , · · · , p n } ⊆ R , w ≡ 1, the query set X is the union ov er every function g : R → [0 , ∞ ) that is a non-negativ e decreasing, increasing, or decreasing and then increasing monotonic function, f ( p, g ) = g ( p ), and loss( · ) = k·k ∞ . Note that here every query is actually a function and not a p oin t. Hence, f loss (( P , 1 ) , g ) = k ( g ( p 1 ) , · · · , g ( p n )) k ∞ = max p ∈ P g ( p ) . Again, the main observ ation is that the maxim um v alue of g ( p ) ov er p ∈ P is obtained in one of the p oin ts p max ∈ arg max p ∈ P p or p min ∈ arg min p ∈ P p ; See Fig. 3 . Therefore, the coreset C = { p min , p max } from Section 3.1 is also v alid here. Figur e 3: A c cur ate c or eset for monotonic functions. A set P ⊆ R and a “de cr e asing then incr e asing” monotonic function g : R → [0 , ∞ ) . The p oint that maximizes g ( p ) over every p ∈ P is either p min or p max , i.e., f loss (( P , 1 ) , g ) = max { g ( p min ) , g ( p max } ) . 6 3.3 V ectors sum  The accurate coreset for vectors sum example presented in Sections 3.3 – 3.3.2 is a wa rm-up example, that will b e used in later sections. Consider the query space ( P , w , R d , f , loss) where P = { p 1 , · · · , p n } is a set of n p oints in R d , w : P → R , X = R d , f ( p, x ) = p − x and loss maps every tuple ( v 1 , v 2 , · · · , ) of vectors to their sum P i v i (whic h is also a vector). In this section, unlike other sections, the function f , as well as loss, return a vector and not a p ositiv e scalar as in Deﬁnition 1 . This query set deﬁnes the weigh ted mean or sum of the diﬀerences ( p − x ) ov er p ∈ P , i.e., f loss (( P , w ) , x ) = X p ∈ P w ( p )( p − x ) . It is easy to see that there is a coreset of a single w eighted p oin t to this query space. Indeed, let c = P p ∈ P w ( p ) p P p ∈ P w ( p ) , C = { c } and u ( c ) = P p ∈ P w ( p ). Here w e assume that P p ∈ P w ( p ) 6 = 0. Otherwise, set c = P p ∈ P w ( p ) p and u ( c ) = 1. W e now hav e that the weigh ted mean of ( c − x ) o ver c ∈ C (a single p oint) is f loss (( C, u ) , x ) = u ( c )( c − x ) = X p ∈ P w ( p )( c − x ) = X p ∈ P w ( p ) p − x · X p ∈ P w ( p ) = X p ∈ P w ( p ) p − x X p ∈ P w ( p ) = X p ∈ P w ( p )( p − x ) = f loss (( P , w ) , x ) . See implemen tation of function vectors sum 1 in (Jubran et al., 2019). 3.3.1 Subset coreset The coreset for vectors sum in the previous section w as not a subset of the input set P , i.e., C = { c } 6⊆ P . In this section, the goal is to compute a weigh ted set C 0 = ( C, u ) where C ⊆ P is a subset of the input suc h that f loss ( C 0 , x ) = f loss ( P 0 , x ). The motiv ation of a subset coreset is explained in Section 1 . Let ˆ p = ( p T | 1) T ∈ R d +1 for ev ery p ∈ P denote the concatenation of p with 1, and let ˆ P = { ˆ p | p ∈ P } . Giv en a distinct pair of p oin ts p, q on a line in R d , w e can describ e any third p oint z on the line as a linear combination of p and q . More generally , ev ery point in R d can b e describ ed b y d indep endent vectors (p oints). Recall that in linear algebra such a set of d v ectors is called a b asis of R d . Sp eciﬁcally , since P ˆ p ∈ ˆ P w ( p ) ˆ p ∈ R d +1 is spanned b y (i.e., a linear combination of ) the p oints of ˆ P , there exists a set ˆ C ⊆ ˆ P of | ˆ C | = d + 1 p oints, and a w eigh t function ˆ u : ˆ C → R such that X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X ˆ p ∈ ˆ P w ( p ) ˆ p = X p ∈ P w ( p ) p    X p ∈ P w ( p ) ! . (1) 7 The set ˆ C can b e computed as explained in Section A . Here w e assume | P | > d + 1, otherwise w e let ( C , u ) = ( P , w ) b e our coreset, since P is already small. More generally , it is not hard to v erify that | C | can b e small as the rank of the matrix whose rows are the p oin ts in ˆ P . Let C = n p ∈ P | ˆ p ∈ ˆ C o and u : C → R such that u ( p ) = ˆ u ( ˆ p ) for every p ∈ C . It now follo ws that X p ∈ C u ( p )( p | 1) = X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X p ∈ P w ( p ) p    X p ∈ P w ( p ) ! , (2) where the ﬁrst equalit y is by the deﬁnition of u and C , and the second equality is b y ( 1 ). By ( 2 ) w e obtain that X p ∈ C u ( p ) p = X p ∈ P w ( p ) p and X p ∈ C u ( p ) = X p ∈ P w ( p ) . (3) Hence, the weigh ted mean of ( p − x ) ov er p ∈ C is f loss (( C, u ) , x ) = X p ∈ C u ( p )( p − x ) = X p ∈ C u ( p ) p − x X p ∈ C u ( p ) = X p ∈ P w ( p ) p − x · X p ∈ P w ( p ) = X p ∈ P w ( p )( p − x ) = f loss ( P 0 , x ) , where the third equality is by ( 3 ). Observ e that by ( 3 ) we obtain that the sum of w eigh ts of the coreset is the same as the sum of original w eigh ts. W e thus hav e that C 0 = ( C, u ) is an accurate coreset of size | C | ≤ d + 1, which is also a subset of the input set P , for the given query space. 3.3.2 Subset Coreset of b ounded p ositiv e w eights In the coreset of the previous section, the coreset’s w eights migh t be both negativ e and un b ounded, ev en if the weigh ts of the input p oints are bounded. This ma y cause serious issues as explained in (Maalouf et al., 2019). In this section w e pro v e that if the input weigh ts are non-negative and sum to one, i.e., w : P → [0 , 1] and P p ∈ P w ( p ) = 1, then there is an accurate coreset C 0 = ( C, u ), where C ⊆ P consists of | C | = d + 2 p oints which is larger b y one than the previous coreset, but has a non-negativ e w eight function u : C → [0 , 1] whic h also sums to one, i.e., P p ∈ C u ( p ) = 1, instead of the previous un b ounded w eight function. This means that the w eigh ts are b oth non-negativ e and cannot b e arbitrarily large, which reduces numerical issues. W e then sho w how to naturally extend this result for the case where the input w eights are non-negative but do not necessarily sum to one. In this generalized case, w e compute an accurate coreset ( C, u ) also of size | C | = d + 1 suc h that u : C → [0 , P p ∈ P w ( p )] and P p ∈ P w ( p ) = P p ∈ C u ( p ). Input weigh ts are non-negativ e and sum to one. Let ˆ p = ( p | 1) T ∈ R d +1 for every p ∈ P , and let ˆ P = { ˆ p | p ∈ P } as in the previous example. Observ e that if P is a set of p oin ts on a line, its mean z m ust lie in the interv al b etw een the rightmost and leftmost p oin ts p and q , resp ectively , of P . This implies that the mean is a conv ex combination of p 8 and q , i.e., z = w 1 p + w 2 q for some w 1 , w 2 ≥ 0 suc h that w 1 + w 2 = 1 and w 1 , w 2 ≥ 0. F or a set P of p oints on the plane, the mean of P is inside the con vex h ull of P , which is the smallest p olygon that con tains P , and there must b e a triangle whose vertices are in P , that also con tains z ; See Fig. 4 . More generally , Caratheo dory’s Theorem (Carath´ eo dory, 1907; Co ok & W ebster, 1972) states that if a p oin t z lies inside a con vex hull of the set ˆ P ⊆ R d 0 , then z also lies inside the con vex hull (i.e., it is a con v ex com bination) of at most d 0 + 1 p oints in ˆ P . Hence, z can b e expressed as a con v ex combination of at most d 0 + 1 p oints in ˆ P . When the input w eigh ts sum to one, i.e., P p ∈ P w ( p ) = 1, the w eigh ted mean P ˆ p ∈ ˆ P w ( p ) ˆ p of ˆ P lies inside the conv ex h ull of ˆ P . Therefore, since ˆ P ⊆ R d 0 = R d +1 and each p oint ˆ p is given a weigh t of w ( p ) where P ˆ p ∈ ˆ P w ( p ) = 1, b y Caratheo dory’s theorem there is a subset ˆ C ⊆ ˆ P , | ˆ C | = d 0 + 1 = d + 2 and ˆ u : ˆ C → [0 , 1] suc h that X p ∈ ˆ C ˆ u ( p ) p = X ˆ p ∈ ˆ P w ( p ) ˆ p = X p ∈ P w ( p ) p    X p ∈ P w ( p ) ! , (4) where the second equality is by the deﬁnition of ˆ p . ˆ C and ˆ u can b e computed as explained in Section 4 . Let C = n p ∈ P | ˆ p ∈ ˆ C o and u : C → [0 , 1] suc h that u ( p ) = ˆ u ( ˆ p ) for ev ery p ∈ C . It no w follo ws that X p ∈ C u ( p )( p | 1) = X p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X p ∈ P w ( p ) p    X p ∈ P w ( p ) ! , (5) where the ﬁrst equalit y holds by the deﬁnition of u and C , and the second equality holds b y ( 4 ). W e now hav e that f loss (( C, u ) , x ) = X p ∈ C u ( p )( p − x ) = X p ∈ C u ( p ) p − x X p ∈ C u ( p ) = X p ∈ P w ( p ) p − x X p ∈ P w ( p ) = f loss (( P , w ) , x ) , where the third equality is by ( 5 ). Hence, we obtain that C 0 = ( C , u ) is an accurate coreset for P 0 = ( P , w ), where C is of size | C | = d + 2, and its weigh t function u is non-negativ e and sums to one ov er the p oin ts of C . Generalized case of non-negative w eigh ts. W e ﬁrst remind the reader that ˆ p = ( p | 1) for ev ery p ∈ P and ˆ P = { ˆ p | p ∈ P } . Since the Caratheo dory Theorem cannot b e applied to the w eighted set ( P , w ) whose weigh ts do not necessarily sum to one, we apply the following steps: (i) deﬁne a new weigh ted set ( ˆ P , ˆ w ) where ˆ w : ˆ P → [0 , 1] such that ˆ w ( ˆ p ) = w ( p ) P q ∈ P w ( q ) for every p ∈ P , (ii) apply the Caratheo dory Theorem to ( ˆ P , ˆ w ) to compute a weigh ted set 9 ( ˆ C , ˆ u ) with the same weigh ted sum as ( ˆ P , ˆ w ) as explained in Section 4 , and (iii) return the weigh ted set ( C , u ) such that C = n p ∈ P | ˆ p ∈ ˆ C o and u : C → [0 , P p ∈ P w ( p )] where u ( p ) = ˆ u ( ˆ p ) · P p ∈ P w ( p ). Similarly to the pro of in the previous case, it is easy to verify that u ( p ) ∈ [0 , P p ∈ P w ( p )] for ev ery p ∈ C , P p ∈ P w ( p ) = P p ∈ C u ( p ), and for every x ∈ R d f loss (( C, u ) , x ) = f loss (( P , w ) , x ) . See implemen tation of function vectors sum 3 in (Jubran et al., 2019). Figur e 4: Car athe o dory’s The or em. A set P ⊆ R 2 (blue p oints), its me an z (r e d p oint), and the c onvex hul l of P (gr e en se gments). The me an z is c ontaine d inside the c onvex hul l of P . Ther e exists a set ˆ P ⊆ P of | ˆ P | = d + 1 = 3 p oints (bigger blue p oints) such that z lies inside the c onvex hul l of ˆ P (the black lines). 3.4 1 -Mean queries  T rying to minimize the distance from our shop to the farthest clien t is v ery sensitive to what is called outliers in the sense that a lo cation of a single client p (e.g. p approaching inﬁnity) ma y signiﬁcan tly c hange our cost function f loss (( P , w ) , x ). A less sensitive cost function ma y wish to select the lo cation that minimizes the a verage sum of squared (Least Mean Squared, LMS) distances from the clients to our shop. T o this end, let P ⊆ R b e a set of | P | = n num b ers, X = R , f ( p, x ) = ( p − x ) 2 , and loss( v ) = 1 n k v k 1 = 1 n P n i =1 v i for ev ery v = ( v 1 , · · · , v n ) ∈ R n . The name 1-mean queries or coreset was giv en since the mean of P is the query x ∈ X that minimizes this cost. W e wish to b e able to compute the av erage sum of squared distances f loss ( P , x ) = 1 n P p ∈ P ( p − x ) 2 in O (1) time to a (currently unkno wn) lo cation x ∈ X whic h will b e given tomorrow, after pre-pro cessing time of O ( n ) to day; See Fig. 5 . This can b e done by observing that f loss ( P , x ) = 1 n X p ∈ P ( p − x ) 2 = 1 n X p ∈ P  p 2 − 2 xp + x 2  = 1 n X p ∈ P p 2 + x 2 − 2 x · 1 n X p ∈ P p. 10 (a) (b) Figur e 5: 1 -me an queries. A set of r esident lo c ations, marke d by the same humans in e ach of the two images. We ar e given O ( n ) time to pr e-pr o c ess the r esident lo c ations. Then, given two p otential lo c ations for a shop as in the left or right images, we ne e d to sele ct the lo c ation that minimizes the aver age sum of squar es distanc es (blue lines / r e d lines) in O (1) time. Hence, to ev aluate the sum of squared distances from P to x , all w e need is to store the (coreset) C = n 1 n P p ∈ P p 2 , 1 n P p ∈ P p o whic h consists of t w o n um b ers. Clearly C can b e computed in O ( n ) time and using C we can compute f loss ( P , x ) exactly for every num b er x ∈ X = R by deﬁning a new function h : P ( R ) × R → R where h ( { a, b } , x ) = a + x 2 − 2 xb . Hence, h ( C, x ) = f loss ( P , x ) for ev ery x ∈ R . A similar solution holds for a set P in R d and X = R d for an y d ≥ 1 Euclidean space, and for any w : P → R as f loss (( P , w ) , x ) = X p ∈ P w ( p ) k p − x k 2 = X p ∈ P w ( p ) k p k 2 + k x k 2 X p ∈ P w ( p ) − 2 x T X p ∈ P w ( p ) p, where we assume that ev ery p ∈ P and x ∈ R d is a column v ector. Here, b y letting C = n P p ∈ P w ( p ) k p k 2 , P p ∈ P w ( p ) , P p ∈ P w ( p ) p o con tain tw o n umbers and the weigh ted mean v ector of P and mo difying h as h ( { a, b, c } , x ) = a + k x k 2 · b − 2 x T c for ev ery a, b ≥ 0 and c ∈ R d , w e obtain f loss (( P , w ) , x ) = h ( C, x ) for every x ∈ R d . This set C contains the second (v ariance), ﬁrst (center of mass or mean), and zero moments of P . Unlik e in previous sections, in this example, this coreset C is not a subset of the input data and we also use a new cost function. 3.4.1 Subset coreset In this section we wish to construct a coreset for 1-mean queries which uses the same cost function and is also a subset of the input. F or ev ery p ∈ P , let ˆ p = ( p T | k p k 2 | 1) T b e a 11 corresp onding v ector in R d +2 and ˆ P = { ˆ p | p ∈ P } b e the union of these vectors. Since the w eighted mean P ˆ p ∈ ˆ P w ( p ) ˆ p is spanned b y ˆ P (i.e., linear combination of its subset), there is a subset ˆ C ⊆ ˆ P of at most | ˆ C | = d + 2 p oints with a corresp onding w eight function ˆ u : ˆ C → R suc h that X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X ˆ p ∈ ˆ P w ( p ) ˆ p. (6) The set ˆ C can b e computed as explained in Section A . Let C = n p ∈ P | ˆ p ∈ ˆ C o , and let u : C → R suc h that u ( p ) = ˆ u ( ˆ p ) for ev ery p ∈ C . W e no w ha ve that   P p ∈ C u ( p ) p P p ∈ C u ( p ) k p k 2 P p ∈ C u ( p )   = X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X ˆ p ∈ ˆ P w ( p ) ˆ p =   P p ∈ P w ( p ) p P p ∈ P w ( p ) k p k 2 P p ∈ P w ( p )   , (7) where the ﬁrst equalit y is b y the deﬁnitions of C and u , the second equalit y is by ( 6 ), and the last equality is by the deﬁnition of ˆ P . Therefore, for ev ery x ∈ R d , w e ha v e that f loss (( C, u ) , x ) = X p ∈ C u ( p ) k p − x k 2 = X p ∈ C u ( p ) k p k 2 + k x k 2 · X p ∈ C u ( p ) − 2 x T X p ∈ C u ( p ) p = X p ∈ P w ( p ) k p k 2 + k x k 2 · X p ∈ P w ( p ) − 2 x T X p ∈ P w ( p ) p = f loss (( P , w ) , x ) , where the third deriv ation holds by ( 7 ). Unlik e in the previous case, here the coreset is simply a scaled (w eighted) subset of P and the cost function f loss is the same as for the input. A natural question that comes to mind at this p oint is “can we hav e a subset whic h is not weigh ted, i.e., w ( p ) = 1 for every p ∈ P ?” Probably not, since this w ould imply that the mean of P is the mean of a (non-weigh ted) subset of P which cannot hold in general, ev en for a set P of 3 p oin ts on a line. Nev ertheless, we can construct such a coreset that yields approximated answers to 1-mean queries (Inaba, Katoh, & Imai, 1994). F or the case of exact solution, we can still b ound the weigh ts as follows. 3.4.2 Subset coreset of b ounded w eigh ts In Section 3.4.1 w e used a linear com bination of d + 2 points from ˆ P to represent the w eighted sum P ˆ p ∈ ˆ P w ( p ) ˆ p . Instead, if the input w eigh ts are non-negativ e, i.e., w : P → [0 , ∞ ), we can apply Caratheo dory’s theorem, similarly to Section 3.3.2 , to compute a subset ˆ C ⊆ ˆ P of size | ˆ C | = d + 3 along with a w eights function ˆ u : ˆ C → [0 , P p ∈ P w ( p )] that satisﬁes P p ∈ ˆ C ˆ u ( p ) = P p ∈ P w ( p ) and X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p = X ˆ p ∈ ˆ P w ( p ) ˆ p. 12 Figur e 6: A signal P = ( t 1 | p T 1 ) , · · · , ( t 4 | p T 4 ) (r e d dots), wher e t 1 = 1 , t 2 = 2 . 2 , t 3 = 4 , and t 4 = 5 , and a 1 -se gment g : R → R 2 (blue line). The c ost f loss (( P , 1 ) , g ) is the sum over the squar e d vertic al distanc es (se gments in gr e ens) k p i − g ( t i ) k 2 for every i ∈ { 1 , · · · , 4 } . No w, by deﬁning C = n p ∈ P | ˆ p ∈ ˆ C o and u ( p ) = ˆ u ( ˆ p ) for every p ∈ C , we obtain for ev ery x ∈ R d that f loss (( C, u ) , x ) = f loss (( P , w ) , x ) . See implemen tation of function one mean 3 in (Jubran et al., 2019). 3.5 Coreset for 1 -segmen t queries    As stated in (Rosman, V olko v, F eldman, Fisher I I I, & Rus, 2014), there is an increasing demand for systems that learn long-term, high-dimensional data streams. Examples include video streams from w earable cameras, mobile sensors, GPS data, ﬁnancial data, audio signals, and many more. In suc h data, a time instance is usually represented as a high-dimensional signal, for example lo cation v ectors, sto ck prices, or image conten t feature histograms. In other w ords, suc h data is usually represen ted as a set of linear segmen ts. F ast and real- time algorithms for summarization and segmen tation of suc h large streams are of great imp ortance, and can b e made possible b y compressing the input signals in to a compact meaningful represen tation, whic h w e call coreset for 1-segment. Let ( P, w ) b e a weigh ted set where P =  ( t i | p T i )  n i =1 ⊆ R d +1 represen ts a (discrete) signal, for ev ery i ∈ [ n ], t i ∈ R represen ts the time stamp and p i ∈ R d is a p oint, and w : P → [0 , ∞ ). F or simplicit y we abuse notation and use w ( p ) to denote w (( t | p T )) for ev ery ( t | p T ) ∈ P . Let X =  g | g : R → R d  b e the set of all 1-segmen ts, f ∈ X b e a 1-segmen t such that f (( t | p T ) , g ) = k p − g ( t ) k 2 for ev ery ( t | p T ) ∈ P and g ∈ X , and loss( · ) = k·k 1 . Therefore, as shown in Fig. 6 , f loss (( P , w ) g ) = X ( t | p T ) ∈ P w ( p ) k p − g ( t ) k 2 . The goal is to compute a weigh ted set ( C , u ) that represents a weigh ted (discrete) signal of size | C | = d + 2, and u : C → [0 , ∞ ) such that for ev ery 1-segmen t g ∈ X , it holds that f loss (( P , w ) , g ) = c · f loss (( C, u ) , g ) . 13 Put g ∈ X . Since g is a linear segment, there exists a, b ∈ R d suc h that g ( t ) = a + b · t for ev ery t ∈ R . Let X ∈ R n × ( d +2) b e a matrix whose i th row is p w ( p i ) · (1 | t i | p T i ) for every i ∈ [ n ]. Let U Σ V T b e the thin SVD of X ; see Section A , and u ∈ R d +2 b e the leftmost column of Σ V T . Let c = k u k 2 d +2 and Z ∈ R ( d +2) × ( d +2) b e an orthogonal matrix such that Z u = ( √ c, · · · , √ c ) T ∈ R d +2 , (8) i.e., Z can b e regarded as a rotation matrix that rotates u to the v ector ( √ c, · · · , √ c ) T . Suc h a matrix Z exists since k ( √ c, · · · , √ c ) k = k u k . Let B ∈ R ( d +2) × ( d +1) b e the ( d + 1) righ tmost columns of Z Σ V T √ c . Combining the deﬁnitions of u , B , and the fact that Z u = ( √ c, · · · , √ c ) T yields that Z Σ V T =    √ c . . . √ cB √ c    . (9) Let C ⊆ R d +2 b e the union of rows of B and u : C → [0 , ∞ ) such that u ( p ) = c for ev ery p ∈ C . Then f loss (( P , w ) , g ) = X ( t | p T ) ∈ P w ( p ) k p − g ( t ) k 2 = X ( t | p T ) ∈ P w ( p ) k a + b · t − p k 2 (10) = X ( t | p T ) ∈ P    p w ( p )( a + b · t − p )    2 =           p w ( p 1 )(1 | t 1 | p T 1 ) . . . p w ( p n )(1 | t n | p T n )      a T b T − I          2 =       U Σ V T   a T b T − I         2 =       Z Σ V T   a T b T − I         2 =           √ c . . . √ cB √ c      a T b T − I          2 (11) = c           1 . . . B 1      a T b T − I          2 = X ( t | p T ) ∈ B c k a + b · t − p k 2 = f loss (( C, u ) , g ) , where ( t | p T ) ∈ B denotes a ro w ( t | p T ) of B which is the concatenation of a scalar t ∈ R and p T ∈ R d , the ﬁrst deriv ation in ( 11 ) is by the deﬁnition of U Σ V T = X , the second deriv ation in ( 11 ) holds since U and Z are orthogonal matrices and the last deriv ation in ( 11 ) is b y ( 9 ). Therefore, the coreset ( C , u ) is an accurate coreset for the giv en query space. See implemen tation of function one segment in (Jubran et al., 2019). 3.6 Coreset for Matrix 2 -norm  A common approach to reduce the dimension of a high-dimensional data set P in R d is to pro ject the v ectors of P (database records) on to some low k -dimensional aﬃne subspace ( k 14 is usually muc h smaller than d ). F or example, a subspace that minimizes the sum of squared distances ( ` 2 norm) to these input v ectors, maybe under some constrain ts. Example algo- rithms include the Principle Comp onent Analysis (PCA), Low-rank approximation ( k -rank SVD) and Laten t Diric hlet Analysis (LDA), and non-negative matrix factorization (NNMF). Learning algorithms such as k -means clustering can then b e applied on the low-dimensional data to obtain faster appro ximations with prov able guaran tees (F eldman, Schmidt, & Sohler, 2013). Dimensionalit y reduction is also used to av oid o verﬁtting. Small num b er of features usually implies faster running/classiﬁcation times and simpler mo dels. F urthermore, smaller dimension means faster training of algorithms, less storage, less redundant features, and man y more adv antages. How ever, the dimensionality reduction algorithms might b e both time and space consuming. Therefore, to b o ost the running time of such algorithms, w e can use accurate coresets as follows Let ( P , w ) be a w eighted set where P = { p 1 , · · · , p n } is a set of n p oin ts in R d and w : P → [0 , ∞ ) is a non-negative weigh ts function, let X = R d , x ∈ X , f ( p, x ) = ( p T x ) 2 where p T x = h p, x i is the inner product of p and x , for ev ery p ∈ R d and x ∈ X , and loss( · ) = k·k 1 . If we deﬁne A as an n × d matrix whose i th row is p w ( p i ) p i , then f loss (( P , w ) , x ) =    w ( p 1 )( p T 1 x ) 2 , · · · , w ( p n )( p T n x ) 2    1 = X p ∈ P w ( p )( p T x ) 2 = k Ax k 2 . W e aim to compute a matrix C ∈ R d × d suc h that k Ax k 2 = kC x k 2 . This can b e done by letting Q ∈ R n × d b e a matrix of orthogonal columns ( Q T Q = I ) that spans the d columns of A , e.g. via Grahm-Shmidt (also known as the A = QR as shown in Fig. 7 ), or the Thin Singular V alue Decomp osition ( A = U r D r V T r as shown in Fig. 8 – 9 ). By letting C = R b e a d × d matrix whose columns corresp ond to the columns of A under the base Q , w e obtain A = QR = Q C . Since Q has orthogonal columns, we hav e k Qx k = k x k for ev ery x ∈ R d . Therefore, b y deﬁning C ⊆ R d to con tain the rows of C w e obtain that f loss (( P , w ) , x ) = k Ax k 2 = k Q C x k 2 = kC x k 2 = f loss (( C, 1 ) , x ) . (12) Note that, without loss of generality , w e can assume that k x k = 1, i.e., that X is a set of only unit v ectors. This is b ecause for every vector y ∈ R d there is c ≥ 0 and a unit vector x = y / k y k such that f loss (( P , w ) , y ) = k Ay k 2 = k Acx k 2 = c 2 k Ax k 2 = c 2 kC x k 2 = kC cx k 2 = kC y k 2 = f loss (( C, 1 ) , y ) . Geometrically , if x is a unit v ector, f ( p, x ) = ( p T x ) 2 is the squared distance b etw een a p oin t p in R d and a h yp erplane that in tersects the origin and is orthogonal to x . More generally , the coreset ( C, 1 ) from ( 12 ) can b e used to compute the weigh ted sum of squared distances dist 2 (( P , w ) , S ) ov er the points of P to a j -dimensional subspace of R d that is spanned b y the orthonormal columns of a matrix S ∈ R d × j , i.e., S T S = I . Let S ⊥ ∈ R d × ( d − j ) denote the matrix whose columns span the subspace that is orthogonal to S , i.e., [ S | S ⊥ ] T [ S | S ⊥ ] = I . Observe that b y the deﬁnition of S ⊥ w e hav e dist 2 (( P , w ) , S ) =   AS ⊥   2 F . Then, b y the Pythagorean Theorem dist 2 (( P , w ) , S ) =   AS ⊥   2 F = d − j X i =1   AS ⊥ ∗ i   2 = d − j X i =1   C S ⊥ ∗ i   2 = dist 2 (( C, 1 ) , S ) , (13) 15 where S ⊥ ∗ i denotes the i th column of S ⊥ , and where the third equalit y is b y the coreset prop ert y in ( 12 ). Figur e 7: QR -De c omp osition. L eft: Given a set of line arly indep endant ve ctors { a 1 , a 2 } ⊆ R 2 . The go al is gener ate an ortho gonal set { e 1 , e 2 } ⊆ R 2 that sp ans the same subsp ac e as the given set. This pr o c ess is also c al le d the Gr am-Schmidt algorithm in Line ar Algebr a. Right: The r esult is a QR-de c omp osition of the matrix A whose c olumns ar e a 1 , a 2 into an ortho gonal matrix Q (i.e., Q T Q = I 2 ) and an upp er triangular matrix R . 3.6.1 Subset coreset of b ounded w eigh ts Recall that A ∈ R n × d is a matrix whose rows are the weigh ted p oin ts n p w ( p i ) p i o n i =1 . The set C ⊆ R d from the previous example contains d p oin ts in R d but are not a subset of the input set P . This has few disadv antages that w ere discussed in Section 1 . W e no w show ho w to obtain suc h a weigh ted set ( C , u ) where C is a subset of P , where ev ery p oint p in C is also assigned to a non-negative weigh t u ( p ) ∈ [0 , ∞ ). Observ e that for ev ery x ∈ R d , f loss (( P , w ) , x ) = k Ax k 2 = x T A T Ax = x T X p ∈ P w ( p ) pp T ! x. F or every p ∈ P , the d × d matrix pp T corresp onds to a vector ˆ p ∈ R d 2 b y concatenating its rows. Hence, P p ∈ P w ( p )v ec( pp T ) = P p ∈ P w ( p ) ˆ p , where vec( M ) ∈ R d 2 is a row stacking of the matrix M ∈ R d × d . Let ˆ P = { ˆ p | p ∈ P } and let ˆ w ( ˆ p ) = w ( p ) P q ∈ P w ( q ) for ev ery ˆ p ∈ ˆ P . Since ˆ P ⊆ R d 2 and the w eighted sum of  ˆ P , ˆ w  lies inside the conv ex hull of ˆ P , by applying Caratheo dory’s Theorem on the weigh ted set  ˆ P , ˆ w  , w e obtain that there is a subset ˆ C ⊆ ˆ P of size | ˆ C | = d 2 + 1 and a w eigh ts function ˆ u : ˆ C → [0 , 1] suc h that 1 P p ∈ P w ( p ) X ˆ p ∈ ˆ P w ( p ) ˆ p = X ˆ p ∈ ˆ P ˆ w ( ˆ p ) ˆ p = X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p. (14) Multiplying ( 14 ) b y P p ∈ P w ( p ) we obtain X ˆ p ∈ ˆ P w ( p ) ˆ p = X p ∈ P w ( p ) · X ˆ p ∈ ˆ C ˆ u ( ˆ p ) ˆ p. (15) 16 Figur e 8: Singular V alue De c omp osition. L eft: A Singular V alue De c omp osition U D V T of a matrix A ∈ R m × n , wher e U ∈ R m × m and V ∈ R n × n ar e ortho gonal matric es and D ∈ R m × n is a diagonal matrix that c ontains the singular values of A . R ight: Visualization of the SVD of a matrix A ∈ R 2 × 2 . The matrix M distorts the unit disc to an el lipse. The SVD is a de c omp osition of A into thr e e simple tr ansformations: an initial r otation (p ossibly with r eﬂe ction) V T , a sc aling D along the c o or dinate axes, and a ﬁnal r otation (p ossibly with r eﬂe ction) U . The lengths σ 1 and σ 2 of the semi-axes of the el lipse ar e the singular values of A , namely d 1 and d 2 . Il lustr ations taken fr om (Wikip e dia c ontributors, 2019). Figur e 9: Thin Singular V alue De c omp osition. A Thin Singular V alue De c omp osition (thin SVD) U r D r V T r of a matrix A ∈ R m × n of r ank r ≤ n , wher e U r ∈ R m × r , V r ∈ R n × r , U T r U r = I r , V T r V r = I r , and D ∈ R r × r is a diagonal matrix c ontaining the singular values d 1 ≥ · · · ≥ d r > 0 . Let C = n p ∈ P | ˆ p ∈ ˆ C o and let u : C → [0 , P p ∈ P w ( p )] suc h that u ( p ) = ˆ u ( ˆ p ) · X p ∈ P w ( p ) for ev ery p ∈ C . Combining the deﬁnitions of C , u and ( 15 ) yields X p ∈ P w ( p ) pp T = X p ∈ P w ( p ) · X p ∈ C ˆ u ( ˆ p ) pp T = X p ∈ C u ( p ) pp T . Hence, by letting Z =  p u ( p ) · p  T p ∈ C ∈ R ( d 2 +1) × d b e a matrix whose ro ws p u ( p ) · p T are 17 w eighted (scaled) p oin ts of C , we obtain that for ev ery x ∈ R d f loss (( P , w ) , x ) = k Ax k 2 = x T A T Ax = x T X p ∈ P w ( p ) pp T ! x = x T X p ∈ C u ( p ) pp T ! x = x T Z T Z x = k Z x k 2 = f loss (( C, u ) , x ) . (16) Therefore, ( C , u ) is an accurate coreset for the query space ( P, w, R d , f , k·k 1 ) where f ( p, x ) = ( p T x ) 2 . Similarly to ( 13 ) in Section 3.6 , for ev ery j -dimensional subspace of R d that is spanned b y the orthonormal columns of a matrix S ∈ R d × j , and its orthogonal complement S ⊥ , we ha ve that dist 2 (( P , w ) , S ) =   AS ⊥   2 F = d − j X i =1   AS ⊥ ∗ i   2 = d − j X i =1   Z S ⊥ ∗ i   2 = dist 2 (( C, u ) , S ) , where the third deriv ation holds by ( 16 ). See implemen tation of function matrix norm2 in (Jubran et al., 2019). 3.7 Least-Mean-Squares Solvers    Least-mean-squares solvers are very common optimization metho ds in mac hine learning and statistics. They are t ypically used for normalization, sp ectral clustering, feature selection, prediction, classiﬁcation, and many more. In this section, w e deﬁne and deriv e a coreset for suc h problems, based on the coreset presented in Section 3.6.1 . The corresp onding query space ( P , w , R d , f , loss) for least mean squares problems is as follo ws. Let ( P , w ) b e a w eighted set where P =  ( a T 1 | b 1 ) , · · · , ( a T n | b n )  ⊆ R d +1 , w : P → [0 , ∞ ), and for every i ∈ [ n ], a i ∈ R d and b i ∈ R . Let X = R d , and for ev ery ( a T | b ) T ∈ R d +1 where a ∈ R d and b ∈ R let f (( a T | b ) , x ) = ( a T x − b ) 2 , and deﬁne loss = k·k 1 . F or simplicity let w i = w (( a T i | b i )) for every i ∈ [ n ]. Therefore, f loss (( P , w ) , x ) = n X i =1 w i ( a T i x − b i ) 2 . (17) T o obtain an accurate coreset for the ab ov e query space ( P , w, R d , f , loss), we shall deﬁne a new and slightly diﬀerent query space, and use the accurate coreset from Section 3.6.1 as follo ws. Let f 2 ( p, x ) = ( p T x ) 2 for every p, x ∈ R d +1 . Let ( C, u ) be an accurate coreset of size | C | = ( d + 1) 2 + 1 for the new query space ( P , w , R d +1 , f 2 , k . k 1 ) as explained in Section 3.6.1 , where C = n (ˆ a T 1 | ˆ b 1 ) , · · · , ( ˆ a T | C | | ˆ b | C | ) o ⊆ P . F or simplicity deﬁne u i = u (( ˆ a T i , ˆ b i )) for ev ery i ∈ [ | C | ]. Then for every x 0 ∈ R d +1 , n X i =1 w i (( a T i | b i ) x 0 ) 2 = | C | X i =1 u i ((ˆ a T i | ˆ b i ) x 0 ) 2 . 18 Since the last equalit y holds for ev ery x 0 ∈ R d +1 , in particular, for every x 0 = ( x T | − 1) T ∈ R d +1 where x ∈ R d , w e ha v e that : f loss (( P , w ) , x ) = n X i =1 w i ( a T i x − b ) 2 = n X i =1 w i (( a T i | b i ) x 0 ) 2 (18) = | C | X i =1 u i ((ˆ a T i | ˆ b i ) x 0 ) 2 = | C | X i =1 u i (ˆ a T i x − ˆ b ) 2 = f loss (( C, u ) , x ) . (19) Hence, ( C , u ) is an accurate coreset for the query space ( P , w , R d , f , loss). Commonly in the ﬁeld of mac hine learning, least mean squares optimization problems are deﬁned using matrix notations as follows. Let A = ( √ w 1 a 1 | · · · | √ w n a n ) T ∈ R n × d and b = ( √ w 1 b 1 , · · · , √ w n b n ) T ∈ R n . Least-mean-squares solvers t ypically aim to minimize g n X i =1 w i ( a T i x − b i ) 2 ! + h ( x ) = g  k Ax − b k 2 2  + h ( x ) (20) o ver every x ∈ X ⊆ R d . Here, h : R d → [0 , ∞ ) is called a r e gularization function on the parameters of x , and it is indep endent of ( P , w ), and g : R → R is a real function. See T able 2 for example ob jective functions. Solv er name Ob jectiv e function g ( x ) h ( x ) Linear regression (Bjorck, 1967) k Ax − b k 2 2 x 0 Ridge regression (Ho erl & Kennard, 1970) k Ax − b k 2 2 + α k x k 2 2 x α k x k 2 2 Lasso regression (Tibshirani, 1996) 1 2 n k Ax − b k 2 2 + α k x k 1 x 2 n α k x k 1 Elastic-Net regression (Zou & Hastie, 2005) 1 2 n k Ax − b k 2 2 + ρα k x k 2 2 + (1 − ρ ) 2 α k x k 1 x 2 n ρα k x k 2 2 + (1 − ρ ) 2 α k x k 1 T able 2: Example solvers that aim to minimize obje ctive functions in the form of ( 20 ) . Each solver gets a matrix A ∈ R n × d , a ve ctor b ∈ R n and aims to c ompute x ∈ R d that minimizes the obje ctive function. A dditional given r e gularization p ar ameters include α > 0 and ρ ∈ [0 , 1] . Hence, as w e deﬁned the matrix A and the v ector b based on the w eighed set ( P , w ), w e deﬁne the matrix Z and the v ector v based on the coreset ( C , u ) of the query space ( P , w , R d , f , loss), i.e., Z ∈ R m × d suc h that the i th row of Z is √ u i ˆ a T i , and v ∈ R m suc h that v = ( √ u 1 ˆ b 1 , · · · , √ u | C | ˆ b | C | ) and m = ( d + 1) 2 + 1. No w as desired, the family of functions from ( 20 ) satisfy that, g ( k Ax − b k 2 2 ) + h ( x ) = g n X i =1 w i ( a T i x − b i ) 2 ! + h ( x ) = g m X i =1 w i (ˆ a T i x − ˆ b i ) 2 ! + h ( x ) = g ( k Z x − v k 2 2 ) + h ( x ) . See implemen tation of function LMS solvers in (Jubran et al., 2019). 19 4 Caratheo dory’s Theorem The Caratheo dory Theorem (Carath´ eo dory, 1907; Co ok & W ebster, 1972) is a fundamen tal result in computational geometry that states that if a p oin t x ∈ R d lies inside the con vex h ull of a set P ⊆ R d , of | P | = n p oints i.e., x is a conv ex combination of the p oin ts of P , then there is a subset of at most d + 1 p oin ts from P that con tains x in its con vex h ull, i.e., x can b e represen ted as a con vex com bination of at most d + 1 p oints from P ; see Theorem 3 . The proof of Caratheo dory’s theorem is constructiv e, and the abov e subset of d + 1 p oin ts can be computed in O ( n 2 d 2 ) time; see Algorithm 1 wh ic h implemen ts this constructiv e proof. An intuition b ehind the proof of correctness is shown in Fig. 10 . The algorithm takes as input a w eighted set ( P, w ) suc h that w : P → [0 , 1] and P p ∈ P w ( p ) = 1, and computes in O ( n 2 d 2 ) time a new w eigh ted set ( S, u ) suc h that S ⊆ P , | S | ≤ d + 1, u : s → [0 , 1], P s ∈ S u ( s ) = 1 and P s ∈ S u ( s ) s = P p ∈ P w ( p ) p . Theorem 3 If a p oint x ∈ R d is in the c onvex hul l of a set P ⊆ R d , then x is also in the c onvex hul l of a set of at most d + 1 p oints fr om P . Figur e 10: A weighte d set ( P , w ) whose weighte d sum is P 4 i =1 w ( p i ) p i = 0 c orresp onds to four p oints (in blue) whose weighte d sum is the p oint x (in orange), which we assume is the origin x = ~ 0 . Algorithm 1 ﬁrst c omputes a weights ve ctor v = ( v 1 , · · · , v 4 ) T such that the weighte d sum P 4 i =1 v i p i (r e d p oints) is the origin, and sum of weights is P 4 i =1 v i = 0 . The weights are sc ale d by α > 0 until αv i = w ( p i ) for some i ( i = 1 in the ﬁgur e). Every p oint in the resulting set { αv i p i } n i =1 in gr e en is subtracte d fr om its c orr esp onding p oint in the input set { w ( p i ) p i } n i =1 to obtain the output set { u ( p i ) p i } n i =1 = { ( w ( p i ) − αv i ) p i } n i =1 , wher e u ( p 1 ) = 0 so p 1 c an b e r emove d. Algorithm 1 then c ontinues iter atively with the remaining points until ( P , u ) has | P | = d + 1 weighte d p oints. The ﬁgur e is taken fr om (Nasser et al., 2015). The pro of of correctness for Theorem 3 . The pro of of correctness for Theorem 3 follo ws from the correctness of the procedure Cara theodor y and vice versa; see Algo- rithm 1 . The pro cedure Cara theodor y takes as input a weigh ted set ( P , w ) whose p oints are denoted by P = { p 1 , · · · , p n } and where P p ∈ P w ( p ) = 1. W e assume n > d + 1, otherwise ( S, u ) = ( P , w ) is the desired output. Hence, the n − 1 > d p oin ts p 2 − p 1 , p 3 − p 1 , . . . , p n − p 1 m ust b e linearly dep endent. This implies that there are reals v 2 , · · · , v n , which are not all zeros, suc h that n X i =2 v i ( p i − p 1 ) = 0 . (21) 20 Algorithm 1: Cara theodor y ( P , w ) Input : A weigh ted set ( P , w ) of n p oints in R d , where P = { p 1 , · · · , p n } , w : P → [0 , 1] and P p ∈ P w ( p ) = 1. Output: A weigh ted set ( S, u ) computed in O ( n 2 d 2 ) time such that S ⊆ P , | S | ≤ d + 1, u : S → [0 , 1], P s ∈ S u ( s ) = 1 and P s ∈ S u ( s ) s = P p ∈ P w ( p ) p . 1 if n ≤ d + 1 then 2 return ( P , w ) 3 for every i ∈ { 2 , · · · , n } do 4 a i := p i − p 1 // p i is the i th point of P . 5 A := ( a 2 | · · · | a n ) // A ∈ R d × ( n − 1) 6 Compute v = ( v 2 , · · · , v n ) T 6 = 0 such that Av = 0. 7 v 1 := − n X i =2 v i 8 α := min  w ( p i ) v i | i ∈ { 1 , · · · , n } and v i > 0  9 u ( p i ) := ( w ( p i ) − α v i ) for every i ∈ { 1 , · · · , n } such that u ( p i ) > 0. 10 S := { p i | i ∈ { 1 , · · · , n } and u ( p i ) > 0 } 11 if | S | > d + 1 then 12 ( S, u ) := Cara theodor y ( S, u ) // Recursive call that reduces S by at least 1 . 13 return ( S , u ) These reals are computed in Line 6 b y solving a system of linear equations. This step domi- nates the running time of the algorithm and tak es O ( nd 2 ) time using e.g. SVD, where the de- sired v ector of co eﬃcien ts ( v 2 , · · · , v n ) T is simply the right singular vector that corresp onds to the smallest singular v alue in the SVD of the matrix M = ( p 2 − p 1 , · · · , p n − p 1 ) T ∈ R ( n − 1) × d . The deﬁnition v 1 = − n X i =2 v i (22) in Line 7 , guarantees that v j < 0 for some j ∈ [ n ] , (23) and that n X i =1 v i p i = v 1 p 1 + n X i =2 v i p i = − n X i =2 v i ! p 1 + n X i =2 v i p i (24) = n X i =2 v i ( p i − p 1 ) = 0 , (25) 21 Figur e 11: Overview of the impr oved Car athe o dory algorithm in (Maalouf et al., 2019). Images left to right: Steps (i) and (ii): A p artition of the input weighte d set of n = 48 p oints (in blue) into k = 8 equal clusters (in circles) whose corr esponding me ans ar e µ , . . . , µ 8 (in r e d). The me an of P (and these means) is x (in gr e en). Step (iii): Carathe o dory (sub)set of d + 1 = 3 p oints (bold r e d) with corr esp onding weights (in gr een) is c ompute d only for these k = 8  n me ans. Step (iv): the Carathe o dory set is r eplace d by its c orr esp onding original p oints (dark blue). The r emaining p oints in P (bright blue) ar e delete d. Step (v): Pr evious steps ar e r ep e ated until only d + 1 = 3 p oints r emains. This takes O (log n ) iter ations for k = Θ( d ) . This ﬁgur e was taken fr om (Maalouf et al., 2019). where ( 24 ) is by ( 22 ) and the second equalit y in ( 25 ) is by ( 21 ). Hence, for ev ery α ∈ R , the w eighted sum of P is n X i =1 w ( p i ) p i = n X i =1 w ( p i ) p i − α n X i =1 v i p i = n X i =1 ( w ( p i ) − α v i ) p i , (26) where the ﬁrst equalit y holds since P n i =1 v i p i = 0 by ( 25 ). The deﬁnition of α in Line 8 guaran tees that α v i ∗ = w ( p i ∗ ) for some i ∗ ∈ [ n ], and that w ( p i ) − αv i ≥ 0 for ev ery i ∈ [ n ]. Hence, the set S that is deﬁned in Line 10 con tains at most n − 1 p oints, its weigh ted sum is equal to the w eighted sum of ( P , w ), and its set of w eights { w ( p i ) − α v i } is non-negativ e. Notice that if α = 0, w e ha ve that u ( p k ) = w ( p k ) > 0 for some k ∈ [ n ]. Otherwise, b y ( 23 ), there is j ∈ [ n ] such that u ( p j ) = w ( p j ) − αv j > 0. Hence, | S | 6 = ∅ . The sum of the p ositiv e w eigh ts is th us equal to the sum of input weigh ts, n X p i ∈ S u ( p i ) = n X i =1 ( w ( p i ) − α v i ) = n X i =1 w ( p i ) − α · n X i =1 v i = 1 , where the last equalit y holds b y ( 22 ) and since the sum of the input w eigh ts is 1, i.e., P n i =1 w ( p i ) = 1. This and ( 26 ) prov es the desired prop erties of ( S, u ), whic h is of size n − 1. In Line 12 w e rep eat this pro cess recursively un til there are at most d + 1 p oin ts left in S . F or O ( n ) iterations the ov erall time is thus O ( n 2 d 2 ). 4.1 F aster Construction As explained in Section 4 , Algorithm 1 takes as input a w eighted set of n p oin ts, and aims to compute a weigh ted subset of at most d + 1 p oints whic h hav e the same w eigh ted sum and whose weigh ts sum to one. At each iteration, the algorithm sets at least one of the weigh ts of the remaining p oints to zero. Hence, O ( n ) such iterations are required to compute the required output, for a total running time of O ( n 2 d 2 ). Ho wev er, it w as suggested in (Nasser et al., 2015) that instead of taking as input the en tire set of n w eighted p oints at once, it w ould b e more eﬃcient to run Algorithm 1 in a streaming fashion as follo ws. Start with a set of d + 1 w eighted input p oints. Then, applying 22 the follo wing pro cess O ( n ) times: Add 1 new w eighted input p oint to the existing set and apply Caratheo dorys theorem to reduce the d + 2 p oints to d + 1 p oints in O ( d 3 ) time. The running time of this algorithm is O ( nd 3 ) since it executed n times the ab ov e pro cedure on only d + 2 p oints. Inspired b y the streaming fashion of the previous algorithm, a more eﬃcien t algorithm w as then prop osed in (Maalouf et al., 2019). It suggests to do the following. (i) Partition the input weigh ted set into k ∈ O ( n/d ) subsets P 1 , · · · , P k , eac h of size at most O ( d ) (sp eciﬁcally 2 d ). (ii) Compute the weigh ted sum µ i of eac h c h unk P i . (iii) Apply the Caratheo dory theorem only to the set µ = { µ 1 , · · · , µ k } of k w eighted sums to obtain a weigh ted subset ˆ µ of | ˆ µ | = d + 1 elements from µ . (iv) Delete ev ery ch unk P j whose weigh ted mean µ j w as not chosen by the Caratheo dory theorem (i.e., µ j 6∈ ˆ µ ). (v) Contin ue recursiv ely until only d + 1 input p oin ts remain. This algorithm is illustrated in Fig. 11 . The running time of this algorithm is O ( nd + d 4 log n ); see Theorem 4 . The following theorem is a restatemen t of Theorem 3 . 1 in (Maalouf et al., 2019) for k = 2 d . Theorem 4 (Theorem 3 . 1 in (Maalouf et al., 2019)) L et ( P , w ) b e a weighte d set of n p oints in R d such that w : P → [0 , 1] and P p ∈ P w ( p ) = 1 . Then a weighte d set ( S, u ) that satisﬁes that S ⊆ P , | S | ≤ d + 1 , u : S → [0 , 1] , P s ∈ S u ( s ) = 1 and P s ∈ S u ( s ) s = P p ∈ P w ( p ) p c an b e c ompute d in time O ( nd + d 4 log n ) . References Bjorc k, A. (1967). Solving linear least squares problems by gram-sc hmidt orthogonalization. BIT Numeric al Mathematics , 7 (1), 1–21. Carath ´ eo dory , C. (1907). ¨ Ub er den v ariabilit¨ atsb ereich der koeﬃzienten von p otenzreihen, die gegeb ene werte nich t annehmen. Mathematische Annalen , 64 (1), 95–115. Co ok, W., & W ebster, R. (1972). Caratheo dory’s theorem. Canadian Mathematic al Bul letin , 15 (2), 293–293. F eldman, D., Schmidt, M., & Sohler, C. (2013). T urning big data into tin y data: Constan t- size coresets for k-means, p ca and pro jectiv e clustering. In Pr o c e e dings of the twenty- fourth annual acm-siam symp osium on discr ete algorithms (pp. 1434–1453). Golub, G., & V an Loan, C. (1996). Matrix computations 3rd edition the john hopkins univ ersity press. Baltimor e, MD . Ho erl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog- onal problems. T e chnometrics , 12 (1), 55–67. Inaba, M., Katoh, N., & Imai, H. (1994). Applications of weigh ted v oronoi diagrams and randomization to v ariance-based k-clustering. In Pr o c e e dings of the tenth annual symp osium on c omputational ge ometry (pp. 332–339). Jubran, I., Maalouf, A., & F eldman, D. (2019). Op en sour c e c o de for al l the algorithms pr esente d in this p ap er. ( Link for op en-source co de. ) Klema, V., & Laub, A. (1980). The singular v alue decomposition: Its computation and some applications. IEEE T r ansactions on automatic c ontr ol , 25 (2), 164–176. 23 Maalouf, A., Jubran, I., & F eldman, D. (2019). F ast and accurate least-mean-squares solvers. arXiv pr eprint arXiv:1906.04705 . Nasser, S., Jubran, I., & F eldman, D. (2015). Coresets for kinematic data: F rom theorems to real-time systems. arXiv pr eprint arXiv:1511.09120 . P aul, R., F eldman, D., Rus, D., & Newman, P . (2014). Visual precis generation using coresets. In 2014 ie e e international c onfer enc e on r ob otics and automation (icr a) (pp. 1304–1311). Rosman, G., V olk ov, M., F eldman, D., Fisher I I I, J. W., & Rus, D. (2014). Coresets for k- segmen tation of streaming data. In A dvanc es in neur al information pr o c essing systems (pp. 559–567). Sc hmidt, E. (1908). ¨ Ub er die auﬂ¨ osung linearer gleich ungen mit unendlich vielen unbek an- n ten. R endic onti del Cir c olo Matematic o di Palermo (1884-1940) , 25 (1), 53–77. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 58 (1), 267–288. T refethen, L. N., & Bau I I I, D. (1997). Numeric al line ar algebr a (V ol. 50). Siam. Wikip edia con tributors. (2019). Singular value de c omp osition. https://en.wikipedia.org/ wiki/Singular value decomposition . Zou, H., & Hastie, T. (2005). Regularization and v ariable selection via the elastic net. Journal of the r oyal statistic al so ciety: series B (statistic al metho dolo gy) , 67 (2), 301–320. A The Basics of Linear Algebra In this section, we giv e a brief o verview of to ols and deﬁnitions from linear algebra whic h are used along the pap er. Recall that the d × d identit y matrix is denoted by I d . An orthogonal matrix is a square matrix M ∈ R d × d whose columns and ro ws are orthogonal unit vect ors, i.e. M T M = M M T = I d . The rank of a matrix A ∈ R m × n is the dimension of the its columns space, i.e., the vector space that is spanned b y its columns, whic h is identical to the dimension of its ro ws space. This is also the maximal n um b er of linearly indep endent columns of A . The matrix A is said to hav e full rank if its rank equals the smaller v alue b et w een m and n . The Gr amSchmidt pro cess or QR-decomposition (Schmidt, 1908; Golub & V an Loan, 1996) is a method for orthonormalising a set of vectors in an inner pro duct space (for example the Euclidean space R d equipp ed with the standard inner product). The GramSc hmidt pro cess tak es a ﬁnite, linearly indep endent set S = { a 1 , · · · , a k } of v ectors for k ≤ d and generates an orthogonal set S = { u 1 , · · · , u k } that spans the same k -dimensional subspace of R d as S . The application of the GramSc hmidt pro cess to the column vectors of a full column rank matrix A ∈ R m × n with m ≥ n yields the QR de c omp osition A = QR (T refethen & Bau I I I, 1997) where Q ∈ R m × m is an orthogonal matrix and R ∈ R m × n is a right triangular matrix; see Fig. 7 . A Singular V alue De c omp osition (SVD) of a matrix (Klema & Laub, 1980) A ∈ R m × n is a factorization A = U D V T suc h that U ∈ R m × m and V ∈ R n × n are orthogonal matrices and D ∈ R m × n is a diagonal matrix whose diagonal en tries are called singular values of A and are non-negativ e and non-increasing; see Fig. 8 . A Thin Singular V alue De c omp osition (thin SVD) of a matrix A ∈ R m × n of rank r ≤ n 24 is a factorization A = U r D r V T r suc h that U r ∈ R m × r , V r ∈ R n × r , U T r U r = I r , V T r V r = I r , and D ∈ R r × r is a diagonal matrix containing the singular v alues d 1 ≥ · · · ≥ d r > 0; see Fig. 9 . Ev ery matrix A ∈ R m × n has a QR, SVD, and thin SVD decomp ositions. Let L ⊆ R d b e a j -dimensional linear subspace and let S ∈ R d × j b e a matrix whose columns are m utually orthogonal and span L . Let S ⊥ ∈ R d × ( d − j ) b e a matrix whose columns are mutually orthogonal and span the orthogonal complement L ⊥ of L . By the Pythagorean theorem, the squared Euclidean distance b et ween a p oint p ∈ R d and L can b e computed by taking the norm of its pro jection onto A ⊥ , namely dist 2 ( p, L ) = k p k 2 −   p T S   2 =   p T S ⊥   2 . The sum of squared distances from the rows of a matrix A ∈ R n × d to L is thus   AS ⊥   2 F . A natural application for SVD is to compute the unit v ector x ∈ R n that minimizes k Ax k 2 2 (Klema & Laub, 1980) A ∈ R m × n is a given matrix. The desired vector of co eﬃcien ts x is simply the vector in the matrix V that corresp onds to the smallest singular v alue in the SVD of the matrix A . Minimizing the ov er determined system k Ax − b k 2 2 giv en an additional non-zero v ector b ∈ R m can b e done as follows. Let U D V T b e the SVD of A . W e no w hav e that Ax = U D V T x = b . Multiplying b oth sides b y U T yields D V T = U T b . Multiplying again by the pseudo inv erse D † of D yields V T x = D † ( U T b ). Hence, by multiplying b oth sides b y V w e get that x = V ( D † ( U T b )) is the desired vector of co eﬃcien ts. If the matrix A contains at least n indep endent vectors in its ro ws, then the rows of A is a basis for R n . This implies that every vector v ∈ R n is a linear combination of only n such v ectors. Section 3.7 suggests such a ”coreset” with only p ositive co eﬃcients in the linear com bination. Computing these co eﬃcien ts can b e done by solving the linear system B x = v where the columns of B are the vectors that span v . 25

Introduction to Coresets: Accurate Coresets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment