Quantum Maximum Likelihood Prediction via Hilbert Space Embeddings

Recent works have proposed various explanations for the ability of modern large language models (LLMs) to perform in-context prediction. We propose an alternative conceptual viewpoint from an information-geometric and statistical perspective. Motivat…

Authors: Sreejith Sreekumar, Nir Weinberger

QUANTUM MAXIMUM LIKELI HOOD PR EDICTIO N VIA HILBER T SP A CE EMBEDDINGS SREEJITH SREEKUMAR AND NIR WEINBERGER Abstra ct. Recent w orks h a ve proposed v arious exp lanations for the ability of modern large language mod els (LLMs) to perform in-context prediction. W e prop ose an alternative conceptual viewpoint from an informatio n-geometric and statistical persp ective. Motiv ated by Bac h [ 2023 ], we model training as learning an embedding of probability distributions in to th e space of quantum densit y op erators, and in-context learning as maximum-lik elihoo d prediction ov er a sp ecified class of quantum mo dels. W e pro vide an interpretation of this p redictor in terms of quantum reverse information pro jection and quantum Pyth agorean theorem when the class of q uantum mo dels is sufficiently exp ressiv e. W e further derive n on-asymptotic p erformance guarantees in terms of conv ergence rates and concentration inequalities, b oth in trace norm and qu antum relative e ntrop y . Our approac h pro vid es a unified framew ork to handle b oth classical and quantum LLMs. 1. Introduction Prediction is a primitiv e task w ith numerous app lications in mac hine learning, statistics, infor- mation theory , physics, fin ance, and engineering. The basic prediction p r oblem p ertains to provid- ing an informed guess ab out the next outcome, X n +1 , of a sequence, giv en the previous observ a- tions X 1 , . . . , X n . This problem h as b een stu died fr om m ultiple p ersp ectiv es, e.g., mac h ine learning [ Cesa-Bianc hi and L u gosi , 2006 ] and information theory [ Merha v and F eder , 1998 ]. A t their core, and with muc h simp lification, mo d ern large language mo dels (LLMs) are such p r e- dictors, in the s ense that du ring inference they indu ce a p robabilit y distribu tion ov er the next tok en (sym b ol, w ord, sen tence, etc.) based on pr evious ones. Nonetheless, the recen t remark able success of LLMs is c h allenging to theoretically explain and align with th e standard classic results. In the clas- sical theory , the p erformance of the p r edictor d egrades with the dimension of the parameter v ector used to determine the predictor. S p ecifically , un der the common log loss (cross-en tropy), the exp ected regret in predicting th at a s amp le from a probabilit y d istribution P follo ws a probabilit y distribu tion Q is giv en b y the K ullbac k–Leibler (KL) d iv ergence, D KL ( P k Q ). In the minimax fr amew ork, the r e- gret scale linearly with this dimension of th e mo d el class of Q (e.g. , Kric h evsky and T r ofimo v [ 1981 ], Rissanen [ 1983 ], Y ang and Barron [ 1999 ], Haussler and Op p er [ 1997 ], P oly anskiy and W u [ 2025 , Sec. Key w or ds and phr ases. R´ enyi divergence, dev iation inequality , v ariational estimator, smoothed plug-in estimator, differential priv acy . 1 2 S. SREEKUMAR AND N. WEINBERGER 13.4]). F or LLMs, the dimens ionalit y of the parameter ma y b e attributed to either ( i ) the num b er of p arameters in the arc hitecture (wh ic h can b e on the order of b illions) and v arious r efinement s of effectiv e dimension (e.g. Golo wic h et al. [ 2020 ]), or ( ii ) the cardinalit y of the vocabulary , assumin g that the arc h itecture is expressiv e enough to allo w any pr obabilit y distribution o v er the vocabulary . The vocabulary W of the tokens corresp ond s to ones used in a useful language, and thus is t yp ically large, sa y on the ord er of h und red thousands. E ven in a simplistic setting, in which the mo d el o p erates on indep en den t and id entical ly d istributed (i.i.d.) s en tences of tok ens, sa y of length k , the resulting v o cabulary of sen tences is astronomical |W | k . This suggests that training and inference m ust rely on some indu ctiv e bias, whether exp licit or imp licit. Moreo v er, follo w ing the LLM training, the abilit y of the mo del to p erform in-c ontext le arning (ICL) emerges: Giv en only a short prompt, the m o del is capable of accurate prediction of the next tok ens (e.g., as an answ er to a question describ ed in the prompt), ev en without an y fine-tuning for this sp ecific task. V arious theoretical w orks suggest that ICL ca n b e view ed as an inference-time activ ation of a simple statistical p ro cedure induced by the learned r epresent ations. F or example, in stylize d generativ e settings, pretrained sequence mo dels w ere sho w n to p erform implicit Ba yesia n inference o ver laten t concepts, tur ning a short con text into a p osterior pr edictiv e rule [ Xie et al. , 2022 ]. Other w orks show that transformers imp lement standard estimators and learning dynamics in-conte xt, e.g., ridge/least-squares rules an d gradient -based up dates [ Aky ¨ urek et al. , 2023 ]. The function-class ICL framew ork mo dels a promp t as i.i.d. samp les from an underlyin g task prior, and sh o ws that LLMs can matc h v arious standard statistical infer en ce pro cedures suc h as optimal least squares, ev en on un seen tasks. This su ggests that the core of th e prediction is ac hiev ed by learning a rep r esen tation in whic h prompt-conditioned pr ediction reduces to a lo w-complexit y up date [ Garg et al. , 2022 ]. Recen t results substanti ally broaden this p oin t of view, showing that transformers can realize a family of in-con text pro cedur es (least squares, ridge, Lasso, generalized linear m o dels, gradien t d escen t) and even select among them based on the sequence, hence reinforcing the inte rpretation of ICL as statistical d ecision rules instantia ted at inference time [ Bai et al. , 2023 ]. F rom a bird’s-eye view, an L LM tak es a toke n sequence as input and em b eds eac h toke n as a ve ctor in a Hilb ert space. The mo del arc hitecture is designed to capture dep endencies an d relatio ns b et we en tok ens at v arious scales. F or example, in the transformer, this is ac hieved b y cascading several m ulti- head atten tion units and feedforward neural nets that capture d ep endencies b et w een differen t features in neighboring tok ens b y pro jecting features and computing their in teractions. The emb edding int o a Hilb ert space pro v id es a geometry in wh ic h tok en similarit y is quan tified b y inner p ro du cts, w hereas the original discrete space lac ks an in trinsic geometric structur e. Additionally , this Hilb ert sp ace HILBER T SP ACE EMBEDDINGS AND PREDICTION 3 equipp ed with the notion of orthogonalit y p ro v id es a natural setting where optimization algorithms suc h as sto chastic gradien t descent (SGD) can b e p erform ed efficien tly and analyzed. In this p ap er, w e prop ose and analyze a simple conceptual mo del of ICL based on em b edding probabilit y d istr ibutions as quan tum d en sit y op erators, an d we c h aracterize ho w suc h embed dings can im p ro v e prediction relativ e to classical prediction approac h es. Ou r analysis leads to s everal us eful insigh ts on prop erties of efficien t embedd ings and mo dels. Our mo del assumes th at du ring training, an effectiv e embed d ing of sequences of tok ens x = ( w 1 , w 2 , . . . , w k ) ∈ X = W k to vect ors | ϕ ( x ) i in a Hilb ert space has b een learned. Int uitiv ely , since the em b eddin g is to a finite-dimensional Hilb ert space of dimension m u c h smaller than the vocabulary , the curse of large vocabulary is alleviate d, and the discrete geometry b et we en words in th e v o cabulary (Hamming distance or “zero-one” similarit y) is replaced by the inner pro du ct in th e em b edded space. T o mo d el the promp t durin g the in-conte xt learning phase, w e assume that a sequence of n i.i.d. sym b ols X n 1 = ( X 1 , X 2 , . . . , X n ) from an unknown d istribution P is obser ved, and the mo del pred icts th e next symb ol X n +1 via a prob ab ility distribution Q , while aiming to minimize the mean log loss − E P [log Q ( X n +1 )]. Our prop osed pr edictor utilizes the learned emb ed ding b y a mapping of th e empirical distrib ution of th e sym b ols X n 1 in to the em b edded domain. A natural baseline is the me an emb e dding , in w hic h a probabilit y distribution Q is mapp ed to E X ∼ Q [ | ϕ ( X ) i ]. Although wid ely used [ Muandet et al. , 2017 ], this em b edding induces a Euclidean geometry on the space of probability d istributions, wh ic h do es not align with the log lossm the geometry for the probabilit y simp lex, and information-theoretic ob jectiv es. As recently argued and explored b y Bac h [ 2023 ], a more natural em b edding of prob ab ility distributions is to the c ovarianc e op e r ator E X ∼ Q [ | ϕ ( X ) ih ϕ ( X ) | ], or, in quantum-information terms [ Wilde , 2017 , Ha ya shi , 2016 ], as a density op er ator . Th e resu lting geometry is b ased on the qu an tum relativ e en tr opy , whic h is a natural generalization of the KL divergence (which is the exp ected log loss up to an ent rop y correction). This induces the bias that closeness in quan tum relativ e en tr opy correlates with pr ed ictiv e accuracy after p ro cessing th em bac k to classical prob ab ility distribu tions o v er the output. Therefore, our prop osed s imple pr edictor f or the in-con text learning step is the quantum maximum likeliho o d pr e dictor (QMLP), based on the emb edding of the empirical d istr ibution of X n 1 to a d ensit y op erator. Giv en a densit y op erator, a cla ssical output distribution is obtained by app lying a measurement and using the Born rule. While the c hoice of measur emen t also induces bias, an accurate QMLP in terms of the quan tum r elativ e en trop y directly implies an accurate predictor in terms of measur ed r elativ e en tr op ies, i.e., the maxim u m K L dive rgence b et ween probabilit y distributions ind uced by qu an tum measuremen ts. This is b ecause an y pro cessing of the densit y op erator to a p robabilit y distribu tion o ver a classical vocabulary is a quan tum measuremen t c hannel, and by the quantum date-processing 4 S. SREEKUMAR AND N. WEINBERGER inequalit y , the KL dive rgence b et ween the measuremen t output distributions is smaller than the quan tum relativ e en tropy b et wee n the d ensit y op er ators. Hence, as an additional b en efi t, our mo del allo ws to add ress pr ediction in the con text of quan tum L L Ms (see e.g., [ Basile and T amburini , 2017 , Kim et al. , 2025 ]) on the s ame f o oting. In su c h systems [ Nielsen and Ch uang , 2001 ], a q u an tum state is prep ared, th en pro cessed by quant um gates, and finally measured b y a p ositiv e op erator-v alued measure (POV M) to obtain a classical probability distribution ov er the output vocabulary . Giv en the ab o ve cont ext for our prop osed p redictor, sev eral natural basic qu estions emerge: What are the inf ormation-geome tric and n on-asymptotic statistical prop erties of p erforming MLP in th e space of embedd ed states rather than the original space of pr obabilit y distrib utions? As we show b elo w, the smaller dimen sion of the Hilb ert space (e.g., in the cont ext of LLMs), compared to the original alphab et size of th e discrete distribu tions (whic h can b e large or ev en infi nite), enables Q MLP to ha v e a muc h b etter conv ergence rate than its analogue in the d istribution space. 1.1. Con tributions. Our main con tribu tions can b e summarized as follo ws: (i) W e show that und er mo d el-class symmetries that are natural f or le arning via em b eddin g (u nitary in v ariance and closure u nder pinc hing op eration), the QMLP ob jectiv e r educes to a classical KL div ergence ob jectiv e on eigen v alues (Prop osition 1 ), and pro vide a geometric in terp retation of this equiv alence in terms of a refi n ed v ers ion of the quan tum Pythagorean theorem (Th eorem 1 ). (ii) W e obtain non-asymptotic up p er b ounds (Theorem 2 ) quantifying the exp ected rate of con ver- gence of QMLP (and MLP) to the target state b oth in trace n orm and quantum relativ e en trop y . W e also obtain the corresp onding concen tration inequalities. These b ou n ds dep end on the d i- mension d of the Hilb ert space, the pr ompt length n , and the minimal p ositiv e eigenv alue of th e em b edded qu an tu m s tate corresp ond in g to the tru e data distribution. F or instance, our b ounds sho w that th e rate of con v ergence in trace norm is ˜ O ( d/ √ n ) (up to an additiv e approxima tion slac k factor) in general and ˜ O ( d 3 /n ) when the class of qu an tu m m o dels is sufficien tly expressiv e, where ˜ O ( · ) hides logarithmic factors and dep endence on minimal eigen v alue. Under add itional conditions on the expr essivit y of quan tum mo dels measured in terms of quantum relativ e en- trop y , the corresp onding rates of con v ergence under quantum relativ e en tropy are ˜ O ( d/ √ n ) and ˜ O ( d 2 /n ), resp ectiv ely . The pro of of Prop osition 1 relies on elemen tary argument s inv olving qu an tu m measurements and data-pro cessing inequalit y . T he p ro of of quan tum Pyth agorean theorem in volv es adapting th e p ro of of the corresp onding classical result give n in Csisz´ ar and S hields [ 2004 ] to the non-comm u tativ e set- ting. The essential ingredien ts in the pro of of Theorem 2 are the v ariational expression for quantum HILBER T SP ACE EMBEDDINGS AND PREDICTION 5 relativ e en trop y (see ( 2 )), matrix Ho effd in g and Bernstein inequalities, a second order T a ylor exp an- sion for quantum relativ e en trop y and Pr op osition 2 , which relates trace norm b et ween QMLP and em b edded state (c orresp ond ing to true distribution) to terms in vo lving quan tum rela tiv e en tropy . Our b ound s are tailored to em b edded empirical s tates and QMLP , giving explicit dep end ence on emb ed - ding d imension and sp ectral conditioning. Th is yields prediction-orien ted b ounds relating the prompt length and the em b edding d imension, wh ic h d iffers fr om existing tomography- orien ted b ound s, e.g., [ Koltc hinskii and Xia , 2015 , F errie and Blume-Kohout , 2016 ]. 1.2. Related work. Sev eral recent stud ies ha ve attempted to unlo ck the learning phenomenon in LLMs from different viewp oints, e.g., m o deling the transformer as a fin ite state mac h ine and quan- tifying the abilit y of transformers for learning via approxima tion of linear functions. F or instance, Basu et al. [ 2023 ] considers Marko v mo dels for b oth data and transform er and sho ws that the a v erage empirical loss as measured by the log loss function con v erges to the conditional en trop y asymptotically , or equiv alent ly th at the asymptotic regret in prediction v anishes. Edelman et al. [ 2024 ] sho w that ICL can b e explained b y the emergence of statistica l indu ction heads in trans f ormers, which imp lemen t simple Mark o vian pred ictors o ve r token sequences, revea ling ICL as the activ ation of low-co mplexit y statistica l ru les enco ded dur ing training. Jeon et al. [ 2024 ] pro vid e an information-theoretic c harac- terizatio n of ICL, sho w in g that in-con text pr edictions in transformer s corresp ond to optimal inference under implicit inform ation constraints, where the pr ompt supplies limited mutual information ab out the task that is efficien tly exploited by the pr etrained mo del. Sander and Peyr ´ e [ 2025 ] inv estigate the expressiv eness of transform ers for next toke n prediction b y considering em b edding of tok ens in to a repro du cing ke rnel Hilb ert space (RKHS). P an et al. [ 2025 ] prop osed a theoretica l f ramew ork f or LLM scaling laws and kno wledge acquisition by mo deling training as an information-theoretic compression pro cess go v ern ed by a hierarchical generativ e mo del of syntax and kno wledge. Em b edding probability d istributions into Hilb er t s p aces for inference has b een explored in a v ariet y of ot her con texts. F or in s tance, the mean embedd ing [ Berlinet and T homas-Agnan , 2004 , Muandet et al. , 2017 ] and the asso ciated maxim um mean discrepancy (MMD) metric has b een used for d esigning kernel based t wo -sample tests [ Gretton et al. , 2006 ], defin ing indep endence/dep endence measures [ F u k u mizu et al. , 2004 ], feature extraction [ Smola et al. , 2007 ] a nd graphical mo d els [ S ong et al. , 2010 ] among other applications. The co v ariance em b edding [ Bac h , 2023 ] of probability distribu tions in to densit y op erators h as also found s ev eral applicatio ns. T o men tion a f ew, Ho yo s-Osorio and S anc h ez-Giraldo [ 2024 ] considered a represen tation Jensen-Sh annon div ergence defined in terms of v on-Neumann en- trop y of em b edded d ensit y op erators, Kac haiev and Recanatesi [ 2024 ] prop osed u nsup ervised learn- ing of the kernel via m axim um vo n-Neumann ent rop y , and San toro and Panareto s [ 2025 ] p rop osed a k ern el-based likel iho o d test utilizing b oth th e mean and co v ariance em b edding. The co v ariance 6 S. SREEKUMAR AND N. WEINBERGER em b edding of p robabilit y distrib utions and the asso ciated Q MLP also captures at a high-lev el the problem of predictio n in quan tu m LLMs [ Basile and T amburini , 2017 ], where the expressivit y of quan- tum architec ture used for p rediction is reflected in th e class of m o dels ov er which QMLP is optimized. Another related to pic corresp onds to quan tum ge nerativ e mo deling and simulatio n of pr obabilit y distributions, in which th e hop e is to use quantum mo dels to efficien tly sim u late p robabilit y distribu- tions wh ic h are d ifficult to mo del classica lly . Two p opular framew orks us ed in this regard p ertains to quan tum circuit Bo rn mac hines [ Liu and W ang , 2018 , Benedetti et al. , 2019 ] and qu antum Boltzmann mac hin es [ Benedetti et al. , 2017 , Kiefero v´ a and Wieb e , 2017 , Amin et al. , 201 8 ], wh ic h u ses quan tum circuits and thermal (or Gibb s) states, r esp ectiv ely , to ac hiev e the mo d eling ob jectiv e. 2. Preliminaries 2.1. Notation. Let X b e a d iscrete set or a su bset of R d . Denote the set of all Borel p robabilit y measures w hose supp ort is con tained in X b y P ( X ). Let µ denote a sigma-finite measure on X . F or P ≪ µ (i.e., P absolutely contin uous with resp ect to µ ), let p denote its Radon-Nikodym d eriv ativ e with resp ect to µ . Unless s p ecified otherwise, w e will tak e µ to b e the count ing measure w hen X is d iscrete or Leb esgue measure when X ⊆ R d , in w h ic h case p b ecomes the u s ual p robabilit y mass function (pmf ) or probabilit y densit y fu n ction (p df ). Let H den ote a complex (or real) separable Hilbert space. Un less sp ecified otherwise, w e will assume that H is complex. When H is of dimension d , we will denote it as H d . F or an inner pro d uct h· , ·i H on H , we will use the p hysics insp ir ed conv en tion in which the inner pro d u ct is an ti-linear (or conjugate) in the firs t argum en t and linear in the s econd. Wh en ev er there is no confusion, we will denote h· , ·i H simply by h· , ·i . The space of con tin u ous linear op erators on H is denoted b y L ( H ), equipp ed with op er ator norm top ology . The set of densit y op erators or q u an tum states is denoted as S ( H ), i.e., a self adjoin t op erator ρ s uc h that ρ ≥ 0 and T r [ ρ ] = 1. W e will emplo y the br a-k et notation from quan tu m information theory (see e.g. Ha ya shi [ 2016 ]). T o men tion it briefly , let h and ˜ h b e an y t w o elemen ts of some Hilb ert sp ace H . Then, | h i , h ˜ h | and | h ih ˜ h | corresp onds, resp ectiv ely , to an elemen t h ∈ H , a linear fu nctional on H with action h ˜ h || h i = h ˜ h, h i , and a linear op erator with action | h ih ˜ h || ¯ h i = h ˜ h, ¯ h i| h i for every ¯ h ∈ H , resp ectiv ely . F or L ∈ L ( H ), λ L denotes the ve ctor comp osed of eigen v alues of L . When ρ ∈ S ( H ), λ ρ is a p m f. The supp ort of L ∈ L ( H ), wh ic h is defined as the orthogonal complement of kernel of L in H , is denoted by spt( L ). L 1 ≪ L 2 denotes that spt( L 1 ) ⊆ sp t( L 2 ) and L 1 6 ≪ L 2 means that spt( L 1 ) ( sp t( L 2 ). F or a, b ∈ R , a ∨ b and a ∧ b denote max { a, b } and min { a, b } , resp ectiv ely . Next, we in tro d uce the p reliminary notions u s ed in the pap er . HILBER T SP ACE EMBEDDINGS AND PREDICTION 7 2.2. Quantum relat iv e en trop y and measured relativ e entrop y. The qu antum r elative entr opy b et w een ρ, σ ∈ S ( H ) is D ( ρ k σ ) : =      T r [ ρ (log ρ − log σ )] , if ρ ≪ σ, ∞ , otherwise . (1) The quantum relativ e en trop y admits the v ariational exp ression [ P etz , 1988 , Berta et al. , 2015 ]: D ( ρ k σ ) = su p H T r [ H ρ ] − log T r h e H +log σ i = su p H T r [ H ρ ] − T r h e H +log σ i + 1 . (2) The me asur e d r elative entr opy [ Donald , 1986 , Piani , 2009 ] b et ween ρ, σ ∈ S ( H ) is defi n ed as D M ( ρ k σ ) : = sup Z , { M z } z ∈Z X z ∈Z T r [ M z ρ ] log  T r [ M z ρ ] T r [ M z σ ]  , (3) where the suprem um is o ver all fi nite sets Z a nd p ositiv e op erator-v alued measures 1 (PO VMs) { M z } z ∈Z indexed by Z . By data-pr o c essing ine quality f or qu an tu m relativ e entrop y , D ( ρ k σ ) ≥ D M ( ρ k σ ). This shows that if t wo densit y op erators ρ and σ are close in the quantum relativ e ent rop y , then the KL div ergence b etw een an y t wo measured probability d istributions is only lo wer (or equal). Moreo v er, wh en ρ, σ > 0, equalit y holds if and only if ρ and σ comm ute, and then th is common v alue also coincides w ith D KL ( λ ρ k λ σ ), w here λ ρ denotes the probabilit y m ass function (pmf ) comp osed of eigen v alues of ρ and the KL div ergence b et we en p robabilit y measures P , Q ∈ P ( X ) is D KL ( P k Q ) : =      E P h log dP dQ i , i f P ≪ Q, ∞ , otherwise . (4) 3. Prediction t as k Let m, n ∈ N and X b e a discrete set or a subset of R m . Let Q ⊆ P ( X ) b e a class of prob ab ility measures. W e will assu me th at Q is a compact con vex set and conta ins a Q > 0. The classical prediction problem can then b e framed as determining the b est p redictor (or mo del for prediction) ˆ Q n : X n → Q based on the samples X n ∼ P ⊗ n , w here P ∈ P ( X ). T he accuracy of pr ediction is measured b y a loss function d : X × P ( X ) → [0 , ∞ ]. A particular loss function of in terest is the log loss function d L ( x, Q ) : = − log q ( x ), wh ere q denotes the pr ob ab ility mass function (p.m.f.) in the discrete case and probabilit y d ensit y function (p .d.f.) in the con tin u ous case. Let { ˆ Q n ( · ) } b e the shorth an d for { ˆ Q n ( x n ) } x n ∈X n . Give n x n , we fin d the predictor that minimizes the cum u lativ e (or equiv alen tly 1 A POVM { M z } z ∈Z on a separable Hilb ert space H , in d exed by a discrete Z , is a set of p ositive semi-d efinite operators M z such that P z ∈Z M z equals the identit y op erator on H . 8 S. SREEKUMAR AND N. WEINBERGER the av erage) empirical loss according to the log loss f unction. C onsider the case of discrete X . Since the empir ical loss can b e written as inf Q ∈Q 1 n n X i =1 d L ( x i , Q ) = H ( ˆ P x n ) + inf Q ∈Q D KL  ˆ P x n    Q  , (5) where H ( P ) denotes the Shann on en tropy of P , th is pro cedu re is equiv alen t to findin g the predictor ˆ Q ⋆ n ( x n ) : = arg min Q ∈Q D KL  ˆ P x n    Q  = arg max Q ∈Q n Y i =1 q ( x i ) , (6) when the m inim um (and m axim um ) ab o ve exists. W e refer to ˆ Q ⋆ n ( x n ) as an MLP wh en it exists. MLP when X ⊆ R m can b e defined similarly using the first equalit y in ( 6 ). When m ultiple s olutions to the optimizatio n in ( 6 ) exist, MLP will m ean an y one of them chosen arbitrarily . Consider the optimization (minimization) problem in ( 6 ). Fixing x n ∈ X n and setting Q = ˆ Q n ( x n ), one may without loss of generalit y consider th e follo win g optimization problem: inf Q ∈Q D KL ( P k Q ) = min Q ∈Q D KL ( P k Q ) = D KL ( P k Q ⋆ P ) . (7) In the ab o ve, the equalities follo w by noting that the resulting infimum is ac h iev ed by some Q ⋆ P (due to lo wer semi-con tinuit y of D KL ( P k Q ) in ( P , Q ) and compactness of Q ). S ince Q is also conv ex, th ere exists a distribu tion Q ′ whose supp ort con tains the sup p ort of all other Q ∈ Q . Define the supp ort of Q , sp t( Q ) = spt( Q ′ ), for su c h a Q ′ . When spt( P ) = spt( Q ), Q ⋆ P is unique. Ev en when X is d iscrete and Q = P ( X ), solving ( 6 ) to fin d the maxim um likelihoo d estimate quickly b ecomes infeasible when |X | is large, sin ce it in volv es optimization o v er a simplex of dimension |X | − 1. This is a typica l case in LLMs, f or example, wh en X = W ⊗ k is the set of all p ossible sentences of length k comp osed of toke ns or words selected from vocabulary W . F ur ther, although ( 6 ) is a conv ex optimization problem, common algorithms su c h as SGD are not directly applicable on an arbitrary discrete set X without a m etric stru cture. Indeed, as explained b elo w, mo d ern LLMs ov ercome the latter c h allenge b y represen ting tok ens or w ord s as vec tors in a Euclidean space wh ere gradien t-based optimization algorithms can b e applied. In the follo wing, w e in terp ret this thr ou gh Hilb ert-space em b eddings, quantum information pro jection, and b oun ds on QMLP p erformance. 3.1. Em b eddings and Quan tum Maxim um Lik e liho o d Predictor. When considering an em- b edd in g of p robabilit y d istributions in to Hilb ert sp aces, a natural p ossibilit y , wh ic h has b een ex- tensiv ely stud ied in the context of RKHSs, is th e mean em b eddin g P 7→ E P [ ϕ ( X )], where ϕ is the feature m ap corresp onding to an RKHS (see App end ix B ). Ho w ever, as noted in Bac h [ 2023 ], the mean em b edding indu ces an MMD metric b et ween prob ab ility distribu tions in the RKHS, wh ic h do es not ha ve a direct conn ection to log loss (and information theory). Accordingly , Bac h [ 2023 ] prop osed HILBER T SP ACE EMBEDDINGS AND PREDICTION 9 a second-order cov ariance em b edding as a suitable framework to study inte raction b et w een proba- bilit y distribu tions and p erform sample-based inference. T o defi ne it, let H b e a Hilb ert space and ϕ : X → H b e a Borel-measurable map s uc h th at k ϕ ( x ) k H = 1 for all x ∈ X . F or a prob ab ility measure P with d en sit y p with resp ect to a dominating p ositiv e σ -finite measure µ (coun ting measure in the discrete case and Leb esgue measure in the contin uous case), set ρ p = Z X p ( x ) | ϕ ( x ) ih ϕ ( x ) | dµ ( x ) . (8) The co v ariance em b edding φ : P ( X ) → L ( H ) is then th e linear map induced b y ( 8 ). Here, w e explore the utilit y of co v ariance em b eddings for pr ediction. Th en, ρ p ∈ S ( H ) since it is p ositiv e semidefinite and has unit trace (see ( 74 ) in App endix B ). Hence, φ is a map of P ( X ) in to S ( H ). Also, recall that S ( H ) is a subset of trace-class op erators on H , and hence lies in a Hilb ert space within 2 L ( H ) equipp ed with the Hilb ert-Sc hmidt in ner pr o duct h ρ, σ i : = T r [ ρ ∗ σ ], w here ρ ∗ denotes the adjoin t of ρ . Sev eral in teresting prop erties of the map φ when ϕ is the feature map corresp onding to an RKHS with kernel K w ere stu d ied in Bac h [ 2023 ]. As noted th erein, th e image of φ lies w ithin a Hilb ert space that is isomorphic to the RKHS generated b y the ke rnel K 2 . Also, the quan tum data-pro cessing inequalit y yields D KL ( P k Q ) ≥ D ( ρ p k ρ q ). Giv en the em b edding of p r obabilit y distributions as den s it y op erators in a Hilb ert space (whic h w e in terpret as an internal represen tation), we mo d el I C L as s electing a densit y op erator σ for prediction from a mo del class Σ determined b y the QMLP . T o obtain a classical distribu tion on the outp ut v o cabulary , w e app ly a measuremen t c h annel (or POVM ) M n , viewed as a quantum-to-c lassical c hann el. Th e appropriate em b eddin g ϕ and measurement are selected during training and then held fixed du ring ICL. Let ˆ ρ n ( x n ) : = 1 n n X i =1 | ϕ ( x i ) ih ϕ ( x i ) | , (9) denote the embedd ing of th e empirical distribu tion ˆ p n of x n , i.e., ˆ ρ n ( x n ) = ρ ˆ p n : = φ ( ˆ p n ). Let ˆ P n and ˆ Q n denote the measurement-output d istributions corresp onding to ˆ ρ n ( x n ) and σ . Pr ediction accuracy is qu an tified b y the minimal a v erage log loss, w h ic h is equiv alen t to min Q ∈Q D KL  ˆ P n k Q  . Here, Q : = {M n ( σ ) : σ ∈ Σ } and M n ( σ ) d enotes the classical distribution obtained as the output of the measuremen t c hannel with input σ . Sin ce the measurement M n is unsp ecified, one ma y consider the ab o ve optimization with measur ed relativ e en trop y in place of KL div ergence and replacing ˆ P n , Q and Q by ˆ ρ n ( x n ) , σ an d Σ, resp ectiv ely . Ho wev er, the measured relativ e en tropy is typica lly in tractable, as it requires optimizing ov er all p ossible measur ements, leading to a nonconv ex in n er 2 F or H d , t he whole L ( H d ) b ecomes a Hilb ert sp ace equipp ed with the Hilb ert-S chmidt inner pro duct. 10 S. SREEKUMAR AND N. WEINBERGER problem without a closed form in general. W e therefore define the QMLP giv en samples x n ∈ X n as an y minimizer of inf σ ∈ Σ D ( ˆ ρ n ( x n ) k σ ) , (10) whenev er the infimum is attained. Th is formulation has the ad d itional adv antage that it leads to con- v ergence guarante es for the MLP in terms o f quan tum relativ e entrop y , which is a stronger discrep ancy measure than measur ed relativ e en tropy . When Σ is a compact 3 set, the infimum in ( 10 ) is ac hiev ed. Moreo ve r, when Σ is also con vex and 4 spt ( ˆ ρ n ( x n )) = sp t(Σ), this minim um is un ique. Ho w ev er, the last condition is restrictiv e and h ard to ensure for all x n ∈ X n . Accordingly , w e call the ”QMLP” an arbitrary minimizer of ( 10 ), when the minim izer is not uniqu e. The motiv ation b ehind considering QMLP stems from the p ossibility that w ith an appropriately c hosen em b edd ing ϕ in to a Hilb ert space H d of d imension d ≪ |X | , ˆ ρ n ( x n ) is a goo d pro xy of ˆ p n for the p rediction task wh ile simultaneously enabling optimization o ver a subs et of a muc h smaller Hilb ert sp ace L ( H d ), i.e., o ver the space of d × d matrices. T his o ccurs, for in stance, when “neigh b oring symb ols” w hic h sh ould lead to similar prediction outcomes are embedd ed to closely aligned elemen ts in the Hilb ert space, so that the sp an of { ϕ ( x ) : x ∈ X } has a low effectiv e dimension compared to orthogonal emb edding. Another r ationale, as w ill b ecome eviden t in Section 4.2 , is that the minimal eige n v alue of ρ p increases in general when m ultiple elemen ts of X are mapp ed to n on-orthogonal elemen ts. F rom an optimization standp oint, ( 10 ) can b e solv ed by optimization algorithms o v er Hilb ert spaces that can exploit its r ic h structure suc h as orthogonalit y and inner p ro du cts, and lac k in P ( X ). In fact, there is a ric h theory of optimization in Hilb ert spaces (e.g., Housk a and Chac huat [ 2017 ]), and gradien t-based algorithms can exploit the aforemen tioned structure wh ile p erformin g optimizat ion. 3 In what follo ws, the compactn ess assumption is u sed only to guarantee that th e relev ant infimum is achiev ed, and hence can b e disp ensed with if the min im um is know n to exist. This is a useful relaxation, esp ecially , in infinite dimensional Hilb ert space settings, where compactness could be a restrictive assumption. 4 Since Σ is a compact conv ex set of p ositive semi-definite operators, it is easy t o see th at there exists an element σ ∈ Σ such th at σ ′ ≪ σ for every σ ′ ∈ Σ. Define spt(Σ) : = spt( σ ) for suc h a σ . T o see th at such a σ exists, n ote that spt( σ 1 ) ∪ spt( σ 2 ) = spt( ασ 1 + (1 − α ) σ 2 ) for an y σ 1 , σ 2 ∈ S ( H ) and α ∈ (0 , 1). Using this, it is possible to construct either a finite or in finite sequence of density op erators σ n ∈ Σ with strictly increasing supp ort. In the fi nite case, the last element of th e sequence gives the desired σ and in the infi nite case, take σ to b e any limit p oint. HILBER T SP ACE EMBEDDINGS AND PREDICTION 11 4. Main Resul ts W e next obtain p erformance guaran tees for QMLP . F rom ( 10 ), the optimization problem of interest for determining Q MLP is the analogue of ( 7 ) in L ( H ), i.e., inf σ ∈ Σ D ( ρ k σ ) , (11) where Σ is a non-empty compact sub set of S ( H ). No te that when Σ is also con ve x, ( 11 ) is a conv ex optimization prob lem o wing to the joint con vexit y of qu an tum relativ e entrop y in its argument s (see e.g. Wilde [ 2017 ]). In the follo win g, w e c haracterize some in formation-geometric and statistical prop erties of QMLP . 4.1. Geometric Pro p erties of Q uan t um Maxim um Likelihoo d Predictor. Solving ( 11 ) in- v olve s optimization of quantum relativ e entrop y o v er a class of density op erators. It is of interest to determine conditions un d er whic h this optimization r educes to its relativ ely simpler classical analogue giv en in ( 7 ). W e sho w t hat this indeed happ ens when Σ has su fficien tly ric h structure. T o state the con- dition precisely , let E ρ : = {| e i ( ρ ) i} ∞ i =1 denote an orthonormal eigen basis of ρ and P i ( ρ ) : = | e i ( ρ ) ih e i ( ρ ) | denote the orthogonal pro jection corresp onding to | e i ( ρ ) i . F or Σ ⊆ S ( H ), define the follo win g set obtained by pinc h ing 5 the element s of Σ by the rank-one orthogonal pro jections { P i ( ρ ) } ∞ i =1 : J ( E ρ , Σ ) : = ( ∞ X i =1 P i ( ρ ) σ P i ( ρ ) : σ ∈ Σ ) . (1 2) It is easy to v erify that th e op erators in J ( E ρ , Σ ) commute with ρ . Next, we call a set Σ ⊆ L ( H ) unitarily invariant if for all σ ∈ Σ and u nitaries U , U σ U † ∈ Σ. W e ha ve the follo wing prop osition that provides conditions on Σ under which QMLP reduces to an MLP . Prop osition 1 (Relation b et ween MLP and Q MLP) L et ρ ∈ S ( H ) , E ρ b e any orthonorm al eig e nb asis of ρ , and Σ ⊆ S ( H ) b e a non-empty set. If J ( E ρ , Σ ) ⊆ Σ , then inf σ ∈ Σ D ( ρ k σ ) = inf σ ∈J ( E ρ , Σ) D KL ( λ ρ k λ σ ) . (13) If Σ is also unitarily invariant, then the ab ove terms also e qual inf σ ∈ Σ D KL ( λ ρ k λ σ ) . The pro of of Prop osition 1 giv en b elo w in Section 5.1 is based on element ary argumen ts in v olving data pro cessing in equ alit y for quantum r elativ e en tropy , quantum measuremen ts and the u nitary in v ariance of Σ. Prop osition 1 h as an inte resting inte rpretation from the prism of quan tum inform ation geometry (see e.g., Amari and Nagaok a [ 2000 ], C sisz´ ar and Matus [ 2003 ], C sisz´ ar and Shields [ 2004 ], Ha ya shi 5 This is sl igh tly differen t from standard pinching, where pro jectors of similar eigenv alues are com bined , ( e.g., Hay ashi [ 2002 ]). 12 S. SREEKUMAR AND N. WEINBERGER [ 2016 ]). T o in tro duce the r elev ant notions, consid er a non-empt y compact con v ex s et ¯ S ⊆ S ( H ). The information pr oje ction (or I − pro jection) of σ ∈ S ( H ) in to ¯ S is ρ ⋆ σ , ¯ S : = arg min ρ ∈ ¯ S D ( ρ k σ ) , (14) if it exists. When sp t( ¯ S ) ⊆ spt( σ ), D ( ρ k σ ) is a conti n uous 6 strictly con vex function of ρ o ver the compact set ¯ S . Hence, ρ ⋆ σ , ¯ S exists and is un ique. W e emphasize that information pr o jection is a minimization of relativ e en tropy in th e first argument, while QMLP in volv es minimization o ver the second argument (whic h is thus a quantum r everse I − pr oje ction , [ Csisz´ ar and Shields , 2004 ]). Although solving QMLP and information-pro jection are distinct problems, w e will see b elo w th at an in terp r etation of Prop osition 1 can b e obtained by com bining them. This connecting link w ill b e pro vided b y the so-c alled quantum Pythagor e an the or em from inform ation geometry (see e.g., Amari and Nagaok a [ 2000 ], Ha ya shi [ 2016 ], Hay ashi and Ito [ 2025 ]). Ho we v er, we will need a gener- alized version of this theorem, than a v ailable in the literature to the b est of our k n o wledge. T o state our v ersion of quantum Pythagor e an the or em , we need to in tro duce a few notions. Call a set S ⊆ S ( H ) of d ensit y op erators line arly close d if for ev ery ρ, ˜ ρ ∈ S and α ∈ R such that αρ + (1 − α ) ˜ ρ ∈ S ( H ). Denote th e set of b ound ed linear operators b y B ( H ), and le t C = { L i } k i =1 ⊆ B ( H ) b e a finite s et. F or a giv en ρ 0 ∈ S ( H ) and C , th e set of all den s it y op erators ρ ∈ S ( H ) su c h that T r [ ρL i ] = T r [ ρ 0 L i ], for 1 ≤ i ≤ k , is called a mixtur e or line ar family generated b y C (and ρ 0 ), and denoted b y M ( ρ 0 , C ). F or a sub-den s it y op erator σ 0 , i.e., σ 0 ≥ 0 and T r [ σ 0 ] ∈ [0 , 1] , and C , the set of all density op erators of th e form ρ = e log σ 0 + P k i =1 β i L i c ( σ 0 , β , C ) , where β = ( β 1 , . . . , β k ) ∈ C k and c ( σ 0 , β , C ) : = T r h e log σ 0 + P k i =1 β i L i i , is called exp onential family generated by C (and ρ 0 ), and denoted by Exp( ρ 0 , C ). Here, c ( ρ 0 , β , C ) is the n ormalization factor to make T r [ ρ ] = 1 and e log σ 0 + P k i =1 β i L i is interpreted to act as the zero op erator on th e kernel of σ 0 . Note that since C can cont ain non self-adjoint op erators, and so not all β in the d efinition of exp onentia l f amily corresp onds to a densit y op erator. Sp ecifically , β and C ha ve to b e su c h that P k i =1 β i L i is s elf-adjoin t. Also, observe that a linearly closed set is con v ex b y definition and m ixture families are linearly closed. On th e other hand, the exp onential family is not closed nor lin early closed. W e can now state our version of the quan tum Pythagorean theorem (illustrated in Figure 1 ). Theorem 1 (Quan tum Pythagorean Th eorem) L et σ ∈ S ( H ) and ¯ S ⊆ S ( H ) b e a non-empty c omp act c onvex set. Then, the fol lowing hold: 6 In general, D ( ρ k σ ) is a low er semi-contin uous conv ex function of its arguments (see e.g. P oly anskiy and W u [ 2025 ]). HILBER T SP ACE EMBEDDINGS AND PREDICTION 13 M ( ρ 0 , C ) M σ ( ρ 0 , C ) ρ 0 ρ σ ρ ⋆ σ , M σ Exp  P M σ σ P M σ , P M σ C P M σ  D  ρ  σ  D  ρ  ρ ⋆ σ , M σ  D  ρ ⋆ σ , M σ  σ  Figure 1. T he Quan tum Pythagorean th eorem. (i) If spt( ¯ S ) ⊆ s pt( σ ) , the I − pr oje ction ρ ⋆ σ , ¯ S satisfies spt  ρ ⋆ σ , ¯ S  = spt( ¯ S ) and D ( ρ k σ ) ≥ D  ρ k ρ ⋆ σ , ¯ S  + D  ρ ⋆ σ , ¯ S k σ  , ∀ ρ ∈ ¯ S , (15) with e quality holding in ( 15 ) if ¯ S i s line arly close d. (ii) Supp ose s p t( ¯ S ) ( spt( σ ) . Setting ¯ S σ : =  ρ ∈ ¯ S : ρ ≪ σ  , Part ( i ) holds with ¯ S r eplac e d b y ¯ S σ if ¯ S σ is non-empty; i n p articular, ρ ⋆ σ , ¯ S ∈ ¯ S σ and s pt( ρ ⋆ σ , ¯ S ) = spt( ¯ S σ ) . If ¯ S σ is empty, then b oth sides of ( 15 ) ar e infinite and any ρ ∈ ¯ S c ould b e taken as ρ ⋆ σ , ¯ S . (iii) F or a finite set C ⊆ B ( H ) , let M ( ρ 0 , C ) b e a c omp act mixtur e family gener ate d by C . L et M σ : = M σ ( ρ 0 , C ) : =  ρ ∈ M ( ρ 0 , C ) : ρ ≪ σ  b e non-empty and P M σ b e the pr oje ction onto the supp ort of M σ . Then, ρ ⋆ σ , M σ ∈ Exp  P M σ σ P M σ , P M σ C P M σ  ∩ M σ , and D ( ρ k σ ) = D  ρ k ρ ⋆ σ , M σ  + D  ρ ⋆ σ , M σ k σ  , ∀ ρ ∈ M ( ρ 0 , C ) . (16) (iv) L et ρ, σ ∈ S ( H d ) b e such that ρ ≪ σ , and E ρ b e an orthonormal eigenb asis of ρ . Set ¯ S ( E ρ ) as the set of density op er ators with eigenb asis E ρ . Then, ( 16 ) holds with ρ ⋆ σ , ¯ S ( E ρ ) = P d i =1 P i ( ρ ) σ P i ( ρ ) ∈ J ( E ρ , { σ } ) . The p ro of of Th eorem 1 is giv en in S ection 5.2 b elo w. It extends the pro of of Csisz´ ar and S hields [ 2004 , Theorem 3.2] to the n on-comm utativ e setting, carefully accoun ting f or supp ort conditions. By sp ecializi ng Part ( i ) to the commuta tiv e setting, the classical resu lt with 0 < σ ∈ S ( H d ), as stated in Csisz´ ar and S hields [ 2004 , Theorem 3.2], is r eco v ered. Moreo v er, Part ( iii ) considers a more general exp onential and mixture families generated by b ound ed linear op erators than we are a w are of. This generalizatio n is used in the pro of of Pa rt ( iv ), which in v olve s exp onentia l and mixture families generated by the b ounded (non necessarily self-adjoin t) class C = {| e i ( ρ ) ih e j ( ρ ) |} 1 ≤ i 6 = j ≤ d , 14 S. SREEKUMAR AND N. WEINBERGER where { e i ( ρ ) } ∞ i =1 denotes an orth onormal eigen basis of ρ . It is also int eresting that th e I − pro jection of σ ∈ S ( H d ) to ¯ S ( E ρ ) is simp ly c haracterized as the pinched op erator P d i =1 P i ( ρ ) σ P i ( ρ ). W e can now provide an in terpretation of Prop osition 1 in terms of Th eorem 1 , when H is fin ite dimensional. T o this en d , consider the inte resting case where Σ ρ : = { σ ∈ Σ : ρ ≪ σ } is non-empt y (otherwise, b oth sides of ( 28 ) are infinite). Th e idea is to consider any σ ∈ Σ ρ and apply th e quan tum Pythagorean theorem with equalit y giv en in ( 15 ) with ¯ S = ¯ S ( E ρ ), i.e., the set of all density op erators with th e same eigen basis as ρ , wh ic h is a compact conv ex set. This implies that D ( ρ k σ ) is lo wer b ou n ded by D  ρ k ρ ⋆ σ , ¯ S ( E ρ )  = D KL  λ ρ k λ ρ ⋆ σ, ¯ S ( E ρ )  for ρ ⋆ σ , ¯ S ( E ρ ) ∈ J ( E ρ , { σ } ), where the equalit y is b ecause ρ ⋆ σ , ¯ S ( E ρ ) and ρ commute. Hence inf σ ∈ Σ D ( ρ k σ ) ≥ inf σ ∈J ( E ρ , Σ) D ( λ ρ k λ σ ). Th e opp osite inequalit y is ob tained from Theorem 1 ( iv ) and the condition J ( E ρ , Σ ) ⊆ Σ, and b oth inequalities sho w ( 13 ). Given ( 13 ), the last c laim in Prop osition 1 can be ea sily argued u sing the unitary in v ariance of Σ. See details in Section 5.3 b elo w. 4.2. Statistical Prop erties of Quantum Maxim um Likelihoo d Predictor. Ha ving seen some information-geometric asp ects of QMLP , w e n ext stud y its statistica l b eha viour. T h e follo wing tec h- nical prop osition will play a k ey role tow ards this pur p ose. Prop osition 2 (QMLP distance b ound) L et ρ, ˜ ρ ∈ S ( H ) and Σ ⊆ S ( H ) b e c omp act. L et σ ⋆ : = arg min σ ∈ Σ D ( ρ k σ ) and ˜ σ ⋆ : = arg min σ ∈ Σ D ( ˜ ρ k σ ) . (17) If Σ satisfies D ( ρ k σ ⋆ ) ≤ ǫ for some ǫ ≥ 0 , then k ˜ σ ⋆ − σ ⋆ k 2 1 ≤ 4  D ( ˜ ρ k σ ⋆ ) − D ( ρ k σ ⋆ )  + 4 k ˜ ρ − ρ k 2 1 + 12 ǫ, (18a) k ˜ σ ⋆ − ρ k 1 ≤ k ˜ σ ⋆ − σ ⋆ k 1 + √ 2 ǫ. (18b) A dditional ly, if Σ is unitarily invariant and J ( E ρ , Σ ) ⊆ Σ (se e ( 12 ) ) for some orth onorma l eigenb asis E ρ of ρ , then ( 18a ) holds with D ( ˜ ρ k σ ⋆ ) − D ( ρ k σ ⋆ ) r eplac e d by D KL ( λ ˜ ρ k λ σ ⋆ ) − D KL ( λ ρ k λ σ ⋆ ) . Prop osition 2 , prov ed in Section 5.4 , is based on elemen tary arguments inv olving quant um Pinsk er’s inequalit y (see, e.g., Ha ya shi [ 2016 , Equation 3.53]). W e will use it to pr o ve Theorem 2 whic h provi des statistica l non-asymptotic p erf orm ance guaran tees for QMLP . T o this end, w e sp ecialize to em b eddings into H d for some d ∈ N . Emb eddings int o finite dimensional Hilb ert spaces are p ractically imp ortan t, since LLMs and QLLMs are based on suc h em b eddings. F or ρ, σ > 0, let the Th ompson metric [ Thompson , 1963 ] on the sp ace of p ositiv e densit y op erators b e T ( ρ, σ ) : = log    σ − 1 / 2 ρσ − 1 / 2   ∨   ρ − 1 / 2 σ ρ − 1 / 2    . Note that T ( ρ, σ ) = 0 if and only if ρ = σ . HILBER T SP ACE EMBEDDINGS AND PREDICTION 15 F or tec hnical reasons, we will consider a slightl y p erturb ed v ersion of ˆ ρ n ( x n ) obtained usin g the maximally mixed state π d in H d suc h that the p ertur b ation v anishes asymptotically with n . Accord- ingly , define ρ n ( x n ) : =  1 − 1 n  ˆ ρ n ( x n ) + 1 n π d , (19) where ˆ ρ n ( x n ) is as defined in ( 9 ), and set ˆ σ ⋆ n ( x n ) : =  1 − 1 n  σ ⋆ n ( x n ) + 1 n π d with σ ⋆ n ( x n ) : = arg min σ ∈ Σ D ( ρ n ( x n ) k σ ) . (20) Note that the definition in ( 19 ) inv olv es a p erturbation of ˆ ρ n ( x n ) with π d , wh ic h requires kno wledge of H d . Ho w ev er, this can b e determined giv en the emb edding ϕ and kno w ledge of the su pp ort of P , as the sp an of { ϕ ( x ) : p ( x ) > 0 , x ∈ X } . The next theorem establishes consistency and conv ergence prop erties of QMLP . Theorem 2 (Consistency and concen tration of QMLP) L et ρ p > 0 and ρ n ( x n ) > 0 b e density op er ators on H d as define d in ( 8 ) and ( 19 ) , r esp e ctiv ely. Supp ose Σ is a c omp act c onvex set with spt(Σ) = H d such that σ ⋆ p : = arg min σ ∈ Σ D ( ρ p k σ ) satisfies D  ρ p   σ ⋆ p  ≤ ǫ . Set b n : = log   σ ⋆ − 1 p   ∨ log( dn ) ∨ T ( ρ p , σ ⋆ p ) and ¯ b n : = log ( dn ) ∨ log   ρ − 1 p   . (21) Then, with σ ⋆ n ( x n ) and ˆ σ ⋆ n ( x n ) as define d in ( 20 ) , the fol lowing hold for X n ∼ P ⊗ n : (i) P erformanc e guar ante es under (squar e d) tr ac e norm: E  k σ ⋆ n ( X n ) − ρ p k 2 1  ≤            16 d ( b n + 4)  p 2 log (2 d ) + p π / 2  n − 1 2 +16 b n n − 1 + 64 n − 2 + 28 ǫ, if ǫ > 0 , (32 d   ρ − 1 p   + 8 d 2 )  8 n 2 + 28 d n  , if ǫ = 0 . (22) Mor e over, for al l t ≥ 0 , we have P  k σ ⋆ n ( X n ) − ρ p k 2 1 ≥ 16 b n n + 64 n 2 + 28 ǫ + 8 d ( b n + 4) t  ≤ 4 de −  nt 2 4 ∧ 3 nt 4  , if ǫ > 0 , (23a) P  k σ ⋆ n ( X n ) − ρ p k 2 1 ≥ (32 d   ρ − 1 p   + 8 d 2 )  8 n 2 + 2 t  ≤ 2 de −  nt 4 ∧ 3 n √ t 4  , if ǫ = 0 . (23b) (ii) P erformanc e guar a nte es under quantum r elative entr op y: If Σ is suc h that E [ D ( ρ n ( X n ) k σ ⋆ n ( X n ))] ∧ E [ D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n ))] ≤ ǫ, (24) then E  D ( ρ p k ˆ σ ⋆ n ( X n ))  ≤      (2 ¯ b n + log d ) n − 1 + d ¯ b n  p 2 log (2 d ) + p π / 2  n − 1 2 + ǫ, if ǫ > 0 , 28 d 2 k ρ − 1 p k +2 log d n if ǫ = 0 . (25) 16 S. SREEKUMAR AND N. WEINBERGER If Σ also satisfies D ( ρ n ( X n ) k σ ⋆ n ( X n )) ∧ D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) ≤ ǫ, almost sur ely, (26) then for al l t ≥ 0 , P  D ( ρ p k ˆ σ ⋆ n ( X n )) ≥ 2 ¯ b n n + log d n + ǫ + d ¯ b n t  ≤ 2 de −  nt 2 4 ∧ 3 nt 4  , i f ǫ > 0 , (27a) P  D ( ρ p k ˆ σ ⋆ n ( X n )) ≥ 2 dt   ρ − 1 p   + 2 log d n  ≤ 2 de −  nt 4 ∧ 3 n √ t 4  , i f ǫ = 0 . (27b) The pro of of Theorem 2 in Section 5.5 com bines the v ariational form of quan tum relativ e en- trop y ( 2 ) with Prop osition 2 to con trol k σ ⋆ n ( X n ) − ρ p k 1 and D ( ρ p k ˆ σ ⋆ n ( X n )), and then applies matrix Bernstein/Ho effd ing b ounds (Theorem 3 ). A few r emarks are in order. In a finite-dimensional Hilb er t space H d , a set Σ s atisfying the conditions in Th eorem 2 exists. F or instance, Σ = S ( H d ) is a con vex compact set satisfying ( 24 ) and ( 26 ). More generally , an y ǫ -co ve ring of S ( H d ) in quan tum relativ e en tropy as a m easure of separation suffices (see [ T ang , 2022 ] for relev ant co ve ring n u m b er b ound s in the classical case). Thus, a smaller ǫ can b e ac hiev ed by incr easing the expr essivity of quant um mod els. Moreo v er, the assump tion ρ p > 0 is not v ery r estrictiv e s ince ˆ ρ n ( X n ) ≪ ρ p almost surely (the complemen tary even t requires X i / ∈ spt( P ) for some 1 ≤ i ≤ n , wh ic h has probabilit y zero), and the su pp ort of Σ can b e restricted to supp ort of ρ p whic h can b e d etermined via kn o wledge of spt( P ) giv en an em b eddin g. Next, observe that ( 22 ) implies a parametric r ate of conv ergence with r esp ect to n for th e QMLP to the target distrib ution (up to the approximat ion factor ǫ ) wh en ǫ > 0 and O d (1 /n ) rate of con v ergence when ǫ = 0. Hence, there is no cur se of dimen sionalit y in the conv ergence rate. Also, note that ( 25 ) and ( 27 ) , com bin ed with Pinsker’s inequalit y , yield the analogues of ( 22 ) and ( 23 ), r esp ectiv ely , with k σ ⋆ n ( X n ) − ρ p k 1 replaced b y k ˆ σ ⋆ n ( X n ) − ρ p k 1 . Finally , b y data pr o cessing in equalit y for trace n orm an d qu an tu m relativ e ent rop y , th e claims in Theorem 2 also hold f or the r esp ectiv e measur ed quantiti es; hence also to the output distribu tions of the LLM. Th eorem 2 pro vides insigh ts on certain prop erties that a go o d em b edding should satisfy . Sp ecifically , a go o d emb edding ϕ sh ould b e such that the minimal eigen v alue of ρ p is as h igh as p ossible su b ject to satisfactory pr ediction p erform an ce. As alluded to in Section 3.1 , this o ccurs when neighboring sym b ols are em b edded in to closely aligned elemen ts in the Hilb ert space, so that the minimal eige n v alue of ρ p is b o osted up. In contrast, classical information-theoretic analyses for pred iction obtains p erform an ce guarantees by considering only the source probabilit y distribu tion without taking accoun t of s imilarit y or con text, which corresp onds to the scenario of orthogonal embed ding. This could p ossibly explain the success of arc h itectures such HILBER T SP ACE EMBEDDINGS AND PREDICTION 17 as the transformer f r om a high-lev el p ersp ectiv e, since non-orthogonal emb eddings into Hilb ert spaces naturally capture the degree of similarit y via inner pr o ducts. 5. Pr oofs 5.1. Pro of of Prop osition 1 . Note that p erforming a quan tum measurement in the eigenbasis E ( ρ ) of ρ and σ yields the pmfs λ ρ and λ σ ′ for σ ′ = P ∞ i =1 P i ( ρ ) σ P i ( ρ ). Then, data pr o cessing inequalit y for quantum relativ e en tr opy yields D ( ρ k σ ) ≥ D KL ( λ ρ k λ σ ) . Hence inf σ ∈ Σ D ( ρ k σ ) ≥ inf σ ∈J ( E ρ , Σ) D KL ( λ ρ k λ σ ) . (28) Since the opp osite inequalit y holds when J ( E ρ , Σ ) ⊆ Σ (by usin g that infimum can only increase when tak en ov er a subset), ( 13 ) follo w s. Let Σ ( E ρ ) denote the set of density op erators in Σ for wh ic h E ρ is an eigen basis. If J ( E ρ , Σ ) ⊆ Σ and Σ is unitarily inv ariant, then the last claim in the prop osition holds since inf σ ∈ Σ D ( ρ k σ ) ≤ inf σ ∈ Σ( E ρ ) D ( ρ k σ ) = inf σ ∈ Σ( E ρ ) D KL ( λ ρ k λ σ ) = in f σ ∈ Σ D KL ( λ ρ k λ σ ) ≤ inf σ ∈J ( E ρ , Σ) D KL ( λ ρ k λ σ ) , and the quant ities at left and right extremes are equal by ( 13 ). In the ab o ve equation, only the p enultimate equalit y is non-trivial, i.e., inf σ ∈ Σ( E ρ ) D KL ( λ ρ k λ σ ) = inf σ ∈ Σ D KL ( λ ρ k λ σ ) . The ≥ imp lication is straigh tforw ard as Σ( E ρ ) ⊆ Σ, while the ≤ imp lication follo ws fr om the f act that for any σ ∈ Σ, there exists another σ ′ ∈ Σ ( E ρ ) with the same eigen v alues λ σ since Σ is u nitarily in v ariant. 5.2. Pro of of Theorem 1 . While the pro of closely follo ws that of Csisz´ ar and S hields [ 2004 , The- orem 3.2], certain essen tial mo difications are required to generalize to the non-comm u tativ e setting. Consider the pr o of of P art ( i ). Assume that ¯ S is conv ex and spt( ¯ S ) ⊆ s p t( σ ). Then, D ( ρ k σ ) is con tinuous strictly con v ex function in ρ for ρ ∈ ¯ S . Hence, ρ ⋆ σ , ¯ S as defined in ( 14 ) exists (as ¯ S is compact), and is uniqu e. Fix an arbitrary ρ ∈ ¯ S . F or t ∈ [0 , 1], let ρ t = (1 − t ) ρ ⋆ σ , ¯ S + tρ ∈ ¯ S . Since ρ ⋆ σ , ¯ S m u st satisfy D ( ρ t k σ ) − D  ρ ⋆ σ , ¯ S k σ  t ≥ 0 , ∀ t ∈ (0 , 1) , (29) 18 S. SREEKUMAR AND N. WEINBERGER w e ha v e b y the m ean v alue theorem that there exists ˜ t ∈ (0 , t ) su c h that d dt D ( ρ t k σ )   t = ˜ t ≥ 0 . (30) Let f 1 ( t ) : = T r [ ρ t log ρ t ] and f 2 ( t ) : = T r [ ρ t log σ ]. Denoting the deriv ativ es of f 1 and f 2 b y f ′ 1 and f ′ 2 , resp ectiv ely , we ha ve f ′ 1 ( ˜ t ) = T r h  ρ − ρ ⋆ σ , ¯ S  log ρ ˜ t i + T r  ρ ˜ t d dt log ρ t   t = ˜ t  = T r h  ρ − ρ ⋆ σ , ¯ S  log ρ ˜ t i , (31a) f ′ 2 ( ˜ t ) = T r h  ρ − ρ ⋆ σ , ¯ S  log σ i , (31b) where we used T r  ρ ˜ t d dt log ρ t   t = ˜ t  ( a ) = T r  ρ ˜ t Z ∞ 0  τ I + ρ ˜ t  − 1  ρ − ρ ⋆ σ , ¯ S  τ I + ρ ˜ t  − 1 dτ  ( b ) = T r   ρ − ρ ⋆ σ , ¯ S  ρ ˜ t Z ∞ 0  τ I + ρ ˜ t  − 2 dτ  ( c ) = T r h ρ − ρ ⋆ σ , ¯ S i ( d ) = 0 , where in ( a ), w e used an in tegral expression for logarithm of a op erator A > 0: log A = Z ∞ 0  1 ( τ + 1) I − 1 τ I + A  dτ ; (32) ( b ) follo ws b y cyclicit y of trace , ( c ) is b ecause the in tegral ev aluates to ρ − 1 ˜ t , and ( d ) is b ecause T r [ ρ ] = T r h ρ ⋆ σ , ¯ S i = 1. Hence T r h  ρ − ρ ⋆ σ , ¯ S  log ρ ˜ t − log σ  i ≥ 0 , ∀ ˜ t > 0 . Note that if ρ 6 ≪ ρ ⋆ σ , ¯ S , th en the LHS ab o ve tends to −∞ as ˜ t ↓ 0, th us r uling out this p ossib ilit y . Hence, ρ ≪ ρ ⋆ σ , ¯ S for all ρ ∈ ¯ S , implying th at sp t  ρ ⋆ σ , ¯ S  = sp t( ¯ S ). T aking limits ˜ t ↓ 0, we obtain that T r h  ρ − ρ ⋆ σ , ¯ S  log ρ ⋆ σ , ¯ S − log σ  i ≥ 0 , ∀ ρ ∈ ¯ S , whic h is equ iv alent to ( 15 ). No w, consider that ¯ S is also linearly closed. T hen, for any ρ ∈ ¯ S , there exists t − < 0 (dep end ing on ρ ) suc h that for all t ∈ [ t − , 1], ρ t = (1 − t ) ρ ⋆ σ , ¯ S + tρ ∈ ¯ S . Hence, D ( ρ t k σ ) is d ifferen tiable at t = 0 and the optimizer ρ ⋆ σ , ¯ S giv en in ( 14 ) m ust satisfy 0 = d dt D ( ρ t k σ )   t =0 = T r h  ρ − ρ ⋆ σ , ¯ S  log ρ ⋆ σ , ¯ S − log σ  i , ∀ ρ ∈ ¯ S , (33) whic h is equ iv alent to equalit y in ( 15 ). T o pro v e P art ( ii ), note that D ( ρ k σ ) is infinite if ρ 6 ≪ σ . Hence, if ¯ S σ is empty , the minimum in ( 14 ) is infinite and ρ ⋆ σ , ¯ S can b e c hosen to b e an y ρ ∈ ¯ S . In this case, b oth sid es of ( 15 ) are infi n ite. HILBER T SP ACE EMBEDDINGS AND PREDICTION 19 On the other hand, if ¯ S σ is non-empty , then ρ ⋆ σ , ¯ S ∈ ¯ S σ . Note that ¯ S σ is also compact and con ve x, b eing the in tersection of t w o compact con vex sets. Moreo v er, if ¯ S is linearly closed, so is ¯ S σ . Hence, b y applying P art ( i ) with ¯ S σ in p lace of ¯ S , ( 15 ) holds for an y ρ ∈ ¯ S σ . Also, for any ρ ∈ ¯ S \ ¯ S σ , b oth sides of ( 15 ) are infi nite simultaneously . Next, we pro v e P art ( iii ). W e will d enote M ( ρ 0 , C ) b y M and M σ ( ρ 0 , C ) b y M σ for simplicit y . Note that since M is compact and linearly closed, so is M σ . Applying P art ( ii ), w e obtain that the optimizer ρ ⋆ σ , M σ ∈ M σ with spt( ρ ⋆ σ , M σ ) = spt( M σ ) sin ce M σ is assumed to b e non-empt y . Hence, the pr o of will b e completed if we sho w that ρ ⋆ σ , M σ ∈ Exp  P M σ σ P M σ , P M σ C P M σ  , (34) where P M σ is the p ro jection onto the sup p ort of M σ . T o see this, let α i : = T r [ ρ 0 L i ] and note that ev ery ρ ∈ M σ should satisfy T r [ ρ ( L i − α i I )] = 0 . Since spt( ρ ) ⊆ sp t( M σ ), this imp lies by cyclici t y of trace th at T r  ρ P M σ ( L i − α i I ) P M σ  = 0 , (35) where I denotes the identit y op erator. Also, w e ha v e similarly fr om ( 33 ) that T r h ρ P M σ  log ρ ⋆ σ , M σ − log σ − D  ρ ⋆ σ , M σ k σ  I  P M σ i = 0 . (36) Let A denote the su bspace s panned by { P M σ ( L i − α i I ) P M σ } k i =1 . Note that ( 35 ) means that ρ ∈ M σ b elongs to the orthogonal co mplemen t of A in spt( M σ ), denoted b y A ⊥ . Moreo v er, since s p t  ρ ⋆ σ , M σ  = spt( M σ ), d ensit y op erators within M σ span the r eal ve ctor sp ace of self-adjoin t op erators w hose supp ort is con tained in A ⊥ . This can b e seen by considering an orthogonal self-adjoin t basis for this space, addin g a scalar multiple of ρ ⋆ σ , M σ to its eac h elemen t to obtain a set of p ositiv e op erators on spt( M σ ), and n ormalizing eac h s uc h op erator by the su m of its eigen v alues, to yield the desired b asis of dens it y op erators (includin g ρ ⋆ σ , M σ ). Giv en this, ( 36 ) holdin g for all ρ ∈ M σ implies that P M σ  log ρ ⋆ σ , M σ − log σ − D  ρ ⋆ σ , M σ k σ  I  P M σ = k X i =1 β i P M σ ( L i − α i I ) P M σ , for some β ∈ C k . Noting that sp t( ρ ⋆ σ , M σ ) = spt( M σ ), we obtain ρ ⋆ σ , M σ = c P M σ e log σ + P k i =1 β i L i P M σ = c e log  P M σ σ P M σ  + P k i =1 β i P M σ L i P M σ , for some c , whic h implies ( 34 ). 20 S. SREEKUMAR AND N. WEINBERGER Finally , we pro ve P art ( iv ). Let ρ ∈ S ( H d ) b e su c h that ρ ≪ σ , and E ρ : = { e i ( ρ ) } ∞ i =1 denote an orthonormal eigen basis of ρ . F or 1 ≤ i, j ≤ d , let L i,j = | e i ( ρ ) ih e j ( ρ ) | and C = { L i,j : 1 ≤ i 6 = j ≤ d } . Consider the m ixture family M : = M ( ρ, C ) : =  ρ ′ ∈ S ( H d ) : T r  ρ ′ L i,j  = T r [ ρL i,j ] = 0  , generated by C and ρ , and set M σ : = M σ ( ρ, C ) : =  ρ ′ ∈ M ( ρ, C ) : ρ ′ ≪ σ  , as the set of densit y op erators with eigen basis E ρ whose supp ort is cont ained in th at of σ . Observe that M = ¯ S ( E ρ ) and M σ = ¯ S σ ( E ρ ). F urther, note that M σ is non-empty and spt  M σ  = sp t( σ ). Hence, we obtain from Part ( iii ) that ρ ⋆ σ , M = ρ ⋆ σ , M σ ∈ M σ ∩ Exp  σ, C  . Hence ρ ⋆ σ , M σ = e log σ + P 1 ≤ i 6 = j ≤ d β i,j | e i ( ρ ) ih e j ( ρ ) | , where β i,j = −h e i ( ρ ) | log σ | e j ( ρ ) i for all i, j suc h that e i ( ρ ) , e j ( ρ ) ∈ s pt( σ ) and 0 otherwise. Hence, the non-diagonal entries in the matrix repr esentati on of ρ ⋆ σ , M σ in the orthonormal basis is E ρ is zero. In other words, ρ ⋆ σ , ¯ S ( E ρ ) = ρ ⋆ σ , M σ = d X i =1 P i ( ρ ) σ P i ( ρ ) ∈ J ( E ρ , { σ } ) . 5.3. In terpretation of Prop osition 1 in terms of Quan t um Pythagorean Theorem. Let Σ ρ : = { σ ∈ Σ : ρ ≪ σ } . If Σ ρ is empt y , then the left hand side of ( 13 ) is infin ite. side. The righ t hand sid e of ( 13 ) is also infinite. This can b e seen by noting that Σ ρ b eing empt y means that for an y σ ∈ Σ, there exists a P i ( ρ ) su c h th at T r [ P i ( ρ ) σ ] = 0 and T r [ P i ( ρ ) ρ ] > 0, whic h implies that T r [ P i ( ρ ) σ P i ( ρ )] = 0. Hence ρ 6 ≪ P i P i ( ρ ) σ P i ( ρ ) and λ ρ 6 ≪ λ σ implying the desired claim. No w, assu me that Σ ρ in n on-empt y . Observe that ¯ S ( E ρ ) as defin ed in Theorem 1 ( iv ), i.e., the set of all d ensit y op erators that sh are the same eigen b asis, E ρ , as that of ρ , is compact and linearly closed. Fix any σ ∈ Σ ρ and let ¯ S σ ( ρ ) : =  ρ ′ ∈ ¯ S ( E ρ ) : ρ ′ ≪ σ  . Since ¯ S σ ( ρ ) is also a non-empty compact linearly closed su bset su c h that spt  ¯ S σ ( ρ )  = spt( σ ), from Theorem 1 ( i ) with ¯ S = ¯ S σ ( ρ ), we ha v e ρ ⋆ σ , ¯ S ( E ρ ) ∈ ¯ S σ ( ρ ) with s p t  ρ ⋆ σ , ¯ S ( E ρ )  = sp t( σ ), and D ( ρ k σ ) = D  ρ k ρ ⋆ σ , ¯ S ( E ρ )  + D  ρ ⋆ σ , ¯ S ( E ρ ) k σ  = D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  + D  ρ ⋆ σ , ¯ S ( E ρ ) k σ  . (37) HILBER T SP ACE EMBEDDINGS AND PREDICTION 21 The p en u ltimate equalit y ab o v e is b ecause ρ ⋆ σ , ¯ S ( E ρ ) and ρ sh are the common eigen basis E ρ . Since quan tum relativ e entrop y is non-n egativ e, min im izing σ o ver a Σ yields inf σ ∈ Σ D ( ρ k σ ) = inf σ ∈ Σ ρ D ( ρ k σ ) ≥ inf σ ∈ Σ ρ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  ≥ inf σ ∈ Σ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  . (38) No w, assume that Σ satisfies J ( E ρ , Σ ) ⊆ Σ. T hen J ( E ρ , Σ ρ ) ⊆ Σ ρ , and we ha ve inf σ ∈ Σ D ( ρ k σ ) = inf σ ∈ Σ ρ D ( ρ k σ ) = inf σ ∈ Σ ρ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  + D  ρ ⋆ σ , ¯ S ( E ρ ) k σ  ( a ) ≤ inf σ ∈J ( E ρ , Σ ρ ) D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  + D  ρ ⋆ σ , ¯ S ( E ρ ) k σ  ( b ) = inf σ ∈J ( E ρ , Σ ρ ) D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  ( c ) = inf σ ∈ Σ ρ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  ( d ) = inf σ ∈ Σ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  , (39) where (a) is b ecause J ( E ρ , Σ ρ ) ⊆ Σ ρ ; (b) is b ecause for σ ∈ J ( E ρ , Σ ρ ), ρ ⋆ σ , ¯ S ( E ρ ) = σ , λ ρ ⋆ σ, ¯ S ( E ρ ) = λ σ , and D  ρ ⋆ σ , ¯ S ( E ρ ) k σ  = 0; (c) is b ecause  λ ρ ⋆ σ, ¯ S ( E ρ ) : σ ∈ J ( E ρ , Σ ρ )  =  λ ρ ⋆ σ, ¯ S ( E ρ ) : σ ∈ Σ ρ  since for an y σ ∈ Σ ρ , ρ ⋆ σ , ¯ S ( E ρ ) ∈ J ( E ρ , Σ ρ ) by Theorem 1 ( iv ) an d J ( E ρ , Σ ρ ) ⊆ Σ ρ ; (d) is b ecause for σ ∈ Σ \ Σ ρ , λ ρ 6 ≪ λ ρ ⋆ σ, ¯ S ( E ρ ) . Hence, D KL  λ ρ k λ ρ ⋆ σ, ¯ S ( E ρ )  = ∞ and hence suc h a σ cannot b e th e infimum within Σ . T o summarize, ( 38 ) and ( 39 ) implies that wh en H is fin ite dimensional and J ( E ρ , Σ ) ⊆ Σ, inf σ ∈ Σ D ( ρ k σ ) = inf σ ∈ Σ D KL  λ ρ    λ ρ ⋆ σ, ¯ S ( E ρ )  = inf σ ∈J ( E ρ , Σ) D KL ( λ ρ k λ σ ) , th u s establishin g ( 13 ). The pro of of the last claim in Prop osition 1 f ollo ws analogous to the p ro of therein. 5.4. Pro of of Proposition 2 . W e ha v e 2  D ( ˜ ρ k ˜ σ ⋆ ) − D ( ρ k σ ⋆ )  ( a ) ≥ k ˜ ρ − ˜ σ ⋆ k 2 1 − 2 ǫ ( b ) ≥ 0 . 5 k ˜ σ ⋆ − σ ⋆ k 2 1 − k ˜ ρ − σ ⋆ k 2 1 − 2 ǫ ( c ) ≥ 0 . 5 k ˜ σ ⋆ − σ ⋆ k 2 1 − 2 k ˜ ρ − ρ k 2 1 − 2 k ρ − σ ⋆ k 2 1 − 2 ǫ ( d ) ≥ 0 . 5 k ˜ σ ⋆ − σ ⋆ k 2 1 − 2 k ˜ ρ − ρ k 2 1 − 6 ǫ, 22 S. SREEKUMAR AND N. WEINBERGER where ( a ) and ( d ) are du e to D ( ρ k σ ⋆ ) ≤ ǫ and q u an tum Pinsker’s in equalit y: 2 D ( ρ k σ ) ≥ k ρ − σ k 2 1 for ρ, σ ∈ S ( H ), while ( b ) and ( c ) follo ws by app lying ( a + b ) 2 ≤ 2( a 2 + b 2 ). Hence, ( 18a ) follo ws since k ˜ σ ⋆ − σ ⋆ k 2 1 ≤ 4  D  ˜ ρ k ˜ σ ⋆  − D  ρ k σ ⋆  + 4 k ˜ ρ − ρ k 2 1 + 12 ǫ ≤ 4  D  ˜ ρ k σ ⋆  − D  ρ k σ ⋆  + 4 k ˜ ρ − ρ k 2 1 + 12 ǫ. (40) Equation ( 18b ) is a simple co nsequence of triangle inequalit y for trace distance and quan tum Pinsker’s inequalit y along w ith D ( ρ k σ ⋆ ) ≤ ǫ : k ˜ σ ⋆ − ρ k 1 ≤ k ˜ σ ⋆ − σ ⋆ k 1 + k σ ⋆ − ρ k 1 ≤ k ˜ σ ⋆ − σ ⋆ k 1 + √ 2 ǫ. Next, consider the case of unitary inv arian t Σ su c h that J ( E ρ , Σ ) ⊆ Σ. Ap plying ( 13 ), we ha ve D  ρ k σ ⋆  = min σ ∈ Σ D  ρ k σ  = min σ ∈J ( E ρ , Σ) D KL ( λ ρ k λ σ ) = D KL ( λ ρ k λ σ ⋆ ) . (41) On the other hand D  ˜ ρ k ˜ σ ⋆  = min σ ∈ Σ D  ˜ ρ k σ  = min σ ∈ Σ D KL ( λ ˜ ρ k λ σ ) ≤ D KL ( λ ˜ ρ k λ σ ⋆ ) , (42) where the second equalit y ab o v e is due to the last s tatemen t in Prop osition 1 which holds sin ce Σ is unitarily inv ariant, compact, and satisfies J ( E ρ , Σ ) ⊆ Σ. Combining ( 40 ), ( 41 ) and ( 42 ) results in the fin al claim. 5.5. Pro of of Theorem 2 . First, we prov e Part ( i ). W e require t wo concen tration results. Let k·k denote the op erator norm on L ( H ). Theorem 3 (Matrix Hoeffd in g and Bernstein Inequalities, see e.g., [ V ershynin , 2018 ]) L et { H i } n i =1 b e fixe d Hermitian d × d matric es such that k H i k ≤ M for al l 1 ≤ i ≤ n . L et { ε i } n i =1 b e i.i.d. R ademacher variables. Then P      n X i =1 ε i H i      ≥ t ! ≤ 2 de − t 2 2 V 2 n , ∀ t ≥ 0 , (43) E "      n X i =1 ε i H i      # ≤  p 2 log (2 d ) + p π / 2  V n , (44) wher e V 2 n : =   P n i =1 H 2 i   . If { H i } d i =1 ar e r andom indep endent zer o-me an d × d Hermitian matric es such that k H i k ≤ M almost sur ely, then P      n X i =1 H i      ≥ t ! ≤ 2 de −  t 2 4 ¯ V 2 n ∧ 3 t 4 M  , ∀ t ≥ 0 , (45) wher e ¯ V 2 n : =   P n i =1 E  H 2 i    . HILBER T SP ACE EMBEDDINGS AND PREDICTION 23 W e pro ceed with the pro of of ( 22 ) w hen ǫ > 0. Note that since Σ is a compact con vex set, σ ⋆ p is unique and σ ⋆ p > 0. F r om ( 18 ), we ha ve using ( a + b ) 2 ≤ 2( a 2 + b 2 ) that k σ ⋆ n ( X n ) − ρ p k 2 1 ≤ 8  D  ρ n ( X n )   σ ⋆ p  − D  ρ p   σ ⋆ p  + 8 k ρ n ( X n ) − ρ p k 2 1 + 28 ǫ. (46) T o control the first term in ( 46 ), w e will use the v ariational expression for quant um relativ e entrop y giv en in ( 2 ). When ρ, σ > 0, the supremum in ( 2 ) is ac h iev ed by H ⋆ ( ρ, σ ) : = log ρ − log σ. Since ρ p > 0 and σ ⋆ p > 0, it follo w s that D  ρ p   σ ⋆ p  = su p H T r [ ρ p H ] − T r h e H +log σ ⋆ p i + 1 , (47) where the su premum is tak en o v er H s u c h that k H k ≤   H ⋆ ( ρ p , σ ⋆ p )   = max | v i : k v k 2 =1   h v | H ⋆ ( ρ p , σ ⋆ p ) | v i   = log  max | v i : k v k 2 =1 h v | ρ p | v i h v | σ ⋆ p | v i ∨ max | v i : k v k 2 =1 h v | σ ⋆ p | v i h v | ρ p | v i  ≤ T ( ρ p , σ ⋆ p ) . Similarly D  ρ n ( X n )   σ ⋆ p  = su p H T r [ ρ n ( X n ) H ] − T r h e H +log σ ⋆ p i + 1 , (48) where the su premum is tak en o v er H s u c h that k H k ≤   H ⋆ ( ρ n ( X n ) , σ ⋆ p )   ≤ − log  min z ∈Z λ σ ⋆ p ( z )  ∨ log ( dn ) = log   σ ⋆ − 1 p   ∨ log( dn ) . In the ab o ve , w e used that max | v i : k v k 2 =1 h v | ρ n ( X n ) | v i h v | σ ⋆ p | v i ≤ 1 min | v i : k v k 2 =1 h v | σ ⋆ p | v i ≤ 1 min z ∈Z λ σ ⋆ p ( z ) , max | v i : k v k 2 =1 h v | σ ⋆ p | v i h v | ρ n ( X n ) | v i ≤ 1 min | v i : k v k 2 =1 h v | ρ n ( X n ) | v i ≤ n d. Hence, with b n as in ( 21 ), w e can upp er b ound the fir st term in the RHS ab o ve as follo w s:   D KL  ρ n ( X n )   σ ⋆ p  − D KL  ρ p   σ ⋆ p    =      sup H : k H k≤ b n T r [ ρ n ( X n ) H ] − T r h e H +log σ ⋆ p i − su p H : k H k≤ b n T r [ ρ p H ] − T r h e H +log σ ⋆ p i      ( a ) ≤ sup H : k H k≤ b n | T r [ ρ n ( X n ) H ] − T r [ ρ p H ] | . ( b ) ≤  1 − 1 n  sup H : k H k≤ b n | T r [ ˆ ρ n ( X n ) H ] − T r [ ρ p H ] | + sup H : k H k≤ b n 1 n | T r [ π d H ] | + 1 n | T r [ ρ p H ] | . 24 S. SREEKUMAR AND N. WEINBERGER ( c ) ≤ k ˆ ρ n ( X n ) − ρ p k 1 b n + 2 b n n , where (a) follo ws from     sup t ∈T f ( t ) − sup t ∈T g ( t )     ≤ sup t ∈T | f ( t ) − g ( t ) | , (49) for an y fu nctions f , g on T when at least sup t ∈T g ( t ) or sup t ∈T f ( t ) is finite; (b) follo ws by app lying d efi nition of ρ n ( X n ) and triangle inequalit y; (c) is via H¨ olders inequalit y . Substituting this in ( 46 ) and using k ρ n ( X n ) − ρ p k 2 1 ≤ 2  k ˆ ρ n ( X n ) − ρ p k 2 1 + (4 /n 2 )  , w e obtain k σ ⋆ n ( X n ) − ρ p k 2 1 ≤ 8 b n k ˆ ρ n ( X n ) − ρ p k 1 + 16 b n n + 16 k ˆ ρ n ( X n ) − ρ p k 2 1 + 64 n 2 + 28 ǫ. (50) T aking exp ectations, w e can b ound the first term in th e RHS as E  k ˆ ρ n ( X n ) − ρ p k 1  ( a ) ≤ d E "      1 n n X i =1  | ϕ ( X i ) ih ϕ ( X i ) | − E P  | ϕ ( X ) ih ϕ ( X ) |        # ( b ) ≤ 2 d E "      1 n n X i =1 ε i | ϕ ( X i ) ih ϕ ( X i ) |      # ( c ) ≤ 2 d  p 2 log (2 d ) + p π / 2  n − 1 2 , (51) where (a) is due to k H k 1 ≤ d k H k for a self-adjoin t H ; (b) is due to the standard symmetrization inequ alit y (see e.g. [ V ershynin , 2018 , Lemma 6.4.2]); (c) follo ws by an application of ( 44 ) with H i = | ϕ ( x i ) ih ϕ ( x i ) | /n f or a giv en X n = x n , and noting that V 2 n : = 1 n 2      n X i =1 | ϕ ( x i ) ih ϕ ( x i ) || ϕ ( x i ) ih ϕ ( x i ) |      = 1 n 2      n X i =1 h ϕ ( x i ) , ϕ ( x i ) i| ϕ ( x i ) ih ϕ ( x i ) |      ≤ 1 n , where the final inequalit y uses h ϕ ( x ) , ϕ ( x ) i = 1 for all x ∈ X , k| ϕ ( x i ) ih ϕ ( x i ) |k ≤ 1 an d triangle inequalit y for op erator norm. F or the s econd term in ( 50 ), usin g 0 . 5 k ˆ ρ n ( X n ) − ρ p k 1 ≤ 1 and z 2 ≤ z for 0 ≤ z ≤ 1, ( 51 ) implies E  k ˆ ρ n ( X n ) − ρ p k 2 1  ≤ 2 E  k ˆ ρ n ( X n ) − ρ p k 1  ≤ 4 d  p 2 log (2 d ) + p π / 2  n − 1 2 . (52) Substituting ( 51 ) and ( 52 ) in ( 50 ), w e obtain ( 22 ) when ǫ > 0. Before proving ( 22 ) wh en ǫ = 0, we first establish ( 23a ). W e hav e fr om ( 50 ) that for t ≥ 0: P  k σ ⋆ n ( X n ) − ρ p k 2 1 ≥ 16 b n n + 64 n 2 + 28 ǫ + 8 d ( b n + 4) t  HILBER T SP ACE EMBEDDINGS AND PREDICTION 25 ≤ P  k ˆ ρ n ( X n ) − ρ p k 1 ≥ dt  + P  k ˆ ρ n ( X n ) − ρ p k 2 1 ≥ 2 dt  . (53) F or the first term in ( 53 ), an app lication of ( 45 ) with H i = | ϕ ( X i ) ih ϕ ( X i ) | − E P  | ϕ ( X ) ih ϕ ( X ) |  yields P  k ˆ ρ n ( X n ) − ρ p k 1 ≥ dt  ≤ P ( k ˆ ρ n ( X n ) − ρ p k ≥ t ) ≤ 2 de −  nt 2 4 ∧ 3 nt 4  , (54) where w e used k H i k ≤ 1 and   P n i =1 H 2 i   ≤ P n i =1   H 2 i   ≤ P n i =1 k H i k ≤ n . F or the second term in ( 53 ), we ha ve via ( 45 ) that P  k ˆ ρ n ( X n ) − ρ k 2 1 ≥ 2 dt  ≤ P k ˆ ρ n ( X n ) − ρ k 2 1 4 ≥ dt 2 ! ≤ P  k ˆ ρ n ( X n ) − ρ k 1 2 ≥ dt 2  ≤ P ( k ˆ ρ n ( X n ) − ρ k ≥ t ) ≤ 2 de −  nt 2 4 ∧ 3 nt 4  . Substituting this and ( 54 ) in ( 53 ) yields ( 23a ). Next, we sho w ( 22 ) wh en ǫ = 0, i.e., σ ⋆ p = ρ p . Th en, ( 46 ) yields that k σ ⋆ n ( X n ) − ρ p k 2 1 ≤ 8 D ( ρ n ( X n ) k ρ p ) + 8 k ρ n ( X n ) − ρ p k 2 1 . (55) Applying a second order T a y lor’s expansion as in [ Sreekumar and Berta , 2025 , Equ ation 41], w e ha ve D ( ρ n ( X n ) k ρ p ) = 2 T r  Z 1 0 (1 − t )( ρ n ( X n ) − ρ p ) Z ∞ 0 u 1 ( ρ n ( X n ) , ρ p , τ , t ) dτ  dt − 2 Z 1 0 (1 − t ) T r   (1 − t ) ρ p + tρ n ( X n )  Z ∞ 0 u 2 ( ρ n ( X n ) , ρ p , τ , t ) dτ  dt, (56) where v ( ρ n ( X n ) , ρ p , τ , t ) :=  τ I + (1 − t ) ρ p + tρ n ( X n )  − 1 , (57a) u 1 ( ρ n ( X n ) , ρ p , τ , t ) := v ( ρ n ( X n ) , ρ p , τ , t )( ρ n ( X n ) − ρ p ) v ( ρ n ( X n ) , ρ p , τ , t ) , (57b) u 2 ( ρ n ( X n ) , ρ p , τ , t ) = u 1 ( ρ n ( X n ) , ρ p , τ , t )( ρ n ( X n ) − ρ p ) v ( ρ n ( X n ) , ρ p , τ , t ) . (57c) Using H¨ olders in equ alit y and s ub-multiplic a vit y of Schatte n n orms, we obtain T r  Z 1 0 (1 − t )( ρ n ( X n ) − ρ p ) Z ∞ 0 u 1 ( ρ n ( X n ) , ρ p , τ , t ) dτ  dt ≤ k ρ n ( X n ) − ρ p k 1 Z 1 0 (1 − t )  Z ∞ 0 k u 1 ( ρ n ( X n ) , ρ p , τ , t ) k dτ  dt ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k Z 1 0 (1 − t ) Z ∞ 0     τ I + (1 − t ) ρ p + tρ n ( X n )  − 1    2 dτ dt ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k Z 1 0 (1 − t ) Z ∞ 0     τ I + (1 − t ) ρ p  − 1    2 dτ dt 26 S. SREEKUMAR AND N. WEINBERGER ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k Z 1 0 (1 − t ) Z ∞ 0 ( τ + ( 1 − t ) λ min ρ p ) − 2 dτ dt ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k   ρ − 1 p   . (58) Similarly , we ha ve Z 1 0 (1 − t ) T r   (1 − t ) ρ p + tρ n ( X n )  Z ∞ 0 u 2 ( ρ n ( X n ) , ρ p , τ , t ) dτ  ( a ) ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k Z 1 0 (1 − t )  Z ∞ 0     τ I + ( 1 − t ) ρ p  − 1    2 dτ  dt ≤ k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k   ρ − 1 p   , (59) where k (1 − t ) ρ p + tρ n ( X n ) k    τ I + (1 − t ) ρ p + tρ n ( X n )  − 1   ≤ 1. Substituting ( 58 ) and ( 59 ) in ( 56 ) , w e obtain D ( ρ n ( X n ) k ρ p ) ≤ 4 k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k   ρ − 1 p   . Substituting the ab o ve inequalit y in ( 55 ), we obtain k σ ⋆ n ( X n ) − ρ p k 2 1 ≤ 32 k ρ n ( X n ) − ρ p k 1 k ρ n ( X n ) − ρ p k   ρ − 1 p   + 8 k ρ n ( X n ) − ρ p k 2 1 ≤ 32 d k ρ n ( X n ) − ρ p k 2   ρ − 1 p   + 8 k ρ n ( X n ) − ρ p k 2 1 ≤ (32 d   ρ − 1 p   + 8 d 2 ) k ρ n ( X n ) − ρ p k 2 , (6 0) where the fi n al inequalit y u ses k ρ n ( X n ) − ρ p k 1 ≤ d k ρ n ( X n ) − ρ p k . T aking exp ectati ons, we obtain E h k σ ⋆ n ( X n ) − ρ p k 2 1 i ≤ (32 d   ρ − 1 p   + 8 d 2 ) E h k ρ n ( X n ) − ρ p k 2 i . (61) W e will show th at the last exp ectat ion is O (1 /n ). An application of ( 45 ) with H i = | ϕ ( X i ) ih ϕ ( X i ) | − E P  | ϕ ( X ) ih ϕ ( X ) |  yields P ( k ˆ ρ n ( X n ) − ρ p k ≥ t ) ≤ 2 de −  nt 2 4 ∧ 3 nt 4  , (62) where we used k H i k ≤ 1 and   P n i =1 H 2 i   ≤ P n i =1   H 2 i   ≤ P n i =1 k H i k ≤ n . Thus, we ha ve E h k ˆ ρ n ( X n ) − ρ p k 2 i = Z ∞ 0 P  k ˆ ρ n ( X n ) − ρ p k ≥ √ t  dt ≤ 2 d Z ∞ 0 e −  nt 4 ∧ 3 n √ t 4  dt = 2 d Z 1 0 e − nt 4 dt + 2 d Z ∞ 1 e − 3 n √ t 4 dt = 8 d n + 16 de − 3 n 4 3 n  1 + 4 3 n  . (63) HILBER T SP ACE EMBEDDINGS AND PREDICTION 27 Substituting ( 63 ) in ( 61 ) and u sing k ρ n ( X n ) − ρ p k 2 1 ≤ 2  k ˆ ρ n ( X n ) − ρ p k 2 1 + (4 /n 2 )  , w e obtain E h k σ ⋆ n ( X n ) − ρ p k 2 1 i ≤ (32 d   ρ − 1 p   + 8 d 2 ) E h k ρ n ( X n ) − ρ p k 2 i ≤ (32 d   ρ − 1 p   + 8 d 2 ) 8 n 2 + 16 d n + 32 de − 3 n 4 3 n  1 + 4 3 n  ! ≤ (32 d   ρ − 1 p   + 8 d 2 )  8 n 2 + 28 d n  , th u s proving ( 22 ) when ǫ = 0. The p ro of of Pa rt ( i ) is completed b y noting that ( 23b ) follo ws using ( 60 ) and k ρ n ( X n ) − ρ p k 2 1 ≤ 2  k ˆ ρ n ( X n ) − ρ p k 2 1 + (4 /n 2 )  as follo w s: P  k σ ⋆ n ( X n ) − ρ p k 2 1 ≥ (32 d   ρ − 1 p   + 8 d 2 )  8 n 2 + 2 t  ≤ P  k ˆ ρ n ( X n ) − ρ p k 2 ≥ t  ≤ 2 de −  nt 4 ∧ 3 n √ t 4  . Next, we pro ve P art ( ii ) starting with the pro of of ( 25 ) wh en ǫ > 0. W e h a ve D ( ρ p k ˆ σ ⋆ n ( X n )) = D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) + D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) ( a ) ≤ D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) + D ( ρ n ( X n ) k σ ⋆ n ( X n )) + 1 n D ( ρ n ( X n ) k π d ) ( b ) ≤ D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) + D ( ρ n ( X n ) k σ ⋆ n ( X n )) + log d n , (64) where ( a ) follo ws b y con vexi t y of q u an tum relativ e entrop y and ( b ) follo ws sin ce D ( ρ n ( X n ) k π d ) ≤ log d . T aking exp ectatio ns w ith resp ect to X n ∼ P ⊗ n , we obtain un d er the assumption ( 24 ) that E [ D ( ρ p k ˆ σ ⋆ n ( X n ))] ≤ E [ D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n ))] + log d n + ǫ. (65) Since ρ p > 0 and ˆ σ ⋆ n ( x n ) > 0 for every x n ∈ X n , it follo ws that D ( ρ p k ˆ σ ⋆ n ( X n )) = sup H T r [ ρ p H ] − T r h e H +log ˆ σ ⋆ n ( X n ) i + 1 , (66) where the su premum is o ver all H suc h that k H k ≤   H ⋆  ρ p , ˆ σ ⋆ n ( X n )    ≤ log ( dn ) ∨ log   ρ − 1 p   . Similarly , D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) = sup H T r [ ρ n ( X n ) H ] − T r h e H +log ˆ σ ⋆ n ( X n ) i + 1 , (67) where the su premum is o ver all H suc h that k H k ≤   H ⋆  ρ n ( X n ) , ˆ σ ⋆ n ( X n )    ≤ log ( dn ) . 28 S. SREEKUMAR AND N. WEINBERGER Hence, we ca n restrict the suprema in ( 66 ) and ( 67 ) o v er H su c h that k H k ≤ ¯ b n : = log ( dn ) ∨ log   ρ − 1 p   . Then, we ob tain | D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) | ( a ) ≤ sup H : k H k≤ ¯ b n   T r  ρ n ( X n ) − ρ p  H    ( b ) ≤ sup H : k H k≤ ¯ b n | T r [ H ] | dn + | T r [ ρ p H ] | n +   T r  ˆ ρ n ( X n ) − ρ p  H    ( c ) ≤ 2 ¯ b n n + sup H : k H k≤ ¯ b n   T r  ˆ ρ n ( X n ) − ρ p  H    ( d ) ≤ 2 ¯ b n n + d ¯ b n k ˆ ρ n ( X n ) − ρ p k , where (a) follo ws by app lying ( 49 ); (b) follo ws by u sing the definition of ρ n ( X n ); (c) is b ecause | T r [ H ] | ≤ d ¯ b n and by H¨ older’s inequalit y , | T r [ ρ p H ] | ≤ k ρ p k 1 k H k ≤ ¯ b n ; (d) is again via H¨ older’s inequalit y and k H k 1 ≤ d ¯ b n . Hence, applying ( 44 ) in Theorem 3 similar to ( 51 ), w e obtain E [ | D ( ρ p k ˆ σ ⋆ n ( X n )) − D ( ρ n ( X n ) k ˆ σ ⋆ n ( X n )) | ] ≤ 2 ¯ b n n + d ¯ b n E [ k ˆ ρ n ( X n ) − ρ p k ] ≤ 2 ¯ b n n − 1 + d ¯ b n  p 2 log (2 d ) + p π / 2  n − 1 2 . The claim in ( 25 ) for the case ǫ > 0 then follo ws b y substituting this in ( 65 ). T o obtain the b ound in ( 25 ) for the case ǫ = 0, w e n ote th at ( 24 ) and non-negativit y of quantum relativ e entrop y imp lies that either σ ⋆ n ( X n ) = ρ n ( X n ) or ˆ σ ⋆ n ( X n ) = ρ n ( X n ) h olds almost surely . In b oth cases, w e ha ve u s ing conv exit y of quan tu m relativ e en trop y and the d efinitions in ( 19 )-( 20 ) that D ( ρ p k ˆ σ ⋆ n ( X n )) ≤ D ( ρ p k ρ n ( X n )) + log d n ≤ D ( ρ p k ˆ ρ n ( X n )) + 2 log d n . (68) Applying a second order T a y lor’s expansion as in [ Sreekumar and Berta , 2025 , Equ ation 41], w e ha ve D ( ρ p k ˆ ρ n ( X n )) = 2 Z 1 0 (1 − t ) T r  ρ p Z ∞ 0 u 2 ( ˆ ρ n ( X n ) , ρ p , τ , t ) dτ  dt − T r  ρ p Z ∞ 0 ( τ I + ρ p ) − 1  ˆ ρ n ( X n ) − ρ p  ( τ I + ρ p ) − 1 dτ  ≤ 2 k ˆ ρ n ( X n ) − ρ p k 1 k ˆ ρ n ( X n ) − ρ p k   ρ − 1 p   ≤ 2 d k ˆ ρ n ( X n ) − ρ p k 2   ρ − 1 p   . (69) Here, the p enultimat e inequalit y uses that T r  ρ p Z ∞ 0 ( τ I + ρ p ) − 1  ˆ ρ n ( X n ) − ρ p  ( τ I + ρ p ) − 1 dτ  = 0 , HILBER T SP ACE EMBEDDINGS AND PREDICTION 29 whic h follo ws u sing cyclicit y of tr ace as shown in Sreekumar and Berta [ 2025 , Equation 33], and T r  (1 − t ) ρ p Z ∞ 0 u 2 ( ˆ ρ n ( X n ) , ρ p , τ , t ) dτ  dt ≤ k ˆ ρ n ( X n ) − ρ p k 1 k ˆ ρ n ( X n ) − ρ p k   ρ − 1 p   . Substituting ( 69 ) in ( 68 ), taking exp ectati ons on b oth sides and u sing ( 63 ) yields E [ D ( ρ p k ˆ σ ⋆ n ( X n ))] ≤ 28 d 2   ρ − 1 p   + 2 lo g d n , as desired. Next, we prov e ( 27a ) given ( 26 ) holds. Under this assumption, follo wing the s ame steps as in exp ectation b ounds in Pa rt ( ii ) for the case ǫ > 0 yields almost s urely that D ( ρ p k ˆ σ ⋆ n ( X n )) ≤ 2 ¯ b n n + d ¯ b n k ˆ ρ n ( X n ) − ρ p k + log d n + ǫ. Hence, ( 27a ) f ollo ws: P  D ( ρ p k ˆ σ ⋆ n ( X n )) ≥ 2 ¯ b n n + log d n + ǫ + d ¯ b n t  ≤ P ( k ˆ ρ n ( X n ) − ρ p k ≥ t ) ≤ 2 de −  nt 2 4 ∧ 3 nt 4  . Finally , ( 27b ) is an outcome of substituting ( 69 ) in ( 68 ) and us ing the r esulting b oun d to obtain P  D ( ρ p k ˆ σ ⋆ n ( X n )) ≥ 2 dt   ρ − 1 p   + 2 log d n  ≤ P  k ˆ ρ n ( X n ) − ρ p k 2 ≥ t  ≤ 2 de −  nt 4 ∧ 3 n √ t 4  . 6. Discuss ion and Futur e Work W e p r op osed a conceptual mo del f or ICL: A QMLP obtained by em b eddin g in to a learned Hilb er t space. W e explored information-geometric prop erties of Q MLP and established statistical p erfor- mance guarantee s giv en an arbitrary embedd ing. Ou r p erforman ce guaran tees ind icate that a goo d em b edding should aim for minimal embedd ed Hilb ert space dimens ion and maximal smallest eigen- v alue of ρ p . Often, add itional p r op erties for emb eddings may b e desirable suc h as in jectivit y f or sp ecific su b-classes of probabilit y d istributions. In th e con text of embedd ings in to RKHS generated b y a symmetric p ositiv e definite translation-in v ariant k ernel, a necessary and suffi cien t condition th at the corresp ond ing feature map is injectiv e (or k ern el is c h aracteristic) for sub-classes of distrib u tions w as obtained b y Gretton et al. [ 2006 ], Srip eru m budur et al. [ 2008 ], F ukumizu et al. [ 2009 ]. Relaxing the assum p tion of translation-in v ariance of kernels, integ ral strict p ositiv e d efiniteness w as sh o wn to b e a sufficient condition for a kernel to b e c haracteristic by Srip erum b udur et al. [ 2010 ]. How ev er, the latter condition is tailored to wa rds the kernels b eing injectiv e o ver the en tire P ( X ), and hence is to o restrictiv e if the class of probabilit y d istributions P of in terest is a strict sub s et. Hence, it is w orth wh ile to inv estigate in depth, wh en em b eddings are inj ectiv e for sp ecific sub-classes of P ( X ) when the asso ciate d k ernel is n ot translation inv ariant. 30 S. SREEKUMAR AND N. WEINBERGER A simple criteria for injectivit y can b e obtained as follo w s. S upp ose P ⊆ P ( X ) is a class of discrete or contin uous distribu tions on R m of inte rest. Consider the class of p airwise differences of densities (with resp ect to counting or Leb esgue measur e) within P , i.e., P 2 : =  p − p ′ : P 6 = P ′ , P , P ′ ∈ P  . Then, it is clear that the feature map ϕ : X → H corresp ondin g to a symmetric p ositiv e defin ite k ernel K (see Ap p end ix B ) is injectiv e on P if and only if P 2 lies in the supp ort of K view ed as an elemen t of L ( H ). It would b e fruitful to exp lore the interpla y b et wee n sp ecific k ernels and P for whic h they are injectiv e from this v iewp oin t or determine alternativ e criteria that are simpler to v erify . An other asp ect to explore p ertains to cont raction co efficien ts for s p ecific emb eddings w hic h quan tifies ho w m u c h the quan tum relativ e entrop y in the em b edded domain decreases compared to KL dive rgence in the distribution space (see the recen t stud y by Belzig et al. [ 2025 ]). It wo uld also b e b eneficial to establish the optimalit y (or gap thereof ) of the p erformance guaran tees in Theorem 2 . Finally , it is of interest to d ev elop computationally pr actical algorithms and asso ciated theoretical guarant ees for solving th e optimization problem in ( 11 ), for wh ic h [ C sisz´ ar and Shields , 2004 , Chapter 5] and [ Wilde , 2025 ] could b e s tarting p oints for inv estigatio n. A c k nowledgement S. Sreekumar is partially supp orted by the CNRS-L2S-Cen tr aleSu p´ elec fu nding WRP630 . N. W ein- b erger is partially supp orted by the Israel Science F oundation (ISF), grant no. 1782/ 22. S. S reekumar thanks Gereon Koßmann for interesting discussions on topics related to the p ap er. Appendix A. Back ground: Classical and Qua ntum Transformer-based Large Language Model s Mo dern LLMs are tran s former-based mo dels comp osed of an inp ut (emb ed ding) la y er follo w ed b y multi- head (mask ed self ) atte n tion un its, feedforwa rd nets, and finally an output la y er whic h pro du ces probabilities ov er a vocabulary . The inpu t la y er con verts data (text, images etc.) in to tok ens and further represen ts them as vect ors in a Euclidean sp ace. Eac h multi-head atten tion unit consists of a stac k of atten tion units whic h are pro cessed in parallel, eac h head p erforming indep en d en t pro jections of toke ns within the context win do w int o some in ternal representa tion space and computing correlations among tok ens, whic h are concatenated at its output. T o describ e in more detail, let W b e a fin ite set of tok ens an d k b e the conte xt windo w. The inpu t emb edding la yer stac ks a sequ ence of toke ns w 1 , . . . , w k as a k × d in matrix Z = ( z 1 , . . . , z k ) T , wh ere the row v ector z j ∈ R d in is th e representati on of tok en w j for 1 ≤ j ≤ k . Eac h attent ion head mec hanism inv olv es three matrices of dimension d in × d mod , the query matrix M Q , key matrix M K and the v alue m atrix M V , whic h are learned du ring the training p hase. Using these, the outpu t of the atten tion mec hanism HILBER T SP ACE EMBEDDINGS AND PREDICTION 31 is computed as A = softmax  mask  QK T √ d mod  V , (70) where Q = Z W Q , K = Z W K and V = Z W V . In the ab ov e, the maskin g op eration enforces causalit y (i.e. dep en d ence of tokens only on past tok ens) b y setting the relev ant matrix en tries to a sufficien tly large negativ e v alue and softmax p erforms the standard softmax op eration (taking exp onentia l of eac h en tr y and normalizing by ro w sum to pro duce a probabilit y d istr ibution). The outp u t from the atten tion u n its (sa y L of them) are concatenated in to a matrix of size k × Ld mod , eac h row of wh ich is pro cessed b y a feedforward n et work. Finally , the output p ro jection m atrix W O tak es as in p ut the output of the f eedf orward net of size k × Ld mod and outputs a matrix of size k × |V | , wh ere the ro ws corresp ond to the pr ed iction probabilities o v er the v o cabulary V . Quant um LLMs (Q L LMs) hav e a similar mechanism when the input la yer op eration is viewed as em b eddings into a Hilb ert space. T he desired classical and quan tum computations suc h as ( 70 ) are then implemen ted using a hybrid classical-quan tum architec ture inv olving classical computations and quan tum gates (whic h a re un itary op erators on th e und erlying Hilb ert space) . Th ese op erations in d uce a quantum state as inpu t to the output la ye r to whic h a PO VM M = { M v } v ∈V is applied, whic h pro du ces outcomes according to s ome p r obabilit y distribu tion o ver the v o cabulary V . Denoting the set of all p ossib le output pr obabilit y distributions induced by suc h measurement s by ˆ Q n , the Q L LM optimization prob lem to find the optimal pr edictor is equiv alent to solving ( 6 ). Appendix B. Reproducing Ker n el Hilber t Sp ace (RK HS) a nd Cov ar iance Embedding While repro du cing kernels [ Aronsza jn , 1950 ] ha v e app eared in sev eral differen t con texts in learning theory and machine learning, their utilit y for the task of prediction has remained u nderexplored. Consider a kernel function K : X × X → C (or R ) whic h satisfies th e follo wing: • K is a cont in u ous and p ositive definite k ernel on X , i.e., K ( x, y ) = K ( y , x ) ∗ and n X i,j =1 c ∗ i c j K ( x i , x j ) ≥ 0 , (71) for ev ery x n ∈ X n and c n ∈ C n (or R n ), such that K ( x, x ) = 1 for all x ∈ X . Ev er y suc h K d efines a Hilb ert sp ace H of functions f : X → C (or R ), called an RK HS, as wel l as a map ϕ : X → H kno wn as the feature map , suc h that for all x, y ∈ X , f ( x ) = h K ( · , x ) , f i , ϕ ( x ) : = K ( · , x ) , and K ( x, y ) : = h K ( · , x ) , K ( · , y ) i . (72) 32 S. SREEKUMAR AND N. WEINBERGER An y function f in H can b e obtained as linear com bin ations of kernel functions, i.e., f ( x ) = k X i =1 α i K ( x, x i ) , for some α ∈ C k , x ∈ X k , (73) or limits of such functions in H . Note that the ab o v e prop erties im p lies th at k ϕ ( x ) k H = 1 f or all x ∈ X . W e refer to Berlinet and T homas-Agnan [ 2004 ] for further details. The feature map induces the so-called mean em b edding of probabilit y distributions into the asso ci- ated R K HS, an d has b een u sed in a v ariet y of app lications (see e.g.[ Muandet et al. , 2017 ]). Recen tly , the co v ariance embedd ing φ (see Section 3.1 ) induced b y ( 8 ) w as stu died in Bac h [ 2023 ], whic h em b eds probabilit y distribu tions in to the sub class of densit y op erators S ( H ) within L ( H ). This can b e seen b y noting that φ is a p ositive linear map, and h en ce the image ρ p of eac h probab ility distribution P satisfies ρ p ≥ 0. Moreo ver, T r [ ρ p ] = 1. Denoting by { e i } ∞ i =1 an orthonorm al basis of H , the latter follo ws as T r [ ρ p ] : = ∞ X i =1 h e i , ρ p e i i = Z X ∞ X i =1 h e i || ϕ ( x ) ih ϕ ( x ) || e i i p ( x ) dµ ( x ) = Z X ∞ X i =1 h ϕ ( x ) , e i ih e i , ϕ ( x ) i p ( x ) dµ ( x ) = Z X k ϕ ( x ) k 2 H p ( x ) dµ ( x ) = 1 , (74) where the last equalit y is b ecause k ϕ ( x ) k H = 1 for ev er y x ∈ X and R pdµ = 1. Density op erators are trace-cla ss and hence lies w ithin th e class of Hilb ert-Sc hm idt op erators equ ip p ed with the Hilb ert- Sc h midt inner -p ro duct: h ρ, σ i : = T r [ ρ ∗ σ ] : = ∞ X i =1 h ρe i , σ e i i , ∀ ρ, σ ∈ L ( H ) , where ρ ∗ denotes the adj oint of ρ . Referen ces E. Aky¨ urek, D. Sc huurmans, J. Andreas, T. Ma, and D. Zh ou. What learning algorithm is in- con text learning? In v estigations with linear mod els. In Internationa l Confer enc e on L e arning R epr esentations , 2023. S. Amari and H. Naga ok a. Metho ds of Information Ge ometry . American Mathemati cal So ciet y , Oxford Unive rsit y Pr ess, 2000. M. H. Amin, E. Andriyash, J. Rolfe, B. Ku lc hytskyy , and R. Melk o. Quantum Boltzmann mac hine. Phys. R ev. X , 8:0210 50, Ma y 2018. doi: 10.1103 /Ph ys R evX.8.021050. HILBER T SP ACE EMBEDDINGS AND PREDICTION 33 N. Aronsza jn. Theory of repr o ducing kernels. T r ansactions of the Americ an Mathematic al So ciety , 68(3): 337–4 04, 1950. F. Bac h . In f ormation theory with kernel metho ds. IEEE T r ansactions on Information The ory , 69(2 ): 752–7 75, 2023. doi: 10.1109 /TIT.2022. 3211077. Y. Bai, F. Chen, H. W ang, C. Xiong, and S. Mei. T ransformers as statisticians: Pr o v able in-con text learning with in -con text algorithm selection. In A dvanc es in Neur al Informatio n P r o c essing Systems , v olume 36, 2023 . I. Basile and F. T am burini. T o w ard s qu an tu m language mo d els. In M. P almer, R. Hwa, and S. Riedel, editors, Pr o c e e dings of the 2017 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 1840–1849 , Cop enhagen, Denmark, S ept. 2017. Asso ciation for Comp u tational Linguistics. doi: 10.18653/ v1/D17 - 1196. S. Basu, M. Choraria, and L. R . V arshney . T ransformers are universal predictors. ArXiv , abs/2307. 07843 , 2023. P . Belzig, L. Gao, G. Smith, and P . W u . Reve rse-t yp e d ata pro cessing inequalit y . Communic ations in Mathematic al Physics , 406(12 ):295 , 2025. M. Benedetti, J. Realp e-G´ omez, R. Biswas, and A. P erd omo-Ortiz. Q uan tum-assisted learnin g of hardware- em b edded probabilistic graphical mo dels. Phys. R ev. X , 7:0410 52, No v 2017. doi: 10. 1103/ Ph ysRevX.7.04 1052. M. Benedetti, D. Garcia-Pin tos, O. Pe rdomo, V. L eyton-Ortega, Y. Nam, and A. Perdomo-Ortiz. A generativ e mo deling approac h for b enc hmarking and training sh allo w quan tum circuits. npj Quantum Information , 5(1):4 5, 2019. doi: 10.103 8/s415 34- 019- 0157- 8. A. Berlinet and C. Thomas-Agnan. R epr o ducing Kernel H ilb ert Sp ac es in Pr ob ability and Statistics . Klu w er Academic Publishers, Boston, MA, USA, 2004. ISBN 978-1- 4020- 7679-4. doi: 10.1007/ 978- 1- 4419- 9096- 9 . M. Berta, O. F a wzi, and M. T omamic hel. On v ariational expressions for qu an tum relativ e en tropies. L e tters in M athematic al Physics , 107(12): 2239– 2265, Dec. 2015. d oi: 10.10 07/s11 005- 017- 0990- 7. N. Cesa-Bianc hi and G. Lu gosi. Pr e diction, le arning, and games . C am b ridge universit y p ress, 2006. I. Csisz´ ar and F. Ma tus. Information pr o jections revisited. IEEE T r ansactions on Information The ory , 49(6): 1474– 1490, 2003. I. Csisz´ ar and P . C . Shields. In formation theory and statistics: A tutorial. F oundations and T r ends ® in Communic ations and Informatio n The ory , 1(4):417– 528, 2004. M. J. Donald. On the relativ e en tr opy . Communic ations in Mathematic al Physics , 105(1):13– 34, 1986. ISSN 1432-09 16. d oi: 10.1007/ BF012 12339. 34 S. SREEKUMAR AND N. WEINBERGER E. Edelman, N. Tsilivis, B. L. Edelman, E. Malac h, and S. Go el. Th e ev olution of statistical in duction heads: In -con text learnin g marko v c hains. A dvanc es in neur al information pr o c essing systems , 37: 64273 –6431 1, 2024. C. F errie and R. Blume-Kohout. Minimax quant um tomography: estimators and r elativ e entrop y b ound s. Physic al R ev iew L etters , 116(9):090 407, 2016. K. F ukumizu, F. R. Bac h, and M. I . J ordan. Dimensionalit y redu ction for sup ervised learning w ith repro du cing k ernel Hilb ert sp aces. Journal of Machine L e arning R ese ar ch , 5:73– 99, 2004. K. F u kumizu, F. R. Bac h, and M. I. Jord an. Kernel dim en sion r ed uction in regression. The Annals of Statistics , 37(4):1871 –190 5, 2009. S. Garg, D. Tsipras, P . Liang, and G. V alian t. What Can T rans f ormers Learn In-Con text? A Case Study of Simple F u nction Classes. In A dvanc es in Neur al Information P r o c essing Systems , v olume 35, 2022 . N. Golo wich, A. Rakhlin, and O. Shamir. Size-indep end en t sample complexit y of neural net works. Information and Infer enc e: A Journal of the IM A , 9(2):473– 504, 2020. A. Gretton, K. M. Borgw ard t, M. Rasch, B. Sc h¨ olk opf, and A. J. Smola. A ke rnel metho d for the t wo-sa mple-problem. In Pr o c e e dings of the 20th International Confer enc e on Neur al Information Pr o c essing Systems , NIPS ’06, page 513–520, Cam b ridge, MA, USA, 2006. MIT Press. D. Haussler and M. Opp er. Mutual information, metric entrop y and cum u lativ e relativ e en tr op y risk. The Annals of Statistics , 25(6):24 51–24 92, 1997. M. Hay ashi. Optimal sequence of quantum measurements in the sense of S tein’s lemma in quantum h yp othesis testing. Journal of Physics A: Mathematic al and Gener al , 35(50): 10759 , Dec. 2002. M. Ha yashi. Quantum Information The ory: Mathematic al F oundation . Springer Berlin, Heidelb erg, 2016. M. Ha y ash i and Y. Ito. En tanglemen t measures for d etectabilit y . IEE E T r ansactions on Information The ory , 71(6):43 85–44 05, 2025. d oi: 10.1109/ TIT.2025.35 57056. B. Housk a an d B. Chac huat. Global optimization in Hilb ert space. Mathematic al P r o gr amming , 173, 12 2017. doi: 10.100 7/s101 07- 017- 1215- 7 . J. K. Ho yos-Osorio and L. G. S anc h ez-Giraldo. T h e representat ion Jensen-Sh an n on d iv ergence, 2024. H. J. Jeo n, J. D. Lee, Q. Lei, and B. V an Ro y . An information-theoretic analysis of in-cont ext learning. In F orty-first International Confer e nc e on Machine L e arning , 2024. O. Kac haiev and S . Recanatesi. Learning to embed distribu tions via maximum k er n el entrop y . A d- vanc es in Ne ur al Information Pr o c essing Systems , 37:4471 0–447 34, 2024. M. Kiefero v´ a and N. Wieb e. T omography and generativ e training with quantum Boltzmann mac h ines. Phys. R ev. A , 96:062 327, Dec 2017. doi: 10.1103/Ph ysRevA.96.062 327. HILBER T SP ACE EMBEDDINGS AND PREDICTION 35 S. H. Kim, J. Mei, C. Girotto, M. Y amada, and M. Ro etteler. Qu an tu m large language mo del fi ne- tuning. In 2025 IEEE International Confer enc e on Qu antum Computing and Engine ering (QCE) , v olume 01, pages 01–1 2, 2025. d oi: 10.1109/ QCE65121.2 025.00258. V. Koltc h inskii and D. Xia. Optimal estimation of low rank densit y matrices. J. Mach. L e arn. R es. , 16(53 ):1757 –1792, 2015. R. Kric hevsky and V. T rofimov. Th e p erformance of u niv ersal enco d ing. IEEE T r ansactions on Information The ory , 27(2) :199– 207, 1981. doi: 10.1109/TIT.198 1.1056331. J.-G. Liu an d L. W ang. Differentia ble learning of quantum circu it b orn mac hin es. Phys. R ev. A , 98: 06232 4, Dec 2018. doi: 10.1103/ Ph ysRevA.98.06 2324. N. Merhav and M. F eder. Univ ers al prediction. IEEE T r ansactions on Information The ory , 44(6): 2124– 2147, 1998. K. Muandet, K. F ukumizu, B. Srip eru m budur , and B. Sc h¨ olko pf. Kernel mean emb edding of distri- butions: A review and b eyo nd. F oundations and T r ends in Machine L e arning , 10(1-2) :1–14 1, 2017. doi: 10.1561/2 2000 00060. M. A. Nielsen and I. L. Chuang. Quantum c omputation and quantum information , v olume 2. Cam- bridge un iversit y press Cam b ridge, 2001. Z. P an, S . W ang, L. Pengfei, and J. Li. Understanding LLM b eha viors via compression: Data generation, kno w ledge acquisition and scaling laws. I n The Thirty-ninth Annual Confer enc e on Neur al Information P r o c essing Systems , 2025. D. Petz. A v ariational expression for the relativ e en trop y . Communic ations in Mathematic al Physics , 114(2 ):345– 349, 1988. IS S N 1432 -0916 . M. Piani. Relativ e entrop y of en tanglemen t and restricted measurements. Physic al R eview L etters , 103(1 6):160 504, Oct. 2009. doi: 10.1103/Ph ysRevLett.103.1 60504. Y. P oly anskiy and Y. W u . Inform ation The ory: Fr om Co ding to L e arning . Cam br idge Univ ersit y Press, 2025. J. Rissan en . Univ ers al co d ing, information, prediction, and estimation. IEE E T r ansactions on Infor- mation the ory , 30(4):629– 636, 1983 . M. E. Sander and G. P eyr´ e. T o w ard s understand ing the universalit y of transform ers for n ext-tok en prediction. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025. L. V. Sant oro and V. M. P anaretos. Lik eliho o d ratio tests b y k ernel Gaussian emb edding. arXiv preprint, Aug 2025 . A. Smola, A. Gretton, L. Son g, and B. Sch¨ olk opf . A Hilb ert sp ace em b edding for distr ib utions. In M. Hutter, R. A. Ser vedio, and E. T akimoto, editors, Algorithm ic L e arning The ory , p ages 13–31, Berlin, Heidelb erg, 2007. Spr inger Berlin Heidelb erg. 36 S. SREEKUMAR AND N. WEINBERGER L. Song, A. Gretton, and C . Gu estrin . Nonparametric tree graphical mo dels. In Y. W. T eh and M. T itterington, editors, P r o c e e dings of the Thirte enth International Confer enc e on Artificial Intel- ligenc e and Statistics , vo lume 9 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 765– 772, Chia Laguna Resort, Sardinia, Italy , 13–15 May 2010. PMLR. S. Sreekumar and M. Berta. Limit distribution theory for quantum divergence s. IEEE T r ansactions on Information The ory , 71(1):4 59–4 84, 2025. B. K. Srip erum budur , A. Gretton, K. F uk u mizu, G. R. G. Lanckriet, and B. Scholk opf. In jectiv e Hilb ert space embedd ings of probabilit y measures. In Annual Confer enc e Computational L e arning The ory , 2008. B. K. Srip erumbudur, A. Gretton, K. F ukumizu, B. Sc h¨ olk opf, and G. R. Lanc kriet. Hil b ert sp ace em b eddings and metrics on probabilit y measures. J. Mach. L e arn. R es. , 11:151 7–156 1, Aug. 2010. ISSN 1532-44 35. J. T ang. Diver genc e Covering . Ph.d. thesis, Massac husetts Institute of T ec hn ology , Cambridge, MA, USA, 2022. A. C. Th ompson. O n certain contract ion mapp ings in a p artially order ed vect or sp ace. Pr o c e e dings of the Americ an Mathematic al So ciety , 14:438–4 43, 1963. ISSN 0002-9939, 1088-6 826. doi: 10.109 0/ S0002- 9939- 1963-0 149237- 7. R. V ershynin. High-Dimensional Pr ob ability: An Intr o duction with Applic ations in Data Scienc e . Cam b ridge Unive rsit y Pr ess, Cam bridge, 2018. M. M. Wilde. Quantum Information The ory . Cam bridge Univ ers it y Pr ess, second edition, 2017. doi: 10.101 7/97 8131680 9976.001. M. M. Wilde. F undamenta ls of quantum Boltz mann mac hine learnin g with visible and h idden units. arXiv: 2512.198 19 , 2025. S. M. Xie, A. Ragh unathan, P . Liang, and T. Ma. An explanation of in-conte xt learning as imp licit Ba ye sian inference. In International Confer enc e on L e arning R epr esentations , 2022. Y. Y ang and A. Barron. Information-theoretic determination of min im ax r ates of con ve rgence. Annals of Statistics , pages 1564– 1599 , 1999. (S. Sreek u mar) L2S, CNRS, C entraleSup ´ elec, Univ e rsity of P ari s-Sacla y, France Email addr ess : sreejith.sre ekumar@ce ntralesupelec.fr (N. W einberger) Electrical a nd Compute r Engineeri ng, Technion, Israel Email addr ess : nirwein@tech nion.ac.i l

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment