On Learning from Ghost Imaging without Imaging
Computational ghost imaging is an imaging technique in which an object is imaged from light collected using a single-pixel detector with no spatial resolution. Recently, ghost cytometry has been proposed for a high-speed cell-classification method th…
Authors: Issei Sato
On Learning from Ghost Imaging without Imaging Issei Sato The Uni versity of T okyo / RIKEN sato@k.u-tokyo.ac.jp Abstract Computational ghost imaging is an imaging technique in which an object is imaged from light collected using a single-pixel detector with no spatial resolution. Recently , ghost cytometry has been proposed for a high-speed cell-classification method that in volves ghost imaging and machine learning in flo w cytometry . Ghost c ytometry skips the reconstruction of cell images from signals and directly used signals for cell-classification because this reconstruction is what creates the bottleneck in high-speed analysis. In this paper , we provide a theoretical analysis for learning from ghost imaging without imaging. 1 Intr oduction Ghost imaging was first observ ed with entangled photon pairs and vie wed as a quantum phe- nomenon [1]. It acquires object information through the correlation calculations of the light- intensity fluctuations of two beams: object and reference [2, 3]. Computational ghost imaging is an imaging technique in which an object is imaged from illumination patterns and light collected using a single-pixel detector with no spatial resolution [4, 5], which simplifies the operations in comparison to con ventional two-detector ghost imaging. Photomultiplier tube (PMT) is of- ten used as a single-pixel detector to collect the scattered light. Using the detected signals and illumination patterns enables us to computationally reconstruct images. Let T ( x, y ) be the transmission function of an object. An object is illuminated by a speckle field generated by passing a laser beam through an optical diffuser , which is a material that dif fuses light to transmit light such as dif fracti ve optical elements (DOEs). A detector measures the total intensity , G m , transmitted through the object gi ven by G m = Z I m ( x, y ) T ( x, y ) dxdy , (1) where I m ( x, y ) is the m -th speckle field, which is also refereed as the m -th structured illumina- tion pattern. W e can reconstruct the transmission function e xpressed by e T ( x, y ) = 1 M M X m =1 ( G m − h G i ) I m ( x, y ) , where h G i = 1 M M X m =1 G m . (2) The transmission function e T ( x, y ) indicates a reconstructed image of the tar get object. 1 Ghost cytometry [6] has been proposed as a high-speed cell-classification method and in- volv es ghost imaging and machine learning in flow cytometry . Flow cytometry is a technique to measure the characteristics of a population of particles (cell, bacteria etc.) such as cell size, cell count, cell morphology (shape and structure), and cell cycle phase at high speed . Cyto- and -metry mean cell and measure, respecti v ely . W ith flo w c ytometry , we can measure the informa- tion of a single cell. A sample including cells, e.g., blood cells , is injected into a flo w cytometer , which is composed of three systems: flo w/fluid, optical, and electric. It detects scattered light and the fluorescence of cells. From the detected scattered light and fluorescence signals, we can obtain information on the relativ e size and internal structure of a cell and the cell membrane, cytoplasm, v arious antigens present in the nucleus, and quantities of nucleic acids. Computational ghost imaging is well known as an imaging method. Ho wev er , a break- through occurred with ghost cytometry in which the reconstruction of cell images from raw signals { G m } M m =1 can be skipped because this reconstruction is what causes the bottleneck in high-speed analysis. Ghost c ytometry directly uses raw signals to classify cells. Also, com- putational ghost imaging uses randomly generated multiple illumination patterns to reconstruct an image. In ghost cytometry , cells pass through a randomly allocated illumination pattern and the signals are detected in time series using a single pixel detector . That is, we do not need to switch the illumination pattern to obtain the fluorescence-intensity features extracted from multiple illumination patterns, which dif fers from ghost imaging. In ghost cytometry , morphologically similar but different types of cells are classified. Ac- cording to a subsequent study [7], it was not possible to classify these different types of cells by using the tw o commonly a vailable features, namely , the intensity of fluorescence and for - ward scattering intensity , of a commercialized flow cytometer , JSAN . Ho wev er , an image-based support vector machine (SVM), which is learned from 28 × 28 pix els images obtained using a commercialized image flow c ytometer , ImageStr eamX , achie ved a test A UC of 0 . 967 . This result is the same as that of image-free-based SVM in ghost cytometry . Thus, we consider that ghost features capture some morphological information in raw signals. In this paper , we provide a theoretical analysis for learning from ghost imaging without imaging both general ghost imaging and ghost cytometry settings. W e sho w that ghost features approximate the radial basis function (RBF) kernel between object images by using signals without imaging. That is, κ RBF ( X , Y ) ≈ κ RBF ( G ( X ) , G ( Y )) , (3) where κ RBF ( X , Y ) is the RBF k ernel between image objects X and Y and κ RBF ( G ( X ) , G ( Y )) is the RBF kernel between signals G ( X ) and G ( Y ) in ghost c ytometry . 2 Related work Recently , optical machine learning , which is a fusion field of optics and machine learning, is recei ving increased attention. Lin et al [8] dev eloped all-optical dif fracti ve deep neural net- work that is a physical mechanism by using 3D-printer . They classified handwritten digits and fashion images. This frame work can perform in all-optical image analysis, feature detection, and object classification. Ghost c ytometry is regarded as a study of this line in which random projection is used as a machine learning method and implemented by using diffracti ve optical elements (DOEs) to extract features from a cell object. The feature extraction, which is essen- tial yet creates a bottleneck in image processing, performs at the speed of light by using optical 2 elements. The relationship between optics and neural networks goes back to the 1980s [9, 10]. In recent years, this trend may re-emerge with ne w technologies on both sides. Compared to an all-optical system, ghost cytometry uses a hybrid system in which the feature extraction is implemented on DOEs and PMT , and the classifier is implemented on a field programmable gate array (FPGA). This hybrid system seems more flexible than an all-optical system because fine tuning is easier in FPGA than in DOEs. The algorithm to create ghost features is re garded as a kind of random projection method. Dimensionality reduction based on random projection, which has been well studied in machine learning, is built on the idea that high-dimensional data lie in fact in a lo wer-dimensional sub- space. The breakthrough occurred with the Johnson-Lindenstrauss lemma [11] in 1984 . It states that there exists a mapping from a high-dimensional space into a lo wer -dimensional space that can preserve the pairwise Euclidean distances between data points up to a bounded relativ e error . Dasgupta and Gupta [12] provided this lemma by using elementary probabilistic tech- niques. Achlioptas [13] relax ed the Gaussian distrib ution to a discrete random distrib ution with zero mean and variance one, i.e., a sparse random projection using the random matrix R with i.i.d entries in { √ s, 0 , − √ s } with probabilities { 1 2 s , 1 − 1 2 s , 1 s } for s = 1 and s = 3 . Li et al. [14] improv ed their work considering s > 3 . Matousek [15] generalized such results to sub-Gaussian random v ariables. An interesting intersection of neural networks and random projections is neur on-friendly random pr ojection [16]. They proposed random projection to learn robust concept classes from a few e xamples with a biologically plausible neuronal mechanism called neur onal ran- dom pr ojection . Neuronal random projection has a a random matrix whose entries are chosen independently from standard normal distribution or uniform distribution over ± 1 . Robust con- cept learning is closely related to large mar gin classifiers. Shi et al. [17] analyzed mar gin preserv ation for binary classification problems where the y sho wed results for random Gaussian projection matrix. Although they only sho wed results for the random Gaussian matrix, similar bounds seems to be achie ved for a subGaussian distrib ution. As described in Section 3.3, we use a Bernoulli random variable for the random projection matrix. Although a Bernoulli random variable can be reg arded as a subGaussian random vari- able, we cannot use this property to obtain tight bounds in our theory when the probability of coin toss is biased. Moreover , projection matrices are not independent. Therefore, we need to de vise an analysis of ghost features in this paper that is different from e xisting work. 3 Analysis Ghost features are reg arded as a type of random projection; thus, we analyze them in terms of random projections. W e first describe our motiv ation and then define se veral terms and sho w lemmas to deri ve the main results: Theorems 3.11 and 3.13. The details of the proofs are provided in the supplementary material. 3.1 Motiv ations and preliminary theor ems In the analysis of random projection , i.i.d entries of a random matrix are generally assumed to be subGaussian [15]. Definition 3.1 ( σ 2 -subGaussian) . A random variable Z ∈ R is said to be σ 2 -subGaussian if 3 ther e e xists σ > 0 suc h that its moment gener ating function satisfies ∀ λ ∈ R , E [exp( λ ( Z − E [ Z ]))] ≤ exp 1 2 σ 2 λ 2 . (4) The constant σ 2 is called a pr oxy variance. Definition 3.2 (Optimal proxy v ariance [18, 19]) . The smallest pr oxy variance is called the optimal pr oxy variance and is denoted σ 2 opt ( Z ) , or simply σ 2 opt . The variance V [ Z ] always pr ovides a lower -bound on the optimal pr oxy variance. When V [ Z ] = σ 2 opt ( Z ) , Z is said to be strictly subGaussian. Let b ∈ { 1 , 0 } be a Bernoulli random variable with parameter q that is the occurrence probability of one. Structured illumination patterns can be formulated as Bernoulli random matrices. In ghost imaging and c ytometry , one laser is divided into multiple ones using DOEs; thus, structural illumination patterns need to be sparse in order to improv e the signal-to-noise (S/N) ratio. That is, q needs to be small, which is problematic in analyzing ghost features. The optimal proxy variance of a Bernoulli random variable has the form σ 2 opt ( b ) = 1 2 − q log( 1 q − 1) . The Bernoulli random variable is strictly subGaussian if and only if q = 1 2 . Ho w- e ver , the optimal proxy v ariance σ 2 opt ( b ) is lar ger than v ariance V [ b ] = q (1 − q ) when q is small, which means that the exponential inequality is too loose when we use the subGaussian property of a Bernoulli random variable. Moreov er , projected features are not independent in the sense that they share projected matrices to obtain projected features. T o overcome these problems. we use two theorems: Theorems 3.3 and 3.4. Theorem 3.3 is referred as the Bernstein inequality [20] and has different forms; we consider the follo wing one. Theorem 3.3 (Bernstein inequality) . Let Z be a random variable satisfying the Bernstein con- dition E [ | Z − E [ Z ] | k ] ≤ 1 2 k ! σ 2 C k − 2 ( k = 3 , 4 , . . . ) . (5) Then, for all | λ | < 1 /C , E [exp( λZ )] ≤ exp λ E [ Z ] + λ 2 V [ Z ] 2(1 − | λ | C ) , (6) and the concentration inequality P [ | Z − E [ Z ] | ≥ ] ≤ 2 exp − 2 2( V [ Z ] + C ) for all ≥ 0 . (7) One suf ficient condition for the Bernstein condition to hold is that Z be bounded; in partic- ular , if | Z − E [ Z ] | ≤ C , then it is straightforward that the Bernstein condition hold. The follo wing theorem is known on non-ne gati vely associated random v ariables [21, 22]. Theorem 3.4. Let { Z i } n i =1 be non-ne gatively associated random variables bounded by a con- stant C and Co v( Z i , Z j ) be the covariance of Z i and Z j . Then for any λ > 0 , E " exp λ n X i =1 Z i !# − n Y i =1 E [exp ( λZ i )] ≤ λ 2 exp( nλC ) X 1 ≤ i , (12) g ( X ) = ( g 1 ( X ) , g 2 ( X ) , . . . , g M ( X )) > , g m ( X ) = G m ( X ) − h G ( X ) i , (13) where > is a transpose of a v ector and matrix. Definition 3.5 (L 2 norm and Frobenius norm) . Denote the L 2 norm of vector g as k g k 2 and F r obenius norm of matrix X as k X k F . Definition 3.6 (Summation of matrix elements) . Let the summation of matrix elements be S [ X ] = K X i =1 K X j =1 X ( i, j ) . (14) Definition 3.7 (Remainder and Quotient) . Denote the r emainder and quotient upon division of A by B as [ A % B ] and b A/B c , r espectively . W e define two functions used in the following lemma and theorem. W e will explain their meanings in Remark 1 belo w . Definition 3.8. Denote i 0 = [ k 0 % H ] , j 0 = b k 0 /H c and define Γ q ( X ) def = (1 − 2 q ) 2 q (1 − q ) W X j =1 H X i =1 X ( i, j ) 4 + 4 W X j =1 H X i =1 W H X k 0 > ( j − 1) H + i ( X ( i, j ) X ( i 0 , j 0 )) 2 , (15) Λ q ( X ) def = max max ( i,j ) 6 =( i 0 ,j 0 ) 2 1 − q q X ( i, j ) X ( i 0 , j 0 ) , max ( i,j ) 1 − 2 q q X ( i, j ) 2 . (16) 5 The feature vector g ( · ) has the linear property . Lemma 3.9 (Linearity) . Let X and Y be N × N r eal matrices. g m ( X − Y ) = g m ( X ) − g m ( Y ) . (17) Since we consider the two parts of g m ( X ) in the main result belo w , g m ( X ) = G m ( X ) − h G ( X ) i = G m ( X ) − q S [ X ] | {z } Part I + q S [ X ] − h G ( X ) i | {z } Part II . (18) we sho w the exponential inequality of the two parts by using Theorems 3.3 and 3.4 as follo ws. Lemma 3.10 (Exponential inequality) . F or any | t | M < 1 Λ q ( X ) , E " exp t q (1 − q ) M M X m =1 ( G m ( X ) − q S [ X ]) 2 !# ≤ exp t k X k 2 F + Γ q ( X ) 2 M 1 − Λ q ( X ) M | t | t 2 , (19) E exp t q (1 − q ) ( q S [ X ] − h G ( X ) i ) 2 ≤ exp t M k X k 2 F + Γ q ( X ) 2 M 3 1 − Λ q ( X ) M 2 | t | t 2 . (20) Since { g m ( X ) } M m =1 share h G ( X ) i , they are not independent given X , which is the dif ference from existing random projections. By using Lemma 3.10 and H ¨ Older’ s inequality , we hav e the follo wing theorem. Theorem 3.11. Let X and Y be H × W r eal matrices. F or all , with pr obability at least 1 − δ , 1 − 1 M − k X − Y k 2 F ≤ 1 M q (1 − q ) k g ( X ) − g ( Y ) k 2 2 ≤ 1 − 1 M + k X − Y k 2 F , (21) wher e δ = 2 exp − 2 M 2 1 + 2 M 2 Γ q ( X − Y ) k X − Y k 4 F + Λ q ( X − Y ) k X − Y k 2 F . (22) Remark 1. The number of illumination patterns, M , need to be lar ger when 2 is smaller . This is the same property of existing random projections as the Johnson-Lindenstrauss lemma. In our case, we need to increase M according to q because Γ q and Λ q increase when q decreases, which is reasonable because obtaining information out of a sparse matrix requires increasing the number of the sparse matrix. In Definition 3.8, Γ q and Λ q consist of two parts: intensity and correlation. In Γ q ( X − Y ) and Λ q ( X − Y ) , ( X ( i, j ) − Y ( i, j )) 2 indicates the intensity and ( X ( i, j ) − Y ( i, j ))( X ( i 0 , j 0 ) − Y ( i 0 , j 0 )) is the pix el-wise correlation of the difference image between X and Y . The intensity of X − Y is small when X and Y are similar objects, otherwise large. The correlation of X − Y is small when X − Y is sparse. When we use the maximum v alue of Γ q ( X − Y ) and Λ q ( X − Y ) o ver the space of X − Y , Theorem 3.11 is independent of objects. In fact, the element of X is bounded and controllable to some extent. 6 3.3 Analysis of Ghost F eatur es in Ghost Cytometry In ghost imaging, multiple illumination patterns are independent, i.e., { B m } M m =1 are indepen- dently and randomly generated. Thus, the detected signals, i.e., ghost features { G m } M m =1 , do not share illumination patterns { B m } M m =1 , i.e., G m is generated only from B m . In ghost cytome- try , ho wev er , objects pass through a randomly allocated illumination pattern; thus, the detected features share illumination patterns as follo ws. Let B be H × M random binary masks where the ( i, j ) -th element, B(i,j), is constructed by B ( i, j ) = ( 1 with probability q , 0 with probability 1 − q , (23) where q ∈ [0 , 1] is a parameter . The matrix B is a illumination pattern in ghost c ytometry . The ghost feature for fluorescence object X is formulated as G m ( X ) = H X i =1 W X j =1 B ( i, j + m − W ) X ( i, j ) , (24) where for simplicity of notation, if j < 0 and j > M , B ( i, j ) = 0 . The problem is that G 1 ( X ) , G 2 ( X ) , . . . , G M + W − 1 ( X ) are highly correlated because they share the elements of B . Ghost c ytometry uses Ghost features G 1 ( X ) , G 2 ( X ) , . . . , G M + W − 1 ( X ) to classify cell types. W e analyze ghost features obtained from Eq .(24). It is worth noting the follo wing. Since the time for cells to pass through the structural illumination is sev eral microseconds and the length of the structural illumination is several micrometers, it can be assumed that by fluid control, the cell does not rotate and passes through the center of the structural illumination. Similar to Definition 3.8, we define the follo wing two functions. Definition 3.12. Denote i 0 = [ k 0 % H ] and m 0 = b k 0 /H c and define Ψ q ( X ) def = (1 − 2 q ) 2 q (1 − q ) H X i =1 W X j =1 X ( i, j ) 2 ! 2 + 4 H X i =1 ( m − 1+ W ) H X k 0 > ( m − 1) H + i W − ( m 0 − m ) X j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) 2 , (25) Φ q ( X ) def = max max ( i,m ) 6 =( i 0 ,m 0 ) 2 1 − q q W − ( m 0 − m ) X j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) , max i (1 − 2 q ) q W X j =1 X ( i, j ) 2 ) . (26) Compared to Γ q and Λ q , Ψ q and Φ q are a little complicated but their basic meanings are the same as those of Γ q and Λ q described in Remark 1. By using Theorems 3.3 and 3.4, we obtain the follo wing theorem. 7 Theorem 3.13. Let X and Y be H × W r eal matrices. F or all , with pr obability at least 1 − δ , 1 q (1 − q )( M + W − 1) k G ( X ) − G ( Y ) k 2 2 ≥ (1 − ) k X − Y k 2 F − q (1 − q ) S [ X − Y ] 2 , (27) 1 q (1 − q )( M + W − 1) k G ( X ) − G ( Y ) k 2 2 ≤ (1 + ) k X − Y k 2 F + q (1 − q ) S [ X − Y ] 2 , (28) wher e δ = 2 exp − 2 M 2 Ψ q ( X − Y ) k X − Y k 4 F + Φ q ( X − Y ) k X − Y k 2 F . (29) Remark 2. The basic property of Theorem 3.13 is the same as that described in Remark 1 on Theorem 3.11. One of the differences between Theorems 3.11 and 3.13 is that we e v aluate g or G because In ghost cytometry , G is directly used in classification. The dif ference appears in the presence or absence of q (1 − q ) S [ X − Y ] . It is desirable that q be small because one laser is di vided into multiple ones using DOEs and the binary matrix needs to be sparse for the better signal-to-noise ratio. That is, the term q (1 − q ) S [ X − Y ] is small. Moreo ver , S [ X − Y ] is typically small. When objects X and Y are similar, S [ X − Y ] takes a small value. When objects X and Y are dissimilar , S [ X − Y ] also takes a small value because the elements of matrix X − Y takes positiv e and negati ve v alues and operation S is just a summation of the elements. When we use the formulation of g , the term S [ X − Y ] disappear and might impro ve the classification performance of ghost cytometry . Since the follo wing corollary holds, we also obtain the subExponential forms of Theorems 3.11 and 3.13. Corollary 3.14 (subExponentiality) . Let Z be a random variable satisfying Bernstein inequal- ity , Theor ems 3.3. Then, for all | λ | < 1 / (2 C ) , E [exp( λZ )] ≤ exp λ E [ Z ] + λ 2 V [ Z ] , (30) and for all | | < V [ Z ] /C , P [ | Z − E [ Z ] | ≥ ] ≤ 2 exp − 2 4 V [ Z ] for all ≥ 0 . (31) 3.4 Discussion Theorems 3.11 and 3.13 indicate that the RBF kernel function calculated using ghost features approximates the RBF kernel function using image objects, i.e., κ γ ( X , Y ) = exp − γ k X − Y k 2 F ≈ κ β ( g ( X ) , g ( Y )) = exp − β k g ( X ) − g ( Y ) k 2 2 , (32) where γ ∈ (0 , + ∞ ) and β ∈ (0 , + ∞ ) are kernel parameters. Note that we can tune β ∈ (0 , + ∞ ) in stead of tuning γ in the case of cross-v alidation. The Frobenius norm is not rotation/shift-inv ariant to capture morphological information. Ho wev er , in the case of flo w cytometry , we can obtain more representati ve objects by using r eal data augmentation from which we obtain augmented ghost features by injecting the object into the flo w cytometer man y times. 8 It is well known that a kernel function defines feature maps and vice versa. Let H be a Hilbert space. A feature map φ : X → H takes input x ∈ X to infinite feature vectors φ ( x ) ∈ H . For e very kernel κ , there exists Hilbert space H and feature map φ : X → H such that k ( x, x 0 ) = h φ ( x ) , φ ( x 0 ) i where h· , ·i is the inner product in the Hilbert Space. That is, on the basis of kernel theory , when we focus on cell-image objects as input space, φ ( X ) indicates some of the features of cell-image object X . The features on Hilbert space are black- box, but with SVM, we predict the label of a tar get object by using the labels of representativ e objects, called support vectors , similar to the tar get object in terms of the Frobenius norm. Thus, the representativ e objects may have specific morphological features for prediction. That is, analyzing the representati ve cells may lead to understanding specific morphological features. Fortunately , ghost c ytometry [6, 7] also showed better results of image reconstruction from raw signals on the basis of ghost imaging and compressed sensing [23]. That is, we can obtain the images of the representati ve cells from ra w signals off-line. 4 Conclusion W e pro vided a theoretical analysis of ghost features in ghost imaging and ghost cytometry . It states that there exists a ghost feature map from an object space into a signal space that can preserve the pairwise Euclidean distances in terms of the Frobenius norm up to a bounded relati ve error . T o the best of our knowledge, this work is the first step to statistically analyze and justify optical machine learning. One direction in optical machine learning is learning structured illumination patterns from training data where first the learning process of illumination patterns is done in computational simulation and then the learned illumination patterns are implemented in optical elements. Since the entries in the structured illumination need to be binary , the recent adv ances in bina- rized neural networks may help with learning structured illumination patterns. One limitation of ghost cytometry is that the illumination pattern is generated by DOEs, which is the same as existing dif fracti ve optical neural networks. That is, the fully hardware-implemented pattern lack flexibility in the learning process. The spatial light modulator (SLM) can be a solution for this problem. That is, a hybrid system comprising binarized neural networks, SLM, a single- pixel detector , and FPGA may be the next trend in optical machine learning. In this case, it is important to construct a solid theory on learning binarized neural networks constrained by SLM-based optical operations. Moreover , we need consider transfer learning and fine tuning theories from computer simulations to experiments with optical elements. 9 Refer ences [1] T . B. Pittman, Y . H. Shih, D. V . Strekalov , and A. V . Ser gienko. Optical imaging by means of two-photon quantum entanglement. Phys. Rev . A , 52:R3429–R3432, Nov 1995. [2] Baris I. Erkmen and Jeffrey H. Shapiro. Ghost imaging: from quantum to classical to computational. Advances in Optics and Photonics , 2(4):405–450, Dec 2010. [3] Jeffre y H. Shapiro and Robert W . Boyd. The physics of ghost imaging. Quantum Infor- mation Pr ocessing , 11(4):949–993, Aug 2012. [4] Jeffre y H. Shapiro. Computational ghost imaging. Phys. Rev . A , 78:061802, Dec 2008. [5] Ori Katz, Y aron Bromberg, and Y aron Silberberg. Compressiv e ghost imaging. Applied Physics Letters , 95:131110, 2009. [6] Sadao Ota, Ryoichi Horisaki, Y oko Ka wamura, Masashi Uga wa, Issei Sato, Kazuki Hashimoto, Ryosuke Kamesaw a, K otaro Setoyama, Satoko Y amaguchi, Katsuhito Fujiu, Kayo W aki, and Hiroyuki Noji. Ghost cytometry . Science , pages 1246–1251, 2018. [7] Sadao Ota, Ryoichi Horisaki, Y oko Ka wamura, Issei Sato, and Hiro yuki Noji. Imaging cytometry without image reconstruction (ghost cytometry). arXiv pr eprint arXiv:1903.12053 , 2019. [8] Xing Lin, Y air Ri venson, Nezih T . Y ardimci, Muhammed V eli, Y i Luo, Mona Jarrahi, and A ydogan Ozcan. All-optical machine learning using dif fractiv e deep neural networks. Science , pages 1004–1008, 2018. [9] Kelvin W agner and Demetri Psaltis. Multilayer optical learning networks. Applied Optics , pages 5061–5076, 1987. [10] H.J. Caulfield, J. Kinser, and S.K. Rogers. Optical neural netw orks. In Pr oceedings of the IEEE , pages 1573–1583, 1989. [11] W . Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics , 26:189–206, 1984. [12] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structur es and Algorithms , 22(1):60–65, 2003. [13] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences , 66(4):671–687, 2003. [14] Ping Li, T rev or Hastie, and K enneth W ard Church. V ery sparse random projections. In Pr oceedings of the T welfth A CM SIGKDD International Confer ence on Knowledge Dis- covery and Data Mining, Philadelphia, P A, USA, August 20-23, 2006 , pages 287–296, 2006. [15] Jir ´ ı Matousek. On v ariants of the johnson-lindenstrauss lemma. Random Structur es and Algorithms , 33(2):142–156, 2008. 10 [16] R. I. Arriaga and S. V empala. An algorithmic theory of learning: Rob ust concepts and random projection. In Pr oceedings of the 40th Annual Symposium on F oundations of Computer Science (FOCS) , 1999. [17] Qinfeng Shi, Chunhua Shen, Rhys Hill, and Anton van den Hengel. Is margin preserved after random projection? In Pr oceedings of the 29th International Confer ence on Machine Learning (ICML 2012) , 2012. [18] V . V . Buldygin and K. Moskvichov a. The sub-gaussian norm of a binary random variable. Theory of Pr obability and Mathematical Statistics , pages 33–49, 2019. [19] Julyan Arbel, Olivier Marchal, and Hien D. Nguyen. On strict sub-gaussianity , op- timal proxy v ariance and symmetry for bounded random variables. arXiv pr eprint arXiv:1901.09188 , 2019. [20] S.N. Bernstein. The theory of probabilities. Gastehizdat Publishing House , 1946. [21] Isha De wan and B. L. S. Prakasa Rao. A general method of density estimation for associ- ated random v ariables. Journal of Nonpar ametric Statistics , 10(4):405–420, 1999. [22] C. M. Ne wman. Normal fluctuations and the fkg inequalities. Communications in Mathe- matical Physics , pages 119–128, 1980. [23] M. A. T . Figueiredo J. M. Bioucas-Dias. A ne w twist: T wo-step iterativ e shrink- age/thresholding algorithms for image restoration. IEEE T r ansactions on Image Pr ocess- ing , pages 2992–3004, 2007. 11 A Pr oofs B Pr oof of Lemma 3.10 For independent random v ariables Z 1 , Z 2 , and Z 3 where E [ Z 1 ] = E [ Z 2 ] = E [ Z 3 ] = 0 , we hav e Co v(Z 1 Z 2 , Z 2 Z 3 ) = E [ Z 1 Z 2 2 Z 3 ] − E [ Z 1 Z 2 ] E [ Z 2 Z 3 ] = 0 , (33) Co v(Z 2 1 , Z 1 Z 2 ) = E [ Z 3 1 Z 2 ] − E [ Z 2 1 ] E [ Z 1 Z 2 ] = 0 . (34) By applying these results, Theorem 3.4, and ( G m ( X ) − q S [ X ]) 2 = H X i =1 W X j =1 ( B m ( i, j ) − q ) X ( i, j ) ! 2 = H X i =1 W X j =1 X ( i, j ) 2 ( B m ( i, i ) − q ) 2 + 2 W X j =1 H X i =1 W H X k 0 > ( j − 1) H + i X ( i, j ) X ( i 0 , j 0 )( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q ) , (35) we hav e E " exp t q (1 − q ) M M X m =1 ( G m ( X ) − q S [ X ]) 2 !# = M Y m =1 E exp t q (1 − q ) M ( G m ( X ) − q S [ X ]) 2 = M Y m =1 H Y i =1 W Y j =1 E exp t q (1 − q ) M X ( i, j ) 2 ( B m ( i, j ) − q ) 2 × M Y m =1 W Y j =1 H Y i =1 W H Y k 0 > ( j − 1) H + i E exp 2 t q (1 − q ) M X ( i, j ) X ( i 0 , j 0 )( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q ) . (36) Note that E [( B m ( i, j ) − q ) 2 ] = q (1 − q ) , (37) E [( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q )] = 0 (( i, j ) 6 = ( i 0 , j 0 )) , (38) V [( B m ( i, j ) − q ) 2 ] = E [( B m ( i, j ) − q ) 4 ] − E [( B m ( i, j ) − q ) 2 ] 2 = q (1 − q ) 4 + (1 − q )( − q ) 4 − ( q (1 − q )) 2 = q (1 − q )((1 − q ) 3 + q 3 − q (1 − q )) = q (1 − q )(1 − 4 q + 4 q 2 ) = q (1 − q )(1 − 2 q ) 2 , (39) V [( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q )] (( i, j ) 6 = ( i 0 , j 0 )) = E [( B m ( i, j ) − q ) 2 ( B m ( i 0 , j 0 ) − q ) 2 ] − E [( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q )] 2 = E [( B m ( i, j ) − q ) 2 ] E [( B m ( i 0 , j 0 ) − q ) 2 ] = q 2 (1 − q ) 2 . (40) 12 By using Bernstein inequality , for any | t | M < 1 Λ q ( X ) , E exp t q (1 − q ) M X ( i, j ) 2 ( B m ( i, j ) − q ) 2 ≤ exp t M X ( i, j ) 2 + t q (1 − q ) M X ( i, j ) 2 2 V [( B m ( i, j ) − q ) 2 ] 2 1 − | t | M Λ q ( X ) = exp t M X ( i, j ) 2 + t q (1 − q ) M X ( i, j ) 2 2 q (1 − q )(1 − 2 q ) 2 2 1 − | t | M Λ q ( X ) (41) and E exp 2 t q (1 − q ) M X ( i, j ) X ( i 0 , j 0 )( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q ) ≤ exp 2 t q (1 − q ) M X ( i, j ) X ( i 0 , j 0 ) 2 V [( B m ( i, j ) − q )( B m ( i 0 , j 0 ) − q )] 2 1 − Λ q ( X ) M | t | = exp 2 t q (1 − q ) M X ( i, j ) X ( i 0 , j 0 ) 2 ( q (1 − q )) 2 2 1 − Λ q ( X ) M | t | . (42) That is, for i 0 = [ k 0 % H ] and j 0 = b k 0 /H c , E " exp t q (1 − q ) M M X m =1 ( G m ( X ) − q S [ X ]) 2 !# ≤ M Y m =1 H Y i =1 W Y j =1 exp t M X ( i, j ) 2 + tρ 1 q (1 − q ) M X ( i, j ) 2 2 q (1 − q )(1 − 2 q ) 2 2 1 − Λ q ( X ) M | t | × M Y m =1 W Y j =1 H Y i =1 W H Y k 0 > ( j − 1) H + i exp 2 t q (1 − q ) M X ( i, j ) X ( i 0 , j 0 ) 2 ( q (1 − q )) 2 2 1 − Λ q ( X ) M | t | = exp t k X k 2 F + t 2 M P W j =1 P H i =1 X ( i, j ) 4 (1 − 2 q ) 2 q (1 − q ) 2 1 − Λ q ( X ) M | t | × exp 4 t 2 M P W j =1 P H i =1 P W H k 0 > ( j − 1) H + i ( X ( i, j ) X ( i 0 , j 0 )) 2 2 1 − Λ q ( X ) M | t | = exp t k X k 2 F + Γ q ( X ) 2 M 1 − Λ q ( X ) M | t | t 2 . (43) 13 Finally , we ha ve E exp t q (1 − q ) ( q S [ X ] − h G ( X ) i ) 2 = E exp t q (1 − q ) 1 M M X m =1 G m ( X ) − q S [ X ] ! 2 = E exp t q (1 − q ) M 2 M X m =1 H X i =1 W X j =1 X ( i, j )( B m ( i, j ) − q ) ! 2 ≤ E exp t q (1 − q ) M 2 M X m =1 H X i =1 W X j =1 X ( i, j )( B m ( i, j ) − q ) ! 2 = E " exp t q (1 − q ) M 2 M X m =1 ( G m ( X ) − q S [ X ]) 2 !# ≤ exp t M k X k 2 F + Γ q ( X ) 2 M 3 1 − Λ q ( X ) M 2 | t | t 2 . (44) C Pr oof of Lemma 3.9 By using the linearity of G m and h G m ( X − Y ) i , G m ( X − Y ) = X i,j B m ( i, j )( X i,j − Y i,j ) = X i,j B m ( i, j ) X i,j − X i,j B m ( i, j ) Y i,j = G m ( X ) − G m ( Y ) (45) h G m ( X − Y ) i = 1 M M X m =1 G m ( X − Y ) = 1 M M X m =1 G m ( X ) − G m ( Y ) = h G m ( X ) i − h G m ( Y ) i , (46) we hav e g m ( X − Y ) = G m ( X − Y ) − h G m ( X − Y ) = G m ( X ) − G m ( Y ) − ( h G m ( X ) − h G m ( Y )) i = G m ( X ) − h G m ( X ) − ( G m ( Y ) − h G m ( Y )) . (47) D Pr oof of Theor em 3.11 Note that M X m =1 ( G m ( X ) − h G ( X ) i ) 2 = M X m =1 (( G m ( X ) − q S [ X ]) + ( q S [ X ] − h G ( X ) i )) 2 14 = M X m =1 [( G m ( X ) − q S [ X ]) 2 + ( q S [ X ] − h G ( X ) i ) 2 + 2( G m ( X ) − q S [ X ])( q S [ X ] − h G ( X ) i )] = M X m =1 ( G m ( X ) − q S [ X ]) 2 + M ( q S [ X ] − h G ( X ) i ) 2 + 2 M X m =1 ( G m ( X ) − q S [ X ])( q S [ X ] − h G ( X ) i ) . (48) W e now ha ve M X m =1 ( G m ( X ) − q S [ X ])( q S [ X ] − h G ( X ) i ) = M X m =1 ( G m ( X ) q S [ X ] − q S [ X ] q S [ X ] − G m ( X ) h G ( X ) i + q S [ X ] h G ( X )) = M X m =1 G m ( X ) q S [ X ] − M q S [ X ] q S [ X ] − M X m =1 G m ( X ) h G ( X ) i + M q S [ X ] h G ( X ) i ) = M h G ( X ) i q S [ X ] − M q S [ X ] q S [ X ] − M h G ( X ) ih G ( X ) i + M q S [ X ] h G ( X ) i ) = − M ( q S [ X ] q S [ X ] + h G ( X ) ih G ( X ) i − 2 q S [ X ] h G ( X ) i ) = − M ( q S [ X ] − h G ( X ) i ) 2 . (49) Thus, we obtain M X m =1 ( G m ( X ) − h G ( X ) i ) 2 = M X m =1 ( G m ( X ) − q S [ X ]) 2 − M ( q S [ X ] − h G ( X ) i ) 2 (50) On the basis of H ¨ Older’ s inequality , E " exp t q (1 − q ) M M X m =1 g m ( X ) 2 !# = E " exp t q (1 − q ) M M X m =1 ( G m ( X ) − h G ( X ) i ) 2 !# = E " exp t q (1 − q ) M M X m =1 (( G m ( X ) − q S [ X ]) 2 − M ( q S [ X ] − h G ( X ) i ) 2 ) !# = E " exp t q (1 − q ) M M X m =1 ( G m ( X ) − q S [ X ]) 2 ! exp − t q (1 − q ) ( q S [ X ] − h G ( X ) i ) 2 # ≤ E " exp tρ 1 q (1 − q ) M M X m =1 (( G m ( X ) − q S [ X ]) 2 !# 1 ρ 1 E exp − t q (1 − q ) ρ 2 ( q S [ X ] − h G ( X ) i ) 2 ) 1 ρ 2 . (51) 15 By using Lemma 3.10, when ρ 1 = ρ 2 = 2 , E " exp t q (1 − q ) M M X m =1 g m ( X ) 2 !# ≤ exp t k X k 2 F + Γ q ( X ) 2 M 1 − Λ q ( X ) M ρ 1 | t | ρ 1 t 2 × exp − t M k X k 2 F + Γ q ( X ) 2 M 3 1 − Λ q ( X ) M 2 ρ 2 | t | ρ 2 t 2 ≤ exp t 1 − 1 M k X k 2 F + 1 + 1 M 2 Γ q ( X ) M 1 − 2 Λ q ( X ) M | t | t 2 . (52) The last inequality is gi ven by 1 − 2 Λ q ( X ) M 2 | t | − 1 < 1 − 2 Λ q ( X ) M | t | − 1 . Thus, we hav e E " exp t q (1 − q ) M M X m =1 g m ( X ) 2 !# ≤ exp t 1 − 1 M k X k 2 F + 1 + 1 M 2 Γ q ( X ) M 1 − 2 Λ q ( X ) M | t | t 2 (53) and by using Theorem 3.3 P " 1 q (1 − q ) M M X m =1 g m ( X ) 2 − 1 − 1 M k X k 2 F ≥ k X k 2 F # ≤ exp − ( k X k 2 F ) 2 2 1 + 1 M 2 2Γ q ( X ) M + 2 Λ q ( X ) M k X k 2 F ! = exp − ( k X k 2 F ) 2 M 2 1 + 2 M 2 Γ q ( X ) + Λ q ( X ) k X k 2 F ! = exp − 2 M 2 1 + 2 M 2 Γ q ( X ) k X k 4 F + Λ q ( X ) k X k 2 F . (54) In the same way , we have P " 1 q (1 − q ) M M X m =1 g m ( X ) 2 − 1 + 1 M k X k 2 F q (1 − q ) ≤ k X k 2 F # ≤ exp − 2 M 2 1 + 2 M 2 Γ q ( X ) k X k 4 F + Λ q ( X ) k X k 2 F . (55) 16 Therefore, for e very real matrix X , with probability at least 1 − δ , 1 − 1 M − k X k 2 F ≤ 1 M q (1 − q ) k g ( X ) k 2 2 ≤ 1 − 1 M + k X k 2 F . (56) On the basis of the linearity of g ( X ) (Proposition 3.9), substitute X − Y for X in Eq. (56). E Pr oof of Theor em 3.13 Deri ving the exponential inequality is similar to that of ghost imaging in Eq.(43). M + W − 1 X m =1 ( G m ( X ) − q S [ X ]) 2 = M X m =1 H X i =1 W X j =1 X ( i, j ) 2 ( B ( i, m ) − q ) 2 ! | {z } Part (A) + 2 M X m =1 H X i =1 ( m − 1+ W ) H X k 0 > ( m − 1) H + i W − ( m 0 − m ) X j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) V ( m − 1) H + i V k 0 | {z } Part (B) , (57) where i 0 = [ k 0 % H ] and m 0 = b k 0 /H c . That is, for any t M < 1 Φ q ( X ) , E " exp t q (1 − q )( M + W − 1) M + W − 1 X m =1 ( G m ( X ) − q S [ X ]) 2 !# ≤ M Y m =1 H Y i =1 E " exp t q (1 − q ) M W X j =1 X ( i, j ) 2 ( B ( i, m ) − q ) 2 !!# | {z } Part (A ’) × M Y m =1 H Y i =1 ( m − 1+ W ) H Y k 0 > ( m − 1) H + i E exp 2 t q (1 − q ) M W − ( m 0 − m ) X j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) V ( m − 1) H + i V k 0 | {z } Part (B’) ≤ M Y m =1 H Y i =1 exp t M X ( i, j ) 2 + tρ 1 q (1 − q ) M P W j =1 X ( i, j ) 2 2 (1 − 2 q ) 2 q (1 − q ) 2 1 − Ψ( X ) M | t | × M Y m =1 H Y i =1 ( m − 1+ W ) H Y k 0 > ( m − 1) H + i exp 2 t q (1 − q ) M P W − ( m 0 − m ) j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) 2 ( q (1 − q )) 2 2 1 − Ψ( X ) M | t | = exp t k X k 2 F + t 2 M P H i =1 P W j =1 X ( i, j ) 2 2 (1 − 2 q ) 2 q (1 − q ) 2 1 − Ψ( X ) M | t | 17 × exp 4 t 2 M P H i =1 P ( m − 1+ W ) H k 0 > ( m − 1) H + i P W − ( m 0 − m ) j =1 X ( i, j ) X ( i 0 , j + m 0 − m ) 2 2 1 − Ψ( X ) M | t | ≤ exp t k X k 2 F + Ψ q ( X ) 2 M 1 − Ψ( X ) M | t | t 2 . (58) Theorem 3.13 holds as a consequence of Theorem 3.3 and the linearity of G ( · ) , i.e., G ( X − Y ) = G ( X ) − G ( Y ) . F Pr oof of Cor ollary 3.14 By using Marko v inequality , for any λ > 0 , P [ Z − E [ Z ] ≥ ] = P [exp( λ ( Z − E [ Z ])) ≥ exp( λ )] ≤ E [exp( λ ( Z − E [ Z ]))] exp( λ ) ≤ (30) exp V [ Z ] λ 2 − λ = exp V [ Z ] λ − 2 V [ Z ] 2 − 2 4 V [ Z ] ! . (59) Thus, when λ = 2 V [ Z ] , we hav e P [ Z ≥ ] ≤ exp − 2 4 V [ Z ] for 2 V [ Z ] < 1 2 C . 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment