Matrix optimization on universal unitary photonic devices

Matrix optimization on univ ersal unitary photonic devices Sunil P ai, 1 , ∗ Ben Bartlett, 2 Ola v Solgaard, 1 and Da vid A. B. Miller 1 , † 1 Dep artment of Ele ctric al Engine ering, Stanfor d University, Stanfor d, CA 94305, USA 2 Dep artment of Applie d Physics, Stanfor d University, Stanfor d, CA 94305, USA Univ ersal unitary photonic devices can apply arbitrary unitary transformations to a v ector of input mo des and provide a promising hardware platform for fast and energy-eﬃcient machine learning using ligh t. W e sim ulate the gradient-based optimization of random unitary matrices on univ ersal photonic devices comp osed of imp erfect tu nable in terferometers. If device comp onen ts are initialized uniform-randomly , the lo cally-in teracting nature of the mesh comp onen ts biases the optimization searc h space tow ards banded unitary matrices, limiting conv ergence to random unitary matrices. W e detail a procedure for initializing the device b y sampling from the distribution of random unitary matrices and sho w that this greatly improv es conv ergence sp eed. W e also explore mesh architecture impro vemen ts such as adding extra tunable b eamsplitters or p erm uting wa v eguide lay ers to further impro ve the training sp eed and scalabilit y of these devices. I. INTR ODUCTION Univ ersal m ultip ort in terferometers are optical net- w orks that perform arbitrary unitary transformations on input v ectors of coherent light mo des. Suc h devices can b e used in applications including quan tum computing (e.g. b oson sampling, photon walks) [1 – 4], mode un- scram blers [5], photonic neural net works [6 – 8], and ﬁnd- ing optimal channels through lossy scatterers [9]. While univ ersal photonic devices ha v e b een exp erimen tally real- ized at a relativ ely small scale [5, 6], commercial applica- tions suc h as hardw are for energy-eﬃcient machine learn- ing and signal pro cessing can beneﬁt from scaling the devices to up to N = 1000 mo des. At this scale, fabrica- tion imp erfections and components with scale-dep enden t sensitivities can negatively aﬀect performance. One canonical univ ersal photonic device is the rectan- gular m ultip ort in terferometer mesh [10] sho wn in Figure 1 in terfering N = 8 mo des. In m ultip ort interferome- ters, an N -dimensional v ector is represented by an ar- ra y of mo des arranged in N single-mo de w av eguides. A unitary operation is applied to the input vector by tun- ing Mac h-Zehnder in terferometers (MZIs) represented b y the red dots of Figure 1. Eac h MZI is a t wo-port op- tical comp onen t made of tw o 50:50 b eamsplitters and t wo tunable single-mo de phase shifters. Other mesh ar- c hitectures hav e b een prop osed, such as the triangular mesh [11] (sho wn in Appendix C), the univ ersal cascaded binary tree architecture [12], and lattice arc hitectures where light do es not mo v e in a forward-only direction [13 – 15]. The scalability of optimizing mesh architectures, es- p ecially using gradient-based metho ds, is limited by the abilit y of the lo cally interacting architecture to control the output p o w ers in the mesh. If phase shifts in the mesh are initialized uniform-randomly , light propagates through the device in a manner similar to a random w alk. ∗ sunilpai@stanford.edu † dabm@stanford.edu 1 2 2 2 3 4 4 1 1 5 6 3 3 7 5 1 1 5 6 3 3 4 4 1 1 2 2 2 1 2 3 4 5 6 7 8 U 1 U 2 U 3 U 4 U 5 U 6 U 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 = R θ R φ [0 , π ] [0 , 2 π ) B B FIG. 1. Mesh diagram representing the lo cally interacting rectangular mesh for N = 8. The inputs (and single-mo de phase shifts at the inputs) are represen ted by blue triangles. Outputs are represented by purple squares. The MZI no des are represen ted by red dots lab elled with sensitivit y index α n` (e.g., α 44 = 7 is the most sensitiv e no de). The no des represen t the Giv ens rotation U n (in orange) at vertical lay er ` (in green). Eac h photonic MZI no de can be represented with 50:50 b eamsplitters B (red) and phase shifters R θ , R φ (orange) with required ranges 0 ≤ θ ≤ π and 0 ≤ φ < 2 π . The oﬀ-diagonal, nonlo cal elements of the implemented unitary matrix tend to b e close to zero b ecause transi- tions b et w een inputs and outputs that are far apart ha v e few er paths (e.g., input 1 and output 8 in Figure 1 hav e a single path). The resulting mesh therefore implemen ts a unitary matrix with a banded structure that is increas- ingly pronounced as the matrix size increases. In man y applications suc h as machine learning [6] and quan tum computing [2, 16], w e a v oid this banded unitary 2 matrix b eha vior in fav or of random unitary matrices. A random unitary matrix is ac hiev ed when the device phase shifts follow a distribution derived from random matrix theory [16 – 20]. In the random matrix theory model, we assign a sensitivity index to eac h component that in- creases to wards the center of the mesh, as shown in Fig- ure 1. The more sensitiv e comp onen ts to ward the cen ter of the mesh require higher transmissivities and tighter optimization tolerances. If the required tolerances are not met, the implemen ted unitary matrix begins to sho w the undesired banded b eha vior. In Section I I, w e introduce the photonic mesh arc hitec- ture and sources of error that can exacerbate the banded unitary matrix problem. In Section I II, w e explicitly mo del the comp onen t settings to implemen t a random unitary matrix and ultimately av oid the banded unitary matrix problem. W e prop ose a “Haar initialization” pro- cedure that allows ligh t to propagate uniformly to all out- puts from any input. W e use this pro cedure to initialize the gradien t-based optimization of a photonic mesh to learn unknown random unitary matrices given training data. W e sho w that this optimization conv erges ev en in the presence of signiﬁcant sim ulated fabrication errors. In Sections IV and V, we prop ose and simulate tw o al- terations to the mesh architecture that further impro ve gradien t-based optimization p erformance. First, we add redundan t MZIs in the mesh to reduce conv ergence er- ror b y up to ﬁve orders of magnitude. Second, we p er- m ute the mesh interactions while maintaining the same n umber of tunable comp onen ts, whic h increases allo w- able tolerances of phase shifters, decreases oﬀ-diagonal errors, and improv es con v ergence time. I I. PHOTONIC MESH W e deﬁne the photonic mesh when op erated perfectly and then discuss how beam splitter or phase shift errors can aﬀect device p erformance. A. Photonic unitary implementation A single-mo de phase shifter can p erform an arbitrary U(1) transformation e iφ on its input. A phase-modulated Mac h-Zehnder in terferometer (MZI) with p erfect (50 : 50) b eamsplitters can apply to its inputs a unitary trans- formation U of the form: U ( θ, φ ) := R φ B R θ B =  e iφ 0 0 1  1 √ 2  1 i i 1   e iθ 0 0 1  1 √ 2  1 i i 1  = ie iθ 2  e iφ sin θ 2 e iφ cos θ 2 cos θ 2 − sin θ 2  , (1) where B is the b eamsplitter op erator, R θ , R φ are upper phase shift op erators. Equation 1 is represen ted diagram- matically b y the conﬁguration in Figure 1. 1 If one or t wo single-mo de phase shifters are added at the inputs, we can apply an arbitrary SU(2) or U(2) transformation to the inputs, resp ectiv ely . W e deﬁne the transmissivity and reﬂectivit y of the MZI as: t := cos 2  θ 2  = | U 12 | 2 = | U 21 | 2 r := sin 2  θ 2  = 1 − t = | U 11 | 2 = | U 22 | 2 . (2) In this conv en tion, when θ = π , w e hav e r = 1 , t = 0 (the MZI “bar state”), and when θ = 0, w e ha ve r = 0 , t = 1 (the MZI “cross state”). If there are N input mo des and the interferometer is connected to w av eguides n and n + 1 then we can embed the 2 × 2 unitary U from Equation 1 in N -dimensional space with a locally-interacting unitary “Giv ens rota- tion” U n deﬁned as: U n := n n + 1                     1 · · · 0 0 · · · 0 . . . . . . . . . . . . . . . 0 · · · U 11 U 12 · · · 0 n 0 · · · U 21 U 22 · · · 0 n + 1 . . . . . . . . . . . . . . . 0 · · · 0 0 · · · 1 . (3) All diagonal elemen ts are 1 except those lab eled U 11 and U 22 , whic h ha v e magnitudes of √ r = √ 1 − t , and all oﬀ- diagonal elements are 0 except those lab eled U 12 and U 21 , whic h hav e magnitudes of √ t . Arbitrary unitary transformations can be implemen ted on a photonic c hip using only lo cally in teracting MZIs [11]. In this pap er, we focus on optimizing a rectangular mesh [10] of MZIs; how ever, our ideas can b e extended to other universal sc hemes, such as the triangular mesh [22], as well. In the rectangular mesh sc heme [10] of Figure 1, we represen t ˆ U R ∈ U( N ) in terms of N ( N − 1) / 2 lo- cally interacting Givens rotations U n and N single-mo de phase shifts at the inputs represen ted by diagonal unitary D ( γ 1 , γ 2 , . . . γ N ): ˆ U R := N Y ` =1 Y n ∈S `,N U n ( θ n` , φ n` ) · D ( γ 1 , γ 2 , . . . γ N ) , (4) where our lay er-wise product left-multiplies from ` = N to 1, 2 the single-mode phase shifts are γ n ∈ [0 , 2 π ), and 1 Other conﬁgurations with tw o indep enden t phase shifters b e- tw een the beamsplitters B are ultimately equiv alent for photonic meshes [21]. 2 In general, for matrix products for a sequence { M ` } , we deﬁne the multiplication order Q N ` =1 M ` = M N M N − 1 · · · M 1 . 3 where the Giv ens rotations are parameterized by θ n` ∈ [0 , π ] , φ n` ∈ [0 , 2 π ). 3 W e deﬁne the top indices of each in teracting mo de for each vertical la y er as the set S `,N = { n ∈ [1 , 2 , . . . N − 1] | n (mo d 2) ≡ ` (mo d 2) } . This v ertical lay er deﬁnition follows the conv en tion of Refs. [7, 23] and is depicted in Figure 1, where ` represents the index of the vertical la y er. B. Beamsplitter error tolerances The expressions in Equations 1 and 4 assume p erfect fabrication. In practice, how ev er, we would lik e to sim- ulate how practical devices with errors in each transfer matrix B , R φ , R θ in Equation 1 impact optimization p er- formance. In fabricated c hip technologies, imp erfect b eamsplit- ters B can hav e a split ratio error  that c hange the b e- ha vior of the red 50:50 coupling regions in Figure 1 or B in Equation 1. The resultant scattering matrix U  with imp erfect beamsplitters B  can b e written as: B  := 1 √ 2  √ 1 +  i √ 1 −  i √ 1 −  √ 1 +   U  := R φ B  R θ B  . (5) As sho wn in App endix B, if we assume b oth b eamsplit- ters hav e identical  , w e ﬁnd t  := t (1 −  2 ) ∈ [0 , 1 −  2 ] is the realistic transmissivity , r  := r + t ·  2 ∈ [  2 , 1] is the realistic reﬂectivity , and t, r are the ideal transmissivity and reﬂectivit y deﬁned in Equation 2. The unitary matrices in Equation 5 cannot express the full transmissivity range of the MZI, with errors of up to  2 in the transmissivity , p oten tially limiting the p erfor- mance of greedy progressiv e photonic algorithms [24 – 26]. Our Haar phase theory , whic h we develop in the follo wing section, determines acceptable interferometer tolerances for calibration of a “p erfect mesh” consisting of imperfect b eamsplitters [21] giv en large N . W e will additionally sho w that simulated photonic backpropagation [7] with adaptiv e learning can adjust to nearly matc h the p erfor- mance of p erfect meshes with errors as high as  = 0 . 1 for meshes of size N = 128. C. Phase shift tolerances Another source of uncertain ty in photonic meshes is the phase shift tolerances of the mesh whic h aﬀect the matrices R θ , R φ of Equation 1, shown in orange in Fig- ure 1. Error sources such as thermal crosstalk or envi- ronmen tal drift may result in slight deviance of phase 3 Since γ n , φ n` are p eriodic phase parameters, they are in half- open in terv als [0 , 2 π ). In con trast, any θ n` ∈ [0 , π ] must be in a closed interv al to achiev e all transmissivities t n` ∈ [0 , 1]. shifts in the mesh from intended op eration. Such errors primarily aﬀect the control parameters θ n` that con trol ligh t propagation in the mesh by aﬀecting the MZI split ratios. This nontrivial problem warran ts a discussion of mean behavior and sensitivities (i.e., the distribution) of θ n` needed to optimize a random unitary matrix. I II. HAAR INITIALIZA TION A. Cross state bias and sensitivit y index The conv ergence of global optimization dep ends crit- ically on the sensitivity of eac h phase shift. The gra- dien t descent optimization w e study in this paper con- v erges when the phase shifts are correct to within some acceptable range. This acceptable range can b e rigor- ously deﬁned in terms of av erage v alue and v ariance of phase shifts in the mesh that together deﬁne an unbiased (“Haar random”) unitary matrix. 4 T o implemen t a Haar random unitary , some MZIs in the mesh need to b e biased to wards cross state ( t n` near 1, θ n` near 0) [16, 24]. This cross state bias correspondingly “pinc hes” the acceptable range for transmissivity and phase shift near the limiting cross state conﬁguration, resulting in higher sensitivity , as can b e seen in Figure 3(b). F or an implemen ted Haar random unitary matrix, lo w- tolerance, transmissiv e MZIs are located to w ards the cen- ter of a rectangular mesh [16, 24] and the ap ex of a tri- angular mesh as prov en in App endix C. F or both the triangular and rectangular meshes, the cross state bias and corresp onding sensitivity for each MZI dep ends only on the total n umber of reachable wa v eguides p orts, as pro ven in Appendix I. Based on this pro of, w e deﬁne the sensitivit y index α n` := | I n` | + | O n` | − N − 1, 5 where I n` and O n` are the subsets of input and output wa veg- uides reachable by ligh t exiting or en tering the MZI, re- sp ectiv ely , and |·| denotes set size. Figure 1 and Figure 2(a) show the sensitivit y index for the rectangular mesh, whic h clearly increases tow ards the cen ter MZI. B. Phase shift distributions and Haar phase The external φ n` , γ n phase shifts do not aﬀect the the transmissivit y t n` and therefore ob ey uniform random distributions [16]. In contrast, the θ n` phase shifts hav e a probability density function (PDF) that dep ends on 4 A Haar random unitary is deﬁned as Gram-Sc hmidt orthogonal- ization of N standard normal complex vectors [16, 20]. 5 Note that 1 ≤ α n` ≤ N − 1, and there are alwa ys N − α n` MZIs that hav e a sensitivity index of α n` . 4 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( a ) Sensitivity index: α n` 0 8 16 24 32 40 48 56 64 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( b ) R eﬂectivity: h r n` i = ( α n` + 1) − 1 0 0 . 25 0 . 5 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( c ) Phase: θ n` / 2 0 π / 4 π / 2 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( d ) Haar phase: ξ n` = t α n` n` 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( e ) Haar random θ n` init 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 8 16 24 32 40 48 56 64 Layer ( ` ) 0 8 16 24 32 40 48 56 64 Input ( n ) ( f ) Uniform random θ n` init 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 FIG. 2. (a) The sensitivit y index α n` for N = 64. (b) Chec kerboard plot for the a v erage reﬂectivit y h r n` i in a rect- angular mesh. (c) Haar-random matrix and run the decom- p osition in Ref. [10] to ﬁnd phases approaching cross state in the middle of the mesh. (d) The Haar phase ξ n` for the rectangular mesh b etter displa ys the randomness. (e, f ) Field measuremen ts (absolute v alue) from propagation at input 32 in (e) Haar and (f ) uniform random initialized rectangular meshes with N = 64. α n` [16]: P α n`  θ n` 2  = α n` sin  θ n` 2   cos  θ n` 2  2 α n` − 1 . (6) The general shap e of this distribution is presen ted in Figure 3(b), showing how an increase in α n` biases θ n` to wards the cross state with higher sensitivit y . W e deﬁne the Haar phase ξ n` as the cum ulative dis- tribution function (CDF) of θ n` / 2 starting from θ n` / 2 = π / 2: ξ n` := Z θ n` / 2 π / 2 P α n` ( θ )d θ. (7) Using Equations 6 and 7, w e can deﬁne ξ n` ( θ n` ) ∈ [0 , 1] that yields a Haar random matrix: ξ n` =  cos 2  θ n` 2  α n` = t α n` n` , (8) where t n` represen ts the transmissivity of the MZI, which is a function of θ n` as deﬁned in Equation 2. C. Haar initialization In the physical setting, it is useful to ﬁnd the inv erse of Equation 8 to directly set the measurable transmis- sivit y t n` of eac h MZI using a uniformly v arying Haar phase ξ n` ∼ U (0 , 1), a pro cess we call “Haar initializa- tion” sho wn in Figure 2(c, d): t n` = α n` p ξ n` θ n` = 2 arccos √ t n` = 2 arccos 2 α n` p ξ n` , (9) where the expression for θ n` is just a rearrangemen t of Equation 2. Haar initialization can be ac hiev ed progressiv ely using a pro cedure similar to that in Ref. [25]. If the phase shifters in the mesh are all well-c haracterized, the trans- missivities can b e directly set [16]. W e will sho w in Sec- tion V that Haar initialization improv es the conv ergence sp eed of gradien t descen t optimization signiﬁcantly . W e can also use Equation 9 to ﬁnd the a verage trans- missivit y and reﬂectivit y for an MZI parameterized by α n` as is found through simulation in Ref. [24]: h t n` i = Z 1 0 dξ n` α n` p ξ n` = α n` α n` + 1 h r n` i = 1 α n` + 1 = 1 | I n` | + | O n` | − N . (10) The a v erage reﬂectivit y h r n` i sho wn in Figure 2(b) gives a simple in terpretation for the sensitivit y index sho wn in Figure 2(a). The av erage reﬂectivity is equal to the in- v erse of the total n umber of inputs and outputs reachable b y the MZI minus the num b er of ports on either side of the device, N . This is true regardless of whether α n` is assigned for a triangular or rectangular mesh. T o see what the Haar initialization has accomplished, w e can compare the ﬁeld propagation through the rect- angular mesh from a single input when Haar initialized v ersus uniform initialized in Figure 2(e). Physically , this corresp onds to light in the mesh spreading out quickly from the input of the mesh and “interacting” more near the b oundaries of the mesh (inputs, outputs, top, and b ottom), as compared to the center of the mesh which has high transmissivity . In contrast, when phases are randomly set, the ligh t eﬀectively follows a random w alk through the mesh, resulting in the ﬁeld propagation pat- tern sho wn in Figure 2(f ). D. T olerance dep endence on N While Haar initialization is based on how the a verage comp onen t reﬂectivit y scales with N , optimization con- v ergence and device robustness ultimately dep end on ho w 5 0 π / 4 π / 2 Phase ( θ / 2 ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Haar Phase ( ξ α ) ( a ) α = 1 α = 5 α = 25 α = 125 α = 500 0 π / 4 π / 2 Phase ( θ / 2 ) P α ( θ / 2) (A.U .) ( b ) α = 1 α = 5 α = 25 α = 125 α = 500 0 200 400 600 800 1000 α 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 Standard Deviation ( σ θ ; α ) ( c ) − 2 − 1 0 1 2 P eriodic Haar Phase ( e ξ α ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 T ransmissivity ( t = α q ξ ( e ξ α ) ) ( d ) α = 1 α = 5 α = 25 α = 125 α = 500 0 1 Haar Phase ( ξ α ) FIG. 3. (a) Plot of the relationship b et ween ξ α and θ . (b) W e sho w that phase shift standard deviation σ θ ; α decreases as α increases. (c) A plot of σ θ ; α as α increases. (d) The transmis- sivit y of an MZI comp onen t as a function of a p eriodic Haar phase has a p o w er law relationship. The p eriodic Haar phase e ξ α is mapp ed to the Haar phase by a function ξ : R → [0 , 1] as discussed in Appendix G. phase shift tolerances scale with N . The a verage sensitiv- it y index in the mesh is h α n` i = ( N + 1) / 3. As sho wn in Figure 3(b, c), the standard deviation σ θ ; α o ver the PDF P α decreases as α increases. Therefore, a phase shifter’s allo wable tolerance, whic h roughly correlates with σ θ ; α , decreases as the total num b er of input and output p orts aﬀected by that comp onen t increases. Since h α n` i in- creases linearly with N , the required tolerance gets more restrictiv e at large N , as shown in Figure 3(c). W e ﬁnd that the standard deviation is on the order 10 − 2 radi- ans for most v alues of N in the sp eciﬁed range. Thus, if thermal crosstalk is ignored [6], it is p ossible to imple- men t a known random unitary matrix in a photonic mesh assuming p erfect op eration. How ever, we concern our- selv es with on-c hip optimization given just input/output data, in which case the unitary matrix is unknown. In suc h a case, the decreasing tolerances do p ose a challenge in conv erging to a global optimum as N increases. W e demonstrate this problem for N = 128 in Section V. T o account for the scalability problem in global opti- mization, one strategy ma y b e to design a comp onen t in suc h a wa y that the mesh MZIs can b e controlled b y Haar phase voltages as in Figure 3(d) and Equation 9. The transmissivity dep endence on a p eriodic Haar phase (sho wn in Figure 3(d) and discussed in App endix G), is mark edly diﬀerent from the usual sinusoidal dependence on p eriodic θ n` . The MZIs near the boundary v ary in transmissivit y ov er a larger v oltage region than the MZIs near the cen ter, where only small voltages are needed get to full transmissivit y . This results in an eﬀectively small con trol tolerance near small voltages. This motiv ates the mo diﬁcations to the mesh architecture which w e discuss in the next section. IV. AR CHITECTURE MODIFICA TIONS W e prop ose t wo architecture modiﬁcations that can relax the transmissivit y tolerances in the mesh discussed in Section I II and result in signiﬁcan t impro vemen t in optimization. A. Redundan t rectangular mesh (RRM) By adding extra tunable MZIs, it is p ossible to greatly accelerate the optimization of a rectangular mesh to an unkno wn unitary matrix. The addition of redundan t tun- able lay ers to a redundant rectangular mesh is depicted in green in Figure 4(a). The authors in Ref. [24] p oin t out that using suc h “underdetermined meshes” (n um b er of inputs less than the n umber of tunable lay ers in the mesh) can ov ercome photonic errors and restore ﬁdelit y in unitary construction algorithms. Adding lay ers to the mesh increases the ov erall optical depth of the device, but em b edding smaller meshes with extra b eamsplitter lay- ers in a rectangular mesh of an acceptable optical depth do es not pose intrinsic w a veguide loss-related problems. B. P ermuting rectangular mesh (PRM) Another method to accelerate the optimization of a rectangular mesh is to shuﬄe outputs at regular inter- v als within the rectangular mesh. This sh uﬄing relaxes comp onen t tolerances and uniformity of the n um b er of paths for each input-output transition. W e use this in- tuition to formally deﬁne a p erm uting rectangular mesh. F or simplicity , 6 assume N = 2 K for some p ositiv e inte- ger K . Deﬁne “rectangular p erm utation” op erations P k that allow inputs to interact with w av eguides at most 2 k a wa y for k < K . These rectangular p erm utation blo c ks can be implemented using a rectangular mesh comp osed of MZIs with ﬁxed cross state phase shifts, as shown in Figure 4(b), or using low-loss w a v eguide crossings. W e now add p erm utation matrices P 1 , P 2 , . . . P K − 1 6 If N is not a p o w er of 2, then one might consider the follo wing approximate design: K = d log 2 N e . Deﬁne b ( K ) = K √ N , and let each P k hav e d b k e lay ers. 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 U 1 U 2 U 3 U 4 U 5 U 6 U 7 U 8 U 9 U 10 U 11 U 12 U 13 U 14 U 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ( a ) R edundant rectangular mesh ( RRM ) 1 2 3 4 P 1 5 6 7 8 P 2 9 10 11 12 P 3 13 14 15 16 U 1 U 2 U 3 U 4 U 5 U 6 U 7 U 8 U 9 U 10 U 11 U 12 U 13 U 14 U 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ( b ) P ermuting rectangular mesh ( PRM ) FIG. 4. (a) A 16 × 16 rectangular mesh (red). Extra tunable la yers (green) may b e added to signiﬁcan tly reduce conv er- gence time. (b) A 16-input, 30-lay er permuting rectangular mesh. The rectangular p erm utation lay er is implemen ted us- ing either wa v eguide crossings or cross state MZIs (gray). in to the middle of the rectangular mesh as follo ws ˆ U PR := M K K − 1 Y k =1 P k M k ! M k := min ( k d N K e ,N ) Y ` =( k − 1) d N K e Y n ∈S `,N U n ( θ n` , φ n` ) , (11) where d x e represents the nearest in teger larger than x . There are tw o op erations p er blo c k k : an d N K e -la yer rectangular mesh which we abbreviate as M k , and the rectangular permutation mesh P k where block index k ∈ [1 · · · K − 1]. This is lab elled in Figure 4(b). V. SIMULA TIONS No w that we hav e discussed the mesh modiﬁcations and Haar initialization, w e simulate global optimization to show ho w our framework can improv e con vergence p er- formance by up to ﬁve orders of magnitude, even in the presence of fabrication error. A. Mesh initialization W e b egin by discussing the imp ortance of initializing the mesh to resp ect the cross state bias and sensitivit y of U R (8 , 8) U R (32 , 32) U R (256 , 256) U R (1024 , 1024) U R (8 , 16) U R (32 , 64) U R (256 , 512) U R (1024 , 2048) U PR (8) U PR (32) U PR (256) U PR (1024) FIG. 5. Element wise absolute v alues of unitary matrices re- sulting from rectangular ( U ∼ U R ) and p erm uting rectangular ( U ∼ U PR ) meshes where meshes are initialized with uniform- random phases. eac h comp onen t for Haar random unitary matrices dis- cussed in Section I II. Uniform random phase initializa- tion is problematic because it is agnostic of the sensitivit y and a v erage behavior of each comp onen t. W e deﬁne this distribution of matrices as U R ( N , L ) for a rectangular mesh for N inputs and L la y ers. As shown previously in Figure 2(f ), an y giv en input follows a random walk-lik e propagation if phases are initialized uniform-randomly , so there will only b e non-zero matrix elements within a “bandsize” about the diagonal. This bandsize decreases as circuit size N increases as shown in Figure 5. W e compare the bandsizes of banded unitary matrices in sim ulations qualitatively as w e do in Figure 5 or quan- titativ ely as w e do in App endix D. W e randomly gener- ate U ∼ U R ( N , N ), U ∼ U PR ( N ) (p erm uting rectangular mesh with N tunable lay ers), and U ∼ U R ( N , N + δ N ) (redundan t rectangular mesh with δ N extra tunable lay- ers). Figure 5 shows a signiﬁcant reduction in bandsize as N grows larger for rectangular meshes. This phenomenon is not observ ed with permuting rectangular meshes whic h generally hav e the same bandsize as Haar random ma- trices (indep enden t of N ) as sho wn in in Figure 5 and App endix D. This correlates with p erm uting rectangular meshes having faster optimization and less dep endence on initialization. Instead of initializing the mesh using uniform ran- dom phases, we use Haar initialization as in Equation 9 to av oid starting with a banded unitary conﬁguration. This initialization, which we recommend for any pho- tonic mesh-based neural net w ork application, dramati- cally impro ves conv ergence b ecause it primes the opti- mization with the right av erage b eha vior for each com- p onen t. W e ﬁnd in our simulations that as long as the initialization is calibrated to wards higher transmissivity 7 ( θ n` near 0), larger mesh net w orks can also ha v e reason- able con vergence times similar to when the phases are Haar-initialized. The prop er initialization of permuting rectangular meshes is less clear b ecause the tolerances and av erage b eha vior of eac h comp onen t ha v e not yet been mo deled. Our prop osal is to initialize eac h tunable blo c k M k as an indep enden t mesh using the same deﬁnition for α n` , except replacing N with the num b er of lay ers in M k , d N /K e . This is what we use as the Haar initialization equiv alent in the p erm uting rectangular mesh case, al- though it is p ossible there may b e b etter initialization strategies for the nonlo cal mesh structure. B. Optimization problem and synthetic data After initializing the photonic mesh, we pro ceed to op- timize the mean square error cost function for an un- kno wn Haar random unitary U : minimize θ n` ,φ n` ,γ n 1 2 N    ˆ U ( θ n` , φ n` , γ n ) − U    2 F , (12) where the estimated unitary matrix function ˆ U maps N 2 phase shift parameters θ n` , φ n` , γ n to U( N ) via Equa- tions 4 or 11, and k · k F denotes the F rob enius norm. Since trigonometric functions parameterizing ˆ U are non- con vex, we know that Equation 12 is a non-con v ex prob- lem. The non-conv exit y of Equation 12 suggests learning a single unitary transformation in a deep neural net work migh t hav e signiﬁcan t dep endence on initialization. T o train the netw ork, we generate random unit-norm complex input vectors of size N and generate corre- sp onding lab els b y multiplying them by the target ma- trix U . W e use a training batc h size of 2 N . The syn thetic training data of unit-norm complex vectors is therefore represen ted by X ∈ C N × 2 N . The minibatc h training cost function is similar to the test cost function, L train = k ˆ U X − U X k 2 F . The test set is the iden tity ma- trix I of size N × N . The test cost function, in accordance with the training cost function deﬁnition, thus matches Equation 12. C. T raining algorithm W e sim ulate the global optimization of a unitary mesh using automatic diﬀerentiation in tensorflow , whic h can b e ph ysically realized using the in situ backpropagation pro cedure in Ref. [7]. This optical bac kpropagation pro- cedure physically measures ∂ L train /∂ θ n` using interfero- metric techniques, whic h can b e extended to any of the arc hitectures we discuss in this paper. The on-chip backpropagation approach is also likely faster for gradient computation than other training ap- proac hes suc h as the ﬁnite diﬀerence metho d mentioned in past on-chip training prop osals [6]. W e ﬁnd empir- ically that the Adam up date rule (a p opular ﬁrst-order adaptiv e up date rule [27]) outperforms standard stochas- tic gradien t descent for the training of unitary netw orks. If gradient measurements for the phase shifts are stored during training, adaptive update rules can b e applied us- ing successive gradient measurements for eac h tunable comp onen t in the mesh. Such a procedure requires min- imal computation (i.e., lo cally storing the previous gra- dien t step) and can act as a ph ysical test of the simula- tions we will no w discuss. F urthermore, w e av oid quasi- Newton optimization metho ds such as L-BF GS used in Ref. [24] that cannot be implemented physically as straigh tforwardly as ﬁrst-order methods. The models w ere trained using our op en source sim ula- tion framework neurophox 7 using a more general v ersion of the v ertical lay er deﬁnition proposed in Refs. [7, 23]. The models were programmed in tensorflow [28] and run on an NVIDIA GeF orce GTX1080 GPU to improv e optimization p erformance. D. Results W e now compare training results for rectangular, re- dundan t rectangular, and permuting rectangular meshes giv en N = 128. In our comparison of p erm uting rectan- gular meshes and rectangular meshes, w e analyze p erfor- mance when beamsplitter errors are distributed through- out the mesh as either  = 0 or  ∼ N (0 , 0 . 01) and when the θ n` are randomly or Haar-initialized (according to the PDF in Equation 6). W e also analyze optimization p erforamnces of redundant rectangular meshes where w e v ary the n um b er of v ertical MZI lay ers. F rom our results, w e rep ort ﬁv e k ey ﬁndings: 1. Optimization of N = 128 rectangular meshes re- sults in signiﬁcant oﬀ-diagonal errors due to bias to wards the banded matrix space of U R (128), as sho wn in Figure 6. 2. Rectangular meshes conv erge faster when Haar- initialized than when uniformly random initialized, as in Figure 6, in whic h case the estimated matrix con verges to w ards a banded conﬁguration shown in App endix H. 3. Perm uting rectangular meshes conv erge faster than rectangular meshes despite having the same n um- b er of total parameters as sho wn in Figure 6. 4. Redundant rectangular meshes, due to increase in the num b er of parameters, hav e up to ﬁve orders of magnitude b etter conv ergence when the n umber of vertical la yers are doubled compared to rectan- gular and p erm uting rectangular meshes, as sho wn in Figure 7. 7 See https://gith ub.com/solgaardlab/neuropho x. 8 5. Beamsplitter imp erfections sligh tly reduce the o v er- all optimization p erformance of permuting and re- dundan t rectangular meshes, but reduce the p erfor- mance of the rectangular mesh signiﬁcantly . (See Figure 6 and App endix E.) 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration 10 − 2 10 − 1 10 0 Mean square error eost: 1 2 N    ˆ U − U    2 F ( a ) RM PRM RM ( σ  = 0 . 1 ) PRM ( σ  = 0 . 1 ) RM (uniform θ n` init) PRM (uniform θ n` init) ( b ) Error magnitude : | ˆ U R − U | ( c ) Error magnitude : | ˆ U PR − U | Layer ( ` ) Input ( n ) ( d ) θ n` / 2 Layer ( ` ) Input ( n ) ( e ) θ n` / 2 0 π / 4 π / 2 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 0 π / 4 π / 2 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 FIG. 6. W e implemen t six diﬀerent optimizations for N = 128 where w e v ary the choice of p erm uting rectangular mesh (PRM) or rectangular mesh (RM); the initialization (random θ n` or Haar-initialized θ n` ); and photonic transmissivity error displacemen ts (  = 0 or  ∼ N (0 , 0 . 01), where σ 2  = 0 . 01 is the v ariance of the b eamsplitter errors). Conditions: 20000 iterations, Adam update, learning rate of 0 . 0025, batc h size of 256, simulated in tensorflow . (a) Comparison of optimiza- tion p erformance (defaults are Haar initialization and  n` = 0 unless otherwise indicated). Optimized error magnitude spa- tial map for (b) rectangular mesh shows higher oﬀ-diagonal errors and than (c) permuting rectangular. The optimized θ n` phase shifts (see Appendix G) for (d) rectangular meshes are close to zero (cross state) near the center as opp osed to (e) p erm uting rectangular meshes which hav e a strip ed pattern (lik ely due to initialization). NOTE: by |·| , we refer to the elemen twise norm. The singular v alue decomp osition (SVD) architecture 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration 10 − 7 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Mean square error cost: 1 2 N    ˆ U − U    2 F RM PRM RRM : δ N = 16 RRM : δ N = 32 RRM : δ N = 64 RRM : δ N = 128 FIG. 7. A comparison of test error in tensorflow for N = 128 b et w een rectangular (RM), permuting rectangular (PRM), and redundant rectangular (RRM) meshes for: 20000 itera- tions, Adam up date, learning rate of 0 . 0025, batc h size of 256. Ideal = Haar random initialized θ n` with  = 0. δ N is the additional lay ers added in the redundant mesh. W e stopp ed the δ N = 128 run within 4000 iterations when it reached con- v ergence within machine precision. Redundant meshes with 32 additional lay ers con verge b etter than p erm uting rectangu- lar meshes, and with just 16 additional lay ers, we get almost iden tical performance. discussed in Refs. [6, 22] consists of optical lossy comp o- nen ts ﬂanked on b oth sides b y rectangular meshes and are capable of implementing any linear op eration with rea- sonable device input pow er. Note that with some mo diﬁ- cations (e.g. treating loss and gain elements lik e nonlin- earities in the pro cedure of Ref. [7]), SVD arc hitectures can also b e trained physically using in situ backpropa- gation. W e simulate the gradient-based optimization of SVD arc hitectures using automatic diﬀerentiation in Ap- p endix F. VI. DISCUSSION A. Haar initialization F or global optimization and robustness of universal photonic meshes, it is imp ortan t to consider the required biases and sensitivities for each mesh comp onen t. Im- plemen ting an y Haar random matrix requires that eac h comp onen t indep enden tly follo ws an av erage reﬂectivit y within some tolerance. This requirement b ecomes more restrictiv e with the n um b er of input and output ports accessible b y eac h mesh component. F or the rectangular mesh, this means the center mesh comp onen ts are close to cross state and the most sensitive. In a Haar-initialized mesh, as shown in Figure 2, the ligh t injected into a single input p ort spreads out to all w av eguides in the device uniformly regardless of N . This is a preferable initialization for global optimization be- cause Haar random matrices require this b eha vior. In 9 con trast, when randomly initializing phases, the ligh t only spreads out o ver a limited band of outputs. This band gets relatively small compared to the mesh gets larger as shown in Figure 9. The av erage reﬂectivities given by Haar initialization ma y b e useful for inv erse design approaches [29] for compact tunable or passive m ultiport interferometers. The comp onen t tolerances ma y inform ho w robust phase shifters need to b e given error sources such as thermal crosstalk [6]. The thermal crosstalk might make it diﬃ- cult to achiev e required tolerances for devices in terfering up to N = 1000 mo des that generally hav e phase shift tolerances just ab o v e 10 − 2 radians. 8 In our simulations in Section V, w e assume that the con trol parameter for photonic meshes is linearly related to the phase shift. How ev er, in man y current phase shifter implementations, such as thermal phase shifters [6], the phase is a nonlinear function of the control pa- rameter (i.e., the voltage) and has minim um and max- im um v alues, unlik e the unbounded phase used in our optimization. In addition, like the Haar phase in our theory , the voltage acts as the CDF for transmissivities in the physical device, up to a normalization factor. P ar- ticular atten tion needs to be given to phase uncertaint y as a function of v oltage, since the Haar random distribu- tion of internal MZI phases has small v ariance for large N , as we show in Figure 3(c). As mentioned in Section I II, the ideal transmissivit y-voltage dep endence with this consideration w ould b e iden tical to the transmissivit y vs Haar phase dep endence in Figure 3(d). B. Applications of mesh optimization Meshes can be tuned using either self-conﬁguration [11, 22] or global optimizations (gradien t-based [7] or deriv ative-free [30]). The algorithmic optimizations pro- p osed in Refs. [11, 22] assume that each component in the mesh can co ver the entire split ratio range, whic h is not the case in presence of 50:50 b eamsplitter errors. This ultimately leads to low er ﬁdelit y in the implemented uni- tary operation, which can b e a voided using a double-MZI arc hitecture [21, 31] or a vertical lay er-wise progressive algorithm [25]. W e explore a third alternative to ov er- come photonic errors; gradient-based global optimization is model-free and, unlik e algorithmic approac hes, can ef- ﬁcien tly tune photonic neural netw orks [7]. This model- free prop ert y makes gradien t-based optimization robust to fabrication error; we sho w in Figure 6(a) that meshes with split ratio error v ariances of up to σ  = 0 . 1 can b e optimized nearly as well as a p erfect mesh, particularly for p erm uting rectangular meshes. 8 Ref. [6] propose a standard deviation of ∼ 10 − 3 might b e p ossi- ble with further circuit c haracterization, whic h might be scalable based on Figure 3(c). In the regime of globally optimized meshes, w e pro- p ose tw o strategies to mo dify the rectangular arc hitec- ture: adding wa v eguide p erm utation la yers and adding extra tunable vertical MZI lay ers. Both approaches re- lax the cross state requiremen ts on the MZIs and acceler- ate the mesh optimization pro cess. Nonlo cal in terference w orks by allo wing inputs that are far aw a y physically in the mesh to interact. These approaches are inspired by sev eral recent prop osals in machine learning and coher- en t photonics to design more error toleran t and eﬃcient meshes, man y of which use single lay ers of MZIs and non- lo cal w a veguide in teractions [23, 26, 32, 33]; such designs can also b e considered to be in the same class of permut- ing arc hitectures as our proposed p erm uting rectangular mesh. Adding extra tunable vertical la yers, as prop osed in Ref. [24], simply adds more tunable paths for the ligh t to ac hieve a desired output. As sho wn in Figure 6, w e ac hieve up to ﬁve orders of magnitude impro vemen t in conv ergence at the exp ense of doubling the mesh size and parameter space. Lik e p erm uting rectangular meshes, multi-plane ligh t con version successfully applies the non-lo cal interference idea for eﬃcient spatial mo de multiplexing [34, 35]. In this proto col, alternating lay ers of transv erse phase pro- ﬁles and optical F ourier transforms (analogous to what our rectangular p erm utations accomplish) are applied to reshap e input mo des of light [34, 35]. A similar con- cept is used in unitary spatial mo de manipulation, where sto c hastic optimization of deformable mirror settings al- lo w for eﬃcien t mo de conv ersion [36]. Th us, the idea of eﬃcien t unitary learning via a F ourier-inspired p erm ut- ing approac h has precedent in con texts outside of pho- tonic MZI meshes. An on-chip optimization for multi-plane ligh t conv er- sion has b een accomplished exp erimen tally in the past using sim ulated annealing [30]. The success of simulated annealing in exp erimen tally training small unitary pho- tonic devices [30] (rather than gradien t descen t as is used in this w ork) suggests there are other algorithms aside from gradien t descen t that ma y eﬀectiv ely enable on-chip training. W e propose that similar simulated annealing ap- proac hes might b e made more eﬃcient by sampling Haar phases from uniform distributions and ﬂashing up dates on to the device. Similar deriv ative-free optimizations ma y also b e useful for quantum machine learning [37– 39]. Whether such approaches can comp ete with back- propagation for classical applications remains to b e in- v estigated. F or exp erimen tal on-chip tuning, simulated annealing has the attractive prop ert y of only requiring output detectors. F or practical mac hine learning ap- plications, how ev er, there is curren tly more literature for bac kpropagation-based optimization. F urthermore, gradien t-based approaches allow for con tinuous control of phase shifters during the optimization. Our tensorflow simulations may be useful in the de- sign of optical recurren t neural net works (RNNs) that use unitary op erators parameterized by photonic meshes. 10 Suc h “unitary RNNs” (URNNs) hav e already b een sim u- lated on con v entional computers and sho w some promise in synthetic long-term memory tasks [23, 40]. Unitary RNNs are physically implementable using a single mesh with optical nonlinearities and recurrent opto electronic feedbac k, suggesting that the arc hitecture discussed in this work is a scalable, energy-eﬃcient option for ma- c hine learning applications. It is p ossible that some tun- able features such as the “bandedness” of unitaries im- plemen ted by rectangular MZI meshes can be useful (e.g. as an attention mec hanism in sequence data) for certain deep learning tasks that use URNNs. VI I. CONCLUSION The scalabilit y of gradien t-based optimization of Haar random unitary matrices on univ ersal photonic meshes is limited b y small reﬂectivities and MZI phase shifter sen- sitivities arising from the constraint of lo cally interact- ing comp onen ts. As shown in Section I I I, the required a verage reﬂectivity and sensitivit y for each MZI is in- v ersely related to the total num ber of inputs and out- puts aﬀected by the MZI. If the tolerance requirements are not met b y the physical comp onen ts, optimization algorithms will hav e diﬃcult y conv erging to a target uni- tary op erator. As sho wn in Section V for the case of N = 128, conv ergence via in situ backpropagation is generally not ac hiev ed if phase shifters are initialized ran- domly . How ev er, Haar initialization can suﬃciently bias the optimization for conv ergence to a desired random uni- tary matrix, ev en in the presence of signiﬁcan t simulated b eamsplitter fabrication errors. In Section IV, we prop ose adding extra tunable beam- splitters or mesh nonlo calities to accelerate mesh opti- mization. Naive (uniform random) initialization on a standard photonic mesh has diﬃculty learning random unitary matrices via gradien t descent. By introducing non-lo calities in the mesh, we can improv e optimization p erformance without the need for extra parameters. A Haar-initialized redundant architecture can achiev e ﬁve orders of magnitude less mean square error for a Haar random unitary matrix and decrease optimization time to suc h a matrix by at least tw o orders of magnitude, as shown in Figure 7. Our ﬁndings suggest that archi- tecture choice and initialization of photonic mesh com- p onen ts ma y prov e imp ortan t for increasing the scala- bilit y and stabilit y of reconﬁgurable univ ersal photonic devices and their man y classical and quantum applica- tions [3, 5, 6, 12, 22, 37 – 39, 41]. A CKNOWLEDGEMENTS This work was funded by the Air F orce Oﬃce of Sci- en tiﬁc Research, sp eciﬁcally for the Center for Energy- Eﬃcien t 3D Neuromorphic Nano computing (CEE3N 2 ), Gran t No. F A9550-181-1-0186. W e would lik e to thank T yler Hughes, Momchil Mink ov, Nate Ab ebe, Dylan Blac k, and Ian Williamson for illuminating discussions. [1] Markus Gr¨ afe, Ren Heilmann, Maxime Lebugle, Diego Guzman-Silv a, Armando P erez-Leija, and Alexander Szameit, “In tegrated photonic quan tum w alks,” Journal of Optics (2016), 10.1088/2040-8978/18/10/103002. [2] Jacques Carolan, Christopher Harrold, Chris Sparro w, Enrique Mart ´ ın-L´ opez, Nic holas J. Russell, Joshua W. Silv erstone, P eter J. Shadb olt, Nobuyuki Matsuda, Manabu Oguma, Mikitak a Itoh, Graham D. Marshall, Mark G. Thompson, Jonathan C.F. Matthews, T oshik azu Hashimoto, Jerem y L. O’Brien, and Anthon y Laing, “Univ ersal linear optics,” Science (2015), 10.1126/sci- ence.aab3642. [3] Justin B. Spring, Benjamin J. Metcalf, Peter C. Humphreys, W. Steven Kolthammer, Xian Min Jin, Marco Barbieri, Animesh Datta, Nic holas Thomas-P eter, Nathan K. Langford, Dmytro Kundys, James C. Gates, Brian J. Smith, P eter G.R. Smith, and Ian A. W almsley , “Boson sampling on a photonic chip,” Science (2013), 10.1126/science.1231692. [4] Nicholas C. Harris, Gregory R. Steinbrec her, Mihik a Prabh u, Y oav Lahini, Jacob Mow er, Darius Bunandar, Changc hen Chen, F ranco N.C. W ong, T om Baehr-Jones, Mic hael Ho c h b erg, Seth Lloyd, and Dirk Englund, “Quan tum transport sim ulations in a programmable nanophotonic pro cessor,” Nature Photonics 11 , 447–452 (2017). [5] Andrea Annoni, Emanuele Guglielmi, Marco Carminati, Giorgio F errari, Marco Sampietro, Da vid Ab Miller, An- drea Melloni, and F rancesco Morichetti, “Unscram- bling ligh t - Automatically undoing strong mixing b e- t ween modes,” Light: Science and Applications 6 (2017), 10.1038/lsa.2017.110. [6] Yichen Shen, Nic holas C. Harris, Scott Skirlo, Mihik a Prabh u, T om Baehr-Jones, Michael Hoch berg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, and Marin Solja ˇ ci ´ c, “Deep learning with coheren t nanophotonic cir- cuits,” Nature Photonics 11 , 441–446 (2017). [7] Tyler W. Hughes, Momchil Mink o v, Y u Shi, and Shan- h ui F an, “T raining of photonic neural netw orks through in situ bac kpropagation and gradien t measuremen t,” Op- tica 5 , 864 (2018). [8] Ian A. D. Williamson, T yler W. Hughes, Momchil Mink ov, Ben Bartlett, Sunil P ai, and Shanhui F an, “Reprogrammable Electro-Optic Nonlinear Activ ation F unctions for Optical Neural Netw orks,” arXiv preprint (2019). [9] David A. B. Miller, “Establishing Optimal W a ve Com- m unication Channels Automatically ,” Journal of Ligh t- w av e T echnology , V ol. 31, Issue 24, pp. 3987-3994 31 , 3987–3994 (2013). [10] William R. Clemen ts, Peter C. Humphreys, Benjamin J. Metcalf, W. Stev en Kolthammer, and Ian A. W almsley , “An Optimal Design for Universal Multiport Interferom- 11 eters,” Optica , 1–8 (2016). [11] Michael Reck, Anton Zeilinger, Herb ert J. Bernstein, and Philip Bertani, “Exp erimen tal realization of any dis- crete unitary op erator,” Physical Review Letters 73 , 58– 61 (1994). [12] David A. B. Miller, “Self-aligning universal b eam cou- pler,” Optics Express 21 , 6360 (2013). [13] Daniel P erez, Iv ana Gasulla, Jose Capman y , and Ric hard A. Soref, “Hexagonal wa veguide mesh design for univ ersal multiport interferometers,” in 2016 IEEE Pho- tonics Confer enc e, IPC 2016 (2017) pp. 285–286. [14] Daniel P´ erez, Iv ana Gasulla, Lee Crudgington, Da vid J. Thomson, Ali Z. Khokhar, Ke Li, W ei Cao, Goran Z. Mashano vich, and Jos Capmany , “Silicon RF-Photonics Pro cessor Reconﬁgurable Core,” in Eur op e an Confer enc e on Optic al Communic ation, ECOC (2018). [15] Daniel P´ erez, Iv ana Gasulla, Lee Crudgington, Da vid J. Thomson, Ali Z. Khokhar, Ke Li, W ei Cao, Goran Z. Mashano vich, and Jos Capmany , “Multipurp ose silicon photonics signal pro cessor core,” Nature Comm unica- tions (2017), 10.1038/s41467-017-00714-1. [16] Nicholas J. Russell, Lev on Chakhmakhc hy an, Jerem y L. O’Brien, and Anthon y Laing, “Direct dialling of Haar random unitary matrices,” New Journal of Ph ysics (2017), 10.1088/1367-2630/aa60ed. [17] Adolf Hurwitz, “ ¨ ub er die Erzeugung der In v arian ten durc h Integration,” Nac hrich ten von der Gesellschaft der Wissensc haften zu G¨ ottingen, Mathematisch- Ph ysik alisc he Klasse , 71–72 (1897). [18] K. Zyczko wski and M. Kus, “Random unitary matri- ces,” Journal of Ph ysics A: General Ph ysics (1994), 10.1088/0305-4470/27/12/028. [19] Persi Diaconis and Peter J. F orrester, “Hurwitz and the origins of random matrix theory in mathematics,” Ran- dom Matrices: Theory and Applications 06 , 1730001 (2017). [20] Christoph Spengler, Marcus Huber, and Beatrix C. Hies- ma yr, “Comp osite parameterization and Haar measure for all unitary and sp ecial unitary groups,” Journal of Mathematical Physics (2012), 10.1063/1.3672064. [21] David A. B. Miller, “Perfect optics with imp erfect com- p onen ts,” Optica 2 , 747 (2015). [22] David A. B. Miller, “Self-conﬁguring universal linear optical comp onen t [Invited],” Photonics Research 1 , 1 (2013). [23] Li Jing, Yichen Shen, T ena Dub cek, John P eurifo y , Scott Skirlo, Y ann LeCun, Max T egmark, and Marin Soljaˇ ci´ c, “T unable Eﬃcient Unitary Neural Netw orks (EUNN) and their application to RNNs,” in Pr o c e e dings of Ma- chine L e arning R ese ar ch (2017) pp. 1733–1741. [24] Ro el Burgwal, William R. Clemen ts, Devin H. Smith, James C. Gates, W. Steven Kolthammer, Jelmer J. Ren- ema, and Ian A. W almsley , “Using an imp erfect photonic net work to implemen t random unitaries,” Optics Express 25 , 28236 (2017). [25] David A. B. Miller, “Setting up meshes of interferometers rev ersed lo cal ligh t in terference metho d,” Optics Express 25 , 29233 (2017). [26] F ulvio Flamini, Nicol Spagnolo, Niko Viggianiello, An- drea Crespi, Rob erto Osellame, and F abio Sciarrino, “Benc hmarking in tegrated linear-optical architectures for quan tum information pro cessing,” Scientiﬁc Rep orts 7 , 15133 (2017). [27] Diederik P Kingma and Jimmy Lei Ba, “Adam: A Metho d for Stochastic Optimization,” International Con- ference on Learning Represen tations (2015). [28] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanja y Ghema wat, Geoﬀrey Irving, Mic hael Isard, Manjunath Kudlur, Josh Leven berg, Ra jat Monga, Sherry Mo ore, Derek G. Murray , Benoit Steiner, Paul T uck er, Vijay V a- sudev an, P ete W arden, Martin Wick e, Y uan Y u, and Xi- ao qiang Zheng, “T ensorFlow: A System for Large-Scale Mac hine Learning,” in Op er ating Systems Design and Im- plementation (Sav annah, GA, 2016) pp. 265–283. [29] Alexander Y. Piggott, Jan P et ykiewicz, Logan Su, and Jelena V uˇ cko vi ´ c, “F abrication-constrained nanopho- tonic inv erse design,” Scientiﬁc Rep orts (2017), 10.1038/s41598-017-01939-2. [30] Rui T ang, T akuo T anemura, Samir Ghosh, Kei- jiro Suzuki, Ken T anizaw a, Kazuhiro Ikeda, Hitoshi Ka washima, and Y oshiaki Nak ano, “Reconﬁgurable all- optical on-c hip MIMO three-mode demultiplexing based on m ulti-plane light con v ersion,” Optics Letters 43 , 1798 (2018). [31] C. M. Wilk es, X. Qiang, J. W ang, R. Santagati, S. P ae- sani, X. Zhou, D. A. B. Miller, G. D. Marshall, M. G. Thompson, and J. L. OBrien, “60dB high-extinction auto-conﬁgured MachZehnder in terferometer,” Optics Letters 41 , 5318 (2016). [32] Alan Genz, “Metho ds for Generating Random Orthog- onal Matrices,” Mon te Carlo and Quasi-Monte Carlo Metho ds , 199–213 (1998). [33] Michael Mathieu and Y ann LeCun, “F ast Approxima- tion of Rotations and Hessians matrices,” arXiv preprint (2014). [34] Guillaume Labroille, Bertrand Denolle, Pu Jian, Jean F ranois Morizur, Philipp e Genev aux, and Nicolas T reps, “Eﬃcient and mo de selective spatial mo de mul- tiplexer based on m ulti-plane light conv ersion,” in 2014 IEEE Photonics Confer enc e, IPC 2014 (2014). [35] Guillaume Labroille, Pu Jian, Nicolas Barr´ e, Bertrand Denolle, and Jean-F ranois Morizur, “Mo de Selective 10- Mo de Multiplexer based on Multi-Plane Light Conv er- sion,” Optical Fiber Communication Conference (2016), 10.1364/OF C.2016.Th3E.5. [36] Jean-F ranois Morizur, Lachlan Nicholls, Pu Jian, Seiji Armstrong, Nicolas T reps, Boris Hage, Magn us Hsu, W arwic k Bow en, Jiri Janousek, and Hans-A. Bac hor, “Programmable unitary spatial mode manipulation,” Journal of the Optical Society of America A 27 , 2524 (2010). [37] Maria Sch uld and Nathan Killoran, “Quan tum Mac hine Learning in F eature Hilb ert Spaces,” Physical Review Letters 122 , 040504 (2019). [38] Maria Sch uld, Alex Bo c haro v, Krysta Svore, and Nathan Wiebe, “Circuit-centric quan tum classiﬁers,” arXiv preprin t (2018). [39] Nathan Killoran, Thomas R. Bromley , Juan Miguel Ar- razola, Maria Sch uld, Nicols Quesada, and Seth Lloyd, “Con tinuous-v ariable quantum neural netw orks,” arXiv preprin t (2018), 10.1016/j.p epi.2014.06.011. [40] Rumen Dangovski, Li Jing, and Marin Soljacic, “Rota- tional Unit of Memory ,” in International Conferenc e on L e arning R epr esentations (2017). [41] Juan Miguel Arrazola, Thomas R Bromley , Josh Izaac, Casey R Myers, Kamil Br´ adler, and Nathan Killoran, 12 “Mac hine learning method for state preparation and gate syn thesis on photonic quan tum computers,” Quantum Science and T echnology 4 , 024004 (2019). App endix A: Soft w are T o repro duce the results of this pap er, the reader can b e directed to neurophox , an op en-source Python pac k age that implements the optimizations and simula- tions of this pap er in numpy and tensorflow . The ex- act co de used to generate the results is provided in the neurophox-notebooks rep ository . App endix B: Deriv ation of beamsplitter errors Unitary matrices generated by lossless MZIs are prone to errors in b eamsplitter fabrication. W e introduce the error  to our expression deriv ed in Equation 1, whic h is t wice the displacement in b eamsplitter split ratio from 50 : 50. Beamsplitter gates with error  are deﬁned as B  =  ρ iτ iτ ρ  where ρ = q 1+  2 , τ = q 1 −  2 are trans- missivit y and reﬂectivity amplitudes that result in slight v ariations from a 50 : 50 beamsplitter. W e use this error deﬁnition since it is a measurable quantit y in the chip; in fact, there are strategies to minimize  directly [21]. The unitary matrix that w e implement in presence of b eam- splitter errors b ecomes U  := R φ B  2 R θ B  1 t  := | U , 12 | 2 = | U , 21 | 2 r  := | U , 11 | 2 = | U , 22 | 2 . (B1) If  1 =  2 =  , whic h is a reasonable practical assumption for nearby fabricated structures, then solving for t  in terms of t : t  = 4 | ρ | 2 | τ | 2 t = 4 t  1 2 +  2   1 2 −  2  = t (1 −  2 ) . (B2) Similarly , w e can solve for r  : r  = 1 − t  = r + t ·  2 . (B3) As we hav e discussed in this pap er (and as we later sho w in Figure 12), photonic errors  (standard devia- tion of 0.1) can aﬀect the optimized phase shifts for uni- tary matrices. The ab o ve constraints on r  and t  sug- gest that limited transmissivity is likely in the presence of fabrication errors, which can inhibit progressive setup of unitary meshes [21, 24]. How ev er, we will later show through tensorflow simulation that in situ bac kprop- agation up dates can to some exten t address this issue using a more sophisticated experimental proto col in volv- ing phase conjugation and in terferometric measurements [7]. App endix C: Haar measure ( a ) 7 6 6 5 5 5 4 4 4 4 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 1 α n` 1 2 3 4 5 6 7 8 U 1 U 2 U 3 U 4 U 5 U 6 U 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 ( b ) t 1 x 1 1 − t 1 t 2 x 2 1 − t 2 t 3 x 3 1 − t 3 t 4 x 4 1 − t 4 t 5 x 5 1 − t 5 t 6 x 6 1 − t 6 t 7 x 7 1 − t 7 1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1 FIG. 8. T riangular mesh for N = 8 using (a) 2 N − 3 v ertical la yers ` sho wing the sensitivit y index α n` and (b) N diagonal la yers m showing the transmissivity basis ( t n in red) and the measuremen t basis ( x n in purple). In this section, we outline a pro of for the Haar measure of a unitary matrix in terms of the physical parameters of a photonic mesh to supplemen t our discussion of Haar phase and the proof in Ref. [16]. The Haar measure for U( N ) can b e deﬁned in t wo ph ysical basis represen ta- tions: the measurement basis represents measurements after eac h MZI and the transmissivity basis represe n ts the transmissivit y of each MZI. T o make our explanation simpler, we will adopt the orthogonalization proto col used by Ref. [11]. In this represen tation, we deﬁne the triangular mesh U T as U T = N − 1 Y m =0 U ( N − m ) U ( m ) = m − 1 Y n =1 U N − n ( θ ( m ) N − n , φ ( m ) N − n ) · D m ( γ N − m +1 ) , (C1) where D m is a diagonal matrix representing a single mo de phase shift at index N − m + 1. The N op erators U ( m ) represen t the diagonal lay ers of the triangular mesh and their role is to pro ject inputs from Hilb ert space dimension from m to m − 1 recur- siv ely until we reac h a single mode phase shift in U (1) = D 1 ( γ N ). Our pro of mov es the same direction as Reck’s orthogonalization pro cedure; starting from m = N , w e progressiv ely ﬁnd the for eac h U ( m ) in decreasing order. 13 F or each la y er m , there are 2 m − 1 complex h yp erspher- ical co ordinates ( m − 1 “amplitude” co ordinates and m “phase” co ordinates). The ﬁrst column v ector of U can b e recov ered by shining ligh t (using a unit pow er P = 1) through the top port of the lay er (given b y n = N − m + 1) and measuring the output ﬁelds in the triangular mesh generated by U ( m ) , as shown in Figure 8(b). As men- tioned in Refs. [11, 22], progressive optimization mov es in the opp osite direction; the desired output ﬁelds are shined bac k in to the device and the transmissivities t ( m ) n and phases φ ( m ) n for each la yer m (moving from N to 1) can b e progressively tuned un til all the p o wer lies in the top input p ort for that la y er. The measurement basis is an unbiased Haar measure (as shown in Ref. [16] using Gaussian random v ectors) and can be physically represented b y the pow er x n mea- sured at wa v eguides n ≤ m − 1 due to shining ligh t through the top input p ort for that la y er. Unlike the pro of in Ref. [16], w e c ho ose our constrain t that the in- put p o wer P = 1 rather than P ∈ R + , which introduces a normalization prefactor in our Haar measure by inte- gration ov er all p ossible P . 9 This allows us to ignore the p o w er in the ﬁnal output port x N b ecause energy conser- v ation ensures w e hav e the constraint x N = 1 − P N − 1 n =1 x n . Therefore, our simpliﬁed Cartesian basis for eac h m is (ignoring the normalization prefactor): d µ ( U ( m ) ) ∝ d γ N − m m − 1 Y n =1 d x n m Y n =1 d φ n . (C2) No w we represen t the Cartesian p o w er quantities x n explicitly in terms of the comp onen t transmissivities, whic h we hav e deﬁned already to b e t n := cos 2 ( θ n / 2). Using the same conv en tion as h yperspherical coordinates, w e get the follo wing recursive relation for x n as sho wn di- agrammatically by following the path of light from the top input p ort in Figure 8(b): x n = (1 − t n ) n − 1 Y k =1 t k . (C3) In tuitively , Equation C3 implies that the pow er x n measured at p ort n is given by light that is transmitted b y the ﬁrst n − 1 comp onen ts along the path of ligh t and then reﬂected b y the n th comp onen t. In other w ords, x n follo ws a geometric distribution. W e can use Equation C3 to ﬁnd the Jacobian J ∈ R N − 1 × N − 1 relating the x n and the t n . W e ﬁnd that we ha ve a low er triangular matrix J with diagonal elements for n ≤ N − 1 J nn = ∂ x n ∂ t n = − n − 1 Y k =1 t k . (C4) 9 This prefactor is exactly R ∞ 0 dP e − P P m − 1 = ( m − 1)!. W e kno w J is low er triangular since for all n 0 > n , J nn 0 = ∂ x n ∂ t n 0 = 0 from Equation C3. Since the determinant of a low er triangular matrix is the same as the pro duct of the diagonal, we can directly ev aluate the un biased measure (oﬀ b y a normalization constan t) as d µ ( U ( m ) ) ∝ d γ N − m +1 det J m − 1 Y n =1 d t n m Y n =1 d φ n = d γ N − m +1 m − 1 Y n =1 J nn m − 1 Y n =1 d t n m Y n =1 d φ n ∝ d γ N − m +1 m − 1 Y n =2 t m − n n − 1 m − 1 Y n =1 d t n m Y n =1 d φ n (C5) T o get the total Haar measure, w e m ultiply the volume elemen ts for the orthogonal comp onen ts d µ ( U ( m ) ). W e get from this pro cedure that the s ensitivit y index α n` = N − n for a triangular mesh in Equation C5 (indep enden t of ` ), which can b e seen using Figure 8. W e can express this Haar measure in terms of Q α n` ( t n` ), the probability distribution for the transmissivity , and P α n` ( θ n` / 2), the probabilit y distribution for the phase shift corresponding to that same transmissivity , assuming appropriate choice n, ` for the triangular mesh: d µ ( U ) = N Y n =1 d µ ( U ( n ) ) = Y n d γ n Y n,` Q α n` ( t n` ) dt n` d φ n` = Y n d γ n Y n,` P α n`  θ n` 2  d θ n` d φ n` (C6) W e can no w normalize Equation C5 using the normal- ization factor for P to get Q α n` ( t n` ) and then substi- tute t n` = cos 2 ( θ n` / 2) to get our desired expression for P α n` ( θ n` / 2): Q α n` ( t n` ) = α n` t α n` − 1 n` P α n`  θ n` 2  = α n` sin  θ n` 2   cos  θ n` 2  2 α n` − 1 . (C7) Finally , w e can recov er the Haar phase parameter ξ n` ∈ [0 , 1] (i.e. the cumulativ e density function) in terms of either t n` or θ n` : ξ n` =  cos  θ n` 2  2 α n` = t α n` n` . (C8) Finally , as explained in Ref. [16], we can use the Clemen ts decomp osition [10] to ﬁnd another lab elling for α n` in a rectangular mesh that gives probability distri- butions and Haar phases in the same form as Equations C7 and C8 resp ectiv ely . 14 App endix D: Unitary bandsizes W e would lik e to quan tify the bandedness of matri- ces implemented b y the meshes with randomly initial- ized phases. W e deﬁne the η -bandsize as the minimum n umber of matrix elements whose absolute v alue squared sums to (1 − η ) N . Note that our η -bandsize measurement is agnostic of the ordering of the inputs and outputs, and is therefore agnostic to any p erm utations that may b e applied at the end of the decomp osition. In photonics terms, if η = 0 . 001, let r i measure the fraction of output w av eguides o v er whic h 99 . 9% of the p o w er is distributed when ligh t is input in to w a veguide i . The η -bandsize is r i a veraged ov er all i . Sampling from our matrix distribu- tions, we observ e the relationship b et w een the bandsize (giv en η = 0 . 001) and the dimension N in Figure 9. 0 200 400 600 800 1000 N 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Matrix bandsize ( η = 0 . 001 ) RM RRM PRM Haar random FIG. 9. Giv en η = 0 . 001, we compare bandsizes for rectangu- lar ( U ∼ U R ( N , N )), p erm uting rectangular ( U ∼ U PR ( N )), and redundan t meshes ( U ∼ U R ( N , 2 N )). P ermuting rectan- gular meshes match the bandsize of Haar random matrices. App endix E: In tro ducing photonic errors in a redundan t mesh When photonic errors are added to the redundant mesh, sp eciﬁcally the 256-lay er mesh, we observ e a sligh t decrease in optimization p erformance in Figure 10, sim- ilar to what w e observ ed for the rectangular and p er- m uting rectangular meshes in Figure 7. This decrease in p erformance, how ev er, is less concerning considering that we still achiev e a mean square error of around 10 − 5 , suggesting that RRM might be more robust to photonic errors ev en during on-chip optimization. App endix F: Photonic singular v alue decomp osition sim ulations W e compare the simulated p erformance of suc h rect- angular and p erm uting rectangular architectures in the 0 500 1000 1500 2000 2500 Iteration 10 − 7 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 Mean square error cost: 1 2 N    ˆ U − U    2 F RRM ( σ  = 0 ) RRM ( σ  = 0 . 1 ) FIG. 10. A comparison of test mean square error for N = 128 b et w een redundant rectangular meshes with error  for 256- la yer mesh for: 20000 iterations, Adam update, learning rate of 0 . 0025, batch size of 256, sim ulated in tensorflow . 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration 10 0 10 1 Mean square error cost: 1 2 k A k 2 F    ˆ A − A    2 F SVD -RM SVD -PRM SVD -RM ( σ  = 0 . 1 ) SVD -PRM ( σ  = 0 . 1 ) SVD -RM (uniform θ n` init) SVD -PRM (uniform θ n` init) FIG. 11. A comparison of test mean square error for N = 64 b et w een SVD devices using rectangular (SVD-RM) and p er- m uting rectangular (SVD-PRM) meshes for: 20000 iterations, Adam up date, learning rate of 0 . 005, batch size of 128, sim- ulated in tensorflow . Unless otherwise noted, the default setting is Haar random initialized θ n` with σ  = 0. singular v alue decomp osition (SVD) conﬁguration dis- cussed in Refs. [6, 22]. Suc h architectures would allow one to p erform arbitary linear op erations with a rela- tiv ely small fo otprin t, and may hav e some other useful dimensionalit y-reduction prop erties in machine learning con texts. In SVD, we represent complex matrix ˆ A ∈ C M × C N as ˆ A = ˆ U ˆ Σ ˆ V † , where ˆ Σ is a diagonal matrix implemented on-c hip with min( M , N ) single-mo de gain or atten uating elemen ts and ˆ U , ˆ V † are unitary matrices implemented in a photonic mesh. While ˆ A has 2 M N free parame- ters, any global optimization for a photonic SVD imple- men tation using rectangular meshes can ha v e at most D = N ( N − 1) + M ( M − 1) + 2 min( N , M ) ≥ 2 M N free parameters, with equality when M = N . In the trian- gular architecture discussed in Ref. [22], the total com- 15 plexit y of parameters can b e exactly D = 2 M N when setting a subset of the beamsplitters to bar state. In the case where the total num ber of singular v alues for ˆ A is S < min( M , N ), w e get D = 2 S ( M + N − S ) tunable el- emen ts. Additionally , there is an “eﬀectiv e redundancy” in that some v ectors in U, V are more imp ortan t than others due to the singular v alues. In our sim ulations, we in vestigate an SVD arc hitecture for A = U Σ V † for A ∈ C M × C N comp osed of the uni- taries U ∈ C M × C M and V ∈ C N × C N . Note that suc h an arc hitecture is redundant when M 6 = N , so w e fo cus on the simple case of M = N = 64. W e deﬁne our train and test cost functions analogous to the unitary mean-squared error cost functions as L test = N k ˆ A − A k 2 F 2 k A k 2 F L train = k ˆ AX − AX k 2 F , (F1) where ˆ A = ˆ U ˆ Σ ˆ V † is deﬁned in Section V. W e randomly generate A ∈ C N × C M b y expressing A j k = a + ib , where a, b ∼ N (0 , 1). The syn thetic training batc hes of unit-norm complex vectors are represented by X ∈ C N × 2 N . Assuming a pro cedure similar to [7] can be used in presence of gains and optimization, the p erm uting rect- angular mesh con verges sligh tly faster but is signiﬁcan tly more resilient to uniform random phase initialization compared to the rectangular mesh as sho wn in Figure 11. Both optimizations are minimally aﬀected b y b eamsplit- ter error, unlike what is seen in the unitary optimization case. App endix G: P erio dic parameters W e comment on our rep orted v alues of θ n` in the c heck erboard plots in Figures 3 (of the main text) and 12. Since our simulated optimization do es not hav e the explicit constraint that θ n` ∈ [0 , π ), we rep ort the “ab- solute θ n` ,” where we map all v alues of θ n` / 2 to some v alue in [0 , π / 2]. This corresponds to the transformation (assuming θ n` is originally b et w een 0 and 2 π ): θ n` → ( θ n` θ n` ≤ π 2 π − θ n` θ n` > π . (G1) Note a similar treatment as Equation G1 can b e used to represent the Haar phase ξ ∈ [0 , 1] in terms of a “p e- rio dic” Haar phase e ξ ∈ [0 , 2] with p eriod 2: ξ ( e ξ ) = ( e ξ e ξ ≤ 1 2 − e ξ e ξ > 1 . (G2) Note both e ξ and e θ can therefore b e made to v ary con- tin uously from ( −∞ , ∞ ) with e ξ having a perio d of 2 and e θ ha ving a p eriod of 2 π . W e map these perio dic parame- ters to their half-p eriods according to Equations G1 and G2 based on symmetry arguments. App endix H: T raining sim ulation comparisons In Figure 12, w e compare the performance for our uni- tary net w ork experiment ov er our aforemen tioned condi- tions in Section V. F or each plot, we also hav e an asso- ciated video, sho wing how the parameter distributions, estimates, and errors v ary during the course of the opti- mization, a v ailable online. 10 There are sev eral takea w ays from these plots. First, the reﬂectivity of the MZIs near the center of the mesh are muc h smaller in the optimized rectangular meshes than in the p erm uting rectangular meshes. Second, the gradien t descent algorithm has a hard time ﬁnding the regime of Haar random matrices after a uniform random phase initialization. The v alues of θ n` are muc h larger than they need to b e even 100 iterations into the op- timization. This is likely evidence of a “v anishing gra- dien t” problem when the mesh is not Haar-initialized. Finally , an imp ortan t observ ation for the meshes with b eamsplitter error is that the θ n` / 2 distribution shifts sligh tly to wards 0 in the rectangular mesh. This is a consequence of the limits in reﬂectivit y and transmissiv- it y in each MZI due to b eamsplitter fabrication error as discussed in Section I I. Our simulated p erm uting rectangular implementation uses the same lay er deﬁnitions as deﬁned in Equation 11 except the P k with the most lay ers are in the center of the mesh, and the P k with the fewest lay ers are near the inputs and outputs of the mesh. In Figure 4, P 2 and P 3 w ould b e switched, and for N = 128, the order is [ P 2 , P 4 , P 6 , P 5 , P 3 , P 1 ]. W e ﬁnd this conﬁguration to b e the b est p erm uting rectangular mesh so far in our exp erimen ts, although the architecture in Equation 11 giv es improv emen ts ov er the rectangular mesh. App endix I: An equiv alen t deﬁnition for α n` Let α n` b e the sensitivity index for an MZI (“no de”) at (wa v eguide, la y er) coordinates ( n, ` ) in a lo cal decom- p osition for an N × N unitary op erator. W e deﬁne the “ro w co ordinate” or wa veguide index n from the MZI’s op erator U n coupling wa v eguides n and n + 1, and we deﬁne the “column co ordinate” or lay er index m to be ` = k + 1, where k is the maximum num ber of op era- tors applied to a reac hable input (This is equiv alen t to the vertical lay ers deﬁnition in Figure 1.). The reachable inputs I n` are the subset of input modes aﬀecting the immediate inputs of the MZI at ( n, ` ), and the reac hable outputs O n` are the subset of output mo des aﬀected by the immediate outputs of the MZI. F ollowing the deﬁnitions in Ref. [16], in the trian- gular sc heme, α n` := N − n , and in the rectangular sc heme, α n` := d ( n, ` ) + 1 − s n` [ ` ] where d ( n, ` ) is the 10 See https://av.tib.eu/series/520/photonic+optimization . 16 Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U R | Error magnitude : | ˆ U R − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 000 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 0 . 150 0 . 175 0 . 200 RM ( σ  = 0 , Haar θ n` init) Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U PR | Error magnitude : | ˆ U PR − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 000 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 0 . 150 0 . 175 0 . 200 PRM ( σ  = 0 , Haar θ n` init) Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U R | Error magnitude : | ˆ U R − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 000 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 0 . 150 0 . 175 0 . 200 RM ( σ  = 0 . 1 , Haar θ n` init) Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U PR | Error magnitude : | ˆ U PR − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 000 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 0 . 150 0 . 175 0 . 200 PRM ( σ  = 0 . 1 , Haar θ n` init) Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U R | Error magnitude : | ˆ U R − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 05 0 . 10 0 . 15 0 . 20 RM ( σ  = 0 , uniform-random θ n` init) Layer ( ` ) Input ( n ) θ n` / 2 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 Estimate magnitude : | ˆ U PR | Error magnitude : | ˆ U PR − U | 0 π / 4 π / 2 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 000 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 0 . 150 0 . 175 0 . 200 PRM ( σ  = 0 , uniform-random θ n` init) FIG. 12. Comparison of learned matrix errors and learned θ n` w eights after 20000 iterations for the Adam update at learning rate 0 . 0025 and batch size 256 for the simple unitary netw ork. W e consider tw o meshes: (1) rectangular mesh (RM), and (2) p erm uting rectangular mesh (PRM). W e consider three conditions for eac h mesh: (1) ideal (with Haar random unitary initialization), (2) photonic beamsplitter error displacement  ∼ N (0 , 0 . 01), (3) random initialization. 17 0 π / 4 π / 2 θ / 2 Histogram : θ n` / 2 P 9 ( θ n` / 2) P 43 ( θ n` / 2) PRM ( N = 128 ) RM ( N = 128 ) FIG. 13. Comparison of learned, normalized θ n` distributions for N = 128 rectangular (RM) and permuting rectangular (PRM) meshes with P α ( θ / 2) PDFs for α = N +1 3 = 43 (the a verage sensitivit y index) and α = b N 2 log N c = 9 respectively . Note that p erm uting meshes hav e a larger tolerance, which ev entually results in faster mesh optimization. n umber of no des on the diagonal (measured along paths of constant n + ` ) con taining a rotation parameterized b y θ n` , and s n` is a sequence of decreasing o dd integers d ( n, ` ) ≥ k odd ≥ 1, follow ed by increasing even integers 2 ≤ k even ≤ d ( n, ` ), as deﬁned in [16]. W e prov e b e- lo w that for both the triangular and rectangular meshes, α n` = | I n` | + | O n` | − N − 1. Lemma 1. In the triangular mesh, α n` = | I n` | + | O n` | − N − 1 . Pr o of. In the triangular mesh (shown for N = 8 in Figure 8) α n` := N − n , so w e wish to sho w that N − n = | I n` | + | O n` | − N − 1, or: 2 N + 1 = | I n` | + | O n` | + n. (I1) Supp ose Equation I1 holds for some arbitrary n 0 , ` 0 in the mesh, such that 2 N + 1 = | I n 0 ` 0 | + | O n 0 ` 0 | + n 0 . First, induct on n : if w e take n = n 0 + 2 and ` = ` 0 , then | I n` | = | I n 0 ` 0 | − 1 and | O n` | = | O n 0 ` 0 | − 1. Next, induct on ` : if we take n = n 0 and ` = ` 0 + 2, then | I n` | = | I n 0 ` 0 | + 1 and | O n` | = | O n 0 ` 0 | − 1. In b oth cases, Equation I1 holds. T rav ersals b y 2 along n or ` from a starting no de can reac h all no des with the same parit y of n and ` , so we need tw o base cases. Consider the ap ex no de at n = 1, ` = N − 1 and one of its neigh b ors at n = 2, ` = N . The former has | I n` | = | O n` | = N and the latter has | I n` | = N and | O n` | = N − 1. In b oth cases, Equation I1 is satisﬁed, so the lemma holds by induction. Lemma 2. In the r e ctangular mesh, α n` = | I n` | + | O n` | − N − 1 . Pr o of. In the rectangular mesh, α n` := d ( n, ` ) +1 − s n` [ ` ], as deﬁned in Ref. [16]. Deﬁne orthogonal axes x and y on the lattice such that for a no de at ( n, ` ), trav el- ing in the + x direction giv es the neighboring no de at ( n + 1 , ` + 1) and trav eling in the + y direction gives the neigh b oring node at ( n − 1 , ` + 1), as depicted in Figure 14. F or even { o dd } N , let the no de at ( n, ` ) = (1 , 1) ha ve x = 1 and the no de at ( n, ` ) = ( N − 1 , 1 { 2 } ) hav e y = 1. Then there is a one-to-one mapping such that ( x, y ) =  n + ` 2 , ` − n 2 + b N 2 c  , as sho wn in Figure 14, and it suﬃces to pro ve the lemma b y induction in this diagonal basis. Since d ( n, ` ) is deﬁned to b e the length of a diagonal along paths of constant n + ` , it dep ends only on x , so w e rewrite d ( n, ` ) 7→ d ( x ) explicitly: d ( x ) = ( 2 x − 1 x ≤ b N 2 c 2( N − x ) x > b N 2 c . (I2) Similarly , since s n` [ ` ] is enumerated along a diagonal, it dep ends only on y , and w e conv ert s n` [ ` ] → s x [ y ] from the sequence deﬁnition of Ref. [16] to an explicit lattice form: s x [ y ] = ( 2  b N 2 c − y  + 1 y ≤ b N 2 c 2  y − b N 2 c  y > b N 2 c . (I3) In this diagonal basis, we w an t to show that d ( x ) + 1 − s x [ y ] = | I xy | + | O xy | − N − 1 . (I4) There are t wo b oundaries at x, y = b N 2 c whic h separate four quadran ts that m ust b e considered, depicted b y gra y lines in Figure 14. W e will induct on x and y within each quadran t, then induct on x or y across eac h of the tw o b oundaries. Supp ose that Equation I4 holds for some arbitrary x 0 y 0 in the mesh, such that d ( x 0 ) + 1 − s x 0 [ y 0 ] = | I x 0 y 0 | + | O x 0 y 0 | − N − 1. First, w e induct on x and y within eac h quadran t; the results are tabulated in T able I. In every case, d ( x ) − s x [ y ] − | I xy | − | O xy | = d ( n, ` ) − s x 0 [ y 0 ] − | I x 0 y 0 | − | O x 0 y 0 | , so Equation I4 remains satisﬁed. Next, w e induct across the x, y = b N 2 c b oundaries, sho wn in T able I. Again, in every case, d ( x ) − s x [ y ] − | I xy | − | O xy | = d ( n, ` ) − s x 0 [ y 0 ] − | I x 0 y 0 | − | O x 0 y 0 | , satisfying Equation I4. Finally , note that the base case of the top left MZI at ( n, ` ) = (1 , 1), ( x, y ) =  1 , b N 2 c  holds, with d ( x ) + 1 − s x [ y ] = 1 = 2 + N − N − 1 = | I xy | + | O xy | − N − 1. This completes the pro of in the ( x, y ) basis, and since there is a one-to-one mapping b et ween ( x, y ) ↔ ( n, ` ), α n` = | I n` | + | O n` | − N − 1 holds b y induction. 18 1 2 2 2 3 4 4 1 1 5 6 3 3 7 5 1 1 5 6 3 3 4 4 1 1 2 2 2 1 2 4 6 1 2 4 6 3 1 2 4 3 1 2 4 5 3 1 2 5 3 1 2 7 5 3 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 y = 1 2 3 4 5 6 7 x = α n` s x [ y ] 1 2 2 2 3 4 4 1 1 5 6 3 3 7 5 1 1 5 6 3 3 4 4 1 1 2 2 2 1 2 4 6 1 2 4 6 3 1 2 4 3 1 2 4 5 3 1 2 5 3 1 2 7 5 3 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 y = 1 2 3 4 5 6 7 x = α n` s x [ y ] FIG. 14. Rectangular decomp osition for even ( N = 8) and o dd ( N = 7) meshes, showing the diagonal x, y basis. V alues for α n` are sho wn in red ab o ve each MZI, with v alues for s x [ y ] shown in blue b elo w. The critical b oundaries of x, y = b N 2 c separating the diﬀeren t quadrants are dra wn in green. (Boundaries are oﬀset for visual clarity .) Quadran t Induction d ( x ) = · · · s x [ y ] = · · · | I xy | = · · · | O xy | = · · · x 0 ≤ b N 2 c , y 0 ≤ b N 2 c x = x 0 − 1 d ( n, ` ) − 2 s x 0 [ y 0 ] | I x 0 y 0 | − 2 | O x 0 y 0 | y = y 0 − 1 d ( n, ` ) s x 0 [ y 0 ] + 2 | I x 0 y 0 | − 2 | O x 0 y 0 | x 0 ≤ b N 2 c , y 0 > b N 2 c x = x 0 − 1 d ( n, ` ) − 2 s x 0 [ y 0 ] | I x 0 y 0 | − 2 | O x 0 y 0 | y = y 0 + 1 d ( n, ` ) s x 0 [ y 0 ] + 2 | I x 0 y 0 | | O x 0 y 0 | − 2 x 0 > b N 2 c , y 0 ≤ b N 2 c x = x 0 + 1 d ( n, ` ) − 2 s x 0 [ y 0 ] | I x 0 y 0 | | O x 0 y 0 | − 2 y = y 0 − 1 d ( n, ` ) s x 0 [ y 0 ] + 2 | I x 0 y 0 | − 2 | O x 0 y 0 | x 0 > b N 2 c , y 0 > b N 2 c x = x 0 + 1 d ( n, ` ) − 2 s x 0 [ y 0 ] | I x 0 y 0 | | O x 0 y 0 | − 2 y = y 0 + 1 d ( n, ` ) s x 0 [ y 0 ] + 2 | I x 0 y 0 | | O x 0 y 0 | − 2 T ABLE I. Induction on x and y within eac h of the quadran ts in the mesh. x 0 y 0 Induction d ( x ) = · · · s x [ y ] = · · · | I xy | = · · · | O xy | = · · · x 0 = b N 2 c an y x = x 0 + 1 d ( n, ` ) − { + } 1 s x 0 [ y 0 ] | I x 0 y 0 | + 0 { 1 } | O x 0 y 0 | − 1 { 0 } an y y 0 = b N 2 c y = y 0 + 1 d ( n, ` ) s x 0 [ y 0 ] + 1 | I x 0 y 0 | | O x 0 y 0 | − 1 T ABLE II. Induction on x or y across each of the b orders of x, y = b N 2 c .

Matrix optimization on universal unitary photonic devices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment