The continued scaling of artificial intelligence workloads is increasingly constrained by data movement, interconnect bandwidth, and energy efficiency in conventional electronic systems. Integrated photonics offers a promising pathway to address these challenges through high-bandwidth optical interconnects and energy-efficient photonic computing primitives. However, translating device-level photonic advances into large-scale, deployable AI systems remains difficult due to strong coupling between physical implementation, system architecture, and learning algorithms. In this work, we identify three considerations that are essential for realizing practical photonic AI systems at scale: (1) dynamic tensor operation support for modern models rather than only weight-static kernels, especially for attention/Transformer-style workloads; 1 (2) systematic management of conversion, control, and data-movement overheads, where multiplexing and dataflow must amortize electronic costs instead of letting ADC/DAC and I/O dominate; 2 and (3) robustness under hardware non-idealities that become more severe as integration density grows. 3 To study these coupled tradeoffs quantitatively, and to ensure they remain meaningful under real implementation constraints, we build a cross-layer toolchain that supports photonic AI design from early exploration to physical realization. SimPhony 4 provides implementation-aware modeling and rapid cross-layer evaluation, translating physical costs into system-level metrics so architectural decisions are grounded in realistic assumptions. ADEPT 5 and ADEPT-Z 6 enable end-to-end circuit and topology exploration, connecting system objectives to feasible photonic fabrics under practical device and circuit constraints. Finally, Apollo 7 and LiDAR 8, 9 provide scalable photonic physical design automation, turning candidate circuits into manufacturable layouts while accounting for routing, thermal, and crosstalk constraints. Together, these capabilities make our co-design loop both quantitative and physically grounded, bridging architectural intent and deployable photonic hardware.
The rapid scaling of artificial intelligence (AI) workloads has exposed fundamental limitations in conventional electronic computing systems. [10][11][12][13] While transistor scaling continues to deliver incremental improvements in logic density, system-level performance, and energy efficiency are increasingly constrained by data movement, memory bandwidth, and interconnect power consumption. These challenges become especially acute in large-scale AI systems, where communication and I/O frequently dominate both latency and energy budgets.
Photonics has emerged as a promising technology to relieve these bottlenecks, offering high bandwidth density, low propagation loss, and natural support for broadcast and wavelength-division multiplexing. In parallel with advances in optical interconnects and co-packaged/heterogeneous integration, 14,15 recent demonstrations have shown photonic computing primitives that accelerate core AI operators (e.g., tensor/matrix computations) at impressive throughput and parallelism. 1-3, 10, 16-18 Despite these device-and chip-level successes, a clear path toward large-scale photonics-empowered AI systems remains elusive.
A central challenge is that scaling beyond isolated accelerators requires system-level integration across devices, circuits, architectures, interconnect fabrics, and learning algorithms, under constraints that are qualitatively different from electronics. Photonic integrated circuit (PIC) / electronic photonic integrated circuit (EPIC) implementations must obey curvilinear geometries, limited routing resources, strict fabrication rules, and strong sensitivity to process variation and thermal effects, all of which directly impact loss, crosstalk, tuning power, and yield. 19,20 Meanwhile, packaging and electronic-photonic interfacing introduce additional constraints and costs that can dominate deployment feasibility and module economics. 14,15 As a result, manual photonic design does not scale to the complexity demanded by system-class AI, and architecture-only abstractions can be misleading unless they explicitly model physical and packaging realities.
In this work, we argue that realizing large-scale photonics-empowered AI systems requires two tightly coupled capabilities:
• Photonic and electronic-photonic physical design automation (EPDA) to enable scalable, manufacturable implementation of complex PICs/EPICs; and
• System-algorithm co-exploration that incorporates physical non-idealities, control/calibration limits, and packaging/interface costs into architectural design and learning optimization.
Drawing on our recent progress, we connect EPDA with cross-layer hardware/algorithm co-design and illustrate how these techniques together enable scalable photonic AI systems.
As argued in Sec. 1, scaling photonics beyond isolated accelerators requires co-optimization across devices, circuits, architectures, and learning algorithms under realistic physical constraints.
In this section, we ground these requirements through three photonic tensor-core (PTC) designs, Lightening-Transformer, 1 TeMPO, 2 and SCATTER, 3 that each target a different bottleneck regime in cloud/edge deployment and demonstrate how cross-layer co-design translates system constraints into implementable architectures.
To support modern LLMs, particularly attention-based Transformer architectures, photonic computing cores must move beyond weight-static matrix units and enable dynamic tensor operations, while jointly optimizing signal conversion and data movement. Our prior architecture Lightening-Transformer 1 was the first photonic accelerator designed to efficiently execute high-throughput, dynamic optical matrix-matrix multiplications for self-attention. It replaces weight-static photonic matrix units with a Dynamically-operated Photonic Tensor Core (DPTC). At its heart is the Dynamically-operated Dot-product (DDot) engine, a coherent dot-product unit that enables picosecond-level operand switching and supports full-range (signed) matrix inputs without hardware duplication or multiple inference passes. Lightening-Transformer further integrates these computing cores with photonic interconnects for inter-core data broadcast. By exploiting WDM for spectral parallelism and optical broadcast for operand sharing, the cross-layer-optimized architecture achieves over a 12× latency reduction compared to prior photonic accelerators.
While Lightening-Transformer targets cloud-scale throughput, edge AI faces a different constraint regime where area and energy budgets are highly restricted, and electronic interfaces can dominate total cost. To address this setting, we extend the dynamic tensor-core concept to TeMPO, 2 introducing an efficient, time-multiplexed dynamic photonic tensor core that improves utilization and amortizes overheads. At the device level, TeMPO employs customized, foundry-fabricated slow-light Mach-Zehnder modulators (SL-MZMs) that leverage enhanced light-matter interaction to a
This content is AI-processed based on open access ArXiv data.