ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validatio…

Authors: Nij Dorairaj, Debabrata Chatterjee, Hong Wang

ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation
ODIN - Based CPU - G PU Architectu re wit h Repla y - Driven Simula tion and Emulation Nij Dorai raj 1 , D ebabrat a Chatter jee 1 , Hong Wang 1 , Hong Jiang 1 , Alan k ar S axen a 1 , Altug K oker 1 , Thiam E rn Lim 1 , C ath ran e T eoh 1 , Chuan Yin Loo 1 , Bishar a Shomar 2 , Anthony Leste r 3 1 Intel Corporation, 2 Intel Corporation/N vidia Cor p, 3 Synopsys In c Abs tra ct Integration of CPU and GPU t echnology is a key enabler f or modern AI and graphics workloa ds, combinin g control‑ori ented proc essing with m as - sive parallel comp ute capability. As syst ems ev olve towar d chipl et‑based architectures, pre‑silicon valida ti on of tightly coupled CPU – GPU subsy stems presents signifi cant chall enges due to complex va lidation fram ework setup , large desig n scale, h igh concur rency, non‑ deter ministic ex ecution, and i n - tricat e pr otoco l int er action s at chi plet b oun dari es — often resultin g in l on g integration cycl es. This paper presents a repl ay‑driven validatio n metho d - ology dev eloped during the integration of a CPU subsyste m, multiple Xe GPU cores, and a con figurable Network‑on‑Chip (NoC) wit hin a foundational So C building blo ck targ eting the ODIN i ntegrated c hiplet archit ecture. By leveraging det erministi c wavefor m capture an d replay acr oss both simul a - tion and em ulati on usin g a singl e desi gn dat abas e, co mplex GP U worklo ads and pro toco l sequen ces can be r eprod uced re liably at the sy stem le vel. Thi s approach sig nificantly accelerates debug, impro ves integr ation confid ence, and enables end‑to‑end system boot and workl oad execution within a sin - g le quar ter, d emon str ati ng t he ef fecti ven ess o f repl ay‑b ased vali dati on as a scalable m ethodolo gy for chipl et‑based systems. 1 Intro ductio n CPU – GPU integration has become foundational for System -on- Chip (SoC) designs tar geting A I, m edia, and high- perform an ce c omputing. W hile CPUs ex cel at con - trol and l aten cy - sensitive tasks, GPUs deliver high throughput through massive para llelism. Integrating these subsystems introduces validation challenges due to execu tio n infrastruc ture req uir ement s, multip le lev els of memory hierarchie s, 1 complex interface protoc ols, large address maps, and non - det ermin ist ic behav - ior. In a ddition, tradition al dire cted te sting is insuff icient to u ncove r integr ation - l evel issues that only manife st under realistic workloads. R eplay - driven valida - tion bridges the gap be tween s imulation visibility and emulation performance. 2 Odin SOC Over view and Chiple t Con text Modern high‑performance SoC designs i ncreasi ngly adopt a chi plet‑b ased integra- tion model to improve sc alability, modularity, and design reuse. In th is approac h, complex compute and memory subsystems are composed as semi‑independent chiplets and integrated using well‑defined interconnect and protocol boundaries. This work is ba sed on an ODIN integr ated chiplet, which brings toge ther CPU, GPU, me mory controllers, and system i nterco nn ect into a sin gle composable uni t. The ODIN integrated chiplet combines a Xeon CP U subsystem, a GPU compute complex, and a Network‑on‑Chip (NoC) fabric, along with high‑bandwidth and standard memory in terf aces. As shown in Fi gu re 1, the CP U and GPU subsystems conn ect to the NoC fabric, whi ch serv es as the prim ary communic ation backbone for co he rent and non‑coherent t raffi c. The NoC al so int erfaces with ext ern al m em - ory controllers, PCIe, and other SoC IPs , enabling syst em‑l evel integr ation while preserving modularity at the c hiplet boundary. A key aspec t of this architecture is the separ ation of compute chiplets from memory technologies. High‑bandw idt h memory (HBM) is accessed through a dedic ated H BM controlle r within th e chiple t, while tr aditiona l DDR me mory is accessed via a DDR controller c onne cted to the NoC. This sepa ration allows the chiplet to support heterogeneous memory systems and e n ables flexible deploy - ment across different platfor m configurations. By adopting a chiplet‑based ar - chitecture, the ODIN d esign ena bl es independent deve lopment and validation of CPU and GPU subsystems, while re lyi ng on the NoC fa bric to provide scalable connectivity, ordering, and protoc ol t ranslation. Th is modular ity, while bene fi - cial for integra tion and reuse, al so introduces validatio n chall enges — part icu larl y at t he int erfa ces b etwe en chi plet s — whi ch mot ivate t he replay‑driven validation methodology described in the f ollowing sections. Figure 1: Od in chip let over vie w. The SoC design descri bed as shown in Figure 2 rep resen ts a found ational build- ing block towar d the ODIN integrated chiplet arc hit ecture, rather tha n the final ODIN state. Individual compute and memor y subsystems are composed and vali - dated in this configuration to enable scalable integra tion int o the broader ODIN chip let f ram ework . T his approach allows CPU, GPU, and inter connect compo - nents to be deve loped and verified incrementally, whil e pre serving architectural alignme nt with th e even tual ODI N system c omposition Figur e 2: Top - level So C integr ation show ing CPU, GP U subsy stem, NoC, and system mem ory. GPU memory and cont rol traffic i s routed through t he NoC fabric to system DDR. 2.1 CPU Subsys tem The CPU subsystem is integr ated into t he NoC using cohe rent Intra - Die I nter - face (IDI) l ink s. Supporting logic includes c ontrol and management block for semap ho res and int errupt handling; a s ystem controller for MMIO decode; and a power management using sideband and industry - s tand ard Q - Channel flows. As shown in Figure 3 , the subsystem c onnects to the NoC through coherent inter - faces. Figure 3: CPU sub syst em arch itectu re highlighting coh er ent IDI conn ect ivity . 2.2 GPU Architecture Conceptually, the GPU operates as a m emory‑input, memory‑output – based pro - cess or, w here wo rkl oad s are exp res sed pri ma ril y as st reams of memo ry tr ansac - tions rather than tightly c oupled control flows. This execution model enable s high throughput and parallelism but place s si gnificant emphasis on the c orre ct - ness, ordering, and performance of the memory subsystem. The GPU IP used in this work is a s ynthesizable RTL de sign composed of mul- ti ple Xe cores , each con tai nin g several execut io n unit s (EUs). The E Us ex ecut e a mix of SIMD ar ithmetic operatio ns, SIMD loa d/store instruc tions, an d systolic operations, enabling efficie nt proc essing of highly pa rallel workloads. Wor k is issued to t he GPU as wav es of threads , with e xecution dis tributed across Xe cores and E Us to maxim ize util ization. GPU execution is inherently memory‑driven, with fre quent accesses to sys - tem memor y for instruc tion fetch , d ata reads , and resu lt writ e‑back . As a re - sult, the GP U reli es h eavi ly on the s urrounding SoC i nfrast ruct ure —including t h e Network‑on‑Chip (NoC), memory controllers, and cache hierarchy — to sustain throughput and maintain correc tness. Interactions such as memory or dering, co - herency, back‑pressure, and response tim ing become critical validati on points at the sy stem leve l. This tight coupling between GPU execution and sy stem memory behavior makes GPU valida tion par ticularly sensitive to integr ation issues at the NoC and memory interface s. Consequently, accu rate modeling of memory t raffi c and determin istic reproduction of interface behavior are essential for effective pr e‑silicon valida - tion. Figure 4 provides an overview of the GPU architecture a nd i ts integration context. Figure 4: GPU arch itectu re s howing Xe cores, exe cutio n units (EUs), and memory hier ar chy. 3 Validation Ch allenges Sev eral chal leng es em erged duri ng CP U – GPU – NoC integration. I n addition to functional complexity, the GPU exec ution flow relies on complex boot, power management, timing, and cloc king protocols, many of which use proprietary in - terface s. Thes e protocols must be followe d precis ely to achiev e a su ccessful GP U bring- up and wor klo ad execution. Fr om a full - SoC perspec tive, modeling and understanding these detaile d sequ enc es introduces signi ficant overhead and co m - plexity. In t raditional bus functional mo del (B FM ) – bas ed valida tion flows, t ransitioning from simulation to emula tion often req uir es sig nifi cant chan ges to model compi- lation, configuration, and exec ution i nfra st ructure. These d iffe ren ces f requ entl y resul t in mainta ining separ at e design dat abas es or plat form‑s peci fic co llat eral for simulation and emulation, increa sing i ntegration overhead a nd maki ng consis - tency across validation environme nts di fficult to maintain. Thi s fragm ent ati on comp licat es debug cor relat ion , sl ows it erat ion , and incre ases the ri sk of diver - genc e betwee n simulation an d emulation results du ring syste m‑leve l validation . Another significant chal leng e during S oC‑lev el validation is root‑causing IP‑l evel issues from full‑syste m fai lu res. Wh en fail ures ar e observ ed at the S oC lev el, identif ying whe ther th e root ca use lie s in the I P impleme ntation, the integ ration logic, or sy stem‑l evel interactions oft en requ ir es extensive debug effo rt. This pro - cess typically involves back‑tracking through multiple abstraction layers — from So C‑l evel b ehavi or to sp ecifi c IP int erfa ces a nd i nt ernal s tat es — making debug time‑consuming and resource‑intensive, pa rticularly for c omplex CP U – GPU in - teractions. Figure 5 hi ghlights the Replay Engine concept used to abstract GPU boot and protocol complexity at the S oC l evel. Addi tional chal leng es aros e from random- ize d modeling u sed for metasta bility a nd pipe line stagin g with in the GPU to i m - prov e veri fic ati on cov erag e. Wh ile e ffect iv e for co ver age, thi s random iz atio n complic ates r eproducibili ty and ma kes de terministic debug dif ficult. Full - chip RTL simula tion is prohibitively slow for realistic workloads, while emula tion lim - its in ternal visibility. 4 Replay Engine Architecture To add ress thes e chal leng es, a Rep lay En gine was int rodu ced t o enabl e det er - ministic capture and reproduction of GPU - d riven traffic across simulation and emula tion. The R epl ay Eng ine i nterfa ces at well - defined subsystem boundaries, avoiding invasive instrumentation while pr eserving functional fidelity. 4.1 Replay Engi ne Compone nts The Replay Engine captures timing - accurat e wav efo rms at th e GPU IP peri ph - ery, focusing on arch itecturally visible interface signals—including data, control, and re spons e information. By preserving c ycle - level behavior observed during standalone GPU IP validation, the c aptured trace provides a deterministic repre - sentation of the protocol inter actions required for boot and workload exec ution. Unlike tradition al approaches that depend on a bus functional model (B FM) to consume GPU outputs and gener ate corresponding responses, the replay method- ology does not require a live BFM during SoC replay. Th e capt ure d wav eform inherently contains both the request signals driven by the GPU and the c orre - sponding responses obser v ed at the inter face boundary. During replay, these re - sponses are re - gen erat ed by the Rep lay Engine in the s ame cl ock cycl es in which they wer e origina lly obse rved, eff ecti vely e mulating the pres ence of a responding agen t without e xplicitly instantia ting BFM c ollate ral. This design choice add ress es multiple syst em‑lev el valid ation chal len ges . Firs t, in conventional BFM‑based flows, transitioning from simulation to emulation of - ten requi res signif icant model an d compilati on chang es and c an result in main - taining separa te simula tion and e mulation databa ses, whic h are dif ficult to ke ep consistent. By embedding the req uir ed stimul us and response be h avior dir ectly in the rep lay arti fact and reusing it a cross p latforms, t he replay m ethodology enab les a uniform validation p ath between simulation and emulation while avoiding du - plicat ed integration coll ateral . Second, S oC‑level failures are oft en time‑consuming to root‑cause back into the originating IP. Deterministic replay improves debug efficiency by enabling consistent reproduction at well‑defined IP boundar i es, re - ducing the search space when back ‑tr acki ng fro m syst em behavior to sp ecifi c interface regions. Finall y, bec ause r epl ay arti fa cts a re de coupl ed f rom dynam ic BFM infrast ruct ur e, increm ent al de sign chan ges — s uch as lo calized IP fi xes or in - tegration updates — ca n be valida ted with minima l disruption , allow ing tar geted updates without reworking or re‑qual i fying extensive testb ench co ll ater al. In combination, replay preserves pr otocol correctness and determinism across simu - lation and emulation while r educing integration overhead and accele rating bot h root ‑caus e an alys is and i terat ive vali dati on. Figure 5: Repl ay Engine arch itectu re cap turin g timing - accu rate wav efo rms at th e GPU IP p erip h - ery and c onverting them into ROM - initia lized repl ay dat a for determ inis tic ex ecut ion. 4.2 Replay Capture and ROM Initializ ation Flow Replay capture is performed dur ing st andalone GPU IP validation, wher e the Repl ay Engine reco rds wave form activity at the GPU IP periphery. The captured wave form serves as the source art if act for rep lay and is not consumed direc tly at runtime . Figure 6 depicts the replay capture, conversion, and ROM initialization flow used to pr epare replay da ta. Foll owing capture, a n offline post proce ssi ng step e xtrac ts the re levant in terfa ce signa ls and c onverts th e wav eform ac tivity into a cycl e ord ered bit repres ent ati on. This c onversion encodes both stimulus and corresponding response informa ti on into a c ompact, replay specific f o rmat sui tabl e for s torag e. The resu lti ng dat a repres ent s th e exact int erface b ehav io r observed during capture, pre served on a per cycle basis. The en coded repl ay data is t hen u sed to initiali ze ROM based storage struct ures within the Replay Engine as p art of the SoC design initia lization process. During simulat ion or emulation, the Rep lay Engine r eads this pre initializ ed ROM content to drive repl ay ex ecuti on , eliminating any depend ency on dynamic wav eform files or external runtime infrastructure. By separating wavef o rm capture, conversion, and ROM initializa ti on from runtime replay execution, this a pproach enables re - peatable, self contained re play across platforms while maintaining consistenc y with the originally ca ptured GPU protocol behavior. Figure 6: R eplay captu re and ROM initiali zation flow used to prepar e replay data . 5 Simulation and Emulation Flow 5.1 Method olog y: Repl ay - Dr iven Sys tem - Le vel Va lida tion The validatio n methodology employs both simula tion and emulatio n as comple - mentary execution platforms, built from a common design database and YA ML‑based configuration. Repl ay artif act s serv e as t he shar ed st im ulu s mechan is m acros s these platforms, enabling consistent execution of GPU‑driven system scenarios while allowing each e nvi ronment to be used where it is most effec ti ve. Simu la - tion is levera ged for detailed d ebug a nd wavefor m visibility , while emulation is used for scalable, high‑speed system‑level execution. In a c onventional SoC integration flow, I P subsystems typically provide test - bench collateral or bu s functional models (B FMs) to enable early bring‑up and stimulus genera tion at t he SoC le vel. Deve lopi ng, validating, and maintaining this collateral often adds significa nt integration overhead and can delay execu - tion of the f irst meanin gful syste m‑level te sts. In contrast, the replay‑driven methodology presented in t his work eliminates the need f or dedicated BFM or testbenc h collateral for GPU integr ati on. Th e re - quired protocol behavior already exists in the timing‑accurate waveform capture perfo rmed during st andalone GPU IP valida tion. Thi s captur ed wav eform , includ- ing bot h stimulu s an d corresponding responses, is convert ed into rep lay dat a and reused directly at the SoC level to drive boot and wor klo ad execution. By reusing validated protocol behavior rather than re‑implementing it in sepa - rate integr ation collateral, the m ethod ology sig nificant ly accelerates So C bring‑up. This approach enabl ed the system to su ccessfull y boot an d execute the first GPU‑driven testc ases ear lier in the integr ation cyc le, while maintainin g deter ministic a nd re - producible behavior across both simula ti on and emulation. By combining detai led simulation debug with high‑throughput emula tion exe - cution around a singl e rep lay mechanism, the methodology red uces debug turnaround time, avoids duplication of integration eff ort, and improves pre‑silicon integra - tion confidence. This replay ‑driv en ap pro ach est abli shes a scal abl e val idat io n frame work that natura lly extends to increasingly compl ex SoC and chipl et‑b ased systems. 5.2 Simulation Flow Simula tion serv es as the prim ary environment for functional debug and root‑cause analysis, providing detailed w aveform vi sibility acr oss t he CPU, GPU, and N oC interface s. The Replay Engine ena bles determ inistic exec ution of compl ex GPU protocol behavior, allowing system‑level scenarios s uch as boot and workload execu tio n to be analy zed with cy cle‑ accu rat e signal visibility. Thi s ca pability is particularly valuable for diagnosing protocol or d ering, timing depende nci es, and integration issues that are diff i cult to observe through directed testing alone. Figure 7 s hows the si mul ation flow a nd output comparison against a go lden ref - eren ce. F igu re 8 Figure 7: Simu lat ion tes t flow . Figure 8: Successful s imulation res ults using re play - driven stimul us with output comparison again st a go lden refer ence. 5.3 Emulation Flow In the emulation environme nt, the r eplay artifacts generated during simulation are reus ed to enab le high‑speed ex ecution of GP U‑driv en system scenar ios. The Rep lay Engine de t erminis tically replays captured GPU protocol behavior at the IP periphery, allowing the SoC to progress through comp lex sequ ences such as boot and workload ex ecut ion without depend ence on full soft ware stacks or ext ern all y gene rated stimulu s. This ap proach simplifies e mulation se tup while maintainin g func tional e quivalence w ith sce narios va lidated in s imulation . Emula tion complem ents simulation by enabling execution of long‑running, sys - tem‑ level work loads that would be impractic al to run repea tedly in RTL simula - tion. Although intern al signa l visibility is mor e limited th an in si mulation, the deterministic nature of re pl ay allows issues observed in emulation to be reliably correlated back to simulation for detailed root‑cause analysis. This combination enab les rapid reproducti on of integr ation i ssues at fu ll‑chi p s cale while preserving confi den ce in funct ion al cor re ctnes s. Figure 9 s hows repre s entative e mul ation waveforms demonstra ti ng successful system boot and cor rect memory output, illustrating how deter mini stic replay drives system‑level execution at si gnificantly higher performance. To demon - stra te fea sibility at s cale, the full CPU – GP U – NoC SoC was mapped onto the em - ulation platform and evaluated for resource utilization. Table 1 sum mari zes t he estimated and c on figured resource usage across different board c onfigurations, showin g that the desig n fits w ithin av ailable capac ity while le aving ma rgin fo r additional debug and inf rastructure logic. Figure 9: Emulation f low using r eplay artifacts to enable high - speed sy stem - lev el valid ation . Table 1: E stim ated res ource utiliz ation for fu ll - chip emulatio n on th e EP1 platform Configuration LUT RAM URAM REG Estimated S i ze 87M 25K 503 36M 15 Boards 88.8% 40.2% 3.7% 24.6% 16 Boards (64 FPGAs) 83.3% 37.7% 3.5% 13.7% 5.4 Emulation Hardware Platform (EP1) The full‑chip emulation described in this work was deployed on the EP1 emula - tion hardware platform, which provides the capacity and interc onnect required t o host the integrate d CPU – GP U – NoC SoC design. EP1 ser ves as th e exe cuti on targ et for the c ompiled design image, en abl ing valid ation of syst em‑s cale integra- tion under rea list ic hardware constraints. Figur e10 presents the EP1 hardware context used in this s tudy, including the E P1 module, the targe t EP 1 har dware platf orm, and the 16‑board EP1 conf igur ation shown is used to acco mm odate the full SoC design. Togethe r, th ese vie ws illustr ate how the des ign scales acro ss the emulation infrastructure to support full‑chip execution while preserving consis - tency with the replay‑driven validation flow described earlier. Figure 10: EP1 emulation hardware platform u sed for ful l - chip CPU – GPU – NoC syste m validation. 6 Results and K ey Learnings Using replay‑dr iven validation, e nd‑to‑end system boot and GPU workload execu - tion were achieved wit hin a quarter. By reusing determinis tic, protocol‑accurate repl ay artif acts capt ure d during standalone IP validation, the methodology elimi - nated the need fo r dedi cat ed SoC ‑lev el tes tbe nch or BFM colla teral, signif icantly accelerating SoC integr ation and enabling earlier exec ution of the first meaning - ful syste m‑leve l tests. 6.1 Key Learnings • Det ermin ist ic wav efo rm cap tur e was essent ia l for en abli ng rel iabl e r eplay across both simulation a nd emulation environments. • Repl ay - bas ed stimulus significa ntly redu ced debug t urnaround time an d en - abled repeatable repr o duction of system - lev el issues. • A reduc ed R OM footprint was achi eved by lev eragin g clo ck - gener ation - based replay instead of f ull capture - replay of internal clocking logic. • Automation of capture a nd replay logic insertion minimized manual err o rs and improved repea t ability across validation runs. • For em ul ati on, clo ck simula tion bas ed clo ckin g requi r ed p reprocessing through the ZEMI3 fl ow using Synopsys BC, introduc ing additional setu p considera- tions. • The cloc k sim ulation ba sed clocking model does not support dynamic f re - quency changes, requiring fixed - fre qu ency assumptions during replay -driven emulation. • Disabling flop randomiza ti on used for me t astability validation was nece s - sary t o achi ev e dete rmin is tic b ehavi or sui tab le for wav efo rm r eplay. 7 Conclusion Replay‑driven validation provides a scalabl e, deterministic foundation f or val - idating complex heterogeneous SoCs, where tight CPU – GPU coupling and sys- tem‑ level protocols make traditiona l valida tion appr oaches insuff i cien t. By uni - fying simulation and e mulation around a single, reusable replay ar tif act, this methodology avoids the nee d for separate simulation and emulation da tabases commonly as soci ated with B FM‑b ased flow s, redu cin g integ ration overh ead and mainta ining co nsi sten cy across valida tion platf orms. In addition, deter ministic re - play improves de bug e ffi cien cy by enabl ing relia ble reproduction at well‑d efi ned IP boundaries, significantly re d u cing t he effort required t o root‑cause SoC ‑level fail ures ba ck to sp ecifi c IP interfaces. As systems evolve to ward OD IN‑cl ass chip let archi tect ures , rep lay ‑bas ed va lidation en abl es repeat abl e, int erf ace‑ accu rat e verification at subsystem and c hiplet boundaries, establishing a prac t ical valida - tion framework t hat sc ales with increasi ng syste m complexity. References [1] G. S. Kalsi, H. W ang, J. M. How ard, J. B. Fryman, F. Petri ni, and D. S. Klow - den, “BiFro st : A C omposabl e, Resilient I n terco nnect Netw ork Arch itecture for Scalabl e Artificial I ntelli gence Syst ems,” IEEE Micro , v ol. 45, no. 5, pp. 67 – 78, S ept. – Oc t. 2025. [2] H. Ji ang, “Int el’s P ont e Vec chio GPU : Archi te cture, Sys tems & Sof twar e,” i n Proc. IEE E Hot Chi ps 34 Symposi um (HCS) , Cuper tino, CA , USA, Aug. 21 – 23, 2022. [3] H. Won g, A. Bracy, E. Schu chman, T. M. Aa modt, J. D. C ollins, P. H. Wa ng, G. Chi nya, A. K . Gro en, H . Ji ang, and H. Wang , “P angaea: A T ight ly - Coup led IA32 H eterog eneous Chip Multi proces sor, ” in Proc. 2 008 I nternat ional Con - feren ce on P arall el Archi tect ures and C ompilat ion Tec hniqu es (PACT) , Toro nto, ON, Ca nada, Oc t. 2008, p p. 52 – 61. [4] P. H. Wang, J. D. C ollins, C. T. Weaver, B. Kuttanna, S. S alam i an, G. N. C hinya , E. Sc huchm an, O. Schilling, T. Doil, S. Steibl, and H. Wan g, “ In - tel® At om ™ Proc essor Core M ade FPGA - Synthes iza ble,” in Pr oc. ACM/SIGDA International Symposium on Field - Programmable Gate Arrays (FPGA) , Monter ey, CA, U SA, 20 09, p p. 209 – 218. [5] G. Sch elle, J. Collins , E. Schu chm an, P. W ang, X. Zo u, G. Chiny a, R. Pl ate, T. Mat tner, F. Olbrich, P. Hammarlund, R. Si nghal, J. Brayton, S. Steibl, and H. Wa ng, “ Int el® Ne halem Pro cess or C ore Mad e FP GA - S ynthesi zabl e,” i n Proc. ACM/SIGDA International Symposium on Field - Programmable Gate Arrays (FPGA) , Mont erey, C A, U SA, 201 0. [6] W. Kwon, Z. Li , S. Z huang, Y. Sheng, L. Zheng, C. H. Yu, J. E. G onz alez, H. Zha ng, a nd I. Stoi ca, “Ef fi cient Memor y Man ageme nt for Lar ge Lan guag e Model S ervin g wit h Page dAtte ntion, ” in Pr oc. ACM SIGOP S 29th Sym posium on Oper ati ng S ystems Pr incipl es ( SOSP) , Kobl enz, Germ any, Oct . 23 – 26, 2023. [7] NVIDIA, “Grac e Hopper S uperchi p Archi tectur e In - Dept h,” NVID IA Techni cal Blog, Sept. 2022. [8] Synops ys, In c., ZeBu User Guide , Vers ion V- 2024 .03, Apr . 2024.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment