Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

Learnable Pulse A ccum ulation for On-Device Sp eec h Recognition: Ho w Muc h A ttention Do Y ou Need? Y ak ov Pyotr Shk olniko v yshkolni@gmail.com Marc h 2026 Abstract Self-atten tion scales quadratically with sequence length, limiting transformer-based sp eec h mo dels on edge devices. W e introduce the Learnable Pulse A ccumulator (LP A), an O ( n ) replacemen t that substitutes key-query dot products with learned gating functions: conten t- dep enden t rectangular pulses, perio dic windows, and p osition-dependent basis functions. An MSE diagnostic sw eep determines p er-la yer replacement diﬃculty and ordering. Replacing 8 of 12 w av2v ec2-base lay ers yields 10.61% word error rate (WER) on LibriSp eec h test-clean, + 7.24 p ercen tage p oin ts (pp) ov er the 3.37% baseline, with 3.27 × sp eedup at 120 s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain v alidation on SepF ormer sp eec h enhancemen t sho ws all 16 in tra-ch unk attention lay ers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LP A limitation. LP A’s near- binary gates at inference enable dense GPU computation with no CPU–GPU synchronization, and all op erations map to mobile neural accelerators. Keywor ds sp eec h recognition, eﬃcient inference, atten tion replacement, on-device ASR, linear complexit y 1 In tro duction Self-atten tion V asw ani et al. (2017) computes pairwise interactions across all sequence p ositions, requiring O ( n 2 d ) operations for sequence length n and hidden dimension d . In speech, 50 Hz frame rates produce thousands of frames p er utterance. At these lengths, atten tion dominates memory at all durations and latency b ey ond ∼ 20 s of audio. On mobile hardw are, the problem is comp ounded. Neural accelerators suc h as Apple’s ANE and Qualcomm Hexagon achiev e p eak throughput on elemen t-wise and con volution op erations but lack eﬃcien t supp ort for the dynamic n × n matrix multiplications that atten tion requires. This forces GPU fallback and negates the accelerator’s p ow er eﬃciency . An ASR mo del whose mixing op erations are en tirely accelerator-compatible could achiev e an order-of-magnitude sp eedup on device. W e prop ose the Learnable Pulse Accum ulator (LP A) , whic h replaces key-query matching with learned gating functions that deﬁne soft windo ws ov er the sequence. Three gate t yp es capture complemen tary patterns. Ap erio dic pulses handle conten t-dep enden t segmentation, p erio dic pulses for m ulti-scale rhythmic structure, and p ositional pulses for ﬁxed structural biases. The computation uses only depthwise con volution, sigmoid gating, and weigh ted summation, all of which are accelerator- compatible operations (Fig. 1). LP A is architecturally distinct from prior O ( n ) alternatives. Linear attention Katharopoulos et al. (2020) appro ximates the softmax kernel. State space mo dels (SSMs) Gu and Dao (2023) use 1 recurren t state ev olution. SummaryMixing P arcollet et al. (2024) uses mean p ooling. LP A instead uses explicit le arne d windows where each pulse deﬁnes where to aggregate information, how wide to lo ok, and at what p erio dicit y to rep eat. W e conv ert a pretrained wa v2vec2-base mo del to LP A via progressiv e la yer-b y-la yer replacement, dev eloping the tec hniques essen tial for preserving qualit y . Our contributions: 1. An O ( n ) primitiv e, gated pulse accum ulation, with three complemen tary gate t yp es for sequence mixing. 2. An MSE diagnostic sw eep that measures p er-la y er replacement diﬃculty and determines optimal replacemen t order, pro viding a practical recip e for conv erting an y pretrained transformer. 3. A rooﬂine analysis demonstrating that pulse coun t has negligible impact on single-sample inference b ecause additional pulses on hard lay ers add negligible cost. 4. Inference b enc hmarks on consumer hardware sho wing 3.27 × sp eedup at 120 s via a fused MLX/Metal path, with linear scaling and full mobile accelerator compatibilit y . 2 Related w ork Eﬃcien t attention and attention-free mo dels. Linear attention Katharop oulos et al. (2020) replaces the softmax k ernel with a feature map ( O ( nd 2 ) ). Gated Linear Atten tion Y ang et al. (2024) adds data-dep endent forget gates with hardware-eﬃcien t c hunkwise training. FlashA ttention Dao et al. (2022) optimizes IO but retains O ( n 2 ) complexit y . The Atten tion F ree T ransformer (AFT) Zhai et al. (2021) replaces dot-product attention with elemen t-wise sigmoid gating and learned position biases, but its full form ulation retains T × T p osition parameters and do es not use explicit windo wed accum ulation. Mega Ma et al. (2023) com bines exp onen tial mo ving a verage with single-head gated atten tion, ac hieving linear complexity on long-range b enc hmarks. The Learnable Multi-Scale W a v elet T ransformer Kiruluta et al. (2025) replaces dot-pro duct atten tion with a learned Haar wa velet decomp osition, ac hieving linear scaling on machine translation. These primarily target language mo deling or general sequence tasks. Eﬃcien t sp eec h models. SummaryMixing P arcollet et al. (2024) replaces attention in w av2v ec 2.0 with mean-p ooling branches (18% faster, no qualit y drop on downstream tasks). The Poly- nomial Mixer F eillet et al. (2026) uses polynomial representations, outp erforming SummaryMixing on LibriSpeech Pana y otov et al. (2015). F ast Conformer Rekesh et al. (2023) uses 8 × do wnsampling for linear scaling (4.99% WER, 2.8 × sp eedup). Zipformer Y ao et al. (2024) uses a U-Net enco der with temporal downsampling, achieving faster inference than Conformer. LiteASR Kamahori et al. (2025) compresses Whisper via low-rank approximation of activ ations (50% enco der size reduction). Mam ba-based approaches (ConMamba Jiang et al. (2025), Sam ba-ASR Shakhadri et al. (2025)) replace atten tion with SSMs but train from scratch on 10k+ hours. Progressiv e la y er replacemen t. LoLCA T s Zhang et al. (2025) linearizes LLM atten tion via MSE distillation and LoRA. Mam ba-in-the-Llama W ang et al. (2024) replaces attention with Mam ba2 stepwise, ﬁnding end-to-end ﬁne-tuning most impactful. GatedDeltaNet Y ang et al. (2025) rep orts a 3:1 linear-to-atten tion ratio optimal in pro duction (Qwen3-Next), consistent with our ﬁnding that 8/12 (2:1) LP A lay ers is the practical limit. LP A diﬀers from AFT and Mega in its use of explicit squar e pulse windows (ap eriodic, p eriodic, p ositional) rather than global element-wise gating or exponential deca y . Square pulses produce sparse, near-binary gate activ ations after temp erature annealing, whic h both reduce arithmetic in tensity and enable kernel fusion across consecutiv e LP A lay ers (the sparsit y pattern is determined b y a few gate parameters rather than a dense T × T bias). LP A diﬀers from SummaryMixing and LiteASR in replacing the mixing mechanism itself rather than using p o oling or low-rank compression. 2 (a) Self-Attention Input X W Q W K W V QK ⊤ / √ d Softmax n × n Attn · V W O Output O ( n 2 d ) (b) Learnable Pulse Accumulator Input X Gate Predictor W V Aperiodic P eriodic P ositional rect pulse Haar-like learned basis Gated Accumulation ¯ v p = P t g p,t v t P t g p,t W eighted Sum w p , a p W O Output O ( nP d ) P =12 pulses , n =6 , 000 (120 s audio): Attention: 137 MB/la yer vs. LP A: 281 KB/layer (488 × ) Figure 1: (a) Standard self-attention computes an n × n matrix via QK ⊤ . (b) LP A replaces this with three learned gate types that deﬁne soft windows ov er the sequence. Gated accumulation pro duces p er-pulse summaries at O ( nP ) cost. All op erations are accelerator-compatible. The combination of gated pulse accumulation, sp eec h/connectionist temporal classiﬁcation (CTC) domain, and progressive replacement of pretrained atten tion is, to our knowledge, unexplored. 3 Learnable Pulse A ccum ulator Giv en input X ∈ R n × d , LP A maintains P pulses, eac h with a gate g p ∈ [0 , 1] n assigning soft mem b ership to each p osition. The output at p osition t is LP A ( X ) t = W O P p w p g p,t a p ¯ v p P p w p g p,t (1) where w p are softmax-normalized pulse weigh ts, a p are p er-pulse amplitudes, W O ∈ R d × d is the output pro jection, and ¯ v p is the gated mean of v alue-pro jected hidden states within pulse p . An activ e mask m t = 1 − exp ( − P p g p,t ) suppresses output at p ositions with no gate cov erage, applied as a multiplicativ e factor on the output: ¯ v p = P t g p,t · W V x t P t g p,t , W V ∈ R d × d (2) 3 3.1 Gate t yp es Ap erio dic gates predict center c p and half-width δ p from the input via a learned query attention mec hanism. g a p ( t ) = σ  t − c p + δ p τ  · σ  c p + δ p − t τ  (3) where σ ( · ) denotes the sigmoid function, c p = P t t · softmax ( h ⊤ q p /τ ) t with h = MLP ( D WCon v ( X )) ∈ R n × d/ 2 (MLP = multi-la yer p erceptron, DW Con v = depth wise conv olution), q p ∈ R d/ 2 is a learned query vector for pulse p , and δ p = softplus ( f ( ¯ h p )) > 0 ensures a v alid in terv al, where ¯ h p is the atten tion-weigh ted mean of h using the same softmax weigh ts as c p . T emp erature τ anneals from soft to hard during training. P erio dic gates predict p eriod T p , phase ϕ p , and duty cycle d p . g per p ( t ) = σ  cos(2 π t/T p − ϕ p ) − cos( π d p ) τ  (4) P erio d is parameterized as T p = 2 softplus ( · )+2 (minim um 4 frames), initialized across scales from ∼ 10 to 512 frames to capture phoneme-to-w ord-level rep etitions. P ositional gates are con ten t-indep enden t. Each gate is a learned linear com bination of K sin/cos basis functions ov er normalized position ˆ t = t/ ( n − 1) , passed through sigmoid: g pos p ( t ) = σ  ( P K k =1 α p,k sin (2 π k ˆ t ) + β p,k cos (2 π k ˆ t ) + b p ) /τ  where α p,k , β p,k ∈ R are learned co eﬃcien ts and b p is a bias. With K = 16 bases, these can represent arbitrary positional patterns including telegraph-like bi-lev el signals, capturing ﬁxed structural biases that complement the conten t-dep enden t gates. 3.2 Arc hitecture details The ap eriodic gate predictor uses a causal depth wise con v olution (kernel 5) follow ed b y a 2-la yer MLP with GELU activ ation. Periodic and p ositional predictors use a single linear pro jection. Cross-lay er co ordination adds a pro jected summary of the previous LP A la yer’s mean gate pattern as an additive bias. The v alue pro jection W V and output pro jection W O are initialized from the original attention w eights. 3.3 Complexit y P er-lay er cost is O ( nP d + nd 2 ) vs. attention’s O ( n 2 d + nd 2 ) . With P = 12 and n = 6000 (120s audio), gate memory is 281 KB vs. atten tion’s 137 MB, a 500 × reduction. At batc h size 1, the nd 2 pro jections dominate and the nP d accum ulation is memory-bandwidth-b ound on the input tensor, making P eﬀectiv ely free (App endix E). 4 Progressiv e replacemen t recip e W e replace attention lay ers in a pretrained w av2v ec2-base-960h Baevski et al. (2020) (3.34% WER on dev-clean) one at a time, progressing from easiest to hardest. The full recip e (Algorithm 1) requires only tw o inputs: the pretrained mo del and a training set. The sweep (Step 1) produces b oth a diﬃculty ordering and a p er-la y er pulse allo cation, since elastic net regularization prunes unneeded pulses on easy lay ers while retaining capacity on hard ones. Three tec hniques in Step 2 pro v ed critical. Sele ctive initialization from atten tion w eights and la yer norm unfreezing together account for − 31 . 8 pp (T able 1), the single largest gain. FFN c o-adaptation (unfreezing at 0 . 1 × lr) adds − 1 . 5 pp, since each la yer’s FFN was co-trained with its original atten tion. 4 Algorithm 1: Progressive LP A replacement Input : Pretrained mo del M with L atten tion la yers, training set D , WER budget B Output : Partially-replaced mo del with k LP A lay ers Step 1: MSE diagnostic sweep for e ach layer ℓ = 0 , . . . , L − 1 do Replace la yer ℓ with ov erpro visioned LP A ( P sweep pulses) T rain 2 ep ochs: L = MSE(LP A( X ) , Attn( X )) + λ 1 ∥ a ∥ 1 + λ 2 ∥ a ∥ 2 2 Record MSE (diﬃculty) and surviving pulse count Restore original attention Sort la yers b y increasing MSE → replacement order Step 2: Progressiv e replacement for i = 1 , . . . , L in swe ep or der do Initialize LP A from atten tion weigh ts ( W O , W V , partial W Q ) MSE w arm-start: 2 ep ochs against original atten tion output CTC training: 8 ep o c hs with temp erature annealing τ : 3 . 0 → 0 . 5 Unfreeze FFN at 0 . 1 × lr, unfreeze lay er norm Alignmen t: jointly ﬁne-tune all LP A lay ers ( ≤ 5 ep o c hs, 0 . 5 × lr) Re-anneal all LP A temp eratures globally Auto-rev ert if WER increases if WER > B then stop Final join t ﬁne-tuning: 8 ep o c hs at 0 . 2 × lr T emp er atur e curriculum con tributes − 2 . 9 pp: during p er-stage training only the current la yer anneals, while during alignmen t all LP A lay ers re-anneal globally , pro viding smo other gradients for cross-lay er adaptation. 5 Exp erimen ts Mo del. facebook/wav2vec2-base-960h (12 lay ers, 768 hidden, 95M params, 3.34% WER on dev-clean), the standard pretrained release ﬁne-tuned with CTC on 960h of LibriSp eec h P anay oto v et al. (2015). Data. LibriSp eec h train-clean-100 (100h) for early ablations, train-clean-360 (360h) for ﬁnal conﬁgurations. Ev aluation on dev-clean (2,703 utterances), greedy CTC deco ding without language mo del. LP A conﬁg. Base: 4 ap erio dic + 4 p eriodic + 4 p ositional = 12 pulses p er lay er, k ernel size 5, deep gate predictor (2-la yer MLP for ap erio dic, single pro jection for perio dic/p ositional), cross-la yer co ordination, v alue pro jection, output gate, 4 heads, position modulation, dynamic conten t-dep enden t pulse w eights, and skip connections from 4 la yers back. In the +Order conﬁguration, sw eep-derived allo cations ov erride pulse counts (up to 36 for hard la yers). The +Architecture conﬁguration uses a ﬁxed o verride of 8+8+8=24 pulses for deep stages ( ≥ 9). T raining. 8 ep o c hs/stage at lr 5 × 10 − 4 with 10% linear warm up (FFN at 5 × 10 − 5 ), up to 5 alignmen t ep ochs at 2 . 5 × 10 − 4 with auto-rev ert, 8 ﬁnal ep o c hs at 10 − 4 . CTC loss uses mean reduction and inﬁnite-loss zeroing. MSE sweep: 2 epo c hs/lay er with λ 1 = 0 . 01 , λ 2 = 0 . 001 , θ = 0 . 1 , ﬂo or f = 4 . A dam W, BF16 mixed precision, batc h 48 on NVIDIA H100. T otal training cost for the full 10-stage progressiv e replacement (including MSE sweep, p er-stage training, alignmen t, and ﬁnal 5 T able 1: Cumulativ e ablation at 8/12 lay ers replaced (dev-clean WER %). Each row adds one tec hnique to the previous conﬁguration. Rows mark ed † are estimated b y correcting an early ev aluation bug ( − 9.3 pp uniform oﬀset, App endix B). The bottom three rows use the corrected ev aluation directly . Conﬁguration WER ∆ Naïv e replacement † 58.33 — + Selectiv e init, unfreeze LN † 26.50 − 31.8 + In ter-stage alignmen t † 16.18 − 10.3 + FFN co-adapt (0.1 × lr) † 14.64 − 1.5 + T emp erature curriculum † 11.72 − 2.9 + P erio dic + p ositional gates 10.25 − 1.5 + Skip connections, multi-head, 360h data 9.77 − 0.5 + MSE-ordered replacement 9.35 − 0.4 join t tuning) is ∼ 4 GPU-hours on H100. 5.1 Ablation T able 1 isolates eac h tec hnique at 8/12 la yers, where diﬀerences are most pronounced. Negativ e results. Doubling pulse count (8 → 16) and adding p er-pulse width predictors pro vided no impro vemen t b ecause gate capacity is not the b ottlenec k. Restricting temperature annealing to only the newly-replaced lay er (rather than globally) degraded p erformance by 3–5 pp, conﬁrming the curriculum eﬀect. 5.2 Replacemen t order matters more than method The MSE sw eep sho ws large v ariation in replacement diﬃculty: early la yers (0–2) hav e MSE 50– 60 × lo wer than the ﬁnal lay er, with a general acoustic-to-linguistic trend (T able 6 in App endix). Fig. 2 shows the progressive WER tra jectory across four conﬁgurations that cumulativ ely add training recip e improv ements, architectural c hanges, and diﬃculty-informed ordering. The most con trolled comparison (“+Arc hitecture” vs. “+Order”) uses iden tical architectures, data (360h), and h yp erparameters, diﬀering only in la yer replacemen t order. Deferring the hard lay er 8 from p osition 3 to position 7 reduces 6/12 WER from 7.32% to 5.64% ( − 1.68 pp, 23% relativ e). A t 8/12, WER drops from 58.33% (naïve) to 9.35% (full recip e) on dev-clean, with 10.61% on test-clean and 27.10% on test-other (T able 8, App endix). T est-clean tracks dev-clean within ∼ 1 pp at all depths. 5.3 The depth w all All conﬁgurations degrade sharply b ey ond 8–9 replaced lay ers. The transition from 9/12 to 10/12 adds + 13–20 pp dep ending on conﬁguration. The MSE sw eep explains this directly . The ﬁnal la yers are far harder to appro ximate. La yer 11 has MSE 60 × higher than lay er 0 (0.370 vs. 0.006), and la yer 10 is 15 × harder. W e hypothesize this reﬂects a transition from acoustic to linguistic computation across enco der depth (Sec. 7). The MSE diﬃculty trend broadly trac ks this transition, though non-monotonically (la yer 4 is harder than lay ers 7–9). F or deploymen t, the wall deﬁnes a qualit y–sp eed tradeoﬀ curve. The MSE sweep tells the practitioner which lay ers to replace and when to stop. 6 0 1 2 3 4 5 6 7 8 9 10 Layers replaced 0 10 20 30 40 50 60 70 WER (%) baseline depth wall Naïve +Recipe (100h) +Architecture (360h) +MSE order Figure 2: WER (%) vs. num b er of replaced la yers across four cum ulative conﬁgurations (full data in T able 7, App endix). MSE-ordered replacement yields low er WER at ev ery stage. T able 2: SepF ormer SI-SNRi (dB) on held-out test-clean during progressive LP A replacement of intra-c h unk atten tion. All 16 la yers are replaced without collapse. La yers SI-SNRi ∆ from baseline 0 (baseline) 5.69 — 2/16 8.89 + 3.20 4/16 8.92 + 3.23 8/16 8.14 + 2.45 11/16 7.68 + 1.99 16/16 6.82 + 1.13 5.4 Cross-domain v alidation: sp eec h enhancement T o test whether the depth w all reﬂects linguistic computation requiremen ts (as h yp othesized) rather than an LP A architectural limitation, w e apply the same progressive replacemen t recip e to SepF ormer Subak an et al. (2021), a purely acoustic mo del for sp eec h enhancemen t. Setup. SepF ormer-WHAM16k ( ∼ 26M params) uses a dual-path architecture with 2 × 8 = 16 in tra-ch unk and 2 × 8 = 16 inter-c h unk atten tion lay ers (embed_dim=256, 8 heads, c h unk size K = 250 ). W e replace the 16 intr a-chunk lay ers progressiv ely , using MSE-ordered replacement. The 16 in ter-ch unk (global) attention lay ers are retained. LP A uses 12 pulses p er lay er. T raining and ev aluation use LibriSp eec h audio mixed with Gaussian noise at 5 dB SNR, with sep ar ate train (dev-clean, 2703 ﬁles) and held-out ev aluation (test-clean, 2620 ﬁles) sets. Results. T able 2 sho ws scale-in v arian t signal-to-noise ratio improv emen t (SI-SNRi, in dB) on the held-out set during progressiv e replacement. No depth w all appears. All 16 lay ers are replaced, with ev ery stage remaining ab o ve baseline. P erformance peaks at 8.92 dB (4/16) and declines gen tly to 6.82 dB (16/16, still + 1.13 ab o ve baseline), 7 T able 3: Inference time (ms) on Apple M4 Pro, batch size 1. PyT or ch : MPS bac kend. M LX : Metal GPU, FP16 with FP32 accumulation. FP16 atten tion is slow er than FP32 on MPS (the n × n matm ul does not b eneﬁt from half precision on this hardware), conﬁrming that the MLX sp eedup reﬂects LP A’s algorithmic adv antage. Audio Base Base 8/12 12/12 8/12 Sp eedup (fp32) (fp16) (fp32) (fp32) (MLX) vs fp16 10s 47 45 59 65 42 1.07 × 30s 183 186 157 135 125 1.49 × 60s 500 522 350 246 244 2.14 × 120s 1668 1767 895 479 540 3.27 × whereas w av2v ec2 collapses sharply at 8–9 lay ers. The MSE diﬃculty ratio across SepF ormer la yers is 15 × (vs. 60 × in w av2v ec2), and diﬃculty does not correlate with depth, consistent with all lay ers p erforming acoustic rather than linguistic computation. Fine-tuning confound. The improv emen t ab o ve baseline (5.69 → 8.92 dB) is unexp ected, since replacing attention with a low er-capacity mechanism should not improv e performance. A con trol exp erimen t that ﬁne-tunes all 16 la yers with attention intact using the same training budget reac hes 11.34 dB, conﬁrming that the impro vemen t is primarily a ﬁne-tuning eﬀect (the pretrained c heckpoint was trained on WHAM! noise, not Gaussian noise). The relev ant comparison is therefore LP A (6.82 dB at 16/16) vs. ﬁne-tuned attention (11.34 dB). LP A underp erforms attention but do es not collapse, which is the depth-w all ﬁnding. Ca veats. The ev aluation uses Gaussian noise (simpler than WHAM!). SepF ormer’s smaller em b edding (256 vs. 768) means FFN la yers dominate compute, so ov erall mo del sp eedup is only ∼ 1.23 × . The v alue of the SepF ormer exp erimen t is the depth-wall v alidation, not ra w sp eedup. 5.5 Inference sp eed In the PyT orc h/MPS reference implemen tation, the crossov er where LP A b ecomes faster than atten tion o ccurs at approximately 20s, b ecause p er-k ernel dispatc h ov erhead exceeds the quadratic cost sa ved at shorter durations. A dedicated MLX inference path (Sec. A), which computes all gate op erations as dense GPU kernels in half precision (fp16) and av oids CPU–GPU sync hronization, shifts this crosso v er to b elo w 10 s. At 10 s, MLX LP A achiev es 1.07 × sp eedup o v er the FP16 atten tion baseline. Sp eedup scales linearly with input length (Fig. 3) b ecause the O ( n ) vs. O ( n 2 ) gap widens indeﬁnitely . A t 120s, the real-time factor is 0.0045. A t batch size 1, linear pro jections dominate LP A time, making pulse count eﬀectively free: 12 → 36 pulses adds only 0.6% (App endix E). All LP A op erations (depthwise conv olution, sigmoid, element-wise multiply , linear pro jection) are compatible with mobile neural accelerators. FlashA ttention is una v ailable on Apple MPS/Metal, so the O ( n 2 ) baseline reﬂects the actual on-device cost. 6 Comparisons Comparison to SummaryMixing. SummaryMixing P arcollet et al. (2024) replaces atten tion with mean-p o oled summaries: s = Linear ( mean ( X )) , broadcast and added to a lo cal p er-p osition pro jection. Both LP A and SummaryMixing are O ( n · d 2 ) , but LP A learns temp or al sele ctivity (which p ositions contribute to eac h output) while SummaryMixing uses uniform a veraging. 8 1 0 2 1 0 3 Inference time (ms) A ttention (FP32) A ttention (FP16) 8/12 LP A (FP32) 12/12 LP A (FP32) 8/12 LP A MLX (FP16) 10 30 60 120 Audio duration (s) 1 2 3 Speedup vs FP16 attn (×) breakeven 8/12 LP A (FP32) 12/12 LP A (FP32) 8/12 LP A MLX (FP16) Figure 3: Sp eedup vs. FP16 attention baseline on Apple M4 Pro, batch 1. T op: inference time (ms, log scale). Bottom: speedup ratio. FP16 atten tion is slow er than FP32 on MPS (T able 3). T able 4 compares progressiv e replacemen t. On SepF ormer, b oth methods use identical training budgets. On wa v2v ec2, the SummaryMixing implementation pro cesses 2k samples p er ep o c h (vs. LP A’s full ∼ 28k), giving LP A a training adv an tage that preven ts direct comparison. Despite this ca veat, the directional pattern is informativ e. LP A outp erforms SummaryMixing at every wa v2v ec2 depth, with the gap widening from + 1.2 pp at 1/12 to + 9.1 pp at 4/12, while on SepF ormer (where budgets are matched) SummaryMixing matches or exc e e ds LP A at all depths. This asymmetry is consisten t with in tra-ch unk attention p erforming only lo cal acoustic smo othing that mean p o oling captures w ell, while wa v2vec2’s linguistic la yers require temp oral selectivity . Comparison to Mam ba-based ASR. ConMamba Jiang et al. (2025) achiev es 22.65% WER with 80h data. LP A at 8/12 achiev es 10.61% with 360h ﬁne-tuning, though the comparison is confounded b y our pretrained w a v2vec2 initialization vs. training from scratc h. CALD He and Garner (2025) con verts W a v2V ec2- lar ge (317M params) attention to Mamba2 via lay erwise distillation, ac hieving + 0.32 pp WER on TED-LIUM. The comparison is not direct because CALD uses a 3 × larger mo del with more redundancy , a diﬀeren t test set, and requires running the full attention model as a teacher during conv ersion. LP A op erates on the smaller wa v2v ec2-base (95M params) without a teac her mo del, trading quality for a simpler, teac her-free con version pip eline. Both approaches conﬁrm that p ost-ho c attention conv ersion is viable for sp eec h mo dels. Qualit y gap in con text. At 8/12 replacement, LP A adds + 7.24 pp ov er the 3.37% baseline, a meaningful degradation. Three factors suggest this gap is reducible rather than fundamental. First, the current recip e uses no distillation: each LP A la yer is trained with CTC loss alone, while 9 T able 4: LP A vs. SummaryMixing (SM) progressive replacemen t. wa v2v ec2: dev-clean WER (%, ↓ ). SepF ormer: held-out test-clean SI-SNRi (dB, ↑ , 16 in tra-ch unk lay ers). SepF ormer uses iden tical training budgets; w av2v ec2 SM uses fewer samples p er ep och (see text). w av2v ec2 WE R (%) SepF ormer SI-SNRi (dB) La yers LP A SM LP A SM 0 (base) 3.18 3.18 5.69 5.69 1 4.26 5.44 7.27 8.38 2 4.64 7.31 8.89 9.60 4 6.38 15.47 8.92 9.62 8 9.92 29.04 8.14 9.12 16 — — 6.82 7.88 CALD’s near-lossless result ( + 0.32 pp) and LoLCA T s Zhang et al. (2025) b oth rely on p er-la yer MSE sup ervision from the original atten tion output as an auxiliary signal. Preliminary experiments with MSE auxiliary loss sho w ed directional impro vemen t but w ere confounded b y sim ultaneous c hanges to replacemen t order. Second, the 3.27 × sp eedup comes with 4 attention la yers intact. GatedDeltaNet Y ang et al. (2025) independently ﬁnds that a 3:1 linear-to-atten tion ratio is optimal in pro duction (Qwen3-Next), suggesting that retaining a minority of atten tion la yers for hard linguistic computation ma y b e the right op erating point rather than a limitation. Third, our h yp erparameter searc h was limited: a single temp erature schedule ( τ = 3 → 0 . 5 ), ﬁxed learning rates, and no p er-la yer tuning of pulse allocation. Cross-la yer skip connections (which prov ed important for depth ≥ 9) and p er-head gate sp ecialization were tested only in the ﬁnal conﬁguration. Systematic tuning of these c hoices remains unexplored. 7 Discussion: a new building blo c k A ttention computes a data-dep enden t weigh ted a verage: the softmax of key-query dot pro ducts selects which positions contribute and how much . This is a matc hed ﬁlter, optimal when the target signal is buried in comp eting signals at unkno wn p ositions. But when the signal is spatially lo calized and spectrally distinct from noise, a simpler estimator suﬃces. Consider a signal x ( t ) = s ( t ) + η ( t ) where s ( t ) is a target signal nonzero only on some interv al [ a, b ] and η ( t ) is noise. The suﬃcient statistic for estimating s is the integral R b a x ( t ) dt , uniform a veraging ov er the supp ort, with no weigh ting needed. The only information the estimator requires is the supp ort [ a, b ] itself. This is what LP A’s gate predictor learns: where the relev ant signal lives (cen ter and half-width), not what it looks like (key-query match scores). The gate predictor replaces con tent matc hing with lo calization, and the rectangular window replaces softmax w eighting with uniform in tegration. Sp eec h satisﬁes the lo calization premise. Phonemes o ccup y contiguous time in terv als. Coar- ticulation eﬀects span at most a few h undred milliseconds. The acoustic features at la yers 0–7 of w av2v ec2 are dominated b y lo cal sp ectro-temporal structure, the regime where gated integration extracts a suﬃcien t statistic and the matc hed-ﬁlter mac hinery of attention is unnecessary . The MSE diﬃcult y gradien t quantiﬁes this. Easy la yers (MSE ∼ 0.006) ha ve atten tion patterns that are already near-rectangular, while hard la yers (MSE ∼ 0.37) hav e the con tent-dependent routing that rectangular windo ws cannot capture. The three gate t yp es co ver complementary p oin ts on the Gab or time-frequency tradeoﬀ Bao et al. 10 (2022); W ang et al. (2025): aperio dic gates provide temporal resolution, p erio dic gates trac k rhythmic structure, and p ositional gates supply ﬁxed structural priors. The concurrent LMWT Kiruluta et al. (2025) applies a similar m ulti-resolution principle via learned Haar w av elets; LP A diﬀers in using explicit rectangular pulses with learned cen ters and widths. LP A returns to lo cal pro cessing but mak es the receptive ﬁeld data-dep endent . The gate predictor uses a small depth wise conv olution to predict where to lo ok and how wide to integrate. The depth w all marks where this argumen t breaks do wn—wa v2vec2’s deep la y ers perform linguistic computation that requires the conten t-dep enden t global routing attention provides. The SepF ormer cross-v alidation, where all 16 in tra-ch unk la yers are acoustic and all are replaceable, is the strongest evidence that the wall reﬂects the natur e of c omputation , not an arc hitectural ceiling. F urther discussion of theoretical p erspectives and op en questions app ears in App endix F. 8 Limitations The depth wall at 8–9/12 la yers is the primary limitation. W e hav e shown that it correlates with the acoustic–linguistic transition, but hav e not pro ven that LP A c annot perform linguistic computation. Diﬀeren t gate shapes or training from scratch might push the wall deep er. The SepF ormer cross- v alidation supp orts the hypothesis but retains 16 inter-c hunk attention la yers as a confound. All w av2v ec2 results are single-run without conﬁdence in terv als, though the consistent depth-w all pattern across four indep enden t conﬁgurations (T able 7) provides implicit replication. All results use the same 95M-parameter mo del. W av2v ec2-large and Whisp er are un tested. The MLX sp eedup num b ers (T able 3) combine LP A’s algorithmic O ( n ) adv an tage with the framew ork switc h from PyT orc h/MPS to MLX; all speedups use the FP16 attention baseline (see caption). LP A is complementary to distillation (which reduces mo del size) and quan tization (whic h reduces precision). The three axes comp ose, and LP A’s element-wise op erations are particularly amenable to low-bit quantization. 9 Conclusion W e introduced the Learnable Pulse Accum ulator, a new O ( n ) building blo c k for sequence mixing. LP A replaces key-query matching with learned gating functions (con ten t-dep enden t rectangular pulses, p eriodic Haar-like windows, and p osition-dependent basis functions) that deﬁne where to aggregate information, ho w wide to lo ok, and at what p erio dicit y to rep eat. W e dev elop ed a practical con version recip e. An MSE diagnostic sw eep measures p er-lay er diﬃculty and determines replacemen t order, requiring no architecture search. A t 8/12 replacement, LP A achiev es 10.61% WER on test-clean with 3.27 × measured sp eedup at 120s audio on M4 Pro vs. an FP16 atten tion baseline, via a dedicated MLX inference path (1.97 × in the PyT orch/MPS path). The near-binary gate structure at inference enables dense GPU computation with no CPU–GPU sync hronization, and all operations are compatible with mobile neural accelerators. Cross-domain v alidation on SepF ormer sp eec h enhancemen t conﬁrms the depth w all is domain-sp eciﬁc. All 16 in tra-ch unk atten tion lay ers are replaced without collapse in a purely acoustic model, while w av2v ec2’s linguistic lay ers resist replacement. LP A is not a replacement for attention. It is a lo wer-cost primitiv e that w orks where attention is unnecessary , together with a recip e for determining where that b oundary lies. Combining it with teac her distillation and h ybrid architectures that retain attention only where needed could close the qualit y gap and bring eﬃcient long-form ASR to mobile neural accelerators. Co de a v ailability . Co de and pretrained c heckpoints will b e released up on publication. 11 References Alexei Baevski, Henry Zhou, Ab del-rahman Mohamed, and Michael Auli. w av2v ec 2.0: A framew ork for self-supervised learning of sp eec h represen tations. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2020. Ch un Bao, Jie Cao, Y aqian Ning, Y ang Cheng, and Qun Hao. Rega-net: Retina Gab or attention for deep con volutional neural net works. arXiv pr eprint arXiv:2211.12698 , 2022. T ri Dao, Daniel Y. F u, Stefano Ermon, A tri Rudra, and Christopher Ré. FlashA ttention: F ast and memory-eﬃcien t exact attention with IO-a wareness. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2022. Ev a F eillet, Ry an Whetten, Da vid Picard, and Alexandre Allauzen. P olynomial mixing for eﬃcien t self-sup ervised speech enco ders. arXiv pr eprint arXiv:2603.00683 , 2026. Accepted at ICASSP 2026. Alb ert Gu and T ri Dao. Mamba: Linear-time sequence mo deling with selectiv e state spaces. arXiv pr eprint arXiv:2312.00752 , 2023. Mutian He and Philip N. Garner. Join t ﬁne-tuning and conv ersion of pretrained sp eech and language mo dels tow ards linear complexity . In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. Xilin Jiang, Yinghao Aaron Li, A drian N. Florea, Cong Han, and Nima Mesgarani. Sp eec h slytherin: Examining the p erformance and eﬃciency of Mamba for sp eec h separation, recognition, and syn thesis. In IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing (ICASSP) , 2025. doi: 10.1109/ICASSP49660.2025.10889391. Keisuk e Kamahori, Jungo Kasai, Noriyuki Ko jima, and Baris Kasik ci. LiteASR: Eﬃcient automatic sp eec h recognition with low-rank approximation. In Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing (EMNLP) , 2025. Angelos Katharop oulos, Ap o orv V yas, Nikolaos Pappas, and F rançois Fleuret. T ransformers are RNNs: F ast autoregressive transformers with linear attention. In International Confer enc e on Machine L e arning (ICML) , 2020. Andrew Kiruluta, Priscilla Burit y , and Saman tha Williams. Learnable multi-scale w av elet transformer: A no vel alternativ e to self-atten tion. arXiv pr eprint arXiv:2504.08801 , 2025. Xuezhe Ma, Ch unting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan Ma y , and Luke Zettlemo yer. Mega: Moving a verage equipp ed gated atten tion. In International Confer enc e on L e arning R epr esentations (ICLR) , 2023. V assil Pana y otov, Guoguo Chen, Daniel Po v ey , and Sanjeev Khudanpur. Librisp eech: An ASR corpus based on public domain audio b ooks. In Pr o c. IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 5206–5210, 2015. doi: 10.1109/ICASSP .2015.7178964. Titouan Parcollet, Rogier v an Dalen, Shucong Zhang, and Sourav Bhattachary a. SummaryMixing: A linear-complexit y alternative to self-attention for sp eec h recognition and understanding. In Pr o c. Intersp e e ch , pages 3460–3464, 2024. doi: 10.21437/In tersp eec h.2024- 40. 12 Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Ma jumdar, V ahid Noro ozi, He Huang, Oleksii Hrinc huk, Krishna Puvv ada, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. F ast conformer with linearly scalable attention for eﬃcien t sp eec h recognition. In Pr o c. IEEE Automatic Sp e e ch R e c o gnition and Understanding W orkshop (ASR U) , pages 1–8, 2023. doi: 10.1109/ASR U57964.2023.10389701. Sy ed Ab dul Gaﬀar Shakhadri, Kruthik a KR, and Kartik Basav ara j Angadi. Sam ba-ASR: State-of-the- art sp eec h recognition leveraging structured state-space mo dels. arXiv pr eprint arXiv:2501.02832 , 2025. Cem Subak an, Mirco Ra v anelli, Samuele Cornell, Mirk o Bronzi, and Jian yuan Zhong. Atten tion is all you need in speech separation. In IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing (ICASSP) , pages 21–25, 2021. doi: 10.1109/ICASSP39728.2021.9413901. Ashish V asw ani, Noam Shazeer, Niki Parmar, Jak ob Uszkoreit, Llion Jones, Aidan N. Gomez, Luk asz Kaiser, and Illia Polosukhin. Atten tion is all y ou need. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2017. Junxiong W ang, Daniele Paliotta, A vner May , Alexander M. Rush, and T ri Dao. The mamba in the llama: Distilling and accelerating h ybrid mo dels. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2024. Zhen W ang, Shanshan F u, Shuang F u, Debao Li, Dandan Liu, Y exiang Y ao, Haobo Yin, and Li Bai. Hybrid Gab or attention conv olution and transformer interaction net work with hierarchical monitoring mec hanism for liv er and tumor segmentation. Scientiﬁc R ep orts , 15(7454), 2025. doi: 10.1038/s41598- 025- 90151- 8. Songlin Y ang, Bailin W ang, Yik ang Shen, Rameswar Panda, and Y o on Kim. Gated linear atten tion transformers with hardware-eﬃcien t training. In International Confer enc e on Machine L e arning (ICML) , 2024. Songlin Y ang, Jan Kautz, and Ali Hatamizadeh. Gated delta net works: Improving Mam ba2 with delta rule. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. Zengw ei Y ao, Liy ong Guo, Xiaoyu Y ang, W ei Kang, F ang jun Kuang, Yifan Y ang, Zengrui Jin, Long Lin, and Daniel Po v ey . Zipformer: A faster and b etter enco der for automatic sp eec h recognition. In International Confer enc e on L e arning R epr esentations (ICLR) , 2024. Sh uangfei Zhai, W alter T alb ott, Nitish Sriv asta v a, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer. arXiv pr eprint arXiv:2105.14103 , 2021. Mic hael Zhang, Simran Arora, Rah ul Chalamala, Alan W u, Benjamin Spector, Aary an Singhal, Krithik Ramesh, and Christopher Ré. LoLCA T s: On low-rank linearizing of large language mo dels. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. 13 A Edge Deplo ymen t: F used Inference Kernels A primary motiv ation for LP A is on-device speech recognition where quadratic attention is prohibitiv e. W e describ e an optimized inference path exploiting the structural prop erties of trained LP A gates. A.1 Hard gate inference During training, gate temp erature anneals from τ = 3 (soft) to τ = 0 . 5 (near-binary). At inference w e set τ → 0 , conv erting all gate computations to hard thresholds: • Ap eriodic gates. The softmax o v er T p ositions that ﬁnds each pulse cen ter collapses to argmax. The sigmoid pro duct σ  t − s p τ  σ  e p − t τ  (where s p = c p − δ p and e p = c p + δ p are the pulse start and end) b ecomes an indicator 1 [ s p ≤ t ≤ e p ] . Each pulse selects a con tiguous frame range. • P erio dic gates. The cosine-threshold gate σ  cos θ − cos π d τ  b ecomes a step function. On-regions are determined analytically from the zero-crossings of cos (2 π t/p − ϕ ) − cos ( π d ) , yielding ⌈ T /p ⌉ con tiguous segments p er pulse. • Learned-basis gates. The sinusoidal basis σ ( w ⊤ f ( t ) /τ ) thresholds to a binary mask, pre- computable for any sequence length T since the basis weigh ts are conten t-indep endent. This transformation is exact in the limit τ → 0 and in tro duces no additional approximation b ey ond the training curriculum that already driv es gates tow ard binary . A.2 Preﬁx-sum accum ulation With hard gates, the dense accum ulation a p = P t g tp x t / P t g tp reduces to a range-sum. F or eac h con tiguous segment [ s i , e i ] : a p = 1 |S p | X i  C [ e i ] − C [ s i − 1]  , C [ t ] = t X t ′ =0 x t ′ (5) where C is the preﬁx-sum of input features (computed once in O ( T D ) ), and |S p | is the total frame coun t across all segments. Pulse a verages then require only O ( P ) indexed reads regardless of segment length. The broadcast step is similarly sparse. At eac h p osition t , only pulses whose segments contain t con tribute to the output. In practice, 3–5 of 12 pulses are activ e at an y giv en frame. A.3 MLX implemen tation W e p ort the full W av2V ec2 + LP A mo del to Apple’s MLX framew ork, computing all gate op erations (con volution, MLP , argmax, thresholding) and accum ulation/broadcast as dense GPU op erations in fp16 with fp32 accum ulation. This eliminates the CPU-GPU sync hronization ov erhead inherent in PyT orch/MPS’s k ernel dispatc h mo del, where eac h LP A la yer requires ∼ 15 separate k ernel launches. The main optimization is computing binary gates as dense tensors en tirely on GPU, then using matrix m ultiplication for the accumulate step, S = G ⊤ X , where G is the [ T × P ] binary gate matrix and X is the [ T × D ] input. Despite G being binary , the dense matm ul is faster than sparse alternativ es at these dimensions ( T ≤ 6000 , P = 12 ) b ecause it av oids CPU-GPU round-trips for segmen t extraction. 14 F or non-LP A enco der lay ers, w e use MLX’s fused scaled_dot_product_attention k ernel. The full model runs in fp16 with fp32 accumulatio n for n umerical stability . A.4 Results T able 5: Inference latency on Apple M4 Pro (36 GB, 273 GB/s), batch=1. 8/12 LP A lay ers replaced. Sp e e dup relativ e to attention baseline. FP16 attention is slow er than FP32 on MPS, so the precision-fair comparison fa vors LP A even more. Audio A ttn A ttn LP A (PT) LP A (MLX) Sp eedup (fp32) (fp16) (fp32) (fp16) vs fp16 10 s 47.1 ms 45.3 ms 58.8 ms 42.4 ms 1.07 × 30 s 182.7 ms 186.1 ms 156.9 ms 124.9 ms 1.49 × 60 s 499.7 ms 521.6 ms 349.5 ms 244.2 ms 2.14 × 120 s 1668.0 ms 1767.1 ms 894.8 ms 540.1 ms 3.27 × The MLX LP A mo del achiev es 3.27 × sp eedup ov er the FP16 attention baseline at 120 s and is 1.66 × faster than the PyT orc h MPS LP A path, conﬁrming that LP A’s gate structure is amenable to framework-lev el optimization. At 120 s, the real-time factor is 0.0045, well within real-time constrain ts for streaming sp eech recognition. The hard-gate path in tro duces no me asurable WER degradation: ev aluation at τ = 0 . 01 vs τ → 0 pro duces identical transcriptions on LibriSp eec h dev-clean. A.5 Discussion The sp eedup derives primarily from LP A’s O ( n ) gate structure replacing O ( n 2 ) atten tion, providing increasing adv antage at longer sequences (1.07 × at 10 s vs 3.27 × at 120 s). Running atten tion itself in fp16 on MPS is actually slower than fp32 (T able 5), b ecause the n × n atten tion matrix do es not b eneﬁt from half precision on Apple Silicon at these sequence lengths. The framework switch from PyT orc h/MPS to MLX contributes fused kernel dispatch, but the precision c hange do es not inﬂate the reported sp eedup. The structural sparsity of LP A gates (contiguous on-regions separated b y oﬀ-regions) is not an inciden tal prop ert y but a direct consequence of the rectangular pulse parameterization. A t inference, this binary gate structure enables the accum ulation to b e expressed as a simple matrix multiplication G ⊤ X with a sparse binary G , requiring no sp ecial sparse k ernels at the dimensions typical of speech ( T ≤ 7500 , P = 12 ). F or longer sequences where the dense matmul b ecomes exp ensiv e, a preﬁx-sum approac h (Eq. 5) with a custom Metal k ernel can reduce accum ulation to O ( P ) indexed reads, though at these dimensions the CPU-GPU synchronization o verhead of segment extraction outw eighs the compute savings. B Ev aluation Metho dology Early exp eriments used batc hed ev aluation with naïve padding. The CTC deco der hallucinated predictions for padded regions, inﬂating all WER v alues b y appro ximately 9 pp. The corrected ev aluation uses p er-sample output length computation to mask padding before CTC deco ding. T raining-time ev aluation (10s audio ﬁlter) yields 3.34% on dev-clean. F ull ev aluation (30s ﬁlter) yields 3.18% on dev-clean and 3.37% on test-clean, matc hing the published 3.4%. 15 Later exp erimen ts use the corrected ev aluation natively . F or the Naïve column in T able 7 and the top ro ws ( † ) of T able 1, we apply a uniform − 9.3 pp oﬀset to the inﬂated v alues from the early runs. This oﬀset is approximate: the baseline correction is 12.67% → 3.34% ( − 9.33 pp), and we veriﬁed that the corrected 8/12 result (10.25%) is within 0.6 pp of the oﬀset-corrected estimate. These v alues should be treated as appro ximate. C Supplemen tary T ables T able 6: P er-lay er MSE diﬃculty from the elastic net diagnostic sw eep. Lay ers sorted by increasing diﬃculty . The sweep ov erpro visions each lay er with 144 pulses (12 × 3 types × 4 heads) and uses L1/L2 regularization to prune; the surviving count indicates how m uch capacit y each lay er demands. The MSE ranking determines replacemen t order (Sec. 5.2). The p er-la yer allocation suggested b y the sw eep has not been v alidated end-to-end (Appendix F). La yer MSE Surviving / 144 Diﬃculty L0 0.006 48 easy L2 0.007 48 easy L1 0.007 51 easy L3 0.021 57 mo derate L6 0.030 74 mo derate L5 0.034 75 mo derate L7 0.043 87 hard L8 0.046 100 hard L9 0.052 87 hard L4 0.065 89 hard L10 0.096 96 v ery hard L11 0.370 107 v ery hard T able 7: Best WER (%) during progressiv e replacement. Columns show cumulativ e recip e impro vemen ts: training techniques (100h), architectural additions with 360h data, and deferred replacemen t order with MSE auxiliary loss. Naïv e column estimated via uniform oﬀset correction (App endix B). La yers Naïv e +Recip e +Arc hitecture +Order 0 3.34 3.34 3.34 3.34 1 4.59 4.88 4.38 4.25 2 6.21 5.17 4.71 4.67 3 9.94 7.17 6.55 4.77 4 10.43 6.93 6.35 5.31 5 13.20 7.73 7.24 5.26 6 14.42 7.78 7.32 5.64 7 19.52 8.05 7.74 7.64 8 58.33 10.25 9.77 9.35 9 69.57 17.35 14.88 14.19 10 — 37.51 28.13 28.02 16 T able 8: WER (%) across ev aluation splits for the +Architecture conﬁguration. All results use greedy CTC deco ding without language mo del. Dev-clean baseline diﬀers from T able 7 (3.18% vs. 3.34%) b ecause this ev aluation includes utterances up to 30s while training-time ev aluation ﬁlters at 10s. La yers dev-clean test-clean test-other 0 3.18 3.37 8.67 1 4.26 4.63 11.83 2 4.64 4.78 12.67 4 6.38 6.62 16.52 6 7.35 7.90 18.98 7 7.90 8.50 20.84 8 9.92 10.61 27.10 9 15.82 16.56 37.08 10 30.39 31.62 54.51 D Inference Benc hmark Details F ull b enc hmark results on Apple M4 Pro, including p er-conﬁguration timing and memory estimates. PyT orch columns use MPS bac kend in FP32; MLX uses Metal GPU in FP16 with FP32 accum ulation. T able 9: Inference time (ms) across conﬁgurations and durations on Apple M4 Pro, batch 1. PT = PyT orc h/MPS. MLX = MLX/Metal (FP16). Speedup is relative to the atten tion baseline (FP16). Audio Base Base 8/12 12/12 8/12 Sp eedup vs Base fp16 (fp32) (fp16) (fp32) (fp32) (MLX) 12/12 PT 8/12 MLX 10s 47 45 59 65 42 0.69 × 1.07 × 30s 183 186 157 135 125 1.38 × 1.4 9 × 60s 500 522 350 246 244 2.12 × 2.1 4 × 120s 1668 1767 895 479 540 3.69 × 3.27 × T able 10: Per-la y er p eak memory for gates/attention weigh ts (ﬂoat32). Audio F rames Atten tion LP A (12p) Ratio 10s 500 0.95 MB 23 KB 42 × 30s 1,500 8.6 MB 70 KB 125 × 60s 3,000 34.3 MB 140 KB 250 × 120s 6,000 137.3 MB 281 KB 500 × E Ro oﬂine Analysis: Pulse Coun t at Batc h Size 1 The MSE sw eep allo cates up to 36 pulses p er lay er for hard lay ers (vs. 12 base). A naïve FLOP analysis w ould suggest 3 × slo wer LP A inference. A ro oﬂine analysis on M4 Pro (273 GB/s bandwidth, ∼ 16.7 TFLOPS fp16) reveals this is wrong. A t B = 1 , the LP A accumulation step is memory-bandwidth- b ound, and the linear pro jections are near the compute–bandwidth crosso ver. 17 • A ttention m ust read/write the n × n score matrix p er head. At T = 6000 with 12 heads, total score storage is 864 MB in FP16, the dominant cost. • LP A accum ulation reads the input tensor X ∈ R T × D once (18.4 MB at T = 6000 in FP32). The gate tensor G ∈ R T × P adds only P × T × 4 b ytes (864 KB at P = 36 ), just 4.7% of the input read. T able 11: Ro oﬂine analysis: p er-lay er inference time ( µ s) on M4 Pro, B = 1 , T = 6000 (120s audio). F our 768 × 768 linear pro jections dominate LP A time regardless of pulse coun t. Comp onen t A ttention LP A ( P =12 ) LP A ( P =36 ) Linear pro jections 1,696 1,696 1,696 QK ⊤ / gate prediction 3,311 797 807 Softmax / element-wise 2,110 201 201 Scores × V / accum ulation 3,311 136 142 T otal per lay er 10,428 2,835 2,851 × 12 la yers 125 ms 34.0 ms 34.2 ms Going from 12 to 36 pulses costs + 16 µ s per la y er ( + 0.2 ms across 12 la y ers), a 0.6% increase. The four 768 × 768 linear pro jections account for 60% of LP A time and are completely pulse-count-independent. 10s 30s 60s 120s Audio duration 1 0 1 1 0 0 1 0 1 1 0 2 P er-layer memory (MB) 42× 125× 250× 500× A ttention LP A (12 pulses) Figure 4: Per-la y er memory for gates (LP A, P = 12 ) vs. attention score matrix across audio durations (FP32). LP A memory grows linearly while attention grows quadratically . F Theoretical P ersp ectiv es F.1 The lo cal–global–learned-lo cal circle LP A’s intellectual lineage traces a circle in sp eec h recognition architecture: Lo cal (1989): Time-Dela y Neural Net works used shared weigh ts across time with ﬁxed lo cal receptiv e ﬁelds. Global (2017–presen t): Self-atten tion provided pairwise in teractions across all positions, enabling arbitrary-range dependencies, but at O ( n 2 ) cost. Learned-lo cal (this w ork): LP A returns to lo cal pro cessing but mak es the receptive ﬁeld data-dep enden t. The gate predictor uses a small depth wise con v olution to predict where to lo ok and ho w wide to integrate. 18 F.2 Op en questions Is the depth w all fundamental? W e hav e sho wn that the wall correlates with the acoustic– linguistic transition, but hav e not pro ven that LP A cannot p erform linguistic computation. Diﬀerent gate shapes or training from scratch might push the w all deep er. Fine-tuning confound. The SepF ormer control exp erimen t (ﬁne-tuning with attention intact: 11.34 dB vs. LP A’s 6.82 dB at 16/16) sho ws that improv ement ov er the pretrained baseline is partly driven by contin ued training. Disentangling these eﬀects requires matched-budget atten tion ﬁne-tuning at each wa v2v ec2 stage. T raining from scratc h. All exp eriments conv ert a pretrained attention model. The pretrained represen tations ma y b e biased to ward patterns attentio n can compute b ut LP A cannot. Sam ba-ASR’s success training Mamba from scratch (1.17% WER, 10K+ hours) suggests this is viable. P er-la yer allo cation. The sw eep suggests v ariable pulse allo cation (T able 6), but all progressive results use ﬁxed allo cation. The sw eep-derived allo cation collapsed at 8/12 due to alignmen t instabilit y , so its v alue is unv alidated. Generalization. wa v2vec2-large (317M, 24 lay ers) and Whisp er are un tested. Larger mo dels ma y hav e more redundancy or more distributed representations. 19

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment