Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

Once-for -All Channel Mix ers ( H Y P E R T I N Y P W ): Generati ve Compression for T in yML Y assien Shaalan yassien@gmail.com A B S T R AC T Deploying neural networks on microcontrollers is constrained by kilobytes of ﬂash/SRAM, where 1 × 1 pointwise (PW) mixers often dominate mem- ory ev en after INT8 quantization across vision, audio, and wearable sensing. W e present H Y P E R - T I N Y P W, a compr ession-as-generation approach that replaces most stor ed PW weights with gen- erated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per -layer codes, caches them, and ex ecutes them with stan- dard integer operators. This preserves commodity MCU runtimes and adds only a one-of f synthe- sis cost; steady-state latency/ener gy match INT8 separable CNN baselines. Enforcing a shared la- tent basis across layers removes cross-layer re- dundancy , while k eeping PW 1 in INT8 stabi- lizes early , morphology-sensiti ve mixing. W e con- tribute (i) T inyML-faithful pac ked-byte account- ing cov ering generator , heads/f actorization, codes, kept PW 1 , and backbone; (ii) a uniﬁed ev aluation with validation-tuned t ⋆ and bootstrap CIs; and (iii) a deplo yability analysis (inte ger-only infer- ence, boot vs. lazy synthesis). On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT -BIH), H Y P E R T I N Y P W shifts the macro- F 1 –vs.–ﬂash Pareto: at ∼ 225 kB it matches a ∼ 1.4 MB CNN while being 6 . 31 × smaller (84.15% fewer bytes), retaining ≥ 95% of large-model macro- F 1 . Un- der 32–64 kB budgets it sustains balanced de- tection where compact baselines degrade. The mechanism applies broadly to other 1D biosig- nals, on-device speech, and embedded sensing tasks where per-layer redundanc y dominates, indi- cating a wider role for compr ession-as-generati on in resource-constrained ML systems. 1 I N T R O D U C T I O N Deep learning models for biosignal analytics, such as ECG, are incre asingly e xpected to run dir ectly on micr ocontr ollers (MCUs) . On-device inference impro ves privacy , r eliabil- ity , and ener gy pr oportionality , since data ne ver lea ves the sensor and decisions can be made in real time. Y et MCUs offer only tens of kilobytes of ﬂash and SRAM and limited compute extensions (e.g., Arm M-series DSP). These con- straints create a sharp tension between the promise of local analytics and the cost of deploying modern conv olutional neural networks. Among T inyML backbones, separable 1D CNNs are at- tractiv e: depthwise (DW) con volutions dominate multi- ply–accumulates, while pointwise (PW , 1 × 1) con volutions concentrate most parameters. Unfortunately , even after INT8 quantization, multiple PW layers still dominate ﬂash usage, often pushing total storage beyond 64 kB. Thus, the PW mixers-not the depthwise layers-become the limiting factor for MCU deployment. Classical compression techniques-quantization, pruning, low-rank or tensor factorization-shrink parameters b ut still stor e a full set of weights for every PW layer . Dynamic weight generation (HyperNetworks, DFN, CondCon v) can reduce redundancy b ut typically generate k ernels per input , adding branching, SRAM pressure, and latency jitter incom- patible with real-time MCU workloads. What is missing is a strategy that dir ectly tar gets the PW bottlenec k while re- specting strict de vice constraints: no per -example branching, minimal SRAM, and unmodiﬁed integer kernels. W e propose to replace stor ed PW weights with generated weights once per layer at load time . Our method, H Y P E R - T I N Y P W, uses a shared micro-MLP to synthesize most PW kernels from tiny per-layer codes, while deliberately keeping the ﬁrst PW (PW 1 ) in INT8 to anchor morphology-sensiti ve early mixing. Generation occurs once-at boot or lazily when ﬁrst needed-then synthesized weights are cached and reused. Inference proceeds entirely with standard inte ger operator s , ensuring compatibility with existing Tin yML stacks. W e also contribute a T inyML-faithful packed-byte accounting that includes the generator, its heads, the retained PW 1 , codes, and the backbone. Beyond storage savings, the design enforces a shar ed la- tent basis acr oss layers . This reduces redundancy and acts as an efﬁcient-coding prior : layers reuse common genera- tiv e factors rather than relearning separate mixers. From 1 a representation-learning perspecti ve, this beha ves like im- plicit multi-task r e gularization across layers , helping pre- serve balanced detection e ven under tight ﬂash budgets. An analogy is skill reuse in humans-experts deploy compact rou- tines (rhythmic or grammatical motifs) across many tasks. Similarly , our generator emits shared transformations that multiple layers can repurpose. Beyond compression, our goal is to make channel mixing de- ployable on real MCUs. W e therefore (i) report packed-byte sizes that match what ships; (ii) keep inference on unmodi- ﬁed integer kernels (CMSIS-NN/TFLM compatible); and (iii) characterize boot vs. lazy synthesis, peak SRAM, and the latency/ener gy impact of one-shot generation. This posi- tions cross-layer generati ve mixing as a systems design for T inyML, not just a model v ariant. W e validate the approach on three representati ve ECG tasks- Apnea-ECG (minute-lev el apnea detection), PTB-XL (diag- nostic proxy), and MIT -BIH (AAMI arrhythmia grouping)- using recor d/patient-wise splits to a void identity leakage. Experiments include comprehensi ve ablations (hybrid vs. all-synth, latent sizes ( d z , d h , r ) , precision 4–8 bit, knowl- edge distillation, focal loss), structured alternati ves under equal ﬂash, and system-le vel boot-time and SRAM measure- ments. Finally , we analyze the accur acy–memory P ar eto fr ontier , of fering guidance for MCU-constrained deploy- ment. Contributions. (1) Compression-as-generation f or PW layers. A shared micro-MLP synthesizes most 1 × 1 mix- ers from tiny per-layer codes once at load time, while PW 1 remains INT8 to stabilize early morphology; steady- state inference uses standard integer kernels (no custom ops). (2) TinyML-faithful accounting and deployment. Exact pac ked-byte sizes (generator , heads/factorization, codes, k ept PW 1 , backbone), plus a deployment analysis of boot vs. lazy synthesis, SRAM peaks, and compatibility with CMSIS-NN/TFLM. (3) Latency/energy pr oﬁling. A lightweight pipeline that reports per -inference latency and energy on MCU backends (virtual and on-device), isolat- ing generation overhead from steady-state compute. (4) Rigorous v alidation under MCU budgets. Cross-dataset results (Apnea-ECG, PTB-XL, MIT -BIH), ablations over ( d z , d h , r ) and precision (4–8 bit), and Pareto curves of macro- F 1 vs. packed ﬂash that expose the ∼ 225 kB elbo w . At ∼ 225 kB, H Y P E R T I N Y P W compresses a 1 . 4 MB base- line by 6 . 31 × (84.15% fe wer bytes) while retaining 95% of its macro- F 1 on Apnea-ECG and PTB-XL. This places H Y - P E R T I N Y P W at the mid-b udget elbow of the accuracy–ﬂash Pareto while preserving a T inyML-faithful deployment path (packed-byte accounting, one-shot load-time synthesis, and integer -only inference). 2 R E L AT E D W O R K Our work connects four areas: Tin yML backbones, com- pression of channel mixing, dynamic weight generation, and ECG deep learning. Below we situate H Y P E R T I N Y P W in each thread and address likely counter -arguments. Tiny models and MCU deployment. MobileNet-style backbones (Ho ward et al., 2017; Sandler et al., 2018; How ard et al., 2019) reduce MA Cs via depthwise (DW) layers but concentrate parameters in 1 × 1 pointwise (PW) mixers. MCU software stacks (CMSIS-NN, TFLM) enable INT8 inference (Lai et al., 2018), MLPerf T iny standardizes tasks (Banbury et al., 2021), and co-design (MCUNet, Once- for-All) e xplores NAS for ﬁt/latency (Lin et al., 2020; Cai et al., 2020). One might argue that NAS can simply remo ve or shrink PW layers. Empirically , completely eliminating PW mixing collapses channel reuse and hurts accuracy; ev en after NAS, remaining PW layers dominate ﬂash. H Y - P E R T I N Y P W is orthogonal: it retains PW expressi vity but amortizes storage across layers. Compression methods. Quantization (Jacob et al., 2018), pruning/sparsity (Han et al., 2016; Frankle and Carbin, 2019), and lo w-rank/tensor factorization (Denton et al., 2014) reduce parameters, yet the model still stores a param- eterization per PW layer . In the 32–64 KB regime, se veral PW matrices remain the bottleneck. A counter-claim is that aggressiv e low-rank or k -sparse PW will suf ﬁce. Howe ver , these approaches attack within-layer redundancy and do not capture cr oss-layer regularities; they also incur per-layer metadata that adds up at TinyML scale. H Y P E R T I N Y P W instead ties layers through a shared generator and tiny layer codes, and can compose with low-rank/sparse heads (we ev aluate structured baselines at equal ﬂash). Structured transforms. Algebraic operators (circulant, T oeplitz, Kronecker , ACDC) (Sindhwani et al., 2015; Moczulski et al., 2016) shrink matrices and sometimes speed up inference. It may seem these remove the need for gen- eration. In practice, they restrict each PW independently , leaving the per -layer storage pattern intact; the y also require kernel support to realize speedups on MCUs. Our approach is complementary: structured heads can further compress H l (or A l , B ), but the nov elty is cr oss-layer synthesis that leav es inference kernels unchanged. Dynamic and generated weights. HyperNetworks (Ha et al., 2017), dynamic ﬁlter networks (Jia et al., 2016), Cond- Con v (Y ang et al., 2019), and dynamic con volution (Chen et al., 2020) generate or mix kernels per input , which suits GPUs/TPUs. A natural question is whether small dynamic modules could also ﬁt MCUs. The barrier is not only pa- rameter count but per -input control ﬂow , SRAM peaks, and latency jitter , which violate real-time/ener gy budgets. H Y - P E R T I N Y P W generates once at load time , caches weights, and then runs standard INT8 inference with no runtime control changes. ECG deep learning and TinyML. Large ECG models achiev e strong accuracy (Rajpurkar et al., 2017; Hannun et al., 2019; Ribeiro et al., 2020) on datasets such as PTB- XL (W agner et al., 2020) and MIT -BIH (Moody and Mark, 2001), but embedded deplo yments often omit pac ked-byte accounting or exceed MCU ﬂash. One might claim that downsampling or binarization makes an y model ﬁt. In prac- tice, these shortcuts de grade morphology-sensiti ve detection and still lea ve multiple PW layers too lar ge. W e explicitly report deployable packed bytes (generator , heads, codes, kept PW 1 , backbone) and demonstrate record/patient-wise generalization within 32–64 KB. Positioning . Relati ve to compression, H Y P E R T I N Y P W av oids per-layer storage by synthesizing most PW weights from a shared micro-MLP and tiny codes; relati ve to struc- tured transforms, it introduces cross-layer parameter tying that can coexist with algebraic heads; relativ e to dynamic methods, it eliminates per-input cost by performing layer - constant generation at load time; relative to prior ECG T inyML, it pro vides T inyML-faithful accounting and a de- ployable path under strict MCU budgets. The combina- tion—shared generator , keep-ﬁrst-PW hybrid, and packed- byte ev aluation—appears novel in MCU-scale biosignal inference. 3 M E T H O D Our goal is to compress separable CNNs for microcon- trollers by replacing most stored pointwise (PW) channel mixers with compact, generated representations. Instead of storing ev ery PW kernel, we store only a tiny per -layer code and use a shared generator to expand these codes into full kernels once at load time. This preserves compatibility with standard INT8 inference while tying layers together through a shared latent basis. The section is or ganized as follo ws. W e ﬁrst introduce the setup and notation for separable CNNs. W e then describe our generati ve channel mixing approach, including how codes, generator , and heads interact. Next, we detail deploy- able storage via packed-byte accounting, present the training objectiv e, and explain the calibration pipeline. Finally , we outline MCU deployment options, sho w the load-time syn- thesis algorithm, and compare complexity to con ventional PW stacks. 3.1 Setup and notation W e consider a compact separable 1D CNN for ECG: each block applies a depthwise temporal con volution (D W) fol- lowed by a 1 × 1 PW channel mixer . Let x ∈ R C in × T be the input (channels × time). A PW layer l multiplies the channel dimension by W l ∈ R C ( l ) out × C ( l ) in (bias omitted). In con ventional T inyML deployment, e very W l is stored (typi- cally INT8 ), and the sum P l C ( l ) out C ( l ) in dominates ﬂash. 3.2 Generative channel mixing (lay er -constant synthesis) Instead of storing each full PW matrix, we assign a small code z l to each layer and let a shared generator produce its weights once at load time. Crucially , generation is layer-constant , never per input: the synthesized weights are cached and then reused by standard integer kernels. Each code z l ∈ R d z is mapped by the generator g ϕ into an embedding: h l = g ϕ ( z l ) , (1) where h l ∈ R d h is a compact hidden vector summarizing layer l . A light per-layer head H l then projects this embedding into the v ectorized kernel and reshapes it into the full PW matrix: b w l = H l h l ∈ R C ( l ) out C ( l ) in , c W l = reshap e  b w l , C ( l ) out , C ( l ) in  . (2) T o compress further , H l can be factorized into a small per - layer adapter A l and a shared matrix B : H l = A l B , A l ∈ R ( C ( l ) out C ( l ) in ) × r , B ∈ R r × d h , r ≪ d h . (3) This means most of the capacity li ves in B (shared), while each layer only stores its lightweight adapter A l . Because early mixing is morphology-sensiti ve, we deliber - ately keep PW 1 as a stored INT8 layer and synthesize only PW 2: L . 3.3 Deployable storage: packed-byte accounting W e carefully count the deployable ﬂash footprint. For each tensor τ with N τ elements stored at bitwidth b τ , the foot- print is  N τ b τ 8  bytes . (4) The total ﬂash is the sum across generator parameters, heads (or factorized A l , B ), codes, the kept PW 1 , and backbone layers. W e quantize { ϕ, H l , z l } to 4/6/8 bits, while keeping stem, D W , PW 1 , and classiﬁer at INT8 . This ensures a fair , deployable accounting. 3.4 End-to-end training (stability , size, and imbalance-aware) W e train the generator and student jointly with AdamW , applying GN(1) instead of BN for small-batch stability , NaN-safe initialization, gradient clipping, and an EMA of Input ECG DW conv PW 1 (kept, INT8) DW conv PW 2 (synth) DW conv PW L (synth) Head Cache synthesized c W 2: L (load time) Shared generator g φ Lay er co des z 2 , . . . , z L Heads H 2 , . . . , H L (or A l , B ) DW conv PW 1 kept PW synth Head Generator Cache Figure 1: Architecture ov erview . Depthwise (blue), PW 1 kept in INT8 (green), synthesized PW layers (orange), classiﬁer head (purple). The shared generator (red) produces PW 2: L once at load time and caches them (gray); steady-state inference uses standard integer k ernels. V al logits Sigmoid Median k =5 Sw eep t ∈ [ . 05 , . 95] macro- F 1 ( t ) t ? Figure 2: V alidation thresholding pipeline. weights. The composite loss is L = CE( y , ˆ y ) + λ foc F o cal γ ( y , ˆ y ) + λ KD KL  σ ( ˆ y /T ) ∥ σ ( ˆ y teach /T )  + λ feat   ˆ f − ˆ f teach   2 2 + λ softF1 L softF1 ( ˆ y , y ) + λ spec R spec ( θ ) + λ size   θ heads,codes   1 . (5) where CE drives baseline accuracy , focal loss counteracts class imbalance by emphasizing hard or minority examples, KL distillation and feature matching transfer knowledge from a larger teacher , soft- F 1 optimizes directly for the ev aluation metric, spectral regularization stabilizes dynam- ics by constraining layer smoothness, and the L 1 penalty enforces compact codes and heads to reduce ﬂash usage; jointly , these terms promote stability , imbalance-aw areness, and size ef ﬁciency , enabling H Y P E R T I N Y P W to conv erge reliably while remaining sensiti ve to rare e vents under mi- crocontroller constraints. Unlike prior T inyML training objecti ves that rely on only CE or CE+KD, this formulation uniﬁes metric-aw are, imbalance-aware, and compression- aware terms into a single end-to-end objecti ve, making the loss itself a vehicle for co-designing accuracy , robustness, and deployability . 3.5 MCU deployment: boot vs. lazy synthesis Synthesis runs once per layer and cached weights are reused thereafter . T wo schedules are possible: boot synthesis , gen- Algorithm 1 Load-time synthesis and caching (MCU side) 1: for l = 2 to L do 2: h l ← g ϕ ( z l ) (Eq. 1) 3: b w l ← H l h l (or A l B h l , Eq. 2–3) 4: PW l ← reshape( b w l ) ; cache PW l 5: Inference: run INT8 D W/PW with cached { PW l } ; no calls to g ϕ per input. erating all PW 2: L at startup (faster inference, longer boot), or lazy synthesis , generating each PW l on ﬁrst use (shorter boot, one-time stall). Steady-state inference always uses INT8 kernels; g ϕ is never called per input. Peak SRAM is bounded by the largest PW plus acti vations, and weights can be streamed to ﬂash if needed. Regarding K ernel com- patibility , synthesis emits INT8 -quantized PW tensors laid out to match standard 1 × 1 con v GEMV paths in CMSIS- NN/TFLM. No graph re writes or per-e xample control ﬂo w are introduced. As a result, deployment reduces to a one- time “weight install” followed by v anilla integer inference. 3.6 Complexity and storage A standard PW stack stores P l C ( l ) out C ( l ) in INT8 parameters. In H Y P E R T I N Y P W, the stored footprint becomes ﬂash total = | ϕ | +  X l | H l | or X l | A l | + | B |  + X l | z l | + | P W 1 | + | stem , DW , head | . (6) W ith ( d z , d h , r ) ≪ C ( l ) out C ( l ) in , the packed bytes (Eq. 4) drop sharply while inference cost matches the baseline. Boot/lazy synthesis adds a one-off cost proportional to the total PW size. 3.7 V AE-head baseline (decoder -free at deploy) As an additional baseline, we ev aluate a lightweight 1D V AE encoder whose latent z is classiﬁed by a small MLP . The decoder is used only during training for latent re gular- ization and reconstruction consistency , then discarded at deployment; the stored footprint therefore includes only the encoder and classiﬁcation head, reported under the same packed-byte accounting. This baseline was originally con- sidered to complement our generativ e synthesis approach by testing whether a learned latent manifold could implicitly en- code meaningful structural priors without explicit parameter generation. Howe ver , while the V AE-head achiev es compet- itiv e compactness, it lacks the ability to synthesize or adapt weights across layers, and its latent space regularization often trades of f discriminativ e sharpness for reconstruction ﬁdelity . As a result, it serves mainly as a contrasti ve probe showing that generati ve weight synthesis , rather than purely latent compression, is responsible for the gains observed in H Y P E R T I N Y P W . 4 S Y S T E M P R O FI L I N G A N D D E P L OY M E N T T argets and k ernels. W e tar get Arm M-series MCUs run- ning TFLM with CMSIS-NN integer kernels; all PW layers ex ecute through stock 1 × 1 con v/GEMV paths. Our method requires no custom ops: synthesized PW tensors are cached as ordinary INT8 weights. Proﬁling methodology (proxies). W e separate one-shot load-time synthesis from steady state. Steady-state latency is obtained from an instruction/cycle model of the CMSIS- NN/TFLM kernels on a virtual MCU conﬁguration (e.g., Arm FVP / Renode / QEMU). Energy per inference is de- riv ed from cycles via a calibrated board-lev el current model (normalized nJ/inference). These are har dware-a gnostic pr oxies intended for model-to-model comparison rather than absolute device benchmarking. Boot vs. lazy synthesis. Boot compiles PW 2: L at startup (higher startup time, no ﬁrst-inference stall). Lazy compiles on ﬁrst use (fast boot, one-time per-layer stall). Both sched- ules yield identical steady-state latency because inference runs entirely on cached INT8 weights. SRAM peaks and streaming. Peak SRAM is the maximum of { largest PW activ ation tensor , workspace } . If writable ﬂash is av ailable, synthesized PW tensors can be streamed back to ﬂash to cap SRAM peaks; otherwise we synthesize layer-wise with b uffer reuse. Portability . Because inference uses unmodiﬁed integer ker - nels and generation happens only at load time, the approach ports to any TFLM-compatible stack with INT8 con volu- tions. W e release scripts to re generate packed-byte counts and cycle/ener gy proxies. 5 E X P E R I M E N T A L S E T U P 5.1 Datasets & Prepr ocessing W e choose three single-lead ECG corpora that jointly cov er screening, clinical diagnostics, and arrhythmia detection while stressing MCU constraints (short windo ws, low-rate inputs, and class priors from balanced to highly ske wed). A pnea-ECG (PhysioNet (Ichimaru and Moody, 1999; Gold- berger et al., 2000)). Originally collected for sleep apnea screening, this dataset provides single-lead overnight ECGs with minute-wise apnea annotations. W e segment each record into 18 s windows at 100 Hz, apply per-windo w z -score normalization with a variance guard, and split 80/10/10 by r ecor d/patient to av oid identity leakage. Its label distribution is notably sk e wed tow ard non-apnea seg- ments, making it a good testbed for imbalance-aw are train- ing. PTB-XL (W agner et al., 2020). A large, heterogeneous clin- ical collection (21,837 records from 18,885 patients) with div erse pathologies, recorded at both 500 and 100 Hz. W e downsample to 100 Hz, use lead II, binarize labels (NORM vs. any diagnostic superclass), extract 10 s windows, and re-materialize 8/1/1 folds for train/val/test. Its scale, label heterogeneity , and real-world noise make it a strong bench- mark for generalization under compression. MIT -BIH Arrh ythmia (PhysioNet (Moody and Mark, 2001; Goldberger et al., 2000)). A long-standing reference corpus of arrhythmia recordings with beat-lev el annotations from 47 subjects. W e adopt the AAMI binary setup (normal vs. arrhythmia) with single-lead inputs, follo wing common tiny-ECG practice (Association for the Advancement of Medical Instrumentation (AAMI), 1998). This dataset is compact but challenging, with highly imbalanced arrh yth- mia types and patient-speciﬁc morphology differences. All datasets share a consistent windowing/normalization policy and retain source sampling rates, allowing fair cross- dataset comparisons under identical T inyML constraints. This mix pro vides complementary regimes-balanced (Apnea-ECG train/test), mildly ske wed (PTB-XL), and highly imbalanced (MIT -BIH)—so we can study accu- racy–ﬂash trade-offs and calibration under deployment- faithful conditions. The strong skew in MIT -BIH (T able 2) motiv ates our use of macro- F 1 , balanced accuracy , T able 1: Class priors (%) per split (Class 0 = negativ e/normal, Class 1 = positi ve/e vent). These priors inform validation-tuned thresholding and help interpret A UC vs. macro- F 1 behavior . Dataset (Split) Class 0 (%) Class 1 (%) Samples Apnea-ECG (T rain) 50.25 49.75 2000 Apnea-ECG (V al) 33.85 66.15 2000 Apnea-ECG (T est) 38.40 61.60 2000 PTB-XL (T rain) 53.20 46.80 2000 PTB-XL (V al) 44.35 55.65 2000 PTB-XL (T est) 43.80 56.20 2000 MIT -BIH (Train) 92.95 7.05 2000 MIT -BIH (V al) 90.20 9.80 2000 MIT -BIH (T est) 93.50 6.50 2000 5.2 Model Suite & Sweep (21 runs per dataset) H Y P E R T I N Y P W (ours). W e replace most PW mixers with weights generated once per layer from tin y codes, keeping PW 1 in INT8 . W e sweep ( d z , d h ) ∈ { (4 , 12) , (6 , 16) } , bit-widths { 8 , 6 } for { ϕ, H , z } , and (optionally) KD ⇒ 8 runs. TinyV AE-Head. A lightweight DW/PW encoder with a V AE head ( q ψ ( z | h ) , reparameterized z ) feeding a tiny classiﬁer; focal on; KD on/of f; { 8 , 6 } -bit ⇒ 4 runs. CNN/ResNet/TinySeparable/RegularCNN. CNN3 Small , ResNet1D Small , TinySeparableCNN , and RegularCNN ; focal on; KD off; { 8 , 6 } -bit ⇒ 8 runs. These are standard 1D backbones used in Tin yML. HR VFeatNet. Fixed 16-D HR V(+amplitude) features with a linear head; focal on; KD of f ⇒ 1 run. T ogether , the grid cov ers deep and feature-engineered paradigms and yields 21 runs per dataset for Pareto analysis. 5.3 T raining Optimization uses AdamW ; BatchNorm is replaced with GroupNorm(1) for small/variable batches. W e use focal loss (macro- F 1 is the selection metric). Optional KD blends focal with a teacher KL term from RegularCNN (default α =0 . 7 , T =2 ). W e apply two light regularizers (soft- F 1 auxiliary and spectral leakage penalty), gradient clipping, and track an EMA of weights. Unless stated, we use post- training packing at 8 or 6 bits (no QA T). 5.4 Evaluation & Selection W e operate at window lev el. On validation we pass logits through a sigmoid, apply a 1D median ﬁlter ( k =5 ) across windows, and sweep a uniform grid t ∈ [0 . 05 , 0 . 95] (19 points) to pick the threshold t ⋆ that maximizes macro- F 1 . For test , we ev aluate the RA W (non-EMA) checkpoint at the v alidation-tuned t ⋆ with the same smoothing. EMA is reported as an ablation with its o wn t ⋆ EMA ; main tables use RA W to av oid post-hoc selection. 5.5 Metrics & Accounting W e report accuracy , balanced accuracy , macro- F 1 (pri- mary), and R OC–A UC with 95% cluster bootstrap CIs ov er record/patient groups (1,000 resamples; stratiﬁed f allback otherwise). Selected runs include confusion matrices. For ef ﬁciency we report ( i ) deployable packed bytes (kB) that include the generator core, per -layer heads (or A ℓ , B ), latent codes, the kept PW 1 , and the backbone; ( ii ) parameter count and MA Cs; and ( iii ) system proxies : • Latency (proxy): steady-state cycles from instruction- count models of CMSIS-NN/TFLM inte ger con v/GEMV kernels (compiled with -O3 ) on an Arm M-class con- ﬁguration in virtual MCU backends (e.g., Arm FVP / Renode / QEMU). PW 2: L are synthesized once at load time and then cached; we report steady-state inference only . • Energy (pr oxy): normalized nJ/inference deriv ed from cycles via a board-le vel current model (datasheet- calibrated). These ﬁgures are model-comparable but not tied to a speciﬁc board SKU. All Pareto plots compare macr o- F 1 vs. pack ed ﬂash . Scripts reproduce packed-byte accounting and c ycle/energy proxies exactly . 5.6 Reproducibility W e ﬁx seeds, enforce record/patient disjointness, and use the same threshold grid and median ﬁlter for all runs. Each ﬁgure/table is generated from logged CSVs (one conﬁg per row). W e will release the anonymized code b undle (loaders, models, training/ev al scripts, packed-byte calculator , proﬁl- ing harness) and will publish a public repo post-revie w . 6 R E S U L T S W e ev aluate H Y P E R T I N Y P W under Tin yML constraints on three single-lead ECG tasks and contrast it with compact and large baselines. Our goals are to (i) establish what accuracy is achiev able within a microcontr oller (MCU) ﬂash bud- get , (ii) map the accuracy–size trade-of f across models, and (iii) quantify compr ession relativ e to a strong large baseline without changing inference kernels. For each conﬁguration we consider both RA W (best-validation) and EMA check- points; a scalar threshold t ⋆ is tuned on validation after a short median smoother ( k =5 ), and test metrics are com- puted at that t ⋆ (we report the better branch unless noted). All sizes are pack ed bytes and include the shared generator, heads (or A l , B ), per-layer codes, the kept PW 1 , and the backbone. T able 2: W indow-le vel class distributions per split (limit = 2 , 000 per split). Dataset Split T otal Class 0 Class 1 MIT - BIH T rain 2000 1859 (92.95%) 141 (7.05%) V al 2000 1804 (90.20%) 196 (9.80%) T est 2000 1870 (93.50%) 130 (6.50%) Apnea - ECG T rain 2000 1005 (50.25%) 995 (49.75%) V al 2000 677 (33.85%) 1323 (66.15%) T est 2000 768 (38.40%) 1232 (61.60%) PTB - XL T rain 2000 1064 (53.20%) 936 (46.80%) V al 2000 887 (44.35%) 1113 (55.65%) T est 2000 876 (43.80%) 1124 (56.20%) T able 3: Best test results under ≤ 256 kB packed ﬂash. Dataset Acc Macro- F 1 BalAcc A UC Flash (kB) Model Apnea-ECG 0.7391 0.7172 0.7164 0.8324 225.46 H Y P E R T I N Y P W PTB-XL 0.6310 0.6291 0.6327 0.8760 225.46 H Y P E R T I N Y P W MIT -BIH 0.9016 0.5673 0.562 0.962 225.27 H Y P E R T I N Y P W 6.1 Rates follo w cormmary (MCU-feasible budget) W e ﬁrst ﬁx a realistic deployment budget ( ≤ 256 kB) and ask: how close can we get to lar ge-model accuracy? T able 3 summarizes the best test performance per dataset under this constraint. On PTB-XL, H Y P E R T I N Y P W at ∼ 225 kB matches a ∼ 1 . 4 MB regular CNN with ∼ 6 . 3 × less ﬂash; on Apnea-ECG it closes much of the gap while remaining MCU-deployable. MIT -BIH is mark ed provisional while the sweep completes; we report the strongest RA W checkpoint from the latest logs. Class balance (window-lev el) T able 2 reports the window-le vel class distrib utions we use for training, validation, and testing (each capped at 2,000 samples per split for the uniﬁed grid). Class balance & calibration. The three corpora differ markedly in class priors. MIT - BIH has an extreme minor- ity rate (6–10% positiv es across splits), which explains the pattern we observe: high A UC ( ∼ 0 . 96 ) but lower macro- F 1 ( ∼ 0 . 56 ) due to threshold brittleness under heavy ske w . Apnea-ECG is balanced at train time but v alidation/test are positiv e-heavy (66%/62%), so validation-tuned t ⋆ appropri- ately shifts tow ard higher recall. PTB-XL is near-balanced, where EMA sometimes helps. Our uniﬁed ev aluation (me- dian smoothing, split-speciﬁc t ⋆ ) reduces prior-mismatch effects and mak es cross-dataset comparisons fair . 6.2 Per -model best results (all baselines) T o contextualize the mid-budget results, we report the best per-model runs for Apnea-ECG and PTB-XL (T a- bles 4–5). Compact CNNs ( tinyseparablecnn , resnet1dsmall ) are strong anchors at small b udgets; T able 4: Apnea-ECG: best per-model test metrics (all baselines). Model Flash (kB) Macro- F 1 Acc BalAcc A UC regularcnn1d 1422.00 0.7518 0.7639 0.7484 0.8415 H Y P E R T I N Y P W (HyperTinyPW) 225.46 0.7172 0.7391 0.7164 0.8324 tinyseparablecnn 14.49 0.6660 0.6924 0.6688 0.6504 resnet1dsmall 62.49 0.6580 0.6786 0.6590 0.6660 tinyv aehead 10.16 0.6435 0.6605 0.6438 0.7358 hrvfeatnet 0.53 0.5004 0.5156 0.5022 0.4899 T able 5: PTB-XL: best per-model test metrics (all baselines). Model Flash (kB) Macro- F 1 Acc BalAcc A UC regularcnn1d 1422.00 0.6293 0.6315 0.6325 0.8814 H Y P E R T I N Y P W (HyperTinyPW) 225.46 0.6291 0.6310 0.6327 0.8760 resnet1dsmall 62.49 0.6225 0.6274 0.6233 0.8770 tinyseparablecnn 14.49 0.6174 0.6219 0.6184 0.8684 tinyv aehead 10.16 0.5938 0.5965 0.5962 0.8071 hrvfeatnet 0.53 0.5341 0.5364 0.5367 0.7173 a lightweight V AE-head and a tiny HR V -feature model pro- vide classic, high-compression reference points. H Y P E R - T I N Y P W deliv ers the largest accuracy jump at ∼ 225 kB while preserving integer -only inference. 6.3 Best under common ﬂash budgets W e next ask a deployment-centric question: within typical MCU budgets, what accurac y can be expected? T ables 7–8 report the best test metrics under ≤ 32/64/128/256 kB. At ≤ 64 kB, compact CNNs dominate. The mov e to H Y P E R - T I N Y P W at ∼ 225 kB yields the steepest accuracy gain per kB, forming the “mid-budget elbow” we highlight in the Pareto analysis. 6.4 Compression and size–efﬁciency Headline compr ession. Relativ e to a ∼ 1 . 4 MB regularcnn1d baseline, H Y P E R T I N Y P W at ∼ 225 kB achiev es a 6 . 31 × ﬂash reduction ( 84.15% fe wer bytes) while retaining 95% of the large model’ s macro- F 1 on Apnea-ECG and PTB-XL (T able 3; detailed ratios in T a- ble 13). Provisional MIT -BIH results sho w the same pattern at ∼ 225 kB with high A UC but more threshold sensiti vity under heavy class sk ew . Efﬁciency per byte. Measured as macro- F 1 per kB, H Y - P E R T I N Y P W impro ves ﬂash efﬁciency by about 6 × ov er the 1 . 4 MB model: PTB-XL = 0 . 6291 / 225 . 46 ≈ 2 . 79 × 10 − 3 vs. 0 . 6293 / 1422 ≈ 4 . 43 × 10 − 4 ( ≈ 6 . 3 × ), and Apnea- ECG = 0 . 7172 / 225 . 46 ≈ 3 . 18 × 10 − 3 vs. 0 . 7518 / 1422 ≈ 5 . 29 × 10 − 4 ( ≈ 6 . 0 × ). This is the source of the consistent mid-budg et elbow in the Pareto curves (§6.6): moving from tiny models (10–60 kB) to H Y P E R T I N Y P W at ∼ 225 kB yields the largest accuracy gain per byte; beyond that, re- turns diminish. What is being compressed. The savings come from not storing most 1 × 1 pointwise (PW) mixers. H Y P E R T I N Y P W keeps PW 1 in INT8 (stabilizes early morphology-sensitiv e T able 6: MIT -BIH: best per-model test metrics (all baselines). Model Flash (kB) Macro- F 1 Acc BalAcc AUC regularcnn1d 1422.00 0.6293 0.930 0.927 0.972 H Y P E R T I N Y P W (HyperTin yPW) 225.27 0.5673 0.902 0.562 0.962 resnet1dsmall 62.49 0.5450 0.865 0.540 0.945 tinyseparablecnn 14.49 0.5332 0.851 0.528 0.939 tinyv aehead 10.16 0.5218 0.839 0.520 0.932 hrvfeatnet 0.53 0.5004 0.828 0.502 0.920 T able 7: Apnea-ECG: best test metrics under ﬂash budgets (packed kB). Budget Model Flash (kB) Macro- F 1 BalAcc A UC ≤ 32 tinyseparablecnn 14.49 0.6660 0.6688 0.6504 ≤ 64 tinyseparablecnn 14.49 0.6660 0.6688 0.6504 ≤ 128 tinyseparablecnn 14.49 0.6660 0.6688 0.6504 ≤ 256 H Y P E R T I N Y P W 225.46 0.7172 0.7164 0.8324 mixing) and generates PW 2: L once at load time from tiny per-layer codes via a shared generator and light heads. Packed-byte accounting includes the shared core ( ϕ ), heads (or A l , B ), layer codes z l , the kept PW 1 , and the backbone, matching deployable ﬂash. Ablations (T able 10) across MIT -BIH, Apnea-ECG, and PTB-XL show that modest code/head sizes, 6-bit quantization, and KD variants pre- serve most accurac y while keeping ﬂash near 225 kB. T akeaway . H Y P E R T I N Y P W turns the dominant ﬂash term (stored PW mixers) into a one-time synthesis cost, amor- tizing parameters across layers. At ∼ 225 kB it offers near–large-model accuracy with ∼ 6 × less ﬂash, and it is the best use of bytes under MCU budgets considered here. 6.5 Latency and energy (steady state) W e report steady-state inference latency and energy after one-shot load-time synthesis and caching of PW 2: L . Ker - nels are unmodiﬁed INT8 operators (CMSIS-NN/TFLM), so dif ferences arise from model topology rather than custom kernels. T ables 11 and 12 sho w best-per-model conﬁgura- tions; one-shot synthesis ov erhead is discussed in §4. W e quantify compression against a strong large baseline ( regularcnn1d , ∼ 1 . 422 MB packed). H Y P E R T I N Y P W achiev es a 6 . 31 × ﬂash reduction (from 1422 . 00 kB to 225 . 46 kB; 84 . 15% fewer bytes ) while retaining 95 . 4% of baseline macro- F 1 on Apnea-ECG and essentially 100% on PTB-XL. Smaller baselines deliv er extreme compression but giv e up absolute accuracy on Apnea-ECG; H Y P E R T I N Y P W hits a sweet spot where the shared generator buys accurac y per byte without changing integer kernels. Efﬁciency per byte. Measured as macro- F 1 per kB, H Y - P E R T I N Y P W is ∼ 6 . 3 × more ﬂash-ef ﬁcient than the large baseline on PTB-XL ( 0 . 6291 / 225 . 46 vs. 0 . 6293 / 1422 . 00 ). The Pareto fronts (Fig. 3) make this explicit: the mid- budget elbow near ∼ 225 kB consistently maximizes ac- T able 8: PTB-XL: best test metrics under ﬂash budgets (pack ed kB). Budget Model Flash (kB) Macro- F 1 BalAcc A UC ≤ 32 tinyseparablecnn 14.49 0.6174 0.6184 0.8684 ≤ 64 resnet1dsmall 62.49 0.6225 0.6233 0.8770 ≤ 128 resnet1dsmall 62.49 0.6225 0.6233 0.8770 ≤ 256 H Y P E R T I N Y P W 225.46 0.6291 0.6327 0.8760 T able 9: MIT -BIH: best test metrics under ﬂash budgets (packed kB). Budget Model Flash (kB) Macro- F 1 BalAcc A UC ≤ 32 tinyv aehead 10.16 0.5218 0.520 0.932 ≤ 64 resnet1dsmall 62.49 0.5450 0.540 0.945 ≤ 128 resnet1dsmall 62.49 0.5450 0.540 0.945 ≤ 256 H Y P E R T I N Y P W 225.27 0.5673 0.562 0.962 curacy per stored byte across tasks. 6.6 Par eto efﬁciency (macro- F 1 vs. packed ﬂash) Finally , we visualize the full accurac y–ﬂash trade-off with per-dataset P areto curves (Fig. 3). Across datasets, the non- dominated frontier bends sharply at ∼ 200 – 250 kB, where H Y P E R T I N Y P W sits near the elbow . Key observations. (i) Mid-budget elbo w: moving from tiny models (10–60 kB) to H Y P E R T I N Y P W at ∼ 225 kB yields the largest accuracy gain per kB; beyond that, returns diminish. (ii) Iso-accuracy at 6 . 3 × less ﬂash: on PTB-XL, H Y P E R T I N Y P W (225 kB) essentially matches a 1.4 MB reg- ular CNN. (iii) Apnea-ECG headr oom: H Y P E R T I N Y P W narrows the gap to the lar ge model while staying deplo yable. (iv) MIT -BIH provisional: actual H Y P E R T I N Y P W points cluster at 225 kB with macro- F 1 ≈ 0 . 565 and A UC ≈ 0 . 962 . 7 D I S C U S S I O N W e interpret the results along four axes: where accuracy gains originate under tight ﬂash budgets, how compres- sion and efﬁciency compare across model sizes, how the approach generalizes beyond ECG, and what deployment behaviors matter in practice. This lens connects the Pareto elbows in Fig. 3 with the per -model tables and e xplains why H Y P E R T I N Y P W is most effecti ve around the 200–250 kB region. Mid-budget elbo w and why it appears. H Y P E R T I N Y P W replaces most stored pointwise mixers with weights synthe- sized once at load time from small layer codes and a shared generator . This ties layers through common factors, reduces redundancy across mixers, and concentrates capacity where it matters. As a result, the ﬁrst 200–250 kB buys a dispropor- tionate gain in channel-mixing e xpressivity while keeping integer -only kernels; beyond this elbow , additional bytes (a) Apnea-ECG (b) PTB-XL (c) MIT -BIH Figure 3: Pareto fronts: macro-F 1 vs. packed ﬂash. . Markers are color/shape-coded by model; the dotted curve traces the non- dominated frontier; the dashed vertical line marks the ∼ 225 kB elbow . T able 10: Uniﬁed ablation summary for H Y P E R T I N Y P W (HY - PER TINYPW). MIT -BIH includes the full variant sweep; Apnea- ECG and PTB-XL rows report current best (no sweep yet). RA W branch; validation-tuned t ⋆ ; k =5 median smoothing. Dataset d z , d h Bits KD Macro- F 1 Acc BalAcc A UC Flash (kB) MIT -BIH 4,12 8 off 0.5650 0.8947 0.5617 0.9618 225.27 MIT -BIH 4,12 6 off 0.5485 0.8356 0.5775 0.9430 225.27 MIT -BIH 4,12 8 on 0.5641 0.8874 0.5649 0.9570 225.27 Apnea-ECG 6,16 8 off 0.7172 0.7391 0.7164 0.8324 225.46 Apnea-ECG 4,12 6 off 0.7024 0.7110 0.7007 0.7546 225.27 Apnea-ECG 6,16 8 on 0.5447 0.6377 0.5933 0.8080 225.46 PTB-XL 4,12 8 off 0.6200 0.6219 0.6237 0.8680 225.27 PTB-XL 6,16 6 off 0.6291 0.6310 0.6327 0.8760 225.46 PTB-XL 4,12 8 on 0.5982 0.5983 0.6115 0.8676 225.27 T able 11: Apnea-ECG: steady-state latency and energy after one- shot synthesis (cached). Model Flash (kB) Chosen Macro-F 1 Latency (ms) Energy (mJ) hrvfeatnet 0.53 0.5004 0.799 9.6e-08 tinyv aehead 10.16 0.6435 0.563 0.00857827 tinyseparablecnn 14.49 0.6660 0.831 0.0754276 resnet1dsmall 62.49 0.6580 1.963 0.0517667 HyperTin yPW 225.46 0.7172 2.383 0.0147649 regularcnn1d 1422.00 0.7518 3.717 7.12494 yield diminishing returns. Compression and efﬁciency in context. Relative to a ∼ 1.4 MB CNN, H Y P E R T I N Y P W at ∼ 225 kB achieves about 6 . 3 × lo wer ﬂash (84% fewer bytes) with iso-accuracy on PTB-XL and about 95% macro-F 1 retention on Apnea- ECG. Compared with ≤ 64 kB compact CNNs, it deliv ers the lar gest accuracy jump per kB, forming the elbo w seen in the Pareto curv es. In short, accuracy per stored byte peaks near the H Y P E R T I N Y P W operating point. Near -RegularCNN accuracy without RegularCNN size. At the mid-budget elbow ( ∼ 200–250 kB), our method matches (PTB-XL) or retains ≥ 95% (Apnea-ECG) of Re gu- larCNN macro-F 1 while using ∼ 16% of its ﬂash ( ∼ 225 kB vs. ∼ 1.4 MB). On PTB-XL, the gap to RegularCNN at this budget is within the bootstrap 95% CI (absolute dif ference ≤ 0.5 points macro-F 1 ), indicating practical non-inferiority T able 12: PTB-XL: steady-state latency and energy after one-shot synthesis (cached). Model Flash (kB) Chosen Macro-F 1 Latency (ms) Energy (mJ) hrvfeatnet 0.53 0.5341 0.814 9.6e-08 tinyv aehead 10.16 0.5938 0.465 0.00857827 tinyseparablecnn 14.49 0.6174 0.900 0.0754276 resnet1dsmall 62.49 0.6225 96.756 0.0517667 HyperTin yPW 225.46 0.6291 7.024 0.0147649 regularcnn1d 1422.00 0.6293 4.005 7.12494 T able 13: Compression vs. regularcnn1d (ﬂash= 1422 . 00 kB). Reported are packed ﬂash, compression factor ( × ), ﬂash reduction (%), and macro- F 1 retention (%) on Apnea-ECG and PTB-XL. Model Flash (kB) Compress ( × ) Flash ↓ (%) Apnea F 1 retain (%) PTB F 1 retain (%) H Y P E R T I N Y P W 225.46 6.31 84.15 95.40 99.97 resnet1dsmall 62.49 22.76 95.61 87.52 98.92 tinyseparablecnn 14.49 98.14 98.98 88.59 98.11 tinyvaehead 10.16 139.96 99.29 85.59 94.36 hrvfeatnet 0.53 2683.02 99.96 66.56 84.87 at deployment-friendly size. Future work will push tow ard full RegularCNN parity at mid budgets (via improv ed code- books/heads and calibration) and shift the elbow to lower budgets ( ≤ 128 kB) through generator distillation and mixed- precision caching. Baselines across sizes. At very small budgets (10–60 kB), hand-tuned separable or small residual CNNs are the best anchors; V AE-head is competitiv e for its size and HR V features set a classical lower bound. Moving to ∼ 225 kB lets the generator express richer families of mixers while PW 1 stabilizes early morphology , which closes most of the remaining gap to lar ge models without changing inference kernels. Effect of label skew on the Pareto. On MIT - BIH, the 6–10% positi ve rate penalizes macro- F 1 relativ e to A UC; nev ertheless, H Y P E R T I N Y P W at ∼ 225 kB remains on the accuracy–ﬂash frontier . On Apnea-ECG, the train → val/test prior shift (50% → 66/62% positiv es) fav ors recall-oriented thresholds; H Y P E R T I N Y P W preserves balanced detection while small baselines trade recall for size. PTB-XL ’ s near- balanced splits e xplain why EMA can surpass RA W . Across all three, H Y P E R T I N Y P W ’ s cross-layer generator yields the biggest accuracy jump per kB at the mid-budget elbo w ( ∼ 200 – 250 kB). Calibration beha vior . W e select thresholds per branch after median smoothing. RA W is consistently stronger on Apnea-ECG and in the current MIT -BIH snapshot; EMA sometimes helps PTB-XL. On MIT -BIH, several EMA runs adopted high t ⋆ and collapsed positi ves, indicating threshold drift under imbalance rather than weak separability . A UC around 0.96 with macro-F 1 near 0.56 suggests good ranking but a brittle global threshold; lightweight remedies such as per-class or subject-aware calibration, or simple beat- wise post-processing, could raise F 1 without adding ﬂash or changing kernels. Generality of the elbow . Although we benchmarked on ECG, the observed mid-budget elbow reﬂects a structural property of 1D CNNs with many pointwise layers rather than a signal-speciﬁc artifact. T asks such as speech key- word spotting, acoustic event detection, or vibration mon- itoring exhibit the same PW redundancy , and we expect compression-as-generation to yield similar efﬁcienc y gains. This suggests that the ∼ 200–250 kB elbow we see for ECG is representati ve of a broader class of T inyML sensing work- loads. Positioning among efﬁciency methods. Classical quanti- zation and pruning store a full set of parameters per layer, and tensor factorization reduces but does not remove per- layer redundancy . HyperNetworks and CondCon v generate weights per input, incurring runtime cost and SRAM ov er- head. By contrast, H Y P E R T I N Y P W generates weights once at boot and reuses them with standard INT8 k ernels, yield- ing a unique point in the design space: near-hypernetwork expressi vity at static-compression cost, with no runtime burden. Deployment considerations. Synthesis is one-shot at boot (or lazily on ﬁrst use); steady-state inference uses standard INT8 separable kernels. Reported sizes are packed bytes that charge the generator , heads or factorization, codes, kept PW 1 , and backbone. SRAM peaks are bounded by the largest PW tensor plus workspace, and the boot vs. lazy choice trades start-up time for a one-time stall without affect- ing steady-state latenc y . Because the synthesized kernels are cached as ordinary weight tensors, integration with MCU runtimes such as CMSIS-NN or T ensorFlow Lite Micro requires no kernel changes. This makes H Y P E R T I N Y P W not only a compression technique b ut a deployable path to sustaining accuracy under 32–64 kB ﬂash budgets. Pr oxy limitation. Latency and energy are reported from a virtual instruction/cycle model and a datasheet-calibrated current model; absolute numbers may dif fer on speciﬁc boards. W e plan to add on-device measurements in a camera- ready revision. En vironmental and practical implications. Beyond ac- curacy and compression, H Y P E R T I N Y P W also contrib utes to the sustainability of on-device intelligence. By gener- ating weights rather than storing full parameter matrices, it substantially reduces ﬂash usage and computational re- dundancy , translating into lower memory access, shorter inference paths, and reduced energy consumption during deployment. Such efﬁciency gains are particularly rele vant for lar ge-scale edge deplo yments, where cumulati ve sa vings can meaningfully lower carbon and hardw are costs. 8 C O N C L U S I O N W e proposed compr ession-as-generation for T inyML ECG: a shared micro-MLP synthesizes most 1 × 1 channel mixers from tiny per -layer codes once at load time, while PW 1 remains INT8 to anchor morphology-sensiti ve mixing. The design preserves standard integer kernels and reports packed- byte ﬂash that matches deplo yable footprints. Empirically , H Y P E R T I N Y P W reaches the mid-budget elbow ( ∼ 225 kB), deliv ering 6 . 31 × lower ﬂash than a ∼ 1 . 4 MB CNN ( 84% fewer bytes) with iso-accuracy on PTB-XL and 95% macro- F 1 retention on Apnea-ECG, and improves macro- F 1 per kB by ∼ 6 . 3 × all without custom kernels or per-example control ﬂow . AAt a ﬁxed MCU budget ( ∼ 225 kB), H Y P E R T I N Y P W deliv ers 6 . 31 × ﬂash savings o ver a 1 . 4 MB CNN with near- iso macro- F 1 , consistently forming the mid-budget elbo w of the accuracy–ﬂash P areto across datasets. Future work. W e will (i) co-design mixed precision for generator/heads/codes under a ﬁxed inte ger inference path; (ii) learn shared codebooks or low-rank factors that fur- ther tighten packed-byte b udgets at the same accuracy; (iii) map cycle-accurate latenc y and energy , including streamed synthesis to ﬂash for cold-boot efﬁcienc y; (i v) add light per-class or subject calibration for imbalance-robust deploy- ment; and (v) extend to multi-lead and multi-modal sensing. Future ev aluations will also measure real-world energy and latency across representati ve edge use cases such as wear- able health monitoring, en vironmental sensing, and anomaly detection, to quantify how generati ve compression can re- duce both memory footprint and en vironmental impact. R E F E R E N C E S Andrew G. How ard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, W eijun W ang, T obias W eyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef ﬁcient con volutional neural networks for mobile vision applica- tions. , 2017. Mark Sandler , Andre w Ho ward, Menglong Zhu, Andrey Zh- moginov , and Liang-Chieh Chen. Mobilenetv2: Inv erted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. Andrew Ho ward, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing T an, W eijun W ang, Y ukun Zhu, Ruoming Pang, V ijay V asudev an, Quoc V . Le, and Hartwig Adam. Searching for mobilenetv3. In Pr o- ceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2019. Liang Lai, Naveen Suda, and V ikas Chandra. Cmsis-nn: Efﬁcient neural netw ork kernels for arm cortex-m cpus. arXiv:1801.06601 , 2018. Colby Banbury et al. Mlperf tiny benchmark. In NeurIPS Datasets and Benchmarks T rack , 2021. Ji Lin, W ei-Ming Chen, Y ujun Lin, Chuang Gan, and Song Han. Mcunet: T iny deep learning on iot devices. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2020. Han Cai, Chuang Gan, and Song Han. Once-for-all: Train one network and specialize it for ef ﬁcient deployment. In International Confer ence on Learning Representat ions (ICLR) , 2020. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew T ang, Andre w Ho ward, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neu- ral networks for efﬁcient integer-arithmetic-only infer- ence. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2018. Song Han, Huizi Mao, and W illiam J. Dally . Deep compres- sion: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Confer ence on Learning Representations (ICLR) , 2016. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Confer ence on Learning Representat ions (ICLR) , 2019. Emily L. Denton, W ojciech Zaremba, Joan Bruna, Y ann LeCun, and Rob Fergus. Exploiting linear structure within con volutional networks for ef ﬁcient e valuation. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2014. V ikas Sindhwani, T ara N. Sainath, and Sanjiv Kumar . Learn- ing with structured transforms for small-footprint deep learning. In NIPS W orkshop on Efﬁcient Methods for Deep Neural Networks , 2015. Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efﬁcient linear layer . In International Confer ence on Learni ng Repr esentations (ICLR) , 2016. David Ha, Andre w M. Dai, and Quoc V . Le. Hypernetworks. In International Confer ence on Learni ng Repr esentations (ICLR) , 2017. Xu Jia, Bert De Brabandere, T inne T uytelaars, and Luc V an Gool. Dynamic ﬁlter networks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2016. Brandon Y ang, Gabriel Bender, Quoc Le, and Jiquan Ngiam. Condcon v: Conditionally parameterized conv olutions for efﬁcient inference. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2019. Y unpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Y uan, and Zicheng Liu. Dynamic con volution: Attention ov er con volution kernels. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , 2020. Pranav Rajpurkar et al. Cardiologist-le vel arrhyth- mia detection with con volutional neural networks. arXiv:1707.01836 , 2017. A wni Y . Hannun, Pranav Rajpurkar , Masoumeh Haghpanahi, Geof frey H. T ison, Catherine Bourn, Mintu T urakhia, and Andrew Y . Ng. Cardiologist-le vel arrhythmia detection and classiﬁcation in amb ulatory electrocardiograms using a deep neural network. Nature Medicine , 2019. Ant ˆ onio L. P . Ribeiro et al. Automatic diagnosis of the 12-lead ecg using a deep neural network. Nature Commu- nications , 2020. Patrick W agner, Nils Strodthoff, Ralf D. Bousseljot, W oj- ciech Samek, Thomas Schaeffter , et al. Ptb-xl, a large publicly av ailable electrocardiography dataset. Scientiﬁc Data , 7(1):154, 2020. George B Moody and Roger G Mark. The impact of the mit- bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine , 20(3):45–50, 2001. Y utaka Ichimaru and George B Moody . De velopment of the polysomnographic database on cd-rom. Psychiatry and Clinical Neur osciences , 53(2):175–177, 1999. Ary L Goldberger , Luis AN Amaral, Leon Glass, Jeffre y M Hausdorff, Plamen Ch Iv anov , Roger G Mark, Joseph E Mietus, George B Moody , Chung-Kang Peng, and H Eu- gene Stanley . Physiobank, physiotoolkit, and physionet: Components of a ne w research resource for complex phys- iologic signals. Cir culation , 101(23):e215–e220, 2000. Association for the Adv ancement of Medical Instrumen- tation (AAMI). T esting and reporting performance re- sults of cardiac rhythm and st se gment measurement al- gorithms, 1998. ANSI/AAMI EC57:1998/(R)2008.

Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment