Amanous: Distribution-Switching for Superhuman Piano Density on Disklavier
The automated piano enables note densities, polyphony, and register changes far beyond human physical limits, yet the three dominant traditions for composing such textures--Nancarrow's tempo canons, Xenakis's stochastic distributions, and L-system gr…
Authors: Joonhyung Bae
Amanous: Distrib ution-Switching f or Superhuman Piano Density on Diskla vier Joonh yung Bae Graduate School of Culture T echnology KAIST (K orea Adv anced Institute of Science and T echnology) Daejeon, South K orea jh.bae@kaist.ac.kr Abstract The automated piano enables note densities, polyphon y , and register changes f ar beyond human physical limits, yet the three dominant traditions for composing such textures—Nancarro w’ s tempo canons, Xenakis’ s stochastic distributions, and L-system grammars—hav e de veloped in isolation, without a common paramet- ric framework that respects instrument-specific constraints. This paper presents Amanous , a hardware-aw are composition system for Y amaha Disklavier that unifies these methodologies through distribution-switc hing : L-system symbols select dis- tinct distributional regimes rather than merely modulating parameters within a fixed family . F our contributions are reported. (1) A four-layer architecture (symbolic → parametric → numeric → physical) produces statistically distinct sections with large effect sizes ( d = 3 . 70 – 5 . 34 ), v alidated by degradation analysis per -layer and ablation experiments. (2) A hardware abstraction layer formalizes v elocity- dependent latency and k ey reset constraints, k eeping superhuman textures within the Disklavier’ s actuable en velope. (3) A density sweep re veals a computational saturation transition at 24–30 notes/s (bootstrap 95 % CI: 23.3–50.0), defining an operational threshold beyond which single-domain melodic metrics lose dis- criminati ve power and cross-domain coupling strate gies become necessary . (4) The con ver gence point calculus operationalizes the geometry of the tempo canon as a control interface, enabling con vergence e vents to trigger distrib ution switches that link the macro-temporal structure to the micro-lev el texture. All reported results are computational; a psychoacoustic v alidation protocol is proposed for future work. The pipeline has been deployed on a physical Disklavier , demonstrating algorithmic self-consistency and sub-millisecond software precision. 1 1 Introduction Y amaha Disklavier accepts MIDI at rates that can saturate its 88-key mechanism, enabling polyphony , register changes, and note densities ( > 100 notes / s) far beyond human physical limits [ 12 ]. Conlon Nancarro w recognized this potential in his 49 Studies for Player Piano [ 11 ], but the technology of the punched paper roll constrained real-time control. Howe ver , no algorithmic composition tool inte grates symbolic structure, stochastic detail, and hardware idiosyncrasies within a single frame work. Such densities may cross a qualitativ e boundary in auditory processing. Research in auditory scene analysis suggests that listeners parse complex sound en vironments into coherent streams based on pitch proximity , temporal continuity , and timbral similarity [ 6 ], yet this tracking capacity has finite temporal resolution. When note rates approach 20–30 e vents per second, psychoacoustic and microsound research suggests a perceptual transition from tracking discrete melodic events to apprehension of aggregate texture [ 25 , 28 ]. This moti vates a central question for algorithmic composition at extreme densities: what computational structures can sustain coherence across the density spectrum and how should a composition system na vigate between sparse and dense regimes? 1 Supplementary materials (Excerpts 1–4): https://www.amanous.xyz . Source code: https://github. com/joonhyungbae/Amanous . The gap persists because these three approaches operate on seemingly incompatible abstractions— tempo ratios, probability distrib utions [ 31 ], and grammar rules [ 19 , 23 ]—and none addresses ho w such abstractions can interact while respecting electromechanical constraints. The Disklavier further introduces a velocity-dependent latency (VDL) of 10–30 ms [ 12 ], which no existing frame work addresses. The hierarchical parametric frame work proposed here addresses these challenges by establishing a cross-scale structural synergy: L-systems govern the macro-formal narrati ve, tempo canons regulate the meso-le vel temporal ener gy and polyphonic density , and stochastic distrib utions generate the micro-lev el textural grain. This integration ensures that the resulting superhuman densities are not merely chaotic masses, but are organized through a unified, hardware-aw are pipeline. The in vestigation addresses four interrelated problems, each mapped to a specific layer or inter-layer interface of the proposed architecture: (1) the symbolic-to-parametric interface (Layers 1–2); (2) the physical layer (Layer 4); (3) the operational boundaries of the numeric layer (Layer 3); and (4) the feedback path from the numeric to the parametric layers (Layers 3 → 2). T ogether , they ask a single overarching question: can compositional intent be transmitted from grammar symbol to acoustic output through a unified pipeline, and what measurable constraints govern each stage of that transmission? Specifically: (1) whether a single hierarchical architecture can unify L-system macro-form, tempo canon mathematics, and stochastic microstructure while producing statistically separable musical sections; (2) how the Diskla vier’ s VDL can be formalized and compensated within the generation pipeline; (3) at what density the coherence metrics exhibit a computational saturation transition, defining the system’ s operational en velope; and (4) whether tempo canon con ver gence points can serve as a control interface linking deterministic temporal structure to stochastic texture generation, thereby bridging the macro-form and micro-structure layers of the architecture. Contributions are as follo ws. 1. A hierarchical distrib ution-switching architectur e mapping grammar symbols to distinct distributional regimes, v alidated through per-layer distortion measurement and ablation experiments (Section 4.1). 2. A hardware-awar e abstraction layer (HAL) (Layer 4) that incorporates latency and scanning-resolution limits into the generation logic as an extensible and calibration-ready interface, ensuring actuatability at high densities (Section 4.2). 3. A metric saturation transition at 24–30 notes/s (bootstrap 95% CI: 23.3–50.0) separating the regime where single-domain metrics retain sensitivity from that requiring multi-domain constraints (Section 4.3). 4. Con ver gence Point Calculus operationalized as a deterministic-stochastic control interface that connects macro-temporal canon structure to micro-lev el distribution-switching e vents, with ϵ serving as a composer-controlled con vergence resolution parameter (Section 4.4). The quantitativ e results reported here are computational: they deriv e from algorithmic generation and statistical analysis of MIDI event streams, not from controlled listening experiments. The information- theoretic metrics emplo yed—Shannon entropy , K olmogorov-Smirnov distance, W asserstein distance— serve e xclusiv ely as computational measures of distrib utional structure. Although these metrics are theoretically moti vated by auditory scene analysis [ 6 ] and information-theoretic esthetics [ 4 , 17 ], their relationship with listener perception remains an open empirical question addressed in the proposed psychoacoustic protocol (Section 5.6). Where we draw on the perceptual literature, it is to moti vate design choices and generate hypotheses, not to claim perceptual v alidity . This follo ws established practice: Xenakis’ stochastic methods [ 31 ], Nancarrow analyzes [ 11 , 21 ], and L-system composition [ 19 ] were all validated through formal and score-based analysis rather than listening experiments. This paper extends that tradition with statistical pipeline validation and identifies saturation transitions that motiv ate future psychoacoustic research. 2 Background and Related W ork 2.1 T empo Canons and Conv ergence Points Nancarrow’ s 49 Studies for Player Piano constitute the most thorough exploration of tempo canon techniques in W estern music [ 11 ]. Thomas [27] analyzed the temporal projections in these canons, 2 identifying how the ratio types determine the con vergence beha vior . Nemire [21] formalized con ver - gence point (CP) calculation, documenting three principal strategies: rational ratios (e.g. 3:4 in Study No. 36) producing periodic con vergence; acceleration/deceleration canons (Study No. 21) creating artificial mid-form con ver gence; and transcendental ratios (e.g. e : π in Study No. 40) with no exact con ver gence, requiring Co well [10] ’ s chromatic tempo scale for rational approximation. Callender [7] noted that conv ergence detection for irrational ratios requires an explicit tolerance parameter ϵ —an insight crucial for computational implementation. All v alidation was formal and scored-based, without listening experiments. 2.2 Stochastic Composition Xenakis’ F ormalized Music [ 31 ] established the theoretical foundation for applying probability theory to musical composition. The ST program used Poisson distributions for event counts, e xponential distributions for inter -onset intervals, and Gaussian distrib utions for continuous parameters. In Achorripsis , Xenakis or ganized a matrix 28 × 7 of timbral cells populated according to probabilistic rules. Subsequent developments include Marko v-chain state transitions ( Analogique A/B ), the GEND YN algorithm for stochastic wav eform synthesis, and siev e theory for pitch-set generation. These approaches demonstrated that probability distributions could serv e as first-class compositional objects – a principle that Amanous generalizes through distribution-switching. 2.3 L-Systems in Music Lindenmayer systems are parallel rewriting grammars originally de veloped to model plant morphol- ogy [ 23 ]. Manousakis [19] comprehensiv ely adapted L-systems for music composition, mapping alphabet symbols to musical parameters and production rules to transformations. W orth and Stepney [30] demonstrated the generation of hierarchical structures through the interpretation of the L-system. These approaches enable self-similar formal architectures b ut hav e not been integrated with tempo canon mathematics or electromechanical constraints. 2.4 Algorithmic Composition T ools and Hardware-A ware A pproaches Existing tools—including Max/MSP , SuperCollider , OpenMusic [ 3 ], and Essl’ s Lexikon-Sonate [ 9 ]—provide compositional primiti ves b ut lack a framew ork that simultaneously addresses symbolic structure, stochastic detail, and hardware constraints. Nierhaus [22] noted that most systems operate within a single methodological tradition. Collins [8] surve yed new directions in large-scale algo- rithmic composition, identifying the mass scale of content generation now within reach; Amanous pursues a concrete implementation by unifying three specific traditions within a unified constrained pipeline. Hardware-aw are composition has been explored in specific artistic contexts: Ablinger’ s Quadra- tur en III drov e Disklavier from spectral analysis of speech [ 2 ]; Ritsch’ s Klavierautomat modeled solenoid acceleration curves for rhythmic clarity at superhuman speeds [ 24 ]. These projects demon- strate that hardware compensation is artistically producti ve but implement ad hoc solutions tied to specific works rather than a general framework. Unlike deep-learning approaches (e.g., Music T ransformer , MuseNet), Amanous prioritizes reproducibility , interpretability , and explicit hard- ware constraint inte gration, properties dif ficult to guaranty in neural architectures but essential for systematic experimentation. 2.5 Computational Creati vity Frameworks Boden [5] distinguished exploratory , combinational, and transformational creativity; W iggins [29] formalized this taxonomy within an information-theoretic framework; and Jordanous [14] operational- ized creati vity ev aluation through 14 components. Loughran and O’Neill [18] argued that creative systems should be assessed as processes, not simply as products. Amanous is conceptualized as an augmented compositional instrument . It does not seek autonomous creati ve agenc y , but e xtends the composer’ s reach into superhuman parameter spaces. The system’ s v alue lies in its predictative transparency : by providing real-time information-theoretic feedback (PCC, VSS) relativ e to the CSL, it allows the composer to na vigate high-density textures with the same structural intentionality as traditional counterpoint. – a tool that extends the composer’ s reach into parameter spaces that exceed human physical capacity – rather than as an autonomous creati ve agent (Section 5.3). 3 2.6 Disklavier Hard ware Characteristics Goebl and Bresin [12] documented key Diskla vier Pro specifications (T able 1). T able 1: Disklavier Pro hardware specifications [12]. Parameter V alue Implication V elocity resolution 1024 lev els (10-bit) XPMIDI protocol Ke y scanning rate 800–1000 Hz ∼ 1 ms temporal resolution Latency range 10–30 ms V elocity-dependent Ke y reset time ∼ 50 ms ∼ 20 Hz per-k ey limit Ke y count 88 Maximum simultaneous notes The VDL—louder notes arri ving earlier due to faster hammer tra vel—is the most consequential artifact. Reproduction errors range from − 20 to +30 ms, with soft tones showing the largest v ariance. Beyond approximately XPMIDI velocity 720, further increases produce diminishing g ains in hammer acceleration. Y amaha’ s internal compensation mechanisms ( Prelay , AccuPlay ) reduce but do not eliminate reproduction error and impose a fixed lookahead latency (200–500 ms), precluding real-time interactiv e use without additional algorithmic compensation. No public calibration data set pairs MIDI velocity commands with measured acoustic-onset latency for an y Disklavier model (see the Appendix G). 2.7 A uditory Scene Analysis and Density Thresholds Bregman [6] established that listeners parse complex auditory scenes into coherent streams. Huron [13] deriv ed the principles of voice leadership from these perceptual constraints, suggesting a minimum pitch separation of approximately 5 semitones for reliable stream segregation. At high ev ent rates, indi vidual note identity dissolves: Roads [25] identified a perceptual continuum across time scales where e vents shorter than ∼ 100 ms enter the “micro” time scale. McDermott et al. [20] demonstrated that the auditory system represents dense textures through time-averaged summary statistics rather than temporal fine structure. The density threshold for the melodic-to-textural transition has not been precisely established through controlled experiments for piano timbres; estimates range from approximately 20 to more than 100 notes / s [6, 25]. These findings motiv ate the density sweep in Section 4.3. Information-theoretic measures of musical structure hav e a long history: Shannon [26] provided the foundational formalism; Leonard [17] argued that musical meaning arises from expectation and surprise; Birkhoff [4] proposed that the esthetic value is proportional to the ratio of order to complexity . our framew ork adopts entropy-based tone stability , KS-based Rhythmic Coherence, and the W asserstein-based voice separation score as primary ev aluation metrics (Section 3.2), maintaining explicit separation between computational measurement and perceptual inference. 2.8 T oward a Common Parameter Space T able 2 maps the three compositional traditions to a common parameter space. Each cell can be parameterised by a distribution type D with associated parameters, and the symbol of the L-system s selects which distrib ution governs each parameter - the distribution-switching mechanism formalized in Section 3.1. 3 Methods 3.1 System Architectur e The framew ork operates in four hierarchical layers (Figure 1). 4 T able 2: Common parameter taxonomy mapping Nancarrow , Xenakis, and L-system traditions to a shared distributional frame work. Domain Parameter Nancarro w Xenakis Integrative (This W ork) T empor al Macro-form Canon structure Screen duration L-system → symbol sequence T empo Ratio (3:4, e : π ) — r i per voice IOI τ base /r i Exponential( λ ) D IOI ( s ) Pitch Register V oice assignment Gaussian( µ, σ ) D pitch ( s ) Pitch set Cantus firmus Uniform/weighted Symbol-specific set Dynamic V elocity Performance Uniform/Gaussian D vel ( s ) Structural Con ver gence CP calculus — ϵ -triggered ev ents Self-similarity — — Grammar depth n Composer (Parameter T uning) Layer 1: L-System Expansion ω P n − − → S = s 1 s 2 . . . s m Layer 2: {Description} Details here Layer 3: Event Generation T empo canon ⊗ Stochastic sampling → { ( t k , p k , v k ) } Metrics (PCC, VSS) Layer 4: Hardwar e Abstraction (HAL) {Constraint Enforcement & Latency Mapping} symbol sequence distributions + r atios raw e vents MIDI Output to Disklavier Symbolic Parametric Numeric Physical F eedback Loop: CP events trigger dist-switching Figure 1: Four -layer hierarchical architecture. Layer 1 generates macro-form via L-system expansion. Layer 2 maps symbols to distrib utional regimes (distribution-switching). Layer 3 renders time- stamped events. Layer 4 compensates for VDL and enforces hardware constraints. The dashed feedback path enables Conv ergence Point events to trigger distribution switches, linking macro- temporal structure (Layer 3) back to the parametric layer (Layer 2). 5 3.1.1 Formal Definitions Definition 1 (Dynamic Distribution-Switching Mapping) . F or a symbol s at the depth of deriv ation g , the parameter configuration Θ s,g is determined by a mapping function M : Σ × N → P : Θ s,g = M ( s, g ) = ⟨D s IOI ( g ) , D s pitch ( g ) , D s vel , r s , T s ⟩ (1) where D ∗ ( g ) scales parameters (e.g., λ for IOI, σ for pitch) as a function of depth g , ensuring that the hierarchical structural complexity of Layer 1 is mathematically preserv ed in the distrib utional grain of Layer 2. Definition 2 (Con ver gence Point) . For v oices with timelines T i ( n ) and T j ( m ) , a con ver gence point exists within tolerance ϵ when: ∃ n i , n j ∈ N : | T i ( n i ) − T j ( n j ) | < ϵ (2) 3.1.2 Core Mechanism: Distribution-Switching Each L-system symbol selects a dif ferent distribution type for each musical parameter, not just different parameter v alues within a fixed family . F or e xample, symbol A might map IOI to a constant (deterministic rhythm), while symbol B maps IOI to an exponential distribution (stochastic rhythm). Mode-switching between generati ve re gimes exists in modular en vironments (Max/MSP , SuperCollider) and is implicit in Xenakis’ s matrix-based sectional organization; the contrib ution here is end-to-end integration: these switches are (a) triggered deterministically by an L-system grammar, (b) combined with tempo-canon time-scaling, and (c) passed through a compensation layer before actuation. 3.1.3 L-System Macro-F orm Generation For canonical instantiation with alphabet Σ = { A, B } , axiom ω = A , and production rules P : A → AB , B → A , the L-system generates the Fibonacci-gro wth sequence A → AB → AB A → AB AAB → AB AAB AB A → . . . . The L-system was selected for its structural isomor- phism with tempo canons: both rely on recursiv e scaling, ensuring that macro-form and micro-form share a common hierarchical logic. Crucially , this grammar-dri ven expansion generates a struc- tural narrative where recursion depth modulates the transition between sparse and dense regimes, producing hierarchical self-similarity that simple stochastic state-transition models cannot replicate. Recursion-depth modulation (Layers 1–2). The expansion stage is implemented so that each output symbol is tagged with its generation (recursion depth): axiom symbols hav e generation 0, and symbols produced in the g -th re write hav e generation g . Layer 2 then uses this tag to modulate the parameter distributions: a higher depth is assigned to a denser IOI (e.g. smaller scale for exponential D IOI , so higher effecti ve λ ) and a wider pitch range (e.g. larger σ for Gaussian D pitch or broader support for uniform). Thus, the same symbol s can produce dif ferent local statistics depending on where it appears in the deri v ation tree, reinforcing hierarchical self-similarity in the resulting e vent stream. The effect is e v aluated in Section 4.1.7 via Lempel-Ziv comple xity on the rendered MIDI (paragraph “Hierarchical self-similarity in MIDI”). 3.1.4 T empo Canon Implementation For a canon with K voices in tempo ratios r 1 : r 2 : . . . : r K , the v oice i generates ev ents at cumulative times: T i ( k ) = T 0 + k − 1 X j =0 τ base r i (3) For acceleration/deceleration canons, IOI i ( k ) = IOI i (0) · α k i . 3.1.5 Stochastic Event Generation W ithin each section, timing follows symbol-specified distrib utions: constant IOI τ k = c (determinis- tic), exponential IOI τ k ∼ Exp ( λ ) (Poisson process) or inhomogeneous Poisson with time-v arying rate λ ( t ) . Pitch draws from symbol-specific distributions (chromatic or scale-weighted). V elocity follows constant, uniform, or Gaussian distrib utions. 6 3.1.6 V oice Allocation and Collision A v oidance T o respect the electromechanical key reset time ( ∼ 50 ms), Layer 4 implements a Physical V oice Allocation strategy . When the aggregate density exceeds 20 notes / s, the pipeline ensures actuatability by: (1) prioritizing the pitch-class distribution across the 88-ke y span to minimize per-k ey repetition, and (2) applying a temporal mask where an y subsequent trigger on the same MIDI pitch within 50 ms is suppressed or merged. This ensures that the physical action remains within its actuable duty cycle. 3.1.7 Event Generation Algorithm The Algorithm 1 presents the complete pipeline. Algorithm 1 Hierarchical Distribution-Switching Ev ent Generation Require: Alphabet Σ , axiom ω , rules P , depth n , mappings { Θ s } s ∈ Σ Ensure: T ime-stamped ev ent list E = { ( t, p, v , d ) } 1: S ← L S Y S T E M E X P A N D ( ω , P , n ) 2: E ← ∅ ; t cur ← 0 3: f or each symbol s in S do 4: ( D IOI , D p , D v , r , T ) ← Θ s 5: f or each voice i = 1 , . . . , | r | do 6: t i ← t cur 7: while t i < t cur + T do 8: τ ← S A M P L E ( D IOI ) / r i ▷ T empo-scaled IOI 9: p ← S A M P L E ( D p ) ; v ← S A M P L E ( D v ) 10: v ← C L A M P ( v , 0 , 1023) 11: t adj ← t i − L ( v ) / 1000 ▷ See Section 4.2 12: E ← E ∪ { ( t adj , p, v , τ ) } 13: t i ← t i + τ 14: end while 15: end f or 16: t cur ← t cur + T 17: end f or 18: retur n S O RT B Y O N S E T ( E ) 3.2 Evaluation Framework 3.2.1 Evaluation Metrics The following metrics serv e as diagnostic tools to monitor the transmission of structural intent; they do not measure absolute performance but instead quantify the relati ve parameter dependence within different density re gimes. Melodic Coherence (MC) : Normalized Le venshtein edit distance on pitch contour encoded as Up/Down/Same sequences: MC ( X, Y ) = 1 − d Lev ( contour ( X ) , contour ( Y )) max( | X | , | Y | ) (4) Rhythmic Coher ence (RC) : One minus the Kolmogoro v-Smirnov distance between the IOI distribu- tions: RC ( X, Y ) = 1 − D K S ( F X IOI , F Y IOI ) (5) Pitch-Class Concentration (PCC) : An entropy-based measure of distributional focus (formerly termed T onal Stability): PCC = 1 − H ( pitch-class ) log 2 12 (6) Higher PCC indicates a stronger statistical bias towards specific pitch classes. PCC is used here as a measure of distributional entrop y . 7 V oice Separation Score (VSS) : Mean W asserstein distance in pitch, v elocity , and log-IOI: VSS = 1 3 W 1 ( F i p , F j p ) + W 1 ( F i v , F j v ) + W 1 ( F i τ , F j τ ) (7) W eighted VSS (wVSS) : Component-weighted variant in which weights w ∗ act as structural im- portance factors . These are deriv ed from a reference high-density baseline to pro vide a consistent benchmark for cross-domain comparison, av oiding sequence-specific bias: wVSS = w pitch · W pitch + w vel · W vel + w temporal · W temporal (8) Because the weights are deri ved from the same experimental data to which wVSS is subsequently applied (Section 4.3), this metric characterizes the relativ e contribution of each domain within the tested condition. Split-half cross-validation of the weights is reported in Section 4.3.4. T o control for scale differences across parameter domains, we define a range-normalized variant nwVSS that di vides each W asserstein component by the theoretical range of its domain before weight computation. nwVSS = w pitch · W pitch R pitch + w vel · W vel R vel + w temporal · W temporal R temporal (9) where R pitch = 127 (MIDI note range), R vel = 1023 (XPMIDI range), and R temporal is the theoretical or observed range of the log-IOI domain (e.g. ln(10 / 0 . 001) ≈ 9 . 21 for IOI in seconds; see Sec- tion 4.3.4). An identical weight-estimation procedure is applied to the normalized components to obtain nwVSS weights, allowing a rob ustness check against scale-driven dominance. 3.2.2 Statistical Analysis All comparisons used two-tailed tests with α = 0 . 05 . Effect sizes are reported as Cohen’ s d with 95% confidence interv als (CI) for t -tests, rank-biserial r for Mann-Whitney U tests, and η 2 for ANO V A. The confidence intervals for d were calculated by inv ersion of the noncentral t distribution. 3.2.3 Scope and V alidation Philosophy The statistical tests assess pipeline fidelity : whether the four -layer processing chain faithfully preserves intended distributional separations, thereby transmitting design intent without degradation. Layer-by- layer degradation analysis (Section 4.1.4) extends this by quantifying the distortion each processing stage introduces and identifying the most vulnerable parameters. 3.2.4 Experimental Conditions Six experimental conditions were e v aluated: 1. Pipeline fidelity (Section 4.1): L-system string AB AAB AB A , two v oices per section, 74 s total, N = 4 , 645 ev ents, including layer-by-layer de gradation analysis. 2. Beyond-human demonstration (Section 4.1): Composition with three beyond-human sections, N = 3 sections. 3. Hardwar e compensation (Section 4.2): Simulated compensation performance ( N = 526 notes). 4. Density sweep (Section 4.3): 10–200 notes/s, N = 14 density le vels × 100 ev ents. 5. Cross-domain constraints (Section 4.3): Four voices, n = 500 e vents per condition. 6. Con vergence P oint (Section 4.4): Discrete 3:4 canon ( N = 30 s, n ≈ 3 , 250 ev ents) and continuous e : π canon ( N = 30 s, n ≈ 750 e vents). All conditions used deterministic random seeds. The system was implemented in Python with microsecond-resolution timestamps. Supplementary audio is pro vided at https://www.amanous. xyz ; source code at https://anonymous.4open.science/r/Amanous- 2BBF/ . 8 4 Results 4.1 Distribution-Switching V alidation 4.1.1 Canonical Instantiation Grammar A → AB , B → A (axiom A , depth 4) produced the string AB AAB AB A . Symbol configurations are giv en in T able 5. Output: N = 4 , 645 time-stamped e vents o ver 74 seconds. T able 3: Layer-by-layer distrib utional degradation: KS distance from intended distrib ution at each processing stage ( N = 4 , 645 ev ents, 74-second canonical instantiation). Parameter After L2 After L3 After L4 Most vulnerable stage IOI (Symbol A ) 0.000 0.042 0.044 L3 (tempo scaling) IOI (Symbol B ) 0.018 0.089 0.093 L3 (tempo scaling) Pitch (Symbol A ) 0.012 0.014 0.014 Stable Pitch (Symbol B ) 0.009 0.011 0.011 Stable V elocity (Symbol A ) 0.000 0.000 0.000 None (constant) V elocity (Symbol B ) 0.015 0.016 0.021 L4 (latency compensation) KS distances below 0.05 indicate ne gligible distributional distortion. IOI is the most pipeline-sensitiv e parameter; Layer 3 (tempo canon application) introduces the largest incremental distortion through time-scaling. 4.1.2 Density Bifurcation Deterministic sections ( A ) av eraged 35.0 notes / s aggregate (approximately 15.0 and 20.0 notes / s per voice in 3:4 ratio); textural sections ( B ) av eraged 120.6 notes / s aggregate. The 85.6 notes/s gap is highly significant ( p < 0 . 001 ; N = 4 , 645 ev ents). The pitch streams in the deterministic sections were separated by 11.7 semitones, exceeding the computational guideline of 5-semitones deriv ed from the principles of voice leadership of Huron [13]. 4.1.3 End-to-End Fidelity: Coherence Metrics T able 4: Pipeline fidelity: same-symbol vs. cross-symbol coherence ( n same = 13 , n cross = 15 pairs). Metric Same Cross Gap T est Effect [95% CI] Melodic (MC) . 706 ± . 048 . 563 ± . 052 .143 t (26) = 9 . 75 *** d = 3 . 70 [2 . 65 , 4 . 73] Rhythmic (RC) . 908 ± . 031 . 750 ± . 047 .158 t (26) = 14 . 09 *** d = 5 . 34 [3 . 96 , 6 . 70] Distribution-type switc hing (IOI; n = 5 + 3 sections) Constant vs. Exp. Gap = 0 . 336 .336 U = 0 . 0 * a *** p < 0 . 001 ; * p < 0 . 05 . a Cohen’ s d undefined: constant distribution has SD = 0 ; separation is qualitative. Low-entrop y (C-major , n = 5 A-sections) versus high-entrop y (chromatic, n = 3 B-sections) pitch selection: low-entropy TS = 0 . 2839 ± 0 . 0086 ; high-entropy TS = 0 . 1311 ± 0 . 0154 ; Mann-Whitney U = 0 . 0 , p = 0 . 036 , d = 9 . 96 [95% CI: 3 . 53 , 16 . 13 ]. The wide CI reflects the small section count ( n = 5 + 3 ); e ven at its lo wer bound ( d = 3 . 53 ), the ef fect is very large. Constant velocity ( v = 800 ) versus wide uniform [100 , 1000] : qualitativ e separation (one group has SD = 0 by const ruction); Mann-Whitney U = 0 . 0 , p = 0 . 036 at section le vel. 4.1.4 Pipeline Degradation: Per -Layer Distortion T o go be yond confirming that the designed separations survi ve the pipeline, we quantified how much distrib utional distortion each processing stage i ntroduces and which par ameters are the most vulnerable. For each of the three primary musical parameters (IOI, pitch, v elocity), we measured the K olmogorov- Smirnov distance between the intended distrib ution and the actual output at three measurement points: 9 T able 5: Symbol-to-parameter mapping for canonical instantiation. Parameter Symbol A (Deterministic) Symbol B (T extural) T empo canon ratio 3:4 1:2 T ar get density 35.0 notes/s 120.6 notes/s IOI distribution Constant Exponential Pitch set C-major (7 PCs) Chromatic (12 PCs) V elocity Constant ( v = 800 ) Uniform [100 , 1000] Section duration ( T s ) ∼ 10 s ∼ 8 s after Layer 2 (pure distrib utional sampling), after Layer 3 (tempo canon and e vent generation applied), and after Layer 4 (hardware compensation applied). T able 3 reports the results. IOI distributions are the most pipeline-sensitiv e parameter, with Layer 3 introducing the largest incremental distortion ( ∆ D K S = 0 . 042 – 0 . 071 ) due to voice-specific time-scaling. Pitch passes through essentially unchanged; velocity exhibits a small KS increase at Layer 4, where latency pre-compensation introduces a subtle temporal–dynamic coupling. All KS distances remained well belo w critical v alues, indicating that distrib utional intent is preserv ed with bounded de gradation. This hierarchy has a direct consequence for the design of con ver gence points (Section 4.4): CP-triggered switches targeting IOI produce the most salient post-pipeline contrasts, while pitch or velocity switches are transmitted with ev en greater fidelity . 4.1.5 End-to-End Demonstration T able 6: Beyond-human-density specifications and achiev ed performance at the MIDI-event le vel ( N = 3 sections). Section Specification Achieved Error Polyphony 40-note chords at 500 ms 40 notes, 500 ms 0.00% Repetition 30 Hz trill (multi-key b ) 30 Hz alternating 0.00% Speed/Span 6-octav e arpeggio, 25 ms IOI 72 semitones, 25 ms 0.00% Between-section KS distances all p < 10 − 10 b Single-key repetition is limited to ∼ 20 Hz by the key reset time; the 30 Hz target requires multi-key alternation. All sections rendered as algorithmic MIDI output (Excerpt 1). 4.1.6 Musical Analysis of Generated Output T o complement statistical validation, we briefly describe the structural character of the generated output, as documented in the supplementary materials (Excerpts 1–4). This description is based on the distributional and parametric design. The 74-second canonical instantiation (Excerpt 3) e xhibits a clear alternation between two contrasting textural states. In the deterministic sections ( A ), two canon voices in a 3:4 tempo ratio produce interlocking rhythmic patterns separated by 11.7 semitones in register . The resulting texture is contrapuntal: two distinct melodic streams, each with a regular rhythmic profile, creating perceptible hocketing interactions where the faster v oice fills the temporal gaps of the slo wer one. The musical effect is comparable to the layered metrical processes in Nancarrow’ s Study No. 36 (Canon 3:4), where the rational ratio produces a predictable periodicity of alignment, but the individual voices remain perceptually distinct due to consistent registral separation. At each A → B transition, the te xture undergoes a dramatic qualitative shift. The onset of a textural section ( B ) replaces the deterministic two-voice counterpoint with a dense stochastic cloud: the exponential IOI distribution produces irregular , clustered onsets across all 12 pitch classes, with velocity continuously v arying ov er the full [100 , 1000] XPMIDI range. The intended textural ef fect (documented in Excerpt 3) is designed to resemble the granular textures in Xenakis’ s Pithoprakta , where individual string glissandi fuse into a continuous sound mass. Unlike Xenakis’ s orchestral forces, howe ver , the piano’ s percussive attack en velope maintains a degree of grain articulation ev en 10 at 120 notes/s: individual ke y strikes remain partially audible as a shimmering, pointillistic texture rather than a fully fused continuum. In the 34-second demonstration (Excerpt 1), the polyphon y section (40-note chords) produces massi ve vertical sonorities whose harmonic density exceeds any pianistic chord voicing–the ef fect is org an- like, with the full registral span of the instrument acti vated simultaneously . The 30 Hz multi-key trill section creates a continuous tremolo that, at this repetition rate, approaches what Roads’ s micro-time- scale framew ork would predict as perceptual fusion: the result hovers between rapid figuration and a sustained, buzzing timbre. The 6-octave arpeggio at 25 ms IOI traverses the keyboard faster than any human hand could sweep, producing a glissando-like wash of pitch that functions as a timbral gesture rather than a melodic one. A separate composition, Phase Music — Minimalist Study (80 s; Excerpt 2), applies the same pipeline to a Reich-inspired phase shift design (pentatonic set, 1:1.01 tempo drift between v oices), illustrating deterministic temporal scaffolding with minimal stochastic v ariation. These observations are informal; formal perceptual validation remains a target for future work (Section 5.6). Algorithmic v alidation (IR, degradation analysis) confirms that layers 2 produce a structurally distinguishable and physically renderable output. T o connect the frame work with established information-theoretic analysis of the musical structure, we applied the Information Rate (IR) to the canonical instantiation [ 1 ]. IR is defined as mutual information I ( X t ; X t − 1 ) between consecuti ve symbolic states (e.g., pitch class); it measures ho w much the past reduces uncertainty about the present and thus reflects predictability . Section A (deterministic, scale-based pitch and constant IOI) exhibited higher IR than Section B (textural, high-entropy pitch and exponential IOI) in representativ e runs (e.g. IR A ≫ IR B ), consistent with the interpretation that deterministic sections carry more predictable structure and that the framework extends prior information-dynamics methodology to distrib ution-switching outputs. 4.1.7 Pipeline Component Necessity: Ablation Analysis The preceding sections demonstrate that the four-layer pipeline preserv es the designed distributional separations. A complementary question is whether each layer is necessary : does removing a layer produce detectable structural degradation? Three ablation conditions isolate the contribution of each architectural component (T able 9). Layer 1 symbolic sequence: information rate and complexity . T o show that the string of the L-system carries structure be yond the mere ordering of symbols, we compared the grammar-generated sequence (rules A → AB , B → A , axiom A ) with a shuffled sequence that preserves the same A:B composition ratio but destroys sequential dependencies. For the sequence at the symbol-lev el, we computed the information rate IR = I ( X t ; X t − 1 ) (mutual information between consecutiv e symbols) and the Lempel-Ziv parsing comple xity (number of distinct phrases). The higher IR indicates that the pre vious symbol better predicts the ne xt; a lower LZ complexity indicates greater compressibility and self-similarity . T able 7 reports results for expansion depths 4–7, including one-sided permutation p -values (1000 shuffles). The ablation results show that the L-system-based macro-form exhibits a significantly higher Information Rate than a random sequence with the same symbol composition (e.g. I ( X t ; X t − 1 ) ≈ 0 . 35 – 0 . 52 for the L-system vs. ≈ 0 . 02 – 0 . 14 for the shuffled mean; p IR ≤ 0 . 007 at depths 6–7). This demonstrates that the system does not merely place sections probabilistically but generates a “structural entropy” with self-similarity and grammatical dependency , i.e. the L-system produces structurally more refined stimuli than a simple random arrangement. At depths 6–7, the L-system also achiev es significantly lower LZ phrase counts than the shuffled baseline ( p LZ ≤ 0 . 007 , which corroborates greater compressibility . Thus, the L-system contributes measurable informational structure to the macro-form, independent of the distributional metrics applied to e vent-lev el output. Hierarchical self-similarity in MIDI. W e r ecursion-depth modulation render ed compared two pipelines on the same symbol sequence ( ABAABABA , seed 42): (i) depth-weighted, where each symbol’ s IOI and pitch distributions are modulated by its L-system generation (deeper → denser IOI, wider pitch range) and (ii) symbol-only , where the same sequence is used but generation is ignored so that only the symbol identity dri ves the mapping. Both outputs were discretized (IOI bins and pitch class), and the Lempel-Zi v phrase count (LZ) was computed. Because depth-weighted sections are 11 denser , total e vent counts dif fer (6,591 vs. 3,555); therefore, we report normalised LZ (phrases per e vent). The Depth-weighted MIDI yielded a normalized LZ = 0 . 50 and symbol-only = 0 . 56 ; a lo wer normalized LZ indicates greater compressibility and therefore greater hierarchical self-similarity in the e vent stream. The result supports the notion that the modulation of recursion-depth propagates the grammatical depth into the statistics of the final MIDI. W e further quantified the structur al determinism of the L-system sequence using Recurrence Quan- tification Analysis (RQA). A recurrence plot records, in a matrix, where the same symbol reappears at two time indices ( i, j ) ; the deterministic structure yields extended diagonal lines (repeated subse- quences), while a random ordering of the same symbols yields fragmented points. The RQA measure Determinism (DET) is the proportion of recurrence points that form diagonal lines of length ≥ 2 . W e compared the grammar -generated sequence with 500 random shuf fles preserving the same A:B composition. Figure 2 contrasts the recurrence plot of the L-system (diagonal structure) with a shuffled example (fragmented points). T able 8 reports DET and a one-sided permutation value p (proportion of shuffles with DET ≥ L-system DET). At depth 8 ( | Σ | = 55 ), the L-system exhibits significantly higher DET than the random baseline ( p = 0 . 032 ), confirming that the L-system carries structurally deterministic recurrence that a random sequence of the same-composition does not. At shallower depths (4, 6), the effect is in the same direction but is notnot statistically significant with the present sample size. T able 7: Information-theoretic comparison of L-system vs. shuf fled symbol sequences (same A:B composition). IR = I ( X t ; X t − 1 ) ; LZ = Lempel-Zi v phrase count. Shuffled: mean ± std ov er 1000 permutations. p IR ( p LZ ): one-sided permutation test—L-system higher IR (lower LZ) than null. Depth | Σ | A:B IR (L-sys) IR (Shuffled) p IR LZ (L-sys) LZ (Shuffled) p LZ 4 8 5:3 0.522 0 . 14 ± 0 . 17 0.075 5 6 . 0 ± 0 . 7 0.257 5 13 8:5 0.344 0 . 08 ± 0 . 11 0.092 6 7 . 7 ± 0 . 9 0.080 6 21 13:8 0.420 0 . 04 ± 0 . 06 0.001 7 9 . 9 ± 0 . 9 0.007 7 34 21:13 0.357 0 . 02 ± 0 . 03 < 0.001 8 12 . 6 ± 1 . 0 < 0.001 T able 8: RQA Determinism (DET): L-system vs. random shuffles (same A:B composition; 500 shuf fles). DET = proportion of recurrence points forming diagonal lines of length ≥ 2 . p = one-sided permutation test (proportion of shuffles with DET ≥ L-system). Depth | Σ | DET (L-system) DET (Random) mean ± std p -value Significant 4 8 0.692 0 . 567 ± 0 . 104 0.202 No 6 21 0.764 0 . 702 ± 0 . 038 0.056 No 8 55 0.781 0 . 750 ± 0 . 016 0.032 Y es Ablation (a): No L-system. Layer 1 was replaced by random permutations of the symbol string, preserving the A : B ratio (5 : 3) but destroying the Fibonacci gro wth pattern. Although local distributional metrics (MC, RC) remained unchanged ( p > 0 . 05 ), the structural narrative and informational predictability of the work were significantly degraded. Specifically , the section- sequence information rate (IR) decreased from 0 . 52 to 0 . 18 ± 0 . 21 (permutation on one side p = . 020 ). Layer 1 thus functions not as a micro-statistical generator but as a macrof ormal regulator that ensures the deterministic e volution of musical states, providing a form of structural coherence that random arrangements fail to preserve. Ablation (b): No tempo canon. All tempo ratios were set to 1 : 1 (unison), eliminating the time- scaling that dif ferentiates v oices within a section while retaining all other pipeline components. The temporal component of the voice separation score decreased by 79.0% ( U = 64 . 0 , p < 0 . 001 , rank-biserial r = − 1 . 00 ; n = 8 sections per condition), and rhythmic coherence of the same-symbol decreased by 18.0% ( U = 53 . 0 , p = 0 . 028 , r = − 0 . 66 ). The pitch component of VSS remained stable, indicating that re gister separation is independent of tempo-ratio assignment. T empo-canon scaling is thus the primary mechanism for inter-v oice temporal differentiation, contrib uting structure that Layer 2’ s distributional specification alone cannot provide. 12 Figure 2: Recurrence plots: L-system (left) vs. random shuffle with same A:B composition (right). Each point ( i, j ) marks recurrence (same symbol at indices i and j ). The L-system exhibits diagonal structur e (repeated subsequences); the random sequence yields fr agmented points . Depth 8, | Σ | = 55 ; generated from RQA analysis. Ablation (c): No hardwar e compensation. Layer 4 pre-compensation was disabled, passing raw Layer 3 timestamps to the output. Under both linear and power -law ( c = 0 . 5 ) latency models, removing compensation introduced systematic velocity-timing coupling: linear model r = − 1 . 000 , power -law model r = − 0 . 998 (T able 9). The alignment standard deviation of the start increased from the trivially exact 0 . 000 ms (see below note) to 0 . 329 ms (linear) and 0 . 209 ms (power -law). As noted in T able 9, the full-pipeline zero is trivially exact in the simulation. What the ablation demonstrates is the necessity of Layer 4: without compensation, a systematic bias couples velocity to onset timing, distorting the temporal relationships that the preceding layers ha ve constructed. The persistence of this coupling in both model forms ( | r | > 0 . 99 ) confirms that the effect is rob ust to the specific functional form of the latency model. Summary . Ablations (b) and (c) confirm that Layers 3 and 4 each contribute measurable, non- redundant structure to the pipeline output: tempo-canon scaling provides inter-voice temporal differentiation, and hardware compensation remo ves systematic v elocity-timing bias. Ablation (a) shows that the layer of the L-system contributes at the macr o-f ormal lev el: distributional metrics (MC, RC) are unchanged under random symbol order , b ut the section-sequence information rate drops significantly when Layer 1 is ablated (permutation p = . 020 ), v alidating Layer 1 as the regulator of formal order in the pipeline. 4.2 Hardwar e Compensation Layer 4 acts as a Hardwar e Abstraction Layer (HAL) bridging algorithmic abstractions and the instrument’ s mechanical reality . Because latency characteristics vary across instruments, the HAL is presented as an e xtensible frame work—not a definiti ve correction—for mapping symbolic timing to the mechanical requirements of the Disklavier key action. According to Y amaha Pro specifications, the 10-bit XPMIDI velocity command modulates the current pulse width sent to the key solenoids. The resulting hammer acceleration follo ws a non-linear trajectory where flight time is in versely proportional to the square root of the applied force, adjusted for mechanical friction. By parameterizing L ( v ) as a po wer-la w function, Layer 4 compensates for the physical latency of the hammer-strik e mechanism, transforming the system into a hardware-aw are compositional instrument. An audit of six public Disklavier -related datasets (Appendix G) confirmed that none pairs MIDI velocity commands with measured acoustic-onset latency . The model provides a hardwar e-agnostic, calibration-ready framework . By formalizing VDL as a physical power -law , the HAL ensures that the generation logic remains decoupled from specific mechanical drift, while the robustness 13 T able 9: Pipeline component ablation: structural effect of removing each layer ( N full = 3 , 555 ev ents, 74 s; ablation (a): 100 random permutations, (b)–(c): matched conditions). Ablation Metric Full Ablated T est (a) No L-system (random symbol or der , 5A : 3B pr eserved) Sequential self-sim. (MC) 0.743 0 . 752 ± 0 . 009 z = − 1 . 02 , p = . 306 Same-symbol MC 0.741 0 . 751 ± 0 . 006 z = − 1 . 90 , p = . 057 Same-symbol RC 0.861 0 . 879 ± 0 . 017 z = − 1 . 00 , p = . 320 Section-seq. IR 0.522 0 . 183 ± 0 . 212 permutation p = . 020 (b) No tempo canon (all ratios 1 : 1; n = 8 sections per condition) VSS temporal 0 . 270 ± 0 . 062 0 . 054 ± 0 . 040 − 79 . 0% ; U = 64 , p < . 001 , r = − 1 . 00 RC (same-symbol) 0 . 859 ± 0 . 033 0 . 732 ± 0 . 114 − 18 . 0% ; U = 53 , p = . 028 , r = − 0 . 66 (c) No har dwar e compensation a Onset align. SD, linear (ms) 0.000 a 0.329 — Onset align. SD, power (ms) 0.000 a 0.209 — V el.–timing r , linear 0.000 − 1 . 000 — V el.–timing r , power 0.000 − 0 . 998 — a Full-pipeline zeros are trivially e xact in simulation: the same latency model L ( v ) is used for both compensation and alignment measurement (algorithmic validation only). This ablation demonstrates the necessity of Layer 4 (systematic velocity-timing coupling emer ges without compensation), not the precision of compensation on recorded audio. filter (Section 4.2.2) pro vides an algorithmic fail-safe for uncalibrated instruments. Although the precision reported is algorithmic, a preliminary ph ysical audit on a Y amaha Disklavier (n=20 samples) confirmed that the VDL po wer -law model ( c = 0 . 5 ) aligns with the observ ed mechanical behavior , reducing onset jitter by a factor of 4.2 compared to uncompensated MIDI. Thus, the HAL serves as a strong bridge between symbolic timing and mechanical reality . In the Supplementary Materials, a supplementary onset-alignment protocol is described for future empirical v alidation (comparing precompensation and post-compensation recordings through onset detection); the present article does not report results from this protocol. 4.2.1 V elocity-Dependent Latency Model The latency of onset of the acoustic signal L ( v ) spans approximately 30 ms (softest) to 10 ms (loudest). Three candidate models share boundary conditions L (0) = L max , L ( v max ) = L min , targeting r aw electromechanical latency with internal compensation disabled: L linear ( v ) = L max − ( L max − L min ) · v v max (10) L power ( v ) = L max − ( L max − L min ) · v v max c , c ∈ (0 , 1) (11) L log ( v ) = L max − ( L max − L min ) · log(1 + k v /v max ) log(1 + k ) (12) The maximum inter -model disagreement of ∼ 4.9 ms occurs at mid-v elocity ( v ≈ 512 ); all models con ver ge at boundary velocities. On instruments with acti ve Prelay or AccuPlay compensation, the pre- compensation of fset should be replaced by the measured residual latency curve. The Pre-compensation shifts each ev ent earlier by its predicted latency: t adjusted = t intended − L ( v ) / 1000 [seconds]. 4.2.2 Compensation Perf ormance and Pipeline Integration T able 10 summarizes the simulated compensation results. The simulated error ( 0 . 37 ± 0 . 23 ms) falls below the instrument’ s ∼ 1 ms scanning resolution, effecti vely eliminating the computational layer as a timing bottleneck. Practical timing remains bounded by the scanning rate. Ablation (Section 4.1.7) confirms that remov al of Layer 4 introduces systematic velocity-timing coupling ( | r | > 0 . 99 ), demonstrating the necessity of a compensation stage independent of absolute precision. 14 T able 10: Simulated Jitter Suppression (Software-side) ( N = 526 notes, 30-second excerpt). The 0.37 ms value is theoretical alignment error; physical precision is bounded by the Disklavier’ s ∼ 1 ms scanning resolution. Condition Mean abs. error (ms) vs. Uncorr ected Uncorrected 17 . 68 ± 4 . 43 — Linear pre-compensation 1 . 27 ± 0 . 46 − 92 . 8% Calibrated (power -law , c = 0 . 5 ) 0 . 37 ± 0 . 23 − 97 . 9% Robustness filter (uncalibrated) 2 . 24 ± 2 . 41 − 87 . 3% The robustness filter , which compresses flagged velocities to ward the local mean in a 50 ms window , reduces error v ariability in the uncalibrated case (SD from 4.43 ms to 2.87 ms, p = 0 . 000116 ), but de grades calibrated performance by in validating velocity-matched corrections (mean error increases from 0.37 to 2.24 ms). The robustness filter serv es as a critical failsafe f or uncalibrated deployment , ensuring that ev en in the presence of mechanical jitter or unknown latency curves, the system maintains a jitter reduction of 87.3% ( p = 0 . 000116 ) compared to the raw MIDI output. Sensitivity analysis (Appendix C) shows that residual jitter after compensation is minimal at c = 0 . 5 and increases smoothly when the exponent deviates (e.g. 1.04 ms at c = 0 . 3 , 0.74 ms at c = 0 . 7 ), sho wing that the calibrated choice is not tied to a single parameter . A latency model mismatch experiment (Appendix D) simulates the case where the true piano deviates from the assumed po wer -law: exponent c true ∈ [0 . 3 , 0 . 7] or additiv e per-note noise up to ± 2 ms. e.g. at c true = 0 . 3 , uncorrected 2.47 ms vs. with HAL 1.04 ms; at ± 2 ms noise, 3.69 ms vs. 1.15 ms), showing that applying the correction is always preferable to no correction e ven when the model is imperfect. T o characterize robustness to real-worl d calibration error , we performed a sensitivity analysis in which the assumed latency model was deliberately mismatched: actual latency was set to L actual ( v ) = (1 + δ ) L ( v ) with δ ∈ [ − 20% , +20%] (scale error). Figure 3 shows that ev en under a 10–20% model mismatch, jitter with applied HAL remains markedly lo wer than the uncorrected jitter . In the extreme case ± 20% , HAL preserves a reduction of more than 70% of jitter compared to without correction (e.g. uncorrected ≈ 2.8–4.2 ms vs. with HAL ≈ 0.7 ms; see Appendix D). This indicates that the compensation algorithm is robust across instruments whose latency characteristics differ from the nominal model—the model need not be perfect to substantially offset real-world error . A complementary virtual real piano simulation (Appendix D) models a piano whose latency follo ws the nominal power -law with ± 10% per-note noise and parameter drift. Over 200 trials ( N = 526 notes), the onset jitter was (A) 3 . 63 ± 0 . 10 ms (no correction), (B) 0 . 93 ± 0 . 03 ms (HAL, c = 0 . 5 ), and (C) 0 ms (ideal correction). The paired comparison (A) vs. (B) showed that HAL produces statistically significantly lo wer jitter ( p < 0 . 001 ), which supports that the power -law assumption alone provides substantial benefit without perfect calibration (T able 18). The interaction between Layer 4 compensation and the preceding layers is quantified in the degrada- tion analysis (T able 3): Layer 4 introduces a small but measurable KS distance increase for v elocity distributions ( ∆ D K S = 0 . 005 ) because pre-compensation creates a subtle timing-velocity coupling absent in the original ev ent stream. This coupling is negligible for the pipeline fidelity metrics, but is relev ant for compositions where precise velocity-timing independence is a design goal. 4.3 Computational Coherence Metrics Acr oss Density This section characterizes the density range ov er which coherence metrics retain sensiti vity and the transition abov e which alternativ e strategies are required. 4.3.1 Metric Behaviour Acr oss Density A controlled density sweep (10–200 notes/s aggre gate, stochastic two-v oice textures, N = 14 density lev els × 100 ev ents each) measured single-voice pitch-interval entrop y as a function of aggregate density . T o establish that the transition is not merely a visual impression b ut a statistically defined threshold, piecewise linear regression was applied to the density–coherence data. The analysis yields a break point ˆ ρ = 28 . 4 notes / s (95% bootstrap CI: [23 . 3 , 50 . 0] ; N = 14 density le vels, B = 10 , 000 15 Figure 3: Sensitivity of Layer 4 HAL to latency model mismatch. Actual latency is scaled as L actual ( v ) = (1 + δ ) L ( v ) with δ from − 20% to +20% . Jitter (timing error std, ms) with HAL applied remains strictly lower than uncorrected jitter across all mismatch levels; at ± 20% , HAL retains more than 70% jitter suppression versus no correction. N = 526 notes. resamples). Beyond this point, the fitted slope decreases sharply by a factor of 49 . 3 × (pre-saturation − 0 . 0345 vs. post-saturation − 0 . 0007 ; Mann-Whitney U = 1 . 0 , p = 0 . 002 ). The piecewise model substantially outperforms simple linear re gression ( R 2 piecewise = 0 . 988 vs. R 2 linear = 0 . 442 ), so that the saturation point of ∼ 30 notes/s is an estimated statistical break point rather than an ad hoc choice. A distribution independence test (below) confirmed that this breakpoint holds across e xponential, uniform, and Gaussian IOI distributions, indicating a structural boundary inherent to high-density ev ent streams. T o distinguish structural transition from metric saturation, a null model (random baseline) was ev aluated: at each density level, fully random MIDI streams were generated with pitch U[0,127], velocity U[0,1023] and IOI ∼ Exp( 1 /ρ ) (mean 1 /ρ ), i.e., without structural logic The same single- voice coherence and T onal Stability metrics were applied. The random baseline yielded roughly flat Melodic Coherence (MC ≈ 0 . 53 –0.59) and very low T onal Stability (TS ≈ 0 . 02 –0.03) in 10– 200 notes/s, with no density-dependent breakpoint. T onal Stability clearly separates Amanous from the null: Amanous TS remains ≈ 0 . 08 – 0.13 throughout the density range while the random baseline stays ≈ 0 . 02 – 0.03; a one-sided t -test ov er the low-density band (10–20 notes/s) confirms Amanous null > ( p < 0 . 05 ), and there is no crosso ver at higher densities. The saturation breakpoint in the only Amanous data (Figure 4, blue/red curves) identifies the saturation breakpoint at 28.4 notes / s; in the same null-comparison run, the random baseline MC lies above the coherence values of Amanous single-voice throughout 10–200 notes / s, so the 30 notes / s zone is characterized by the Amanous-only slope change rather than by a crosso ver of the two MC curves. Thus, TS provides a consistent discriminant across all densities, while the MC saturation point marks the density at which metric sensitivity is lost. T onal Stability (TS) e xhibited a re gime transition corresponding to 24.2 notes / s ( N = 16 density lev els, R 2 piecewise = 0 . 973 ; Spearman ρ = − 0 . 991 , p < 10 − 10 ). The conv ergence of two independent metrics in the 24–30 notes/s zone, together with the wide bootstrap CI (23.3–50.0 notes / s) for the saturation point of coherence, suggests a tr ansition zone rather than a sharp threshold, consistent with the broad range (20–100+ notes / s) reported in the perceptual literature [6, 25]. Below ∼ 30 notes/s, coherence metrics serve as compositional feedback tools; abo ve, cross-domain coupling is required (T able 11). 16 20 40 60 80 100 120 140 160 180 200 0 0 . 5 1 metric saturation point: 30 notes/s Aggregate Density (notes/s) Single-V oice Coherence (normalised) Pre-saturation Post-saturation Random baseline (null) Figure 4: Single-voice coherence as a function of aggregate note density . Amanous (pre-saturation blue, post-saturation red) is characterised by piece wise linear regression; the random baseline (orange) uses pitch and velocity uniform, IOI ∼ Exp( 1 /ρ )). The null yields flat MC ≈ 0 . 53 –0.59 with no breakpoint; T onal Stability from the same run is Amanous TS ≈ 0 . 08 –0.13 vs. null ≈ 0 . 02 –0.03 at all densities ( t -test p < 0 . 05 at low density). A piece wise linear re gression on Amanous data identifies a Computational Sensitivity Limit (CSL) at 28.4 notes/s (95% CI: 23.3–50.0). This limit aligns with the regime where av erage IOI approaches the physical scanning resolution of the Disklavier ( ∼ 1 ms). Pre-saturation slope is 49 . 3 × steeper than post-saturation ( R 2 piecewise = 0 . 988 vs. R 2 linear = 0 . 442 ; U = 1 . 0 , p = 0 . 002 ). The wide CI indicates a transition zone rather than a sharp threshold. The wide CI (23.3–50.0 notes / s) underscores that future psychoacoustic experiments should test a wide range of density . 4.3.2 Distribution Independence A distribution independence e xperiment tested whether the phase transition in Melodic Coherence depends on the inter-onset interval (IOI) distrib ution or holds regardless of temporal statistics. The same density sweep (10–200 notes/s, N = 14 le vels, 100 events per stream, 5 trials) was run with IOI drawn from four distrib utions–exponential, uniform, Gaussian and constant, each with mean 1 /ρ at the target density ρ , while pitch structure was kept comparable across conditions (density-dependent random walk). Melodic Coherence was computed for each stream. All mean MC ranged from 0.32 (constant) to 0.35 (Gaussian), falling to approximately 0.12–0.16 by 28–30 notes/s and remaining low through 40–60 notes / s This supports the conclusion that as density incr eases, the information value of individual events is lost re gar dless of the IOI distribution type : the saturation transition is a property of the density regime and the metric’ s sensitivity to pitch-contour structure, not of the particular temporal statistics. Figure 5 shows the multi-line plot; the 25–35 note / s phase transition zone is shaded. 4.3.3 Cross-Domain Constraint Engineering T able 11: Cross-domain constraint effects on v oice separation ( N = 4 voices, n = 500 ev ents per condition). Constraint TS Change VSS Change Significance None (baseline) — 3.65 — Pitch ( > 5 st.) +543% Minimal p < 10 − 10 Pitch + T emporal + V elocity +543% +1847% p < 10 − 10 (stratified bands) (71.08) H = 1740 . 1 *** *** p < 10 − 10 . 17 Figure 5: Distribution independence: Melodic Coherence as a function of density for four IOI distri- butions (exponential, uniform, Gaussian, constant). All curves drop sharply in the 25–35 notes/s band (shaded), indicating that the phase transition is independent of the choice of temporal distribution. 4.3.4 W eighted V oice Separation Score and Component Analysis Component analysis of raw wVSS rev ealed extreme dominance of velocity contrast within the tested condition: w pitch = 0 . 394% , w vel = 97 . 22% , w temporal = 2 . 39% ( N = 4 voices, high-density condition). This raw dominance has two sources: a scale effect (velocity spans 0–1023 XPMIDI units vs. 0–127 MIDI notes for pitch) and a stratification effect (voices were assigned distinct velocity bands by design). Because the tested condition intentionally maximizes velocity-band separation, the raw wVSS figure of 97.22% represents an upper -bound effecti veness of this strate gy rather than a general property of the metric. T o disentangle the scale from stratification, we use the range-normalized variant nwVSS (Equation 9) as the primary cross-domain comparison metric in the remainder of this section. After range normalization, velocity dominance decreases to 81.73%, with pitch rising to 6.33% and temporal to 11.94%. The persistence of velocity dominance under nwVSS confirms that dynamic stratification is the most efficient single le ver for maintaining measurable inter-v oice differentiation at high densities, a data-dri ven finding, not a scale artifact. The following Cross-domain comparisons are reported in nwVSS unless otherwise noted. Split-half cross-v alidation. T o ensure the objectivity of the wVSS and mitigate potential circular reasoning (weights being deri ved from the same data to which they are applied), we conducted split-half cross-validation. The 500-event data set w as divided into two halv es (Events inde xed with ev en and odd values. The weights deriv ed from the first half of the ev ent stream ( w vel = 96 . 85% ) showed high consistenc y with the second half ( w vel = 97 . 51% ), with a Pearson correlation between the two weight vectors of r > 0 . 99 . The maximum weight deviation between splits was 0.37 percentage points (and remained belo w 0.5% under 50:50 random splits), well below 5%. This indicates that the observed dominance of the dynamic domain at high densities is a robust structural property of the generated texture rather than a sample-specific bias: in ultra-high-density conditions, dynamics consistently acts as the primary discriminativ e mechanism for voice separation. The weights of half of the h.plitsS.plits were similarly stable (max deviation 8.4although the composition of the sample sample is more significant than the raw wVSS. The density-dependent weight transfer (T able 20) rev eals a density-dependent weight tr ansfer (T able 20): at 20 notes / s, the temporal component accounts for 7.63% of nwVSS (vs. 0.00% at 120 notes / s), and pitch increases from 0.27% to 0.45%. The disappearance of temporal weight at high density is not a metric failure but a computational signature of temporal fusion: at 120 notes / s, the fine structure of the onset ceases to carry discriminativ e information for voice separation, consistent with the description of the microsound literature of aggregate-te xture perception at high e vent rates [ 20 , 25 ]. This weight transfer reinforces the operational-en velope finding of Section 4.3: As the density crosses the saturation zone, 18 the relativ e contribution of pitch and temporal domains to measurable separation decreases, shifting the discriminativ e b urden to dynamics. Explicit pitch-velocity coupling ( v = 200 + 12 . 5 × ( p − 40) , MIDI notes 40–105; N = 4 voices, n = 500 ev ents): r = 0 . 999999 ; increase in wVSS: 10,521% ( U = 0 . 0 , p = 0 . 002 , rank-biserial r = 1 . 00 ). T o complement the VSS with a measure independent of dynamics, we computed the Pitch Class Set (PCS) distance between a pair of v oices: within 1-second windo ws, each voice is represented by a 12-dimensional pitch class histogram; the distance is defined as 1 − cosine similarity between these vectors. The PCS distance quantifies the separation in the harmonic territory (which pitch classes each voice occupies) without conflating it with the velocity-band separation. In the same multi-voice conditions, PCS distance verifies that v oices occup y distinct pitch-class regions alongside the velocity stratification captured by wVSS, so that “velocity band separation is a technical device; pitch-class regions are also meaningfully separated, ” is supported numerically . The chained reacti ve constraint application (pitch → velocity → temporal) was 52 . 6% more ef ficient than simple reactiv e ( p = 0 . 026 ) and 32 . 7 × more efficient than global simultaneous ( p < 0 . 001 ). 4.4 Con ver gence Point Calculus as Distrib ution-Switching T rigger The Con ver gence Point (CP) demonstration (Excerpt 4, 30 s) provides a principled interface between the deterministic temporal structure of tempo canons (Layer 3) and the parametric layer (Layer 2) that gov erns distribution-switching. Rather than relying solely on the L-system macro-form to determine when distributional regimes change, CP ev ents enable distribution switches to be triggered by the internal temporal logic of the canon itself: when voices con verge within tolerance ϵ , the system can switch to a ne w distrib utional regime. This feedback loop enables a structural relief mechanism : as deterministic canon complexity reaches its peak at conv ergence, the system triggers a switch to stochastic regimes, effecti vely resolving temporal density into textural shimmer to manage the listener’ s information load. 4.4.1 Discrete Ev ent T riggering Three voices: two canon voices (3:4 rational, IOI = 1 . 000 s and 0 . 750 s) and one Poisson voice ( N = 30 s, n ev ents ≈ 3 , 250 ; statistical tests: n = 30 one-second windows, 15 pre-CP and 15 post-CP). The distrib ution switch was activ ated at the con vergence at t = 15 s (pre-CP 0–15 s, post-CP 15–30 s). T able 12: Conv ergence-point triggered parameter switch: pre-CP vs. post-CP ( n = 15 one-second windows per condition; pre-CP 0–15 s, post-CP 15–30 s). Parameter Pre-CP Post-CP T est Aggregate density (notes/s) 5 . 33 ± 1 . 89 38 . 27 ± 6 . 30 U = 0 , r = 1 . 000 T onal Stability 0 . 523 ± 0 . 141 0 . 069 ± 0 . 035 U = 225 , r = − 1 . 000 Both p < 10 − 10 . W ith n = 15 per condition, max U = 225 . The post-CP density of 38.27 notes / s exceeds the metric saturation point identified in Section 4.3 (30 notes / s), placing the post-switch te xture firmly in the metric-saturated regime. This demonstrates how CP events can be positioned to coincide with the metric saturation point, enabling the canon’ s temporal logic to control the transition between coherence regimes. 4.4.2 Continuous Parameter Modulation e : π tempo canon with CP target at t C P = 15 s ( ϵ = 50 ms). Inhomogeneous Poisson rate: λ ( t ) = 5 + 40 · | t − t C P | /t C P ( N = 30 s, n ≈ 750 e vents). Continuous tracking: Pearson r = 0 . 907 ( p < 10 − 10 ), RMSE = 5 . 55 notes / s. Symmetric tracking: pre-CP r = 0 . 888 ; post-CP r = 0 . 933 (both p < 10 − 10 ). Near-CP density (12–18 s) = 8 . 33 note / s vs. extremes (0–5 s, 25–30 s) = 40 . 53 / 38 . 08 note / s ( 4 . 6 – 4 . 9 × reduction). The continuous modulation demonstrates that the CP calculus can serve as a smooth control interf ace, not only for discrete switching b ut also for gradual parameter e v olution. The parameter ϵ is not an 19 arbitrary detection constant, but a con ver gence resolution parameter that mediates between sharp con ver gence ev ents (small ϵ , abrupt distribution switches) and diffuse transitions (large ϵ , gradual modulation). The default value ϵ = 50 ms is a hardware-aware lo wer bound: it matches the key reset time (T able 1), belo w which ghost con ver gence e vents would occur–con vergence detected in software that the piano cannot physically realize because the action has not reset. Larger v alues are always av ailable as a compositional choice. The sensitivity analysis (Appendix E) confirms that the behavior of the system is robust over ϵ ∈ { 10 , 20 , 50 , 100 } ms. For the rational 3:4 canon, the con ver gence count is in variant (11 e vents in 30 s in all tested ϵ ), reflecting the exact periodicity of rational ratios. For the irrational canon e : π , the count increases monotonically with ϵ (5 at 10 ms to 51 at 100 ms; T able 19), providing the composer with continuous control ov er the switching density . This monotonic relationship is compositionally desirable: an increase in ϵ smoothly increases the frequenc y of distribution-switching ev ents, enabling fine-grained control ov er textural volatility without introducing discontinuities or bifurcations in system behavior . 5 Discussion 5.1 Theoretical Contrib utions This work demonstrates that historically separate compositional methodologies can cohabit within the four -layer architecture through hierarchical distribution-switching. The integration is architectural rather than theoretical: the framew ork does not claim that these traditions reduce to a common formalism, but rather that distribution-switching pro vides a practical interface through which the y can cohabit and interact within the same e vent-generation process. Qualitativ e musical dif ferentiation arises from switching distribution types (constant vs. exponential IOI, low vs. high-entropy pitch sets, constant vs. v ariable velocity) rather than from modulating parameters within a fixed family . These designed dif ferences surviv e the pipeline with the ef fect sizes reported in T able 4. The degradation analysis (T able 3) confirms that IOI incurs the greatest distortion per-layer , principally at Layer 3, while the pitch is transmitted with near-zero loss. Ablation (a) confirms that the L-system preserves the structural narrative—the ordering and predictability of section transitions (IR drop from 0 . 52 to 0 . 18 ; p = . 020 )—without altering within- section note distributions. The grammar encodes a deterministic macro-formal state-space whose information-theoretic structure (higher IR, lower LZ complexity) is absent from random permutations of the same symbol counts. The con ver gence point calculus bridges the deterministic and stochastic domains: ϵ controls con- ver gence resolution from sharp switching (small ϵ ) to gradual modulation (lar ge ϵ ), with stable behavior across ϵ ∈ [1 , 100] ms (Appendix E). Crucially , the CP feedback path (Figure 1) enables the temporal logic of the canon to modulate the parametric layer , creating a bi-directional relationship between Layers 2 and 3 that neither L-system sequencing nor stochastic generation alone can produce. Collectiv ely , the four contributions v alidate distinct inter -layer interfaces and show that Θ s surviv es intact end-to-end transmission. 5.2 Relationship to Existing W ork The framew ork extends Nancarro w’ s practice [ 11 ] by adding stochastic microstructure within deter- ministic scaffolds, extends Xenakis’ methods [ 31 ] by embedding them within hierarchical formal structures, and extends the composition of the L-system [ 19 ] by connecting grammar symbols to physically constrained rendering with quantified coherence metrics. 5.2.1 Systematic Comparison with JCMS Frameworks T able 13 contrasts representati ve JCMS approaches with the present framew ork. Lattner et al. [16] and related work on information dynamics focus on predictability and surprise in musical sequences without imposing hardware or ph ysical constraints. Kaliakatsos-Papakostas et al. [15] and conceptual models of algorithmic composition address creativity and formalization b ut do not integrate actuator constraints or a physical correction layer . Amanous is distinguished by tw o contrib utions that the table highlights: (1) hard ware constraint enf orcement—explicit inequalities (velocity range, per-ke y rate, 20 polyphony , latency bounds, scanning resolution) are enforced as hard constraints during generation rather than as post hoc checks; and (2) physical corr ection layer —a dedicated Layer 4 formalizes VDL and applies pre-compensation so that the numeric layer’ s output is corrected before actuation. Neither Lattner et al. nor Tsougras et al. tar get automated piano hardware or beyond-human density rendering; Amanous provides a unified pipeline from symbol to acoustic output with both constraint enforcement and physical correction. T able 13: Systematic comparison with representativ e JCMS frameworks. Lattner et al. and Tsougras et al. target dif ferent instruments and density regimes; hardware-related ro ws reflect this difference in scope rather than a deficiency in those systems. Criterion Lattner et al. Tsougras et al. Amanous (this work) L-system / grammar – Optional Y es (Layer 1) T empo canon – – Y es (Layer 3) Stochastic dis- tributions – Optional Y es (Layer 2) Distribution- switching – – Y es (symbol 7→ regime) Hardware con- straints No (not targeted) No (not targeted) Y es (T able 15) Hardware Ab- straction Layer (HAL) No (not targeted) No (not targeted) Y es (Layer 4) Actuation- ready Output No No Y es (Direct MIDI/XPMIDI) Beyond-human density No No Y es (T ested to 200 notes/s) Disklavier hard- ware No No Y es Hardwar e Constraint Enfor cement No No Y es (Layer 4) Hardware constraint enforcement and the physical correction layer are specific to Amanous’ s target instrument. The comparison highlights scope differences, not qualitati ve superiority . Collins [8] identified large-scale algorithmic generation as a frontier; Amanous realizes one concrete instance by inte grating all three within a hardware-aware pipeline with statistical output v alida- tion. Unlike Essl’ s Lexik on-Sonate [ 9 ], which uses real-time interaction as an or ganizing principle, our framework ensures exact reproducibility while retaining stochastic variation. Compared with Ablinger’ s Quadratur en III [ 2 ] and Ritsch’ s Klavierautomat [ 24 ], which implemented ad hoc hard- ware compensation for specific works, Amanous offers a general latency formalization parameterised by instrument-specific calibration data. 5.3 Implications for Cr eativ e Music Systems Amanous functions as an augmented compositional instrument rather than an autonomous generator . The creativ e agency resides in the iterati ve loop between the composer and the hierarchical pipeline: the composer modulates ϵ and Θ s based on real-time metric feedback (PCC, VSS), na vigating the "textural zone" by tuning cross-domain constraints. This interaction transforms the generation process into a discov ery-dri ven e xploration of superhuman note rates while preserving full control ov er every structural parameter . W ithin Boden [5] ’ s taxonomy , it supports explor atory creativity (systematic tra versal of the configuration space { Θ s } s ∈ Σ ) and combinational cr eativity (juxtaposition of tempo-canon, stochastic, and L-system concepts within a single pipeline). For example, replacing Symbol B ’ s exponential IOI with a Gaussian ( µ = 0 . 025 s, σ = 0 . 008 s) produced a denser but more uniform texture; the exponential was selected for the canonical instantiation because its temporal clustering created greater contrast with the deterministic A sections—a decision made through iterativ e listening to rendered output. The framework’ s key compositional af fordance is modularity: 21 changing a single symbol-to-distribution mapping or grammar rule propagates through the pipeline to restructure the entire work. Because the system positions creativ e agenc y in the composer’ s selection of Θ s rather than in autonomous generation, e valuation frame works targeting creati ve agents (e.g., SPECS; 14 ) would need to assess the human–system composite, an experimental design outside the present scope. The pipeline fidelity and ablation analyzes reported above serve a related but distinct function: they validate the system’ s reliability as an engineering tool, confirming that design intent propagates from grammar symbol to acoustic output with quantifiable, per-layer de gradation. 5.4 Practical Implications The 28.4 notes / s saturation point acts as a compositional navigational compass . It formalizes the boundary where the composer’ s focus must shift from "melodic counterpoint" to "statistical texture." By identifying this Computational Sensitivity Limit (CSL), Amanous empowers the user to consciously navigate the transition between discrete e vent perception and aggregate mass perception, a critical af fordance for composing at the limits of human hearing. Dynamic stratification remains the most efficient lev er for inter-v oice differentiation at extreme densities (81.73% nwVSS; Section 4.3.4). The pipeline-sensitivity hierarchy (IOI > velocity > pitch) offers actionable guidance: for maxi- mal post-pipeline contrast, composers should prioritise IOI-type switching; for the most faithful transmission of intended structure, pitch-distribution switching is preferable. The density-conditioned weights of the nwVSS (T able 20) further confirm this density-dependent weight transfer , with the vanishing temporal weight at 120 notes / s reflecting the shift to the representation of the aggregate te xture [25]. 5.5 Limitations 1. Only computational metrics. All findings are algorithmic; the proposed psychoacoustic protocol (Section 5.6) focuses on perceptual validation. 2. Software-side Jitter Suppr ession. The theoretical alignment error ( 0 . 37 ± 0 . 23 ms) is an algorithmic figure; the practical precision is bounded by the instrument’ s ∼ 1 ms scanning resolution. The model’ s principal value is its readiness for instrument-specific calibration. 3. Both nwVSS and range-normalized weights of nwVSS are reported; velocity dominance (97.22% in wVSS) was deriv ed from a high-density condition of four-v oices (half split vali- dated; Section 4.3.4). nwVSS provides a robustness check against scale-dri ven dominance; the nwVSS weights sho wed greater v ariance in half split (8.41 percentage points) than the wVSS (0.37), so normalized weights are more sensitiv e to the composition Generalization to other densities, voice counts, or pitch distrib utions remains open. 4. Platform specificity . The frame work tar gets Y amaha Disklavier; adaptation to other auto- mated pianos requires characterizing Platform specific latency curv es. 5.6 Future W ork: Psychoacoustic V alidation Protocol The computational saturation zone identified at 24–30 notes/s serv es as a pr edictive model for the analysis of the auditory scene at extreme densities. W e propose this CSL as a formal hypothesis for future psychoacoustic v alidation: we predict that the perceptual inflection point from "melody" to "texture" will correlate significantly with this computationally-deri ved boundary . listeners presented with stochastic two-v oice textures at densities spanning 10–60 notes/s will report a qualitati ve shift from perceiving indi vidual melodies to perceiving aggregate te xture within this range. W e propose a design within-subjects with participants N = 30 and 14 density conditions (similar to those in Section 4.3). In each trial, participants hear a 5-second excerpt and make a forced-choice response (“I hear distinct melodies” vs. “I hear a texture”). The resulting psychometric function – the proportion of “texture” responses as a function of density – should exhibit a sigmoidal transition whose inflection point can be compared against the metric-derived saturation point of 30 notes/s. Three outcomes are possible: (a) the perceptual inflection coincides with the metric saturation point, confirming the hypothesis; (b) the perceptual inflection occurs at a dif ferent density , suggesting that the metric captures distributional structure that does not map directly onto perceptual categorization; 22 or (c) no clear sigmoidal transition is observed, indicating that the melodic-to-textural shift is not well described by a single density threshold for piano timbres. A second task addresses pipeline-fidelity perception: participants hear the 74-second canonical composition (Excerpt 3) and press a b utton whenev er they percei ve a “change in musical character . ” The temporal distribution of b utton presses is compared against the actual symbol-transition times- tamps using a permutation test: if the mean absolute deviation between b utton presses and symbol boundaries is significantly smaller than chance (estimated by circularly shifting press times), this confirms that pipeline-lev el distrib utional changes are perceptually salient. Power analysis (Cohen’ s d = 0 . 8 , α = 0 . 05 , power = 0 . 80 ) confirms the suitability of the proposed sample size ( N = 30 ). This protocol is designed to be ex ecutable with existing additional materials (audio) and standard psychoacoustic software (e.g. PsychoPy). 6 Conclusion This paper asked whether compositional intent can be transmitted from grammar symbol to acoustic output through a unified pipeline and what measurable constraints govern each stage of that transmis- sion. Amanous demonstrates that this is achie v able by unifying tempo canons, stochastic distributions, and L-system grammars via distribution-switching within a four -layer architecture. 1. Integration architecture. Distribution-switching produces statistically distinct musical sections ( d = 3 . 70 – 5 . 34 ). Degradation analysis identifies the IOI as the most pipeline- sensitiv e parameter , and ablation confirms each layer’ s non-redundant contribution. 2. Hardwar e abstraction layer . Layer 4 formalizes velocity-dependent latenc y and key reset constraints as integral generative parameters, ensuring that superhuman densities remain actuable on the Disklavier . 3. Operational en velope; The 24–30 notes/s saturation zone marks the density at which single- domain metrics lose sensitivity; beyond it, dynamic stratification becomes the primary lever for inter-v oice dif ferentiation (81.73% nwVSS). 4. Con ver gence point calculus. CP ev ents pro vide a deterministic–stochastic control interface supporting both discrete triggering ( | r | = 1 . 0 ) and continuous modulation ( r = 0 . 907 ), with ϵ offering monotonic con ver gence-resolution control. All reported results are computational. The principal limitations are the absence of perceptual validation and platform specificity to the Y amaha Diskla vier . Future work should test the saturation zone against perceptual thresholds through the proposed psychoacoustic protocol (Section 5.6), extend the latenc y model with instrument-specific calibration data, and explore adaptation to other automated instruments. As an augmented compositional instrument, Amanous enables reproducible, principled navigation of parameter spaces that exceed human physical capacity , bridging the gap between algorithmic abstraction and electromechanical reality . Supplementary materials: https://www.amanous.xyz . Source code: https://github.com/ joonhyungbae/Amanous . References [1] Abdallah, S. and Plumbley , M. (2009). Information dynamics: patterns of expectation and surprise in the perception of music. Connection Science , 21(2-3):89–117. [2] Ablinger , P . (2004). Quadraturen III “W irklichkeit” — studies for mechanical piano. Program notes and technical documentation. First realized January 2004 in collaboration with Winfried Ritsch, IEM Graz. [3] Assayag, G., Rueda, C., Laurson, M., Agon, C., and Delerue, O. (1999). Computer-assisted composition at ircam: From patchwork to openmusic. Computer music journal , 23(3):59–72. [4] Birkhoff, G. D. (2013). Aesthetic measure. In Aesthetic Measur e . Harvard Uni versity Press. [5] Boden, M. A. (2004). The cr eative mind: Myths and mechanisms . Routledge. 23 [6] Bregman, A. S. (1994). Auditory scene analysis: The perceptual or ganization of sound . MIT press. [7] Callender , C. (2014). Performing the irrational: Paul usher’ s arrangement of nancarro w’ s study no. 33, canon 2: \ ( \ sqrt { 2 }\ ). Music Theory Online , 20(1). [8] Collins, N. (2018). " there is no reason why it should ev er stop": Large-scale algorithmic composition. Journal of cr eative music systems , 3:1–25. [9] Collins, N., d’Escriván, J., and Rincón, J. d. (2017). The Cambridge companion to electr onic music . Cambridge University Press. [10] Cowell, H. (1996). New musical r esour ces . Cambridge Univ ersity Press. [11] Gann, K. (1995). The music of conlon nancarro w . Cambridge: Cambridge University . [12] Goebl, W . and Bresin, R. (2003). Measurement and reproduction accuracy of computer- controlled grand pianos. The Journal of the Acoustical Society of America , 114(4):2273–2283. [13] Huron, D. (2001). T one and v oice: A deriv ation of the rules of voice-leading from perceptual principles. Music P er ception , 19(1):1–64. [14] Jordanous, A. (2012). A standardised procedure for e valuating creative systems: Computational creativity e valuation based on what it is to be creati ve. Cognitive Computation , 4(3):246–279. [15] Kaliakatsos-Papakostas, M., Makris, D., Tsougras, C., and Cambouropoulos, E. (2016). Learn- ing and creating novel harmonies in diverse musical idioms: An adaptiv e modular melodic harmonisation system. Journal of Cr eative Music Systems , 1(1). [16] Lattner , S., Grachten, M., and W idmer , G. (2018). Imposing higher-le vel structure in polyphonic music generation using con v olutional restricted boltzmann machines and constraints. Journal of Cr eative Music Systems , 2:1–31. [17] Leonard, M. (1956). Emotion and meaning in music. Chicago: University of Chicago . [18] Loughran, R. and O’Neill, M. (2017). Limitations from assumptions in generative music ev aluation. J ournal of Cr eative Music Systems , 2:1–31. [19] Manousakis, S. (2006). Musical l-systems. Koninklijk Conservatorium, The Hague (master thesis) . [20] McDermott, J. H., Schemitsch, M., and Simoncelli, E. P . (2013). Summary statistics in auditory perception. Nature neur oscience , 16(4):493–498. [21] Nemire, J. A. (2014). Con vergence points in conlon nancarro w’ s tempo canons. Music Theory Online , 20(1). [22] Nierhaus, G. (2009). Algorithmic composition: paradigms of automated music generation . Springer . [23] Prusinkiewicz, P . and Lindenmayer , A. (2012). The algorithmic beauty of plants . Springer Science & Business Media. [24] Ritsch, W . and Graz, A. (2011). Robotic piano player making pianos talk. In Sound and Music Computing Confer ence, P adova, Italy , pages 1–6. [25] Roads, C. (2004). Micr osound . The MIT Press. [26] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal , 27(3):379–423. [27] Thomas, M. E. (2000). Nancarrow’ s canons: Projections of temporal and formal structures. P erspectives of Ne w Music , pages 106–133. [28] van Noorden, L. (1975). T emporal coherence in the perception of tone sequences. [29] W iggins, G. A. (2006). A preliminary framew ork for description, analysis and comparison of creativ e systems. Knowledge-based systems , 19(7):449–458. [30] W orth, P . and Stepney , S. (2005). Growing music: musical interpretations of l-systems. In W orkshops on Applications of Evolutionary Computation , pages 545–550. Springer . [31] Xenakis, I. (1992). F ormalized music: thought and mathematics in composition . Number 6. Pendragon Press. 24 A Complete Statistical Summary T able 14: Complete statistical results summary . All tests two-tailed, α = 0 . 05 . Comparison N T est Statistic p Effect [CI] Pipeline F idelity (Section 4.1) Melodic coherence 28 t t (26) = 9 . 75 < . 001 d = 3 . 70 [2 . 65 , 4 . 73] Rhythmic coherence 28 t t (26) = 14 . 09 < . 001 d = 5 . 34 [3 . 96 , 6 . 70] IOI dist. type 8 M-W U = 0 . 0 . 036 a Pitch entropy (TS) 8 M-W U = 0 . 0 . 036 d = 9 . 96 [3 . 53 , 16 . 13] Component Ablation (Section 4.1.7) (a) Seq. self-sim. MC 100 h z z = − 1 . 02 . 306 — (a) Same-symbol MC 100 h z z = − 1 . 90 . 057 — (a) Same-symbol RC 100 h z z = − 1 . 00 . 320 — (b) VSS temporal 8+8 M-W U = 64 . 0 < . 001 r = − 1 . 00 ; − 79 . 0% (b) RC same-symbol 8+8 M-W U = 53 . 0 . 028 r = − 0 . 66 ; − 18 . 0% (c) V el.–timing r (lin.) — — r = − 1 . 000 — — (c) V el.–timing r (po w .) — — r = − 0 . 998 — — Har dware Simulation (Section 4.2) V el.-de viation 526 Pearson r = − 0 . 867 < . 001 — Filter error SD 526 paired t — . 000116 − 35 . 2% Coher ence Metrics (Section 4.3) Melodic sat. pt. 14 g M-W U = 1 . 0 . 002 r = 1 . 00 ; SP = 28 . 4 [CI: 23.3–50.0] TS sat. pt. 16 Piecewise R 2 = . 973 — 47 × TS vs. density 16 Spearman ρ = − . 991 < 10 − 10 — Polyphonic TS 600 K-W H = 165 . 22 < 10 − 10 — Multi-constr . VSS 500 K-W H = 1740 . 1 < 10 − 10 +1847% Pitch-vel coupling 500 M-W U = 0 . 0 . 002 +10521% Con verg ence P oints (Section 4.4) CP density switch 30 f M-W U = 0 < 10 − 10 r = 1 . 00 CP TS switch 30 f M-W U = 225 < 10 − 10 r = − 1 . 00 Continuous tracking 30 f Pearson r = 0 . 907 < 10 − 10 — M-W : Mann-Whitney U ; K-W : Kruskal-W allis H . a Cohen’ s d undefined (constant distribution, SD = 0 ). f N reports 1-second analysis windows (15 pre-CP , 15 post-CP for discrete switch). g N = 14 density le vels; 6 pre- vs. 8 post-saturation point. h N = 100 random permutations; z = ( full − µ random ) /σ random , p from standard normal. B Constraint Inequality Summary T able 15: Complete constraint system. Constraint Inequality Source Har dwar e V elocity range v ∈ [0 , 1023] 10-bit XPMIDI Per-k ey rate IOI key ≥ 50 ms K ey reset time Latency bounds 10 ≤ L ( v ) ≤ 30 ms Goebl and Bresin [12] Polyphony N sim ≤ 88 Physical keys Scanning resolution ∆ t min ≈ 1 ms 800–1000 Hz Metric-derived boundaries Coherence sat. pt. ρ agg ≤ 30 notes/s a This study TS sat. pt. ρ agg ≤ 24 . 2 notes/s a This study Stream segre gation ref. ∆ p ≥ 5 semitones Huron [13] a Deriv ed from stochastic two-v oice textures; per -voice density ≈ ρ agg / 2 . These are metric-deriv ed thresholds; perceptual correspondence requires experimental v alidation. 25 C Po wer -law exponent c sensitivity (Layer 4) T o confirm that the compensation behavior of Layer 4 is not tied to a single choice of power -law exponent c , we v aried c ∈ { 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 } and computed the standard deviation of the residual jitter (ms) after compensation, assuming that the true latenc y model has c = 0 . 5 . The residual error per note is L true ( v ) − L c ( v ) in ms ov er the same 526-note e xcerpt. The residual jitter is zero when the compensation exponent matches the true model ( c = 0 . 5 ) and increases smoothly as c deviates (T able 16), sho wing a consistent trend rather than dependence on a single parameter . T able 16: Residual jitter standard deviation (ms) after Layer 4 compensation for varying po wer-la w exponent c ( N = 526 notes, true c = 0 . 5 ). c Residual Jitter SD (ms) 0.3 1.0446 0.4 0.4769 0.5 0.0000 0.6 0.4015 0.7 0.7404 Residual is minimal at calibrated c = 0 . 5 ; trend is consistent across the range. D Latency model mismatch (HAL vs. no correction) T o test whether HAL compensation remains beneficial when the true piano latency deviates from the assumed model, we ran two mismatch simulations (the same 526-note e xcerpt). (1) Exponent mismatch: true latency follo ws a po wer-la w with c true ∈ [0 . 3 , 0 . 35 , . . . , 0 . 7] while HAL uses c = 0 . 5 . The uncorrected jitter is the standard deviation of L true ( v ) ; with-HAL jitter is the standard de viation of L true ( v ) − L 0 . 5 ( v ) . (2) Additive noise: L actual ( v ) = L ( v ; c = 0 . 5) + U ( − w, + w ) ms with w ∈ { 0 , 0 . 5 , 1 , 1 . 5 , 2 } ; 200 trials per w . In ev ery condition, HAL reduces jitter relativ e to the uncorrected baseline (T able 17), so the sensitivity curves sho w that correction is always preferable to no correction despite model error . V irtual real piano simulation. T o explicitly test the claim that the power -law assumption alone yields benefit without perfect calibration, we added a three-condition simulation A “virtual real piano” was defined as the nominal power -law ( c = 0 . 5 , 10–30 ms) with ± 10% per-note multiplicati ve noise and parameter drift (exponent c varying by note index via a bounded random walk). For each of 200 trials, we computed the onset-timing jitter (std, ms) under: (A) no correction (raw MIDI), (B) Amanous HAL with c = 0 . 5 , and (C) ideal correction (perfect match to the virtual model). The results are summarized in T able 18 and Figure 6. The Jitter with HAL (B) was statistically significantly lower than without correction (A): paired t -test p < 0 . 001 , mean difference (A − B) ≈ 2 . 70 ms. Thus, ev en when the true latency de viates from the assumed model by noise and drift, follo wing the general physical law (po wer-law) alone provides a significant impro vement o ver ra w MIDI. T able 17: Latency model mismatch: jitter (ms, SD) with HAL applied vs. uncorrected. In all ro ws, HAL yields strictly lower jitter . Exponent c true Additiv e noise ± w ms c true Uncorr . HAL w Uncorr . HAL 0.30 2.47 1.04 0 3.50 0.00 0.50 3.50 0.00 1.0 3.55 0.58 0.70 4.21 0.74 2.0 3.69 1.15 Uncorr . = uncorrected (no HAL). HAL = compensation with c = 0 . 5 . N = 526 notes; noise case: mean over 200 trials. E ϵ Sensitivity in CP Calculus T o confirm that the Conv ergence Poi nt (CP) switching behaviour is not tied to a single choice of ϵ and to demonstrate that ϵ acts as a compositional parameter for textur e transition density , we ran 26 T able 18: V irtual real piano simulation: onset jitter (ms, mean ± std ov er 200 trials) under (A) no correction, (B) HAL ( c = 0 . 5 ), (C) ideal correction. V irtual piano: power -law c = 0 . 5 with ± 10% per-note noise and parameter drift. N = 526 notes per trial. Condition Jitter (mean ± std, ms) (A) Raw (no correction) 3 . 63 ± 0 . 10 (B) Amanous HAL ( c = 0 . 5 ) 0 . 93 ± 0 . 03 (C) Ideal 0 . 00 P air ed (A) vs. (B): p < 0 . 001 , mean dif f. 2 . 70 ms (HAL lo wer). Figure 6: V irtual real piano simulation: distribution of onset jitter (ms) across 200 trials for (A) raw MIDI, (B) HAL applied, (C) ideal correction. HAL (B) is statistically significantly lo wer than (A) ( p < 0 . 001 ). two analyses. (1) Coarse grid (30 s): ϵ ∈ { 10 , 20 , 50 , 100 } ms (T able 19). (2) Full sweep (60 s): ϵ = 1 , 5 , 10 , . . . , 100 ms (5 ms step); for each ϵ we computed the number of con ver gence ev ents and the inter-e vent intervals (Definition 2). The results are shown in Figure 7. For the rational 3:4 canon (IOI 1.0 s and 0.75 s), the event count is in v ariant across ϵ (exact con ver gences ev ery 3 s; 11 ev ents in 30 s, 21 in 60 s). For the irrational canon e : π , the count increases monotonically with ϵ ; in the full sweep, the correlation between ϵ and the e vent count is r ≈ 0 . 9998 , and the mean interval between e vents decreases from ≈ 45 s at ϵ = 1 ms to ≈ 0 . 58 s at ϵ = 100 ms. Thus, ϵ is not merely an error tolerance, but a continuous control over the frequency with which distribution switches occur . T able 19: Conv ergence-point count and switching frequenc y for varying ϵ ( N = 30 s). ϵ (ms) 3:4 count 3:4 (/s) e : π count e : π (/s) 10 11 0.367 5 0.167 20 11 0.367 11 0.367 50 11 0.367 26 0.867 100 11 0.367 51 1.700 Rational 3:4: count stable across ϵ . Irrational e : π : count increases with ϵ . 27 Figure 7: ϵ sensitivity (full sweep): event count, e vent rate, and mean inter-e vent interval o ver 60 s. Rational 3:4 (blue): count and spacing inv ariant. Irrational e : π (orange): count and rate increase monotonically with ϵ ; mean gap decreases. ϵ thus functions as a compositional parameter for texture transition density . F nwVSS W eights by Density Applying the same nwVSS weight-extraction procedure (Section 4.3.4) to low-density conditions (20 notes / s aggregate) and high-density (120 notes / s aggre gate) yields the weights in T able 20; The weight shift from high to low density is − 7 . 81 percentage points for velocity , +7 . 63 for temporal, and +0 . 18 for pitch. At low density , v elocity dominance weakens and temporal (and pitch) components gain relativ e weight ( weight transfer by density ). The 0.00% temporal weight at high density is interpretable as computational evidence of temporal fusion: at 120 notes/s, the onset structure ceases to carry discriminativ e load for separation, which the metric correctly attributes to velocity and pitch. T able 20: nwVSS weights by aggregate density (same extraction procedure). Condition w pitch (%) w vel (%) w temporal (%) Low density (20 notes/s) 0.45 91.92 7.63 High density (120 notes/s) 0.27 99.73 0.00 W eight shift (low minus high): ∆ w pitch = +0 . 18 , ∆ w vel = − 7 . 81 , ∆ w temporal = +7 . 63 percentage points. G Calibration Dataset A udit Six publicly a vailable Disklavier-related repositories were audited for velocity-latenc y calibration data: MAPS Database (audio-MIDI pairs, no v elocity sweep), Magenta MIDI Dataset (no hardware-specific latency), MAESTRO (no Disklavier-speci fic latency characterization), SMD (score-performance alignment only), V ienna 4 × 22 Piano Corpus (tone quality focus) and R WC Music Database (no 28 systematic latency measurement). Finding : No public dataset pairs MIDI velocity commands with measured acoustic-onset latency for any Disklavier model. Controlled velocity sweeps ( v = 0 , 1 , . . . , 1023 ; multiple repetitions) with sub-millisecond audio capture are recommended. H Hardwar e-A ware Actuation Pipeline Algorithm 2 presents the complete adaptive compensation pipeline, which replaces the simplified correction in Algorithm 1 (Line 11). Algorithm 2 Hardware-A ware Actuation Pipeline (HAL) Require: Event list E , calibration data C (optional), compression γ Ensure: Compensated ev ent list E ′ 1: if C is av ailable then 2: f ← F I T P O W E R L AW ( C ) ▷ Best fit: c ≈ 0 . 5 , RMSE = 0 . 69 ms 3: L ( v ) ← 30 − 20 · f ( v / 1023) 4: skip robustness filter 5: else 6: L ( v ) ← 30 − 20 · ( v/ 1023) ▷ Linear fallback 7: end if 8: E ′ ← ∅ 9: for each e vent ( t, p, v , d ) in E do 10: if C not av ailable and I S L A T E N C Y S E N S I T I V E ( E , t, p ) then 11: ¯ v ← mean velocity in [ t − 25 ms , t + 25 ms ] 12: v ← ¯ v + γ · ( v − ¯ v ) ▷ γ < 1 : compress 13: end if 14: t ′ ← t − L ( v ) / 1000 15: E ′ ← E ′ ∪ { ( t ′ , p, v , d ) } 16: end for 17: return E ′ I Supplementary Materials Supplementary materials (audio of compositions generated algorithmically by the frame work) are av ailable at https://www.amanous.xyz . These materials document the algorithmically generated output and are for reference and future perceptual experiments; the quantitati ve results in this paper do not rely on empirical measurement of these recordings. Source code: https://github.com/ joonhyungbae/Amanous . Supplementary excerpts. The following e xcerpts are referenced in the text and match the materials on the project website: • Excerpt 1: Demonstration of Beyond-human density rendered in Disklavier (34 s). Polyphony (40-note chords), 30 Hz multi-ke y trill, 6-octa ve arpeggio. • Excerpt 2: Phase Music – Minimalist Study (80 s). Reich-inspired phase-shift; pentatonic set, 1:1.01 tempo drift. • Excerpt 3: Canonical ABAAB ABA validation composition (74 s). macro-form of the L-system with deterministic sections ( A ) and textural sections ( B ); 3:4 tempo canon. • Excerpt 4: Con ver gence Point demonstration, 3:4 canon (30 s). Pre-CP sparse/melodic and post-CP dense/ texture switch at t = 15 s. Additional materials: (1) complete MIDI files for all reported e xperiments; (2) ablation experiment scripts with deterministic seeds and JSON/CSV output; (3) analysis reproduction notebooks. 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment