A Comparison of Random Forests and Ferns on Recognition of Instruments in Jazz Recordings

A Comparison of Random Forests and Ferns on Recognition of Instruments   in Jazz Recordings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we first apply random ferns for classification of real music recordings of a jazz band. No initial segmentation of audio data is assumed, i.e., no onset, offset, nor pitch data are needed. The notion of random ferns is described in the paper, to familiarize the reader with this classification algorithm, which was introduced quite recently and applied so far in image recognition tasks. The performance of random ferns is compared with random forests for the same data. The results of experiments are presented in the paper, and conclusions are drawn.


💡 Research Summary

The paper investigates instrument recognition in jazz band recordings without relying on any prior segmentation such as onset, offset, or pitch extraction. Audio is processed in short, overlapping frames of 40 ms with a 10 ms hop size, and each frame is described by a 91‑dimensional feature vector. The feature set consists mainly of MPEG‑7 low‑level descriptors: spectral flatness (25 dimensions across selected frequency bands), spectral centroid, spectral spread, overall energy, 13 mel‑frequency cepstral coefficients (MFCCs), zero‑crossing rate, roll‑off frequency, and a set of “difference” features that capture the change between two 30 ms sub‑frames within the main frame. This rich representation allows the authors to avoid any higher‑level preprocessing such as pitch tracking or source separation.

Two ensemble classifiers are compared: Random Forests (RF o) and Random Ferns (RF e). RF o follows the classic Breiman approach: each tree is grown on a bootstrap sample of the training set, at each node K ≈ √P attributes are randomly selected, and the best split is chosen using the Gini impurity criterion. Trees are grown to maximal depth without pruning, and classification is performed by majority voting. The computational cost of training a forest is O(Nₜ·N₀·log N₀·K) and classification cost is O(Nₜ·hₜ), where Nₜ is the number of trees, N₀ the number of training objects, and hₜ the average tree height.

RF e, originally introduced for image object detection, consists of a collection of “ferns”. A fern is a simplified binary decision tree of fixed depth D where all nodes at the same depth use the same splitting criterion. Consequently a fern can be viewed as a D‑dimensional array of class‑conditional leaf distributions. Training proceeds by drawing N_f bootstrap bags, randomly selecting D attributes and thresholds for each fern, and populating leaf histograms with a Dirichlet prior (adding one count per class). The training complexity is O(2·D·N_f·N₀) and classification cost O(D·N_f), both linear in the number of ferns and depth, and typically much lower than for Random Forests.

For the experimental setup, the authors built a binary‑classifier “battery” for each of four target instruments (clarinet, trombone, trumpet, sousaphone). Training data were generated by mixing isolated instrument recordings (from McGill, Iowa, and RWC databases) in random combinations of 1–4 instruments, normalizing each mix to unit RMS, and creating 3 000 positive and 3 000 negative examples per instrument. RF o used 1 000 trees with default K = √P ≈ 9; RF e used 1 000 ferns of depth D = 10.

Testing employed three real jazz band recordings: “Mandeville” (Paul Motian), “Washington Post” (John Philip Sousa, arranged by Matthew Postle), and two movements of “Stars & Stripes Forever”. Ground truth was obtained by manually labeling each instrument track. Evaluation metrics were precision, recall, and F‑score, all weighted by the RMS energy of each frame to reduce the influence of low‑energy (noisy) frames. Ten independent repetitions of the whole training‑testing pipeline were performed to assess stability.

Results show that Random Forests achieve higher precision (e.g., 92.7 % vs 88.4 % on “Mandeville”), while Random Ferns obtain slightly better recall (e.g., 73 % vs 67 % on the same piece). Overall F‑scores are comparable across all pieces, with RF e slightly ahead on “Washington Post” (77 % vs 75 %) and RF o marginally better on “Stars & Stripes 2” (78 % vs 76 %). Instrument‑specific analysis indicates that sousaphone and trumpet are consistently recognized with high accuracy, whereas trombone yields lower precision across recordings.

From a computational perspective, RF e trains considerably faster and requires fewer memory accesses during classification, making it attractive for deployment on low‑power mobile or embedded devices. The authors argue that despite the modest differences in accuracy, the efficiency gains of Random Ferns justify their use in real‑time or large‑scale indexing applications where latency and power consumption are critical.

In summary, the paper successfully adapts Random Ferns—originally an image‑recognition tool—to the audio domain, demonstrates that they can match Random Forests in recognition performance while offering superior computational efficiency, and provides a solid baseline for future work on multi‑instrument, real‑time music information retrieval in resource‑constrained environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment