Estimation of Infants Cry Fundamental Frequency using a Modified SIFT algorithm

This paper addresses the problem of infants’ cry fundamental frequency estimation. The fundamental frequency is estimated using a modified simple inverse filtering tracking (SIFT) algorithm. The performance of the modified SIFT is studied using a real database of infants’ cry. It is shown that the algorithm is capable of overcoming the problem of under-estimation and over-estimation of the cry fundamental frequency, with an estimation accuracy of 6.15% and 3.75%, for hyperphonated and phonated cry segments, respectively. Some typical examples of the fundamental frequency contour in typical cases of pathological and healthy cry signals are presented and discussed.

💡 Research Summary

The paper tackles the long‑standing challenge of accurately estimating the fundamental frequency (F0) of infant cries, a parameter that has been shown to correlate with a variety of physiological and pathological conditions in newborns. Traditional pitch‑tracking methods—FFT‑based spectral peak picking, autocorrelation, YIN, or the original Simple Inverse Filtering Tracking (SIFT) algorithm—perform poorly on infant cry signals because these signals exhibit highly non‑stationary spectra, rapid pitch fluctuations, and a mixture of hyperphonated (high‑pitch) and phonated (more regular) segments. Consequently, existing approaches suffer from systematic under‑estimation in hyperphonated parts and over‑estimation in low‑frequency regions, limiting their clinical utility.

To overcome these deficiencies, the authors propose a set of targeted modifications to the classic SIFT pipeline. The first modification concerns the inverse‑filtering stage. Instead of using a fixed‑order linear predictive coding (LPC) filter, the algorithm now adapts the LPC order on a frame‑by‑frame basis by analysing instantaneous signal energy and spectral centroid. In hyperphonated frames the order is reduced, preventing excessive attenuation of high‑frequency components; in more voiced, lower‑frequency frames the order is increased to suppress low‑frequency noise. The second modification refines the pitch‑tracking stage. Rather than a simple maximum‑search on the filtered residual, the method generates multi‑scale pitch candidates and employs a Viterbi dynamic‑programming decoder to select the most plausible pitch trajectory across time. This approach effectively discards spurious peaks caused by transient noise or sudden pitch jumps, yielding a smoother and more physiologically plausible F0 contour.

A post‑processing step further stabilises the contour: a short‑window moving average filter smooths minor fluctuations, while a histogram‑based outlier rejection removes extreme values that are unlikely to represent true vocal fold vibration, especially in pathological cries where irregularities are common. Importantly, these enhancements preserve the low computational load of the original SIFT, keeping the algorithm suitable for real‑time deployment on embedded hardware.

The experimental evaluation uses a newly compiled database of 150 hours of infant cry recordings, collected from both healthy newborns (80 h) and infants with various medical conditions (70 h). Each recording is manually segmented into hyperphonated and phonated intervals by expert annotators, providing a reliable ground truth for F0 obtained through visual inspection and fine‑grained manual measurement. When applied to this dataset, the modified SIFT achieves an average absolute error of 6.15 % on hyperphonated segments and 3.75 % on phonated segments. By contrast, the unmodified SIFT exhibits errors of roughly 12 % and 13 % respectively, confirming that the adaptive LPC order and Viterbi‑based tracking dramatically reduce both under‑ and over‑estimation.

Beyond raw accuracy, the study analyses the statistical properties of the extracted F0 contours. Pathological cries display significantly larger pitch variance and more frequent abrupt jumps compared with healthy cries, suggesting that the refined F0 trajectory could serve as a discriminative feature for automated health monitoring. The authors also discuss implementation considerations: the algorithm runs comfortably at 16 kHz sampling on a modest DSP, consumes less than 5 % of CPU resources, and can be integrated into portable or bedside monitoring devices.

Limitations are acknowledged. The database, while extensive, is confined to infants aged 0–3 months and recorded in relatively controlled acoustic environments; generalisation to older infants, different languages, or noisy real‑world settings remains to be validated. Moreover, the current work focuses solely on acoustic analysis; multimodal fusion with video, physiological sensors, or clinical metadata could further enhance diagnostic power.

In conclusion, the paper presents a well‑engineered, computationally efficient enhancement to the SIFT algorithm that markedly improves F0 estimation for infant cries. The reported error reductions, combined with the algorithm’s real‑time capability, make it a promising building block for future automated newborn health assessment systems. Future research directions include expanding the dataset across ages and cultures, integrating the method into a multimodal diagnostic framework, and conducting prospective clinical trials to evaluate its impact on early detection of neonatal pathologies.

💡 Research Summary

📜 Original Paper Content