Automatic recognition of element classes and boundaries in the birdsong with variable sequences

February 23, 2026

Reading time: 7 minute

...

📝 Abstract

Researches on sequential vocalization often require analysis of vocalizations in long continuous sounds. In such studies as developmental ones or studies across generations in which days or months of vocalizations must be analyzed, methods for automatic recognition would be strongly desired. Although methods for automatic speech recognition for application purposes have been intensively studied, blindly applying them for biological purposes may not be an optimal solution. This is because, unlike human speech recognition, analysis of sequential vocalizations often requires accurate extraction of timing information. In the present study we propose automated systems suitable for recognizing birdsong, one of the most intensively investigated sequential vocalizations, focusing on the three properties of the birdsong. First, a song is a sequence of vocal elements, called notes, which can be grouped into categories. Second, temporal structure of birdsong is precisely controlled, meaning that temporal information is important in song analysis. Finally, notes are produced according to certain probabilistic rules, which may facilitate the accurate song recognition. We divided the procedure of song recognition into three sub-steps: local classification, boundary detection, and global sequencing, each of which corresponds to each of the three properties of birdsong. We compared the performances of several different ways to arrange these three steps. As results, we demonstrated a hybrid model of a deep neural network and a hidden Markov model is effective in recognizing birdsong with variable note sequences. We propose suitable arrangements of methods according to whether accurate boundary detection is needed. Also we designed the new measure to jointly evaluate the accuracy of note classification and boundary detection. Our methods should be applicable, with small modification and tuning, to the songs in other species that hold the three properties of the sequential vocalization.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Title: Automatic recognition of element classes and boundaries in the birdsong with variable sequences.

Short title: Automatic recognition of birdsong.

Authors: Takuya Koumura1, 2 and Kazuo Okanoya1, 3*

Affiliations: 1 Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan 2 Research Fellow of Japan Society for the Promotion of Science 3 Cognition and Behavior Joint Laboratory, RIKEN Brain Science Institute, Saitama, Japan

Corresponding author E-mail: cokanoya@mail.ecc.u-tokyo.ac.jp

Abstract Researches on sequential vocalization often require analysis of vocalizations in long continuous sounds. In such studies as developmental ones or studies across generations in which days or months of vocalizations must be analyzed, methods for automatic recognition would be strongly desired. Although methods for automatic speech recognition for application purposes have been intensively studied, blindly applying them for biological purposes may not be an optimal solution. This is because, unlike human speech recognition, analysis of sequential vocalizations often requires accurate extraction of timing information. In the present study we propose automated systems suitable for recognizing birdsong, one of the most intensively investigated sequential vocalizations, focusing on the three properties of the birdsong. First, a song is a sequence of vocal elements, called notes, which can be grouped into categories. Second, temporal structure of birdsong is precisely controlled, meaning that temporal information is important in song analysis. Finally, notes are produced according to certain probabilistic rules, which may facilitate the accurate song recognition. We divided the procedure of song recognition into three sub-steps: local classification, boundary detection, and global sequencing, each of which corresponds to each of the three properties of birdsong. We compared the performances of several different ways to arrange these three steps. As results, we demonstrated a hybrid model of a deep neural network and a hidden Markov model is effective in recognizing birdsong with variable note sequences. We propose suitable arrangements of methods according to whether accurate boundary detection is needed. Also we designed the new measure to jointly evaluate the accuracy of note classification and boundary detection. Our methods should be applicable, with small modification and tuning, to the songs in other species that hold the three properties of the sequential vocalization. 3

Author summary A lot of animal species communicate with sequential vocalizations. The clearest example is human speech, in which various meanings are conveyed from speakers to listeners. Other animals also show interesting behaviors using sequential vocalizations for attracting mates, protecting territories, and recognizing individuals. Studying such behaviors leads not only to understanding of spoken language but also to elucidation of mechanisms for precise control of muscle movements and perception of auditory information. In studying animal vocalization, it is not rare case that long duration of sound data spanning over days or months must be analyzed, leading to a need of automatic recognition of vocalizations. In analyzing sequential vocalization it is often necessary to accurately extract temporal information as well as its contents. Another thing that must be considered is the rule for sequencing vocal elements, according to which variable sequences of vocal elements are produced. In the present study we propose methods suitable for automatic recognition of birdsong, one of the most intensively studied sequential vocalizations. We demonstrated the effectiveness of machine learning in automatically recognizing birdsong with temporal accuracy. Also we designed the new method to evaluate temporal accuracy of the recognition results.

Introduction Sequential vocalizations, in which voices are produced sequentially, have been a target of wide variety of researches. This is not only because they include human spoken language, but also because they serve as excellent models for precise motor control, learning, and auditory perception.
Birdsong is one of the most complex and precisely controlled sequential vocalizations, and has been widely and intensively studied [1-3]. Birdsong, as well as most of other sequential vocalizations, has several distinct properties. First, usually a song is a sequence of discrete vocal elements (called notes) [4]. Thus, by grouping similar notes into a single class, it is possible to convert songs into symbol sequences of note classes. Notes in a single class are considered to be generated by the same set of commands in motor neurons, which leads to the similar patterns of muscle activation to the similar acoustic outputs [5-8]. It is also known that auditory stimuli of notes i

View Original ArXiv

This content is AI-processed based on ArXiv data.

Automatic recognition of element classes and boundaries in the birdsong with variable sequences

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found