spINAch: A Diachronic Corpus of French Broadcast Speech Controlled for Speakers' Age and Gender
We present spINAch, a large diachronic corpus of French speech from radio and television archives, balanced by speakers' gender, age (20-95 years old), and spanning 60 years from 1955 to 2015. The dataset includes over 320 hours of recordings from mo…
Authors: Simon Devauchelle, David Doukhan, Rémi Uro
spINA ch: A Diachronic Corpus of F rench Broadcast Speech Controlled for Speak ers’ A ge and Gender Simon Dev auchelle ∗† , David Doukhan † , Rémi Uro ‡ , Lucas Ondel Y ang ∗ , V alentin Pelloin † , Olympia Imber t-Brégégère † , V éronique Lefor t † , K évin Picard † , Emeline Seignobos † , Albert Rilliard ∗ ∗ Universit é P aris Saclay , CNRS, LISN - Orsay , France; † Institut National de l’ Audiovisuel - Paris, France; ‡ LIASD, Université Paris 8 - Saint-Denis, F rance. simon.dev auchelle@universite-paris-sacla y .fr , r .uro@iut.univ-paris8.fr , lucas.ondel@cnrs.fr , {ddoukhan,vpelloin,olimber tbregegere,vlef or t,kpicard,eseignobos}@ina.fr , albert.rilliard@lisn.fr Abstract W e present spINA ch , a large diachronic corpus of F rench speech from radio and tele vision archives, balanced by speakers’ gender , age (20-95 y ears old), and spanning 60 years from 1955 to 2015. The dataset includes ov er 320 hours of recordings fr om more t han two thousand speakers. The methodology for building the cor pus is described, focusing on the quality of collected samples in acoustic terms. The data were automatically transcribed and phonetically aligned to allow studies at a phonemic level. More than 3 million oral vowels hav e been analyzed to propose their fundamental frequency and formants. The corpus, available to the community for research purposes, is valuable for describing the evolution of Parisian French through the representation of gender and age. The presented analyses also demonstrate that the diachronic nature of the cor pus allows the obser vation of various phonetic phenomena, such as the ev olution of voice pitch o ver time (which does not differ by gender in our data) and the neutralization of the /a/ - /A/ opposition in Parisian Fr ench during this period. Ke ywords: Speech Corpus, Diachrony , Broadcast News, Parisian F rench, Gender and Age Bias ev aluation 1. Introduction Diachronic changes in speech may be studied lon- gitudinally , focusing on a specific speaker who may represent a given population category , as in Harrington et al. ( 2000 ). Other longitudinal stud- ies are more focused on individual changes in relation to specific and documented life ev ents ( Riverin-Coutlée and Harrington , 2022 ). Cross- sectional corpora, like the one described by Stuart- Smith ( 2020 ), allow for inves tigating changes at the population lev el. Stuar t-Smith ( 2020 ) stratified their speakers according to social gender and age (middle-aged vs. younger) with two dates of record- ings, allowing them to study language changes ov er four lev els of date of bir th, and introducing a distinc- tion between real- and apparent-time differences. Real-time differences are found with the tw o dat es of recording, and apparent-time differences ar e ob- tained by fact oring in these recording dates with the actual ages of the speaker to consider the dates of bir th. Studying speech variation across time raises the no table challenge of finding old recordings that can be compared to more recent ones. The scarcity of such resources e xplains why some studies com- pare only two groups of speak ers – for ex ample, one from the 1940s and one from the 1990s in Pem- ber ton et al. ( 1998 ). Another important question when selecting a speech dataset is related to the speech sty le(s) it represents, with read vs. sponta- neous speech being known t o induce differences at various le vels ( Hollien et al. , 1997 ). The w ay we speak and its ev olution ov er time is influenced b y many factors, including contacts be- tween populations ( Mufwene , 2007 ), or the identifi- cation of social groups to shared refer ence from me- dia outlets that can shape identity displa y ( Stuart- Smith , 2006 ; Vigouroux , 2015 ). Gathering re- sources capable of some generalization ov er a population requires a large and complex dataset. While corpora such as V oxC eleb ( Nagrani et al. , 2017 ) or C ommonV oice ( Ardila et al. , 2020 ) fea- ture thousands of speakers and v er y large acous- tic datase ts (o ver 100k hours), the y are mostly synchronic resources and thus not well suit ed for studying language evolution. Large r esources in terms of speaker diversity are a rar e featur e within diachronic cor pora described in the litera- ture (e.g., Zou et al. , 2012 , is a relativ ely large resource, but features very few speakers). T ypi- cal diachronic datasets feature dozens of speakers (e.g., P ember ton et al. , 1998 ; Stuar t-Smith , 2020 ), and are generally based on read speech, with no- table e xceptions ( Hollien et al. , 1994 ; Barras et al. , 2002 ). T wo diachronic datasets based on broad- cast archiv es were recently described for F rench ( Suire and Bar kat-Defradas , 2020 ; Uro et al. , 2022 ), and f eature hundreds of speakers, with some gen- der balance, but ha ve not been made av ailable to the community due to authorship considerations, copyright restrictions, privacy concerns, etc. In this paper , we pr esent and analyze a new large- scale cross-sectional corpus of F rench, spIN Ach , which is made fr eely a vailable to the research community . The complete cor pus is freely av ailable for research purposes at https://www.ina. fr/institut- national- audiovisuel/ research/dataset- project#spINAch . The acoustics estimates presented in Sec- tion 2.5 are directly av ailable at https: //doi.org/10.5281/zenodo.18714702 . This corpus comprises audio recordings of more than 2,000 speakers, recorded over a sixty-y ear - long time span (from the 1950s to the 2010s). The data is composed of ex cerpts from Fr ench radio and tele vision archives from the Institut National de l’ Audio visuel (INA). Archivists prepared a list of potential speakers participating in broadcast shows (focusing on int er views and talk shows) in order to target known individuals. This allowed a stratification in terms of speakers’ Age (betw een 20 and 95 years old) and Gender , for seven time Periods (a 10-year time span was selected between 1955 and 2015). The recordings, totaling more than 320 hours of speech, were automatically transcribed and forced aligned in order to allow the extraction of acoustic analyses (formants and fundamental frequency , f o ). This dataset (including the audio recordings, their automatic and manual transcriptions, acoustic analyses, with anon ymous demographic information about the speakers) is made av ailable to the research community . W e present in section 2 the methods used to gather and analyze this large datase t, with details on its composition, the acoustic measurements made, and quality evaluations. Section 3 proposes some preliminary diachronic analysis of the changes that can be obser ved across this time span for F rench spoken in national media outlets. 2. Corpus description The corpus w as collected in two it erations (the first being described in Uro et al. , 2022 ) with an identical methodology , ex cept for impro vements of stat e-of- the-art diarization and music detection methods. An e valuation of the e xtraction methods of the two phases applied on a subsam ple is giv en in sec- tion 2.6.2 to ascertain that the resulting acoustic segments propose comparable linguistic informa- tion, if obtained through different processes. 2.1. Archive selection An essential st ep in building a gender - and age- balanced cross-sectional corpus ov er sixty y ear s was to spo t in the archiv es’ metadata pot ential tar - get speakers that match the A ge and Gender crite- ria: female and male speakers spread across four age groups (20-34, 35-49, 50-64, and ov er 64 years old), in equivalent number at sev en time Periods separated by 10 year time-steps (1955-1956, 1965- 1956, 1975-1976, 1985-1986, 1995-1996, 2005- 2006, 2015-2016). A target of 30 speakers per Age, Gender , and Period categor y was set, without dupli- cates across cat egories. The construction of such a balanced corpus was only possible thanks to the e xper tise of INA ’s archivists, who parsed the t elevi- sion and radio databases to identify potential target speakers. For each period, they selected media with reasonable acoustic quality , such as studio- recorded talk shows free of background noise fea- turing interactiv e conversations, and verified that par ticipants had enough speaking time. Achie ving this requires archivists to review or listen to the col- lections. By cross-refer encing the speak er’s bir th date with the date of the first broadcast, they es- timated the par ticipant’s age. The compilation of this cor pus in volved a back -and-for th process with the archivists’ identification work, which helped us fill in the missing speak er categories. Based on INA ’s documentation databases, archivists identi- fied about 10,000 individuals who matched these characteris tics. Difficulties in completing some pro- files could not alwa ys be ov ercome, especially for women, and for y ounger or older persons, from the earliest periods (see T able 2 ). This bias of female representation in the media is known and well doc- umented ( Coulomb-Gully , 2011 ; Doukhan et al. , 2018b ). 2.2. Sound- T rac k extraction Automat ed signal-processing routines were applied to obtain single-track W A V files sam pled at 16 kHz from archiv es, and to discard recordings having undesirable properties. A first decompression step was per formed using ffmpeg , leading to up to 5 uncompressed tracks (stereo, mono, or Dolby) from heterogeneous archives encoded with various codecs. Audio tracks wer e inspected for the pres- ence of speaking clock: a method used from 1970 to 1990 to embed time-code information in archives using a dedicated audio track ( V allet and Carrive , 2014 ). The speaking clock was spott ed using a 1000 Hz beep det ector , along with hard-coded rules describing characteristics of its t em poral patterns, leading to the ex clusion of the corresponding audio tracks. A signal bandwidth estimator , based on the cumulative sum of the long-term spectrogram, was used to discard recordings with bandwidth below 8 kHz, of ten corresponding to undesirable archive encoding or transcoding strat egies that may result in biased acoustic paramet er e xtraction. Lastly , we used autocorrelation to det ect time dela ys between the remaining audio tracks. When a time delay was det ected, we kept only the first track; other wise, we mix ed the remaining tracks. 2.3. Manual speaker identification The next step, and the more time-consuming one, was to identify whether and when each tar - get ed speaker actually speaks within the raw audio archives. This was a manual process, suppor ted by the voice activity detection (V AD) and the di- arization of each archiv e. Speaker identification was realized by four authors of this study , result- ing in a total involvement of about 40 days. Pre- processed audiovisual archiv es were presented to annotators using ELAN ( Sloetjes and Wittenburg , 2008 ), displaying the cluster identifiers obtained during the diarization and cleaning process (see section 2.4 ), synchronized with the archiv es’ audio and video tracks. Annotators wer e provided with a shared spreadsheet containing a list of archive iden- tifiers and target speakers. The list was enriched with details about the target speaker (gender , age, occupation) to suppor t the identification process. Annotators reported the target speaker’ s clust er identifier in the spr eadsheet. Sev eral crit eria w ere defined along the iden- tification process to reject speakers having un- desirable proper ties, resulting in a manual rejec- tion rate of about 11%: bad acoustic quality of speaker utterances (telephone speech, outdoor recordings, large amounts of bac kground noise or music, strong audio effects), diarization under - segmentation errors resulting in se veral speakers sharing the same clust er identifier , use of foreign language or strong non-F rench speaking accent, dubbed speaker , homonym speaker having dif- ferent age and occupation than the target, retro- spective show broadcasting the voice of the targe t speaker over several decades resulting in incorrect speaker age estimation, etc. 2.4. Data extraction, cleaning, and transcription Once a target speaker is identified, all segments obtained from py annote [v3.1] ( Bredin , 2023 ) di- arization corresponding to their voice are e xtracted, ex cluding overlapping voice segments. The au- dio ex cer pts are submitted to a cleaning proce- dure because the primar y objective of the corpus is studying speech ’s acoustic characteristics – a process sensitive to the presence of noise or back - ground music. On top of p y annote predictions, we also applied InaSpeechSegmenter ( I S S v0.8; Doukhan et al. , 2018a ) as a voice activity detec- tor and merged its outputs with those from the diarization in order to better identify spoken seg- ments. Outputs from p yanno te overlapping (ev en marginally) non-speech ev ents det ected by I S S were discarded. T o avoid an y acoustic bias from telephone-quality speech segments, we use the LIUM SpkDiarization [v8.4.1] ( Meignier and Merlin , 2010 ) to detect and remo ve them. Some clean- ing w as already done when applying py annote and I S S , as it remov es what is considered noise or mu- sic to k eep only speech, but background music is still possible. W e use a music det ection model and apply a threshold of 0.8 to the ratio of the detect ed duration to the segment’ s total duration. The music segmentation model 1 described in P el- loin et al. ( 2026 ) uses music2vec ( Li et al. , 2022 ) embeddings and classifies each frame. It obtains a frame-lev el F1-Score of 89.7% on OpenBMA T ( Meléndez-Catalán et al. , 2019 ) and 92.0% on Se y- erlehner ( Sey erlehner et al. , 2007 ), two music de- tection datasets of TV broadcast content. At the end of the cleaning process, af ter remo ving all par ts containing noise, music, or overlapping speech, we obtained the diarized speech segments for each target speaker . These speech e xcerpts were then fed to Whisper [large-v3] ( Radford et al. , 2022 ) in order to obtain a le xical transcription. This transcription w as used to force-align its phonetic transcription to the speech signal using the Montr eal F orced Aligner ( M F A , version 3.0; McAuliffe et al. , 2017 ). An ev aluation of the transcription accuracy , bo th in terms of word error rate and of phone error rate, is given in sec- tion 2.6.1 . Without applying an y cleaning (i.e., us- ing raw segments from p y annote and I S S ), the total number of phones from the M F A output e xceeded 4.6M oral v owels. V owels with pr edicted duration ov er 200 milliseconds were remov ed (about 5.1%). After remo ving unv oiced vow els using the method described in the ne xt section 2.5 , the music detec- tion model reduces again this vowel set by 8.2%. This leads to a set of 3,016,134 million vow els av ail- able in the clean version of the cor pus. 2.5. Acoustic measurements Sev eral acoustic parameters hav e been extracted and are pro vided with the corpus to suppor t pho- netic analyses of the speech productions it con- tains. First, t he speech ’s f o was estimated for each segment with a 10 ms time step, using two different pitch det ection algorithms for robustness ( V aysse et al. , 2022 ): the aut ocorrelation algorithm implemented in Praat ( Boersma , 1993 ; Boersma and Weenink , 2025 ), and the REAPER estimator ( T alkin , 2015 ). Fr ames that any of the two algo- rithms annotated as unv oiced were deemed un- voiced, and frames where the y differ by mor e than a 20% gross-error difference in their f o prediction 1 https://hf.co/ina- foss/ ssl- music- detection- music2vec were also marked as un voiced – because possibly unreliable. For the remaining frames (about 79% of the total number of frames), the value estimated by P r aat w as r etained. Then, the first five formants were estimated along the signal using Praat ’s implementation of the Burg algorithm, with the same 10 ms time step, and adapting the ceiling parameter for each speaker and vow el category follo wing the strategy proposed by Escudero et al. ( 2009 ). The strat egy consists of estimating formants for a set of ceilings, and using the one that minimizes the f ormants’ variance for a given v owel cat egor y and a specific speaker . In our case, we used a set of twenty ceilings above and below the reference ceiling recommended for female and male speakers in P r aat documentation (respectiv ely 5.5 kHz and 5 kHz), spaced b y steps of approximat ely 50 Hz (for details, see Praat ’ s doc- umentation 2 ) abo ve or below the ref erence ceiling. The best ceiling that w as kep t minimizes the sum of v ariances obser ved for the first three formants of all the vowels of a given category f or a specific speaker . W e used the first thr ee formants, in place of the first two in Escudero et al. ( 2009 ) because the third formant is relevant for rounding, which is an im por tant feature of the Fr ench vocalic system (e.g., Ménard et al. , 2009 ). For each formant, the median of all values ob- ser ved along the middle third of each vowel w as kept. Formants are expressed in Her tz and con- verted to a Bark scale using the equation in T raun- müller ( 1990 ). F or f o , the median of all valid values obser ved along the vowel was considered. The f o values are e xpressed in Her tz and in semitones (relativ e to 1Hz). 2.6. Quality e valuations 2.6.1. Evaluation of automatic transcription and forced alignment A crucial aspect of this corpus construction lies in its fully automatic transcription and subsequent phonetic alignment, which enable the study of the acoustic characteristics of vowels over time. T o ev aluate this process, one hour of speech was ran- domly sam pled from the to tal corpus, representing 245 speakers spread across the seven periods. These samples were manually transcribed by four L1 F rench speakers to reflect their full content of speech. These human-made transcriptions were then submitted to the same M F A -based forced- alignment to obtain a phonetic version. Primar y concerns relat ed to the transcription quality were related to the possibility of W hisper adding le xical items that were not actually pronounced, a risk that ma y be increased in older recordings – acoustically 2 https://www.fon.hum.uva.nl/praat/manual/ FormantPath.html uncommon – and therefore less likely to hav e been included in W hisper ’s training corpus. The auto- matic transcrip tion quality is ev aluated using word error rate ( W E R ) and phone error rate ( P E R ). Af ter the te xt normalization, the W E R is 11.7 us- ing Whisper large-v3. This score is one point higher than that r epor ted for CommonV oice 15 ( Ardila et al. , 2020 ) on the French subset. The phone-level ev aluation is a crucial com ponent for the acoustic analysis of the vo wels detailed in section 3 , given the nature of our dataset (recorded speech from in- terviews in which speak ers may repeat themselves and exhibit disfluencies or hesitations). The results indicate that the P E R (substitutions, deletions, and inser tions summed and divided by the to tal num- ber of phones) reached 7.74, with a phone-level precision of 93.7%, discarding the risk of major in- ser tions. When focusing only on vowels, the P E R degraded by about three points, reaching 10.26. The most common errors made by Whisper were deletions. They r epresent 64.39% of phone-lev el errors. On the ev aluation subse t, 5.45% of vo wels were deleted during the automatic alignment pro- cess, compared to the manual transcrip tions. The first thr ee most deleted vowels are the /œ/ (58%), /ø/ (22%) followed by the /@/ (9%). Hesitations, expr essed in F rench by words like "heu" or "euh" (phonetized by /œ/ or /ø/ ), represent 28% of the delet ed vo wels. Then, the next 5% are derived from the conjunction "et" ( /e/ ), follo wed by the words "de" and "que" (4.9% and 3.5%) – these words are likely used in repe titions and hesitations and were only transcribed once by W hisper . Among all vo wel classes, the v owel /@/ is the most frequently inser ted and is mostly derived from the negation word "ne" (23% of /@/ inser tions), which is gener - ally not produced in F rench ( Abeillé and Godard , 2021 ). W hisper tends to add it, resulting in more formal lexical predictions. 2.6.2. Comparison of extraction methods Given the size of the corpus, it was produced in two phases (in 2021 and 2025), separated by four years. The processing for the first version priori- tized speech quality when identifying speakers, at the risk of missing potential targets and thereby lowering recall. In addition, more up-to-date soft- ware has since been released, so the automated processing of archives differs slightly between the two phases, mainly in the algorithms emplo y ed (for details on the first version, see Uro et al. , 2022 ). This notably explains why there are more speakers in the years 1965, 1985, and 2005, as the newer version was more efficient. In order to evaluat e if these two processing meth- ods introduced a q ualitative difference in terms of the distribution of the acoustic characteristics of phonemes, an ev aluation was necessar y , as both processing methods end up with comparable, but not identical, speech segments out of the same orig- inal r ecording. The main differences introduced by the two processings are (i) different diarization of the archive, and (ii) possibly different transcription and alignment, as the cleaning process (remov - ing backgr ound music, noises, etc) was per formed with different algorithms. Our evaluation thus fo- cuses on verifying that the phonetic characteristics of a given speaker (in our case, the distribution of their formants for oral vo wels) are comparable across extraction methods. That is, the /a/ s (and other vowels) extract ed with the two me thods shall hav e comparable acoustic characteristics for the same speak er . The question here is to verify , f or a subset of speakers from the first phase ( Uro et al. , 2022 ), if the formants measured on vow els det ected using the new processing are equiv alent to those measured on vowels detect ed by the initial method. Even if the two me thods give slightly different se ts of vo wels (in terms of the number of vow els ob- ser ved for a giv en speak er), on a sufficiently long dataset, the formants of each vo wel category shall hav e comparable distributions within each speaker , across processing methods, if one thinks the e x- traction method in itself does not bias the phonetic content. T o evaluat e the possibility of a bias linked to the processing chain, eight speakers wer e randomly drawn from the eight categories of the Periods (1955-1956, 1975-1976, 1995-1996, 2015-2016) and Gender (Female, Male) processed at the first phase, for a total of 64 individuals (8 speakers from 8 categories). The new extraction method was ap- plied to the 64 archiv es of the initial corpus to extr act and transcribe the target speakers’ voices a sec- ond time. From these two se ts of extr acted v owel segments, thanks to the initial (therefore Method [1]) and newer software (therefore Method [2]), the same phoneticization pipeline and the same f o and formant detection process (see section 2.5 ) w ere applied. An equiv alent number of vowels, approxi- mately 90,000, is obtained by both Met hods. De- tails regarding the number of occurrences obtained for each oral vowel are presented in T able 1 . A linear mixed model was then fit to the value of each of the first three formants, with the extraction Method ([1]/[2]) as the main predictor , controlled for Gender and V owel category , and using Periods and Speakers (nested within gender and period) as ran- dom v ariables. This is formalized in formula 1 , that follow s R’s lme4 syntax ( R Core T eam , 2024 ; Bates et al. , 2015 ), and where z i are the standardized val- ues of either of the fir st three formants in Bar k, M is the Met hod ([1] or [2]) used for processing the archives, G being the Gender of a given speaker , V the V owel class (12 levels, see T able 1 ), P the P eriod (4 le vels), and S the index of the Speak er Phone Method [1] Method [2] [i] 13,338 14,194 [e] 13,132 13,814 [E] 13,483 14,317 [a] 18,140 19,313 [A] 769 828 [O] 5,930 6,332 [o] 2,960 3,172 [u] 4,146 4,511 [y] 4,721 4,961 [ø] 2,652 2,799 [@] 8,705 9,167 [œ] 1,165 1,257 T otal 89,141 94,665 T able 1: Number of occurrences for each vowel category for the two segment sets obtained by the two extraction Methods on the same 64 speak er s. producing a given phone: z i ∼ M + G + V + (1 | P /G/S ) (1) The tables summarizing the factors (fixed and ran- dom) of the three regression models (one for each formant) are giv en in Appendix A (Section 8.1 ). The models showed there were no significant differences introduced in the distribution of these formants by the e xtraction Met hod (for F 1 : χ 2 (1) = 0 . 084 , p = 0 . 772 ; for F 2 : χ 2 (1) = 0 . 96 , p = 0 . 327 ; for F 3 : χ 2 (1) = 0 . 6147 , p = 0 . 433 ), while the other paramet er s had, obviously , major effects on the formants values — primarily the V owel categories, but also the gender and a large inter -speaker v ari- ability (cf. T ables 4 & 5 ). This result confirms that the extraction method employ ed here is robust to the evolution of diarization, speech-to-te xt, forced alignment, and music det ection algorithms. This is for tunate given the surge of use of comparable frameworks in phonetic studies e.g., Ballier and Méli , 2024 ; Coats , 2025 ; Christodoulidou et al. , 2025 . We believe this comparison of the two meth- ods supports that the extr acted vowels from all Pe- riods of the corpus shall give coherent information, regardless of how they were segmented. W e con- sider the data extract ed using these two met hods to be comparable in terms of phonetic distribution, which justifies prioritizing improv ements to the pro- cessing pipeline by incorporating state-of-the-art advances in the second phase. 2.7. Summar y of corpus features T able 2 presents the distribution of the data in terms of speaker count and recording duration across the sev en periods, with columns presenting gender Figure 1: Plot of vowels ’ F 1 and F 2 (in Bark) val- ues obtained from a sample of 240 speakers in the spINA ch corpus, balanced in terms of period and gender . Median v alues by v owel category (encir- cled in white) and for each v owel (color points) are shown for bot h genders. and rows presenting age categories. A sam ple of oral vowel formants estimat ed using the procedure described above is shown in Figur e 1 , resulting in well-known vocalic triangles. Cleaned oral vowels ’ distribution is repor ted in the table in Appendix B 6 3 . Details about the whole phone dataset and its categories (including nasal vo wels and consonants) are given in Appendix C ( 7 ). 3. Data Analysis This section presents a series of preliminary anal- yses of the corpus data, highlighting its versatility and showing how diachronic data rev eal new in- sights into language description. 3 Acoustic estimates from the aligned oral vowels are hosted at Zenodo on this url https://doi.org/10. 5281/zenodo.18714702 . 3.1. Does v oice pitch change with time? Our voices are impor tant parts of our personalities, as the y inde x aspects of each individual, from their gender to their health ( La ver , 1968 ; Eckert , 2019 ; Podesv a and Callier , 2015 ). An im por tant com po- nent of voice quality is relat ed to its perceiv ed pitch: a lo wer or higher v oice being a central component of gender perception through vocal cues ( Leung et al. , 2018 ; Simpson and W eirich , 2020 ), but also to a series of interactiv e functions relat ed to Ohala’s F requency Code ( Ohala , 1994 ). The construction of an individual’ s voice pitch is mediated by cul- tural components, notably those related to culturally variable representations of gender ( van Bezooijen , 1995 ; Ohara , 2001 ). As cultural v alues e volv e with time, the role of women in socie ty has undergone impor tant changes since the end of the Second W orld W ar . As voice pitch is (negativ ely) relat ed to social po wer in interaction ( Ohala , 1994 ; Spencer - Oate y , 1996 ; Goudbeek and Scherer , 2010 ), one ma y hypo thesize that female voice pitch decreases ov er time. A pitch decrease was claimed by some publications (e.g., Berg et al. , 2017 , albeit without diachronic evidences), but it is contro versial in the literatur e (e.g., Hollien et al. , 1997 ; Pemberton et al. , 1998 ). The corpus presented here pro vides insight into potential changes in vocal characteristics for both genders among the subset of Fr ench personali- ties who are invited to radio and tele vision shows. Thanks to the acoustics measurements detailed in section 2.5 , we hav e a set of f o measurements, one for each oral vowel annotated as voiced (when both pitch de tection algorithms re turned coherent mea- sures). There are 3,016,134 f o measurements, which ar e the median values observed on the voiced frames of each corresponding vowel. The vo wels come from 2,109 speak ers, female or male, distributed uniquely across sev en Periods as shown in T able 2 . The speakers’ ages range from 20 to 95. 3.1.1. Methods W e tried to ev aluate a potential evolution of f o across P eriods, possibly linked to the speaker’s gender , as a potentially different effect could be expect ed for female and male speakers, but con- trolling for changes linked to the speak er’s age ( Berg et al. , 2017 ; Gisladottir et al. , 2023 ). Linear mix ed-effect regression models were fitted (follow- ing Gries , 2021 ; Crawley , 2013 , and using R’ s lmer library; R Core T eam , 2024 ; Bates et al. , 2015 ) to the median f o values of each vow el, expressed in semitones and standardized to av oid numerical problems. The models took as predictors three fix ed fact ors: the speaker’ s Age (in years, cen- tered around 50 years and divided by 30), their 1955/6 1965/6 1975/6 1985/6 1995/6 2005/6 2015/6 T otal F M F M F M F M F M F M F M 20-34 16 0.32 34 1.42 40 2.85 29 9.81 16 0.44 16 1.41 29 3.96 52 7.88 28 3.59 27 2.94 52 9.97 53 10.59 25 3.71 30 3.76 447 62.65 35-49 21 0.85 70 2.94 37 3.41 37 8.07 24 0.92 37 1.87 48 8.15 63 12.24 33 4.32 44 6.43 57 11.28 58 11.22 36 4.02 53 7.47 618 83.19 50-64 18 0.62 52 2.78 33 4.91 34 11.25 27 2.76 41 2.33 45 6.69 60 13.98 29 5.96 47 6.94 56 11.73 62 12.02 26 3.22 50 5.53 580 90.72 ≥ 65 21 1.82 17 1.11 29 3.49 32 9.25 18 2.28 26 4.55 21 3.97 62 15.21 31 7.82 37 6.13 48 13.88 59 13.20 33 5.14 30 5.55 464 93.4 T otal 76 3.61 173 8.25 139 14.66 132 38.38 85 6.4 120 10.16 143 22.77 237 49.31 121 21.69 155 22.44 213 46.86 232 47.03 120 16.09 163 22.31 2109 329.96 T able 2: Number of speakers (on top of the cell) and duration of recordings in hours (at the bottom of the cell) in each cat egor y of Age (rows) by Period and Gender (columns) in the spIN Ach corpus. Gender (two lev els: “F” or “M”), and the P eriod of time corresponding to their recording (sev en lev- els: 1955-56, 1965-66, 1975-76, 1985-86, 1995- 96, 2005-06, 2015-16). The models also con- trolled for variation associated with two random factors: the Speaker (2,109 wer e consider ed) and the V owel categor y (12 lev els). As vowel production is speaker -specific, the fact or V ow el was nested in Speaker , itself nested in Gender and in Period . The spINAch corpus is cross-sectional, so each speaker belongs to a specific Gender and a spe- cific Period . Gender was nested in P eriod as gen- der representation in society ma y evolv e across time. Double inter actions betw een each pair of the three random factors wer e also kept in the model, while the three-w ay interaction w as discarded dur - ing a model simplification process ( Crawley , 2013 ), as not significant. The model considered here fol- lows equation 2 (based on lmer ’s syntax), where z f o stands for the standardized f o , A for Age , G for Gender , P for P eriod , S for Speak er , and V for V owel categor y . The square in equation 2 en- codes the two-w ay interactions among the A , G , and P factors. The ANO V A table f or this model is presented in T able 3 . z f o ∼ ( A + G + P ) 2 + (1 | P /G/S/V ) (2) Factors χ 2 Df P r ( > χ 2 ) A 6.9508 1 0.0083780 ** G 14.9094 1 0.0001128 *** P 0.0246 6 0.9999997 A : G 57.6536 1 3.126e-14 *** A : P 22.8561 6 0.0008461 *** G : P 0.0593 6 0.9999958 T able 3: Analysis of De viance T able (T ype II W ald χ 2 tests) obtained for the model fitted to the f o measurements; the A , G , and P factors correspond to the A ge , Gender , and P eriod factors described in the te xt. Figure 2: f o values (y-axis) estimated by the model for Age (x-axis) across Genders (colors), plott ed ov er points that represent each speaker’s mean f o . 3.1.2. Results Results showed that the speaker’ s Gender has an (expect ed) major effect on f o values, with the mean f o difference of 7.7 semitones across gen- ders, while, as main factors, Age show ed only a limited effect, and P eriod had none. Ag e is nonethe- less, and as described in the literature, fundamen- tal to explain f o changes, but conditioning on the speaker’ s Gender (cf. the A : G line). Female voices hav e their pitch low ered with Age (a mean lowering of about 2.8 semitones in 60 years), while the re- verse tendency is obser ved for male voices (a more modest rise of about 0.8 semitones in 60 y ears; see Figure 2 ). As f or P eriod , this factor has a limit ed impact on f o values. Its effect as a main factor is not signifi- cant, but it e xhibits a significant interaction with Ag e . This interaction (not plott ed for space reasons) is linked with the e volution of f o across speakers of different Ages at a given P eriod . This effect (con- trolled for Gender , as discussed above) consists of an increase of f o with age for the 1955-56 and 1965-66 Periods , while the tendency is rev ersed starting with the 1985-86 P eriod , when pitch tends to diminish with age. This means that across older P eriods , as people get older , they tend to increase their pitch, controlling for their gender t endencies, whereas in new er periods, older individuals tend to decrease their pitch. The effect size is compara- tively limited with respect to the Ag e:Gender inter - action, but is still inter esting, as pot entially link ed to varying social conditions in the population. A possible explanation may relate to impro ved life expectancy in more recent periods, linked to bett er health conditions. 3.2. Evolution of French vocalic syst em One documented change in Parisian F rench dur - ing the twentie th centur y was an evolution of its vocalic system. Among other variations, the op- position between /a/ and /A/ is not productiv e in the main v ariant of F rench spok en in F rance (for an overview of F rench diachron y , see Abeillé and Godard , 2021 ). This vowel shif t has already been obser ved on a series of speech cor pora from differ - ent periods ov er a centur y ( Cęcelewski et al. , 2024 ), but on a dataset featuring only male speakers (be- cause of the difficulty in finding female speakers in the archives). W e focus here on these two vow els ( /a/ an /A/ ), and not on the complete vocalic sys- tem of F rench, in order to give a simple ex ample of diachronic changes and the impor tance of con- trolling for gender – one of the ke y features of the spINA ch corpus. 3.2.1. Methods These tw o v owels wer e annotated by the M F A al- gorithm, which distinguishes be tween the two pos- sible variants in its phonetic dictionar y based on its acoustic model for F rench. In the cor pus, w e get a sample of 623,003 and 28,202 occurrences of /a/ and /A/ respectiv ely , which first shows that the y are clearly used unequally in Fr ench. We fit two linear mixed-effect regression models, one on each of the first two f ormants ( F 1 and F 2 ) expressed in bark and standardized. For F 1 , after a simplifica- tion procedure, the model that was kept to describe the variation of this formant, controlling for varia- tions link ed to individual Speak er ( S ), Gender ( G ), P eriod ( P ), A ge ( A ), and V ow el category ( V ) cor - responds to the Equation 3 . z F 1 ∼ ( A + G + P + V ) 2 + A : P : V + (1 | P /G/S/V ) (3) For F 2 , a complex inter action between the speaker’ s age and the period w as found. Follow- ing ( Stuar t-Smith , 2006 ), we thus estimated each speaker’ s birth date, and ev aluated the same model but using apparent time in place of P eriod and A ge . Af ter simplification, the model corresponds to Equa- tion 4 , where AT ref ers to the Apparent Time when a given speaker star ts learning their phonological syst em. z F 2 ∼ AT ∗ ( G + V ) + (1 | G/S/V ) (4) 3.2.2. Results The model fitted to the first formant (see T able 8 in Appendix D) expects effects for the speaker’ s gender ( χ 2 (1) = 18 . 9 , p < 1 . 0e − 4 ) and for t he vow el cat egor y ( χ 2 (1) = 2618 . 8 , p < 1 . 0e − 5 ). It also shows int eractions between the Gender with the vo wel category ( χ 2 (1) = 30 . 6 , p < 1 . 0e − 5 ) and between the time Period with the v owel category ( χ 2 (6) = 15 . 9 , p < 0 . 05 ) and a triple interaction be- tween Period with vow el and the speaker’s age ( χ 2 (6) = 13 . 8 , p < 0 . 05 ). The largest diachronic changes along F 1 are dependent on the vowel cat- egory , and consist of a F 1 decrease from 1955 to 1985, but the differences across the two v owel categories were mostly k ept unchanged ov er time. An increase of the fir st formant is known to f ollow the jaw aper ture ( Erickson , 2002 ). Apar t from F 1 obvious relation to v owel aper ture, increased jaw aper ture is also relat ed to vocal effort (e.g., Rilliard et al. , 2018 ). The observed diachronic changes in vocalic charact eristics along this dimension may thus be link ed to an evolution in recording prac- tices in media outlets ov er this period, with a more declamatory style and microphones placed farther from the mouth in more ancient recordings (see e.g. Boula de Mareüil et al. , 2012 ; Dev auchelle et al. , 2024 ). In terms of the two vocalic cat egories, the differences across the /a/ and /A/ along F 1 do not change with time: it seems the vo wels tagged as /A/ had alwa ys larger jaw openings. For the model fitted on the second formant (see T able 9 in Appendix D), we obser ved an expect ed effect of vowel categor y ( χ 2 (1) = 5377 . 1 , p < 1 . 0e − 4 ) as both phones differ along the antero-posterior dimension. Ther e is also an interaction betw een the Appar ent Time and the V owel categor y ( χ 2 (1) = 204 . 4 , p < 1 . 0e − 4 ) that is represented in Figure 3 . It shows that the two vo wels ’ F 2 values conver ge ov er the twentieth centur y , so they are statisticall y com- parable for people born around the middle of the century . We also observe a significant interaction between Apparent Time and the speaker’s gender ( χ 2 (1) = 11 . 6 , p < 1 . 0e − 3 ), that is independent of vow el category . Because of a longer vocal tract, males tend t o have lower F 2 values than females, but male speakers born more recently tend to am- plify this difference, once controlled for vowel: the y displa y a lowering F 2 trend over Apparent Time that is not obser ved for f emale speak er s (whose mean F 2 values are almost flat with time). This continued lowering of F 2 for these two vowel categories along the xx th century is visible in the analysis proposed by Cęcelewski et al. 2024 – who Figure 3: F 2 values (y-axis) estimated by the model for Apparent Time (x-axis) for both vowel categories ( /a/ and /A/ ; color s), plotted ov er points that rep- resent each speaker’ s mean F 2 . worked on male-only datasets. The int erpretation of a gender-specific tendency suppor ted by our dataset may thus differ from theirs. Lowering F 2 is a correlate of a post eriorized ar ticulation, an ar tic- ulatory setting that tends to lower a voice ’s pitch, which ma y be a marker of masculinity ( van Bezooi- jen , 1995 ) – while the contrar y , an anteriorization strat egy , w as linked by other works to a phonostyle of seduction used by F rench females ( Léon , 1993 ; Rilliard e t al. , 2018 ). So, during the first half of the century , we obser ve in our datase t a conver gence of the two phonetic categories, with the/ A / vo wel progressiv ely anteriorized until it mixes with / a /; this ev olution was not gender -dependent. Once the v o- calic sy stem has only one category of open vo wel, it has more space for sociophonetic variations – and during the second half of the centur y , P arisian F rench-speaking males appear to use this av ailable space t o r einforce their voice ’s masculinity by pos- teriorizing their articulation of the now single open vo wel. Such a gender -specific tendency was not ob- ser ved in our diachronic data for f emale speakers (who k ept a more anterior ar ticulation, compared to males). W orking on a subset of the spINA ch corpus, Elie et al. ( 2024 ) had already obser ved gender -ex aggerating ar ticulatory t endencies in fe- male and male speakers, who are using lips and lar ynx positions to respectively shor ten or lengthen their vocal tract. Meanwhile, in Elie et al. ( 2024 )’ s work, the tendency did not ev olve diachronically . Here, the e xtra v ocalic space lef t by the fusion of / a / and / A / appears to hav e been used for inde xical means by male speakers. 4. Conclusion In this paper , we present the spINA ch cor pus, f ea- turing more than 320 hours of speech extract ed from radio and television broadcasts ov er a 60- y ear period, from speakers selected to balance gen- der and age distributions. This dataset comprises more than two thousand speakers, represented by speech segments of varying duration, with a me- dian speaker duration of ov er six minutes (390 s). The speaker’ s bir th dates range from 1870 to 1990. The cor pus, which is made av ailable for research purposes 4 , contains the audio samples, their au- tomatic transcription, and the phonetic alignment. The speakers’ identities are not disclosed, with only non-identifying demographic information retained (age at the time of recording, with a five-y ear preci- sion, and gender). All audio segments of a given speaker resulting from the diarization process have been assigned a random ID to make it difficult t o reconstruct the speak er’s argument for potential authorship purposes. The acoustic analyses ( f o and formants) present ed in this paper focused on oral vowels. More than 3 million vowels hav e been analyzed, and these measurements are present ed in a separate file to allow inter ested researchers t o study this aspect of the cor pus dir ectly 5 . The two rapid analyses of this dataset presented in section 3 , beyond their specific findings, showed that the spINA ch corpus contains reliable data, as we were able to reproduce sever al known charac- teristics of f o changes with age or v ocalic variation in French. These data enable a variety of pho- netic investigations of changes in F rench as it is spoken in F rance’ s national media outlets ov er this 60-year period. W e hope our efforts to gather this dataset will enable the community to conduct more research on the diachrony of the F rench language. 5. Acknowledgments This work was partly funded by ANR "Gender Equal- ity Monit or" (GEM) grant ANR-19-CE38-0012. W e are especially grateful to Pascal Flard at INA for his valuable assistance in recov ering par t of the archives needed for this work. 6. Bibliographical References 4 A vailable at https://www.ina.fr/ institut- national- audiovisuel/research/ dataset- project#spINAch 5 A vailable at https://doi.org/10.5281/ zenodo.18714702 Anne Abeillé and Danièle Godard. 2021. La grande grammair e du français . Actes Sud Imprimerie nationale éditions, Arles [P aris]. Nicolas Ballier and Adrien Méli. 2024. Inv estigat- ing acoustic correlat es of whisper scoring for l2 speech using forced alignment with the italian component of the isle corpus . page 20–32. Claude Barras, Ale xandre Allauzen, Lori Lamel, and Jean-Luc Gauvain. 2002. T ranscribing audio- video archiv es . In IEEE International Confer - ence on Acous tics Speech and Signal Process- ing , pages I–13–I–16, Orlando, FL, US A. IEEE. Douglas Bates, Mar tin Mächler , Ben Bolker , and Ste ve Walk er . 2015. Fitting linear mixed-effects models using lme4 . Journal of Statistical Soft- war e , 67(1):1–48. Mar tin Berg, Michael F uchs, K erstin Wirkner , Markus Loeffler , Christoph Engel, and Thomas Berger . 2017. The speaking voice in the general population: Normative data and associations to sociodemographic and lifestyle factors . Journal of V oice , 31(2):257.e13–257.e24. P aul Boersma. 1993. Accurate shor t-term analy sis of the fundamental frequency and the harmonics- to-noise ratio of a sampled sound. Proceedings of the institut e of phonetic sciences , 17:97–110. P aul Boersma and David Weenink. 2025. Praat : do- ing phonetics b y computer [computer program]. version 6.4.45 . Philippe Boula de Mar eüil, Alber t Rilliard, and Alex andre Allauzen. 2012. A diachronic study of initial stress and other prosodic features in the fr ench new s announcer style: Corpus-based measurements and perceptual e xperiments . Lan- guage and Speech , 55(2):263–293. Her vé Bredin. 2023. pyannot e.audio 2.1 speaker diarization pipeline: principle, benchmar k, and recipe . In Int erspeech 2023 , pages 1983–1987. Pol ychronia Christodoulidou, James T anner , Jane Stuart-Smith, Michael McAuliffe, Mridhula Mu- rali, Amy Smith, Lauren T aylor , Joanne Cleland, and Anja Kuschmann. 2025. A semi-automatic pipeline for transcribing and segmenting child speech . In Inter speech 2025 , pages 4278–4282. Ste ven Coats. 2025. An aut omatic pipeline for pr o- cessing s treamed content : New horizons for cor - pus linguistics and phonetics , page 257–274. De Gruyter . Marlène Coulomb-Gully . 2011. Genre et médias : vers un état des lieux. Sciences de la sociét é , 83:3–13. Michael J. Crawley . 2013. The R book , second edition edition. Wiley , Chichest er , W est Susse x, UK. Juliusz Cęcelewski, Cédric Gendrot, Mar tine Adda- Decker , and Philippe Boula de Mareüil. 2024. Étude en temps réel de la fusion des /a/ ~/ A / en français depuis 1925 . In Act es des 35èmes Journées d’Études sur la P arole , page 71–81, T oulouse, F rance. A T ALA and AFPC. Simon Dev auchelle, Alber t Rilliard, David Doukhan, and Lucas Ondel Y ang. 2024. V ariation of Per - ceived V oice Pitch Across Time Periods, Gen- der , and A ge in Fr ench Media Archives . In V alentina De Iacov o, Bianca Maria De P aolis, and Daniela Mereu, editors, The voice in the me- dia and new technologies , volume 12 of Studi Associazione Italiana Scienze della V oce , pages 47–71. Officinaventuno. David Doukhan, Jean Carrive, Félicien V allet, An- thony Larcher , and Sylvain Meignier . 2018a. An open-source speaker gender detection frame- work for monitoring gender equality . In Acoustics Speech and Signal Pr ocessing (IC ASSP), 2018 IEEE International Conference on . IEEE. David Doukhan, Géraldine Poels, Zohra Rezgui, and Jean Carrive. 2018b. Describing gender equality in french audio visual streams with a deep learning approach . VIEW Journal of Euro- pean T ele vision Hist or y and Culture , 7(14):103– 122. P enelope Eck er t. 2019. The limits of meaning: Social indexicality , v ariation, and the cline of in- teriority . Language , 95(4):751–776. Benjamin Elie, David Doukhan, Rémi Uro, Lu- cas Ondel- Y ang, Albert Rilliard, and Simon De- vauchelle. 2024. Ar ticulator y Configurations across Genders and Periods in F rench Radio and TV archives . In Interspeech 2024 , pages 3085–3089. Donna Erickson. 2002. Ar ticulation of extreme for - mant patterns for emphasized vowels . Phonetica , 59(2–3):134–149. P aola Escudero, Paul Boer sma, Andréia Schur t Rauber , and Ricardo A. H. Bion. 2009. A cross-dialect acoustic description of vow els: Brazilian and european por tuguese . The Journal of the Acoustical Society of America , 126(3):1379–1393. Rosa S. Gisladottir , Agnar Helgason, Bjarni V . Hall- dorsson, Hannes Helgason, Michal Borsky , Y u- Ren Chien, Jon Gudnason, Sigurjon A. Gudjons- son, Scott Moisik, Dan Dediu, Gudmar Thorleif- sson, Vinicius T ragante, Mariana Bustamant e, Gudrun A. Jonsdottir , Lilja St efansdottir , Gudrun Rutsdottir , Sigurdur H. Magnusson, Marteinn Hardarson, Egil Ferkingstad, Gisli H. Halldors- son, Solvi Rognv aldsson, Astros Skuladottir , Erna V . Iv ar sdottir , Gudmundur Norddahl, Gud- mundur Thorgeirsson, Ingileif Jonsdottir , Mag- nus O. Ulfarsson, Hilma Holm, Hreinn St efans- son, Unnur Thorsteinsdottir , Daniel F . Gudbjar ts- son, Patrick Sulem, and Kari St efansson. 2023. Sequence variants affecting voice pitch in hu- mans . Science Adv ances , 9(23):eabq2969. Mar tijn Goudbeek and Klaus Scherer . 2010. Be- yond arousal: V alence and potency/control cues in the vocal expr ession of emotion . The Journal of the Acoustical Society of America , 128(3):1322–1336. Stef an Thomas Gries. 2021. Statis tics for linguis- tics with R: a practical intr oduction , 3rd revised edition edition. De Gruyt er Mouton te xtbook. de Gruyter Mouton, Berlin Boston. Jonathan Harrington, Sallyanne Palet horpe, and Catherine Watson. 2000. Monophthongal vowel changes in receiv ed pronunciation: an acoustic analysis of the q ueen’ s christmas broadcasts . Journal of the Inter national Phonetic Association , 30(1–2):63–78. Harry Hollien, Rachel Green, and Kar en Massey . 1994. Longitudinal research on adolescent voice change in males . The Journal of the A coustical Society of America , 96(5):2646–2654. Harry Hollien, Patricia A. Hollien, and Gea De Jong. 1997. Effects of three parame- ters on speaking fundamental frequency . The Journal of the Acoustical Society of America , 102(5):2984–2992. Ale xandra Kuzne tsov a, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerT est package: T ests in linear mixed effects models . Journal of Statistical Sof twar e , 82(13):1–26. John D. M. La ver . 1968. V oice quality and inde x- ical inf ormation . British Journal of Disorders of Communication , 3(1):43–54. Y eptain Leung, Jennifer Oates, and Siew Pang Chan. 2018. V oice, ar ticulation, and prosody con- tribute to list ener perceptions of speaker gender : A sy stematic revie w and meta-analysis . Journal of Speech, Language, and Hearing Research , 61(2):266–297. Y . Li, R. Y uan, G. Zhang, Y . MA, C. Lin, X. Chen, A. Ragni, H. Yin, Z. Hu, H. He, E. Benetos, N. Gyenge, R. Liu, and J. F u. 2022. Lv-49: Map- music2vec: A simple and effective baseline for self-super vised music audio representation learn- ing . In 23rd International Society f or Music Inf or - mation Retrie val Conference (ISMIR 2022) . Pierre Léon. 1993. Précis de phonosty listique. P a- role et expr essivité . Nathan Univ er sité, Paris. Michael McAuliffe, Michaela Socolof, Sarah Mi- huc, Michael Wagner , and Morgan Sonderegger . 2017. Montreal forced aligner : T rainable te xt- speech alignment using kaldi . In Interspeech 2017 , page 498–502. ISC A. Sylv ain Meignier and T ev a Merlin. 2010. Lium spkdi- arization: an open source toolkit for diarization. In CMU SPUD W orkshop . Salikok o S Mufwene. 2007. P opulation mov ements and contacts in language ev olution. Journal of language contact , 1(1):63–92. Lucie Ménard, Sophie Dupont, Shari R. Baum, and Jérôme Aubin. 2009. Production and percep tion of french vo wels by congenitally blind adults and sighted adults . The Journal of the Acoustical Society of America , 126(3):1406–1414. John J. Ohala. 1994. The frequency code underlies the sound-symbolic use of voice pitch , 1 edition, page 325–347. Cambridge Univ ersity Press. Y umiko Ohara. 2001. Finding one’ s voice in Japanese: A study of the pitch le v els of L2 users . DE GRUY TER MOUT ON, Berlin, New Y ork. V alentin P elloin, Lina Bekkali, Reda Dehak, and David Doukhan. 2026. Data selection effects on self-super vised learning of audio representations for french audiovisual broadcasts. In Fifteenth In- ter national Confer ence on Language Resour ces and Evaluation (LREC 2026) , P alma, Mallorca, Spain. European Language Resour ces Associa- tion. Cecilia P ember ton, Paul McCormack, and Ali- son Russell. 1998. Have women’ s voices low- ered across time? a cross sectional study of australian women’ s voices . Journal of V oice , 12(2):208–213. Robert J. Podesv a and P atric k Callier . 2015. V oice quality and identity . Annual Re view of Applied Linguistics , 35:173–194. R Core T eam. 2024. R: A Language and Envir on- ment for Statistical Computing . R Foundation for Statistical Computing, Vienna, A ustria. Alec Radford, Jong W ook Kim, T ao Xu, Greg Brock- man, Christine McLeav ey , and Ilya Sutskev er . 2022. Robust speech r ecognition via large-scale weak super vision . Alber t Rilliard, Christophe d’ Alessandro, and Marc Evrard. 2018. P aradigmatic variation of vo wels in expressiv e speech: Acoustic description and dimensional analy sis . The Journal of the Acous- tical Society of America , 143(1):109–122. Josiane Riv erin-Coutlée and Jonathan Harrington. 2022. Phonetic change ov er the career : a case study . Linguistics V anguard , 8(1):41–52. Adrian P . Simpson and Melanie W eirich. 2020. Pho- netic Correlat es of Se x, Gender and Se xual Ori- entation . Oxford Univ ersity Press. Han Sloe tjes and Pe ter Witt enburg. 2008. Annota- tion b y category-elan and iso dcr . In 6t h int erna- tional Conference on Language Resources and Ev aluation (LREC 2008) . Helen Spencer -Oate y . 1996. Reconsidering power and distance . Journal of Pragmatics , 26(1):1–24. Jane Stuart-Smith. 2006. The Influence of the Me- dia , 0 edition, page 140–148. Routledge. Jane Stuar t-Smith. 2020. Changing per spectives on /s/ and gender o ver time in glasgow . Linguis- tics V anguard , 6(s1):20180064. Alex andre Suire and Melissa Barkat-Defradas. 2020. Evolution of human pitch: Preliminary analyses in the french population using ina au- diovisual archives of vo x pops. In 2020 IASA - FIA T/IFT A Joint Conference . David T alkin. 2015. Reaper: Robust epoch and pitch estimator . Har tmut T raunmüller . 1990. Analytical expressions for the t onotopic sensor y scale . The Journal of the Acoustical Society of America , 88:97–100. Rémi Uro, David Doukhan, Albert Rilliard, Laetitia Larcher , Anissa-Claire Adgharouamane, Marie T ahon, and Ant oine Laur ent. 2022. A semi- automatic approach to create large gender - and age-balanced speak er corpora: Usefulness of speaker diarization & identification. In Proceed- ings of the Thirteent h Language Resources and Ev aluation Confer ence , pages 3271–3280, Mar - seille, France. European Language Resour ces Association. Félicien V allet and Jean Carrive. 2014. Quand l’horloge parlant e a beaucoup à raconter sur l’évolution des techniques d’archivage audio vi- suel. In Journées d’étude sur la par ole . Reneé van Bezooijen. 1995. Sociocultural aspects of pitch differ ences between japanese and dutch women . Language and Speech , 38(3):253–265. Robin V ay sse, Corine Astésano, and Jérôme Fari- nas. 2022. Performance analysis of v arious fundamental fr equency estimation algorit hms in the context of pathological speech . The Journal of the Acoustical Society of America , 152(5):3091–3101. Cécile B. Vigouroux. 2015. Genre, heteroglossic performances, and new identity : Stand-up com- edy in modern french socie ty . Language in Soci- ety , 44(2):243–272. Y u Zou, Y an W ang, and Wei He. 2012. Diachronic contrastiv e analysis on read speech in broadcast news: Evidence from pitch and duration . In 2012 8th International Symposium on Chinese Spok en Language Processing , page 291–295, K owloon T ong, China. IEEE. 7. Language Resource References Ardila, Rosana and Branson, Megan and Davis, K elly and Kohler , Michael and Meyer , Josh and Henretty , Michael and Morais, Reuben and Saun- ders, Lindsay and T yers, Francis and W eber , Gregor . 2020. Common V oice: A Massiv ely- Multilingual Speech Corpus . European Lan- guage Resources Association. Meléndez-Catalán, Blai and Molina, Emilio and Gómez, Emilia. 2019. Open Broadcas t Media Audio from TV : A Dataset of TV Broadcas t Au- dio with Relativ e Music Loudness Annotations . Ubiquity Press, Ltd. Nagrani, Arsha and Chung, Joon Son and Zisser- man, Andr ew . 2017. V oxCeleb: A Large-Scale Speak er Identification Dataset . ISCA. Sey erlehner , Klaus and Pohle, Tim and Schedl, Markus and Widmer , Gerhard. 2007. Aut o- matic music det ection in tele vision productions . SCRIME/LaBRI Bordeaux. 8. Supplementar y Materials 8.1. A : ANO V A T ables ANO V A table for models presented in section 2.6.2 , fitted for the first three formants (expressed in Bar k and standardized): T able 4 presents the results for the three fixed factors ( Met hod , Gender , and V owel ), while T able 5 presents the effect of the random factors, obtained using single-term deletion with the lmerT est library ( Kuzne tsova et al. , 2017 ). 8.2. B : Cleaned V owels Summar y T able See the caption of the table 6 . 8.3. C: Phone Summar y T able See the caption of the table 7 . 8.4. D: ANO V A T ables ANO V A tables for models present ed in section 3.2 for the analysis of diachronic changes in formants F 1 (T able 8 ) and F 2 (T able 9 ). Model Factors χ 2 Df P r ( > χ 2 ) F 1 Method 0.08 1 0.772 Gender 17.37 1 0.002 ** V owel 202660 11 0.000 *** F 2 Method 0.96 1 0.327 Gender 68.24 1 0.000 *** V owel 287610 11 0.000 *** F 3 Method 0.61 1 0.433 Gender 49.72 1 0.000 *** V owel 69210 11 0.000 *** T able 4: Analysis of De viance T able (T ype II W ald χ 2 tests) obtained for the fixed factors of the models fitted respectivel y t o formants F 1 , F 2 , and F 3 obtained on vow els segmented with two different processing Met hods , controlled for Gender and V ow el categor y (see section 3 ). Model Deletion npar loglik AIC LRT Df P r ( > χ 2 ) F 1 < none > 18 -182288 364613 - - - (1 | S :( G : P )) 17 -194498 389030 24419.6 1 0.00000 *** (1 | G : P ) 17 -182288 364611 0.0 1 0.90799 (1 | P ) 17 -182290 364614 2.8 1 0.09284 F 2 < none > 18 -166592 333220 - - - (1 | S :( G : P )) 17 -173107 346248 13030.3 1 0.00000 *** (1 | G : P ) 17 -166592 333218 0.0 1 0.8749 (1 | P ) 17 -166592 333219 0.1 1 0.3931 F 3 < none > 18 -187973 375982 - - - (1 | S :( G : P )) 17 -209152 418338 42358.0 1 0.0000 *** (1 | G : P ) 17 -187973 375980 1.0 1 0.4542 (1 | P ) 17 -187973 375980 0.0 1 1.0000 T able 5: ANO V A-like table for random effects obtained by single t erm deletions of the Speaker ( S ), Gender ( G ), and P eriod ( P ) factors, for the models fitt ed on each formant ( F 1 , F 2 , F 3 ). V alues of the likelihood ratio t est (LRT) are repor ted for each dele ted term, compared to the full model (see section 3 ). Phoneme Cleaned V owels Oral [ i ] 443,795 [ y ] 148,757 [ e ] 443,329 [ ø ] 89,547 [ @ ] 285,534 [ E ] 473,529 [ œ ] 36,875 [ a ] 623,003 [ u ] 149,453 [ O ] 191,682 [ o ] 102,608 [ A ] 28,202 T able 6: Summary table of cleaned oral vowel returned by M F A after phonetic transcription and forced alignment of the aut omatic transcrip tions ( W hisper [large-v3]) (see section 2 ). Phoneme Automatic Manual V owels Oral [ i ] 706,217 2,047 [ y ] 238,829 766 [ e ] 675,676 2,118 [ ø ] 127,524 435 [ @ ] 512,990 1,581 [ E ] 710,354 2,169 [ œ ] 52,340 375 [ a ] 923,020 2,754 [ u ] 232,540 672 [ o ] 147,186 416 [ O ] 262,739 786 [ A ] 45,190 140 Nasal [ ˜ a ] 421,404 1,267 [ ˜ O ] 256,564 770 [ ˜ E ] 184,927 551 Consonants Plosive [ p ] 462,424 1,361 [ t ] 627,664 1,847 [ k ] 450,392 1,350 [ b ] 122,218 349 [ d ] 539,279 1,738 [ g ] 58,156 172 [ c ] 122,142 344 [ é ] 4274 19 F ricative [ f ] 175,399 562 [ s ] 767,359 2,358 [ S ] 60,638 149 [ v ] 272,070 804 [ z ] 115,033 332 [ Z ] 216,210 672 [ K ] 912,787 2,781 Appro ximant [ 4 ] 59,972 171 [ w ] 124,831 419 [ j ] 175,202 503 [ l ] 657,357 1,895 [ L ] 56,563 168 Nasal [ m ] 429,348 1,279 [ n ] 275,343 825 [ ñ ] 39496 120 [ N ] 849 3 [ m j ] 6,990 21 Affricates [ tS < ] 800 5 [ dZ < ] 439 0 [ ts < ] 302 1 T able 7: Summary table of the number of phones returned by M F A af ter phonetic transcription and forced alignment of the automatic ( W hisper [large-v3]) and manual transcriptions, without an y cleaning at this stage (see section 2 ). Note that M F A uses a fine-grained phonetic transcription that includes some phonetic phenomena such as palatalization and non-standard F rench transcriptions; we kept its default choices. Factors χ 2 Df P r ( > χ 2 ) A 0.3767 1 0.53938 G 18.9247 1 1.360e-05 *** P 2.1955 6 0.90087 V 2618.7972 1 < 2.2e-16 *** A : G 6.5716 1 0.01036 * A : P 9.4499 6 0.14981 A : V 0.3280 1 0.56685 G : P 0.2253 6 0.99978 G : V 30.6383 1 3.109e-08 *** P : V 15.8983 6 0.01431 * A : P : V 13.8443 6 0.03142 * T able 8: Analysis of De viance T able (T ype II W ald χ 2 tests) obtained for the fixed factors of the model fitted to F 1 obtained on / a / and / A / V owels ( V ), across Age ( A ), Gender ( G ) and Period ( P ) (see section 3.2 for details). Factors χ 2 Df P r ( > χ 2 ) AT 5.3243 1 0.021030 * G 0.5006 1 0.479247 V 5377.1017 1 < 2.2e-16 *** AT : G 11.5903 1 0.000663 *** AT : V 204.4467 1 < 2.2e-16 *** T able 9: Analysis of De viance T able (T ype II W ald χ 2 tests) obtained for the fixed factors of the model fitted to F 2 obtained on / a / and / A / V ow els ( V ), along Apparent Time ( AT ), controlled for Gender ( G ) (see section 3.2 for details).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment