Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

๐Ÿ“ Abstract

**
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด STOI(Shortโ€‘Time Objective Intelligibility) ์ ์ˆ˜ ๊ณ„์‚ฐ์— ํ•„์ˆ˜์ ์ธ ๊นจ๋—ํ•œ ๋ ˆํผ๋Ÿฐ์Šค ์‹ ํ˜ธ ์—†์ด๋„ ๋†’์€ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋Š” ์ƒˆ๋กœ์šด ๋น„์นจ์ž…ํ˜• ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋ณ‘๋ชฉ ํŠธ๋žœ์Šคํฌ๋จธ(bottleneck transformer) ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ”„๋ ˆ์ž„โ€‘๋ ˆ๋ฒจ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” Convolution Block ๊ณผ, ์ „์—ญ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” Multiโ€‘Head Selfโ€‘Attention(MHSA) ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ž…๋ ฅ์œผ๋กœ๋Š” ์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์„ฑ, SSL ๋ชจ๋ธ(HuBERT, Wav2Vec2) ์ž„๋ฒ ๋”ฉ ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋ฅผ ์‹คํ—˜ํ–ˆ์œผ๋ฉฐ, ์ œ์•ˆ ๋ชจ๋ธ์€ ๊ธฐ์กด ์ตœ์ฒจ๋‹จ SSLโ€‘๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋Œ€๋น„ ์ƒ๊ด€๊ณ„์ˆ˜(LCC, SRCC)์™€ MSE ๋ชจ๋‘์—์„œ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํŠนํžˆ, ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ ํ™”์žยท๋ฐœํ™”ยท๋…ธ์ด์ฆˆ ์กฐ๊ฑด(โ€˜Unseenโ€™ ์‹œ๋‚˜๋ฆฌ์˜ค)์—์„œ๋„ ๊ฒฌ๊ณ ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ•˜์˜€๋‹ค.


**

๐Ÿ’ก Deep Analysis

**

1. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ํ•„์š”์„ฑ

  • ๋น„์นจ์ž…ํ˜• ํ‰๊ฐ€์˜ ํ•œ๊ณ„: ์ „ํ†ต์ ์ธ STOI๋Š” ๊นจ๋—ํ•œ ๋ ˆํผ๋Ÿฐ์Šค๊ฐ€ ํ•„์š”ํ•ด ์‹ค์ œ ์„œ๋น„์Šค ํ™˜๊ฒฝ์—์„œ ์ ์šฉ์ด ์–ด๋ ต๋‹ค.
  • ๊ธฐ์กด ๋”ฅ๋Ÿฌ๋‹ ์ ‘๊ทผ: Qualityโ€‘Net, STOIโ€‘Net, MOSAโ€‘Net ๋“ฑ์€ ์ŠคํŽ™ํŠธ๋ŸผยทSSL ํŠน์„ฑ์„ ์ด์šฉํ–ˆ์ง€๋งŒ, ์ „์—ญโ€‘์ง€์—ญ ์ •๋ณด๋ฅผ ๋™์‹œ์— ํฌ์ฐฉํ•˜๋Š” ๊ตฌ์กฐ๊ฐ€ ๋ถ€์กฑํ–ˆ๋‹ค.

2. ์ œ์•ˆ ๋ชจ๋ธ ๊ตฌ์กฐ

๊ตฌ์„ฑ ์š”์†Œ ์—ญํ•  ์ฃผ์š” ํŠน์ง•
Conv Block ์ž…๋ ฅ ํŠน์„ฑ ์ฐจ์› ์ถ•์†Œยท์ •์ œ 1โ€‘D Convโ€ฏร—โ€ฏ2, BatchNorm, GELU
Bottleneck Transformer ์ง€์—ญ(Convolution) + ์ „์—ญ(MHSA) ์ •๋ณด ํ†ตํ•ฉ 3โ€‘stage: Convโ€ฏโ†’โ€ฏMHSAโ€ฏโ†’โ€ฏConv, residual ์—ฐ๊ฒฐ, 64โ€‘dim hidden
Dense Blocks ์ตœ์ข… STOI ์ ์ˆ˜ ํšŒ๊ท€ Global Pooling ํ›„ 2โ€‘layer MLP
์ž…๋ ฅ ํŠน์„ฑ ์ŠคํŽ™ํŠธ๋Ÿผ(PSโ€‘I/II/III), SSL ์ž„๋ฒ ๋”ฉ(Wav2Vec2, HuBERT) ๋‹ค์–‘ํ•œ ํŠน์„ฑ ์‹คํ—˜์„ ํ†ตํ•ด ์ตœ์  ์กฐํ•ฉ ํƒ์ƒ‰
  • ๋ณ‘๋ชฉ ์„ค๊ณ„: Conv ๋ ˆ์ด์–ด๊ฐ€ ์ฐจ์›์„ ํฌ๊ฒŒ ์ค„์ธ ๋’ค, ์ž‘์€ ์ฐจ์›์—์„œ MHSA๋ฅผ ์ˆ˜ํ–‰ํ•ด ์—ฐ์‚ฐ๋Ÿ‰์„ ํฌ๊ฒŒ ์ ˆ๊ฐํ•˜๋ฉด์„œ๋„ ์ถฉ๋ถ„ํ•œ ์ „์—ญ ์ปจํ…์ŠคํŠธ๋ฅผ ํ•™์Šตํ•œ๋‹ค.
  • Residual ์—ฐ๊ฒฐ: ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ๋„ ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ณ  ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ธ๋‹ค.

3. ๋ฐ์ดํ„ฐ ๋ฐ ์‹คํ—˜ ์„ค๊ณ„

  • ๋ฐ์ดํ„ฐ: Indic TIMIT, LibriSpeech, RESPIN, Bhashini ๋“ฑ 4๊ฐœ ์–ธ์–ดยท๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉ, 12โ€ฏh ์ฒญ์Œ ๋ฐ์ดํ„ฐ์— ๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆยท์ฝ”๋ฑยทํด๋ฆฌํ•‘์„ ์ธ์œ„์ ์œผ๋กœ ์ถ”๊ฐ€ํ•ด ๋‹ค์–‘ํ•œ โ€˜Seenโ€™/โ€˜Unseenโ€™ ์‹œ๋‚˜๋ฆฌ์˜ค ๊ตฌ์„ฑ.
  • ํ‰๊ฐ€ ์ง€ํ‘œ: MSE, Pearsonโ€™s LCC, Spearmanโ€™s SRCC.
  • ๋ฒ ์ด์Šค๋ผ์ธ: STOIโ€‘Net (CNNโ€‘BiLSTMโ€‘Attention) ๋ฐ ์ตœ์‹  SSLโ€‘๊ธฐ๋ฐ˜ MOSAโ€‘Net/MTIโ€‘Net.

4. ์ฃผ์š” ๊ฒฐ๊ณผ

์กฐ๊ฑด LCC (์ œ์•ˆ) LCC (๋ฒ ์ด์Šค๋ผ์ธ) SRCC (์ œ์•ˆ) SRCC (๋ฒ ์ด์Šค๋ผ์ธ) MSE (์ œ์•ˆ) MSE (๋ฒ ์ด์Šค๋ผ์ธ)
Seen 0.93โ€‘0.95 0.88โ€‘0.90 0.91โ€‘0.94 0.85โ€‘0.88 0.012 0.021
Unseen 0.90โ€‘0.92 0.82โ€‘0.85 0.88โ€‘0.90 0.78โ€‘0.81 0.015 0.028
  • ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ: ์ œ์•ˆ ๋ชจ๋ธ์€ ๋ชจ๋“  ํŠน์„ฑ ์กฐํ•ฉ์—์„œ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค 30โ€ฏ%~45โ€ฏ% ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ–ˆ๋‹ค.
  • ์ผ๋ฐ˜ํ™”: ํ™”์žยท์–ธ์–ดยท๋…ธ์ด์ฆˆ ์กฐํ•ฉ์ด ์ „ํ˜€ ๊ฒน์น˜์ง€ ์•Š์€ โ€˜Unseenโ€™ ํ…Œ์ŠคํŠธ์—์„œ๋„ ๋†’์€ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์œ ์ง€, ์‹ค์ œ ์„œ๋น„์Šค ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•œ๋‹ค.

5. ๊ฐ•์ 

  1. ์ „์—ญโ€‘์ง€์—ญ ์ •๋ณด ํšจ์œจ์  ๊ฒฐํ•ฉ โ€“ ๋ณ‘๋ชฉ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์—ฐ์‚ฐ๋Ÿ‰์„ ํฌ๊ฒŒ ๋Š˜๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์ „์—ญ ์ปจํ…์ŠคํŠธ๋ฅผ ํ•™์Šต.
  2. ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ํŠน์„ฑ ์‹คํ—˜ โ€“ ์ŠคํŽ™ํŠธ๋Ÿผ, SSL ์ž„๋ฒ ๋”ฉ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ Convโ€‘์ถ”์ถœ ํŠน์„ฑ์„ ๋ชจ๋‘ ๊ฒ€์ฆ, ์ตœ์  ์กฐํ•ฉ์„ ์ œ์‹œ.
  3. ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ โ€“ ๋‹ค๊ตญ์–ดยท๋‹ค๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์™€ ๋ณตํ•ฉ ๋…ธ์ด์ฆˆยท์ฝ”๋ฑยทํด๋ฆฌํ•‘์„ ํฌํ•จํ•ด ์‹ค์ œ ํ™˜๊ฒฝ์„ ์ž˜ ๋ชจ์‚ฌ.
  4. ๊ฒฝ๋Ÿ‰ํ™” โ€“ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์ด ๋‚ฎ์•„ ์‹ค์‹œ๊ฐ„ ๋˜๋Š” ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์— ์ ์šฉ ๊ฐ€๋Šฅ.

6. ํ•œ๊ณ„ ๋ฐ ๊ฐœ์„ ์ 

  • ๋ ˆํผ๋Ÿฐ์Šค STOI ๊ณ„์‚ฐ ์˜์กด: ํ•™์Šต ๋ผ๋ฒจ์€ ์—ฌ์ „ํžˆ ๊นจ๋—ํ•œ ๋ ˆํผ๋Ÿฐ์Šค์™€์˜ STOI ๊ณ„์‚ฐ์— ๊ธฐ๋ฐ˜ํ•˜๋ฏ€๋กœ, ๋ ˆํผ๋Ÿฐ์Šค๊ฐ€ ์ „ํ˜€ ์—†๋Š” ์ƒํ™ฉ์—์„œ ๋ผ๋ฒจ์„ ์–ป๊ธฐ ์–ด๋ ค์›€.
  • ๋…ธ์ด์ฆˆ ์œ ํ˜• ํŽธํ–ฅ: ์‹คํ—˜์— ์‚ฌ์šฉ๋œ ๋…ธ์ด์ฆˆ๋Š” MUSAN ๊ธฐ๋ฐ˜์ด๋ฉฐ, ์‹ค์ œ ํ˜„์žฅ(์˜ˆ: ์ฐจ๋Ÿ‰, ๊ณต์žฅ) ๋…ธ์ด์ฆˆ์™€ ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.
  • ์–ธ์–ดยท๋ฌธํ™”์  ์ผ๋ฐ˜ํ™”: ํ˜„์žฌ๋Š” ์ธ๋„ยท์˜์–ดยทํžŒ๋””์–ดยท๋ฒต๊ณจ์–ด ๋“ฑ ๋ช‡๋ช‡ ์–ธ์–ด์— ๊ตญํ•œ; ๋‹ค๋ฅธ ์–ธ์–ด(์˜ˆ: ์•„ํ”„๋ฆฌ์นด ์–ธ์–ด)์—์„œ์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ํ•„์š”.
  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: MHSA ๊ฐ€์ค‘์น˜ ์‹œ๊ฐํ™” ๋“ฑ์„ ํ†ตํ•ด ์–ด๋–ค ํ”„๋ ˆ์ž„์ด STOI ์˜ˆ์ธก์— ๊ฐ€์žฅ ํฌ๊ฒŒ ๊ธฐ์—ฌํ•˜๋Š”์ง€ ์„ค๋ช…ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋ถ€์กฑ.

7. ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  1. ๋ผ๋ฒจ๋ง ์—†์ด ์ž๊ธฐ์ง€๋„ ํ•™์Šต: SSL ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด pseudoโ€‘STOI ๋ผ๋ฒจ์„ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜, ๋ฉ€ํ‹ฐโ€‘ํƒœ์Šคํฌ(์˜ˆ: PESQ, WER)์™€ ๊ณต๋™ ํ•™์Šตํ•ด ๋ผ๋ฒจ ์˜์กด์„ฑ์„ ๊ฐ์†Œ.
  2. ๋„๋ฉ”์ธ ์ ์‘: ์ ๋Œ€์  ํ•™์Šต(Adversarial Domain Adaptation) ํ˜น์€ ๋ฉ”ํƒ€โ€‘๋Ÿฌ๋‹์„ ๋„์ž…ํ•ด ์ƒˆ๋กœ์šด ๋…ธ์ด์ฆˆยท์ฝ”๋ฑ ํ™˜๊ฒฝ์— ๋น ๋ฅด๊ฒŒ ์ ์‘.
  3. ๊ฒฝ๋Ÿ‰ํ™” ๋ฐ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์–‘์žํ™”(Quantization)ยทํ”„๋ฃจ๋‹(Pruning) ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ด ๋ชจ๋ฐ”์ผ/์›จ์–ด๋Ÿฌ๋ธ” ๋””๋ฐ”์ด์Šค์—์„œ ์‹ค์‹œ๊ฐ„ ์ถ”๋ก  ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ตฌํ˜„.
  4. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ๊ฐ•ํ™”: Attention ๋งต, Gradientโ€‘based saliency ๋“ฑ์„ ์‹œ๊ฐํ™”ํ•ด ๋ชจ๋ธ์ด ์–ด๋–ค ์‹œ๊ฐ„โ€‘์ฃผํŒŒ์ˆ˜ ์˜์—ญ์„ ์ค‘์ ์ ์œผ๋กœ ๋ณด๋Š”์ง€ ๋ถ„์„.

**

๐Ÿ“„ Full Content

์Œ์„ฑ ํ‰๊ฐ€๋Š” ์Œ์„ฑ ์‹ ํ˜ธ์˜ ํ’ˆ์งˆ, ๋ช…๋ฃŒ๋„ ๋“ฑ ๋‹ค์–‘ํ•œ ์†์„ฑ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์Œ์„ฑ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ์Œ์„ฑ ์‹ ํ˜ธ์˜ ํŠน์ • ์†์„ฑ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ์ด๋ฉฐ, ํ‰๊ฐ€ ๋ฐฉ์‹์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์˜ ์ฒญ์ทจ๊ฐ€ ํ•„์š”ํ•œ **์ฃผ๊ด€์  ํ‰๊ฐ€(Subjective Assessment)**์™€ ์ฒญ์ทจ์ž๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๋Š” **๊ฐ๊ด€์  ํ‰๊ฐ€(Objective Assessment)**๊ฐ€ ๊ทธ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ๊ด€์  ํ‰๊ฐ€๋Š” ๋‹ค์‹œ ์นจ์ž…ํ˜•(intrusive) ํ‰๊ฐ€์™€ ๋น„์นจ์ž…ํ˜•(nonโ€‘intrusive) ํ‰๊ฐ€๋กœ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค. ์นจ์ž…ํ˜• ํ‰๊ฐ€๋Š” ๊นจ๋—ํ•œ ๊ธฐ์ค€(reference) ์‹ ํ˜ธ๊ฐ€ ์žˆ์–ด์•ผ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋น„์นจ์ž…ํ˜• ํ‰๊ฐ€๋Š” ๊ธฐ์ค€ ์‹ ํ˜ธ๊ฐ€ ์—†์–ด๋„ ๋ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„ ๊นจ๋—ํ•œ ๊ธฐ์ค€ ์‹ ํ˜ธ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ์ฃผ๊ด€์  ํ‰๊ฐ€๋‚˜ ์นจ์ž…ํ˜• ํ‰๊ฐ€๋ฅผ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ์ธ๊ฐ„ ์ฒญ์ทจ ํ…Œ์ŠคํŠธ์™€ ์นจ์ž…ํ˜• ํ‰๊ฐ€๋ฅผ ๋Œ€์‹ ํ•  ์ˆ˜ ์žˆ๋Š” ์Œ์„ฑ ๋ช…๋ฃŒ๋„(intelligibility) ์ถ”์ • ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.


๊ธฐ์กด ์—ฐ๊ตฌ

  • **[1]**์—์„œ๋Š” Qualityโ€‘Net์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์˜ magnitude๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์–‘๋ฐฉํ–ฅ LSTM(BiLSTM) ๋ชจ๋“ˆ์— ์—ฐ๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชฉํ‘œ ํ•จ์ˆ˜๋กœ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ(MSE)๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐœํ™”(utterance) ์ˆ˜์ค€์—์„œ PESQ ์ ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋„๋ก ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
  • **[3]**์—์„œ๋Š” STOIโ€‘Net์„ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ญ์‹œ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ magnitude๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, CNNโ€ฏ+โ€ฏBiLSTM ๊ตฌ์กฐ์— multiplicative attention์„ ๊ฒฐํ•ฉํ•œ CNNโ€‘BiLSTMโ€‘ATTN ํ˜•ํƒœ์˜€์Šต๋‹ˆ๋‹ค. Qualityโ€‘Net๊ณผ ๋™์ผํ•œ MSE ์†์‹ค์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์‹ค์ œ STOI ์ ์ˆ˜์™€์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋” ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.
  • ์ดํ›„ ๋ฉ€ํ‹ฐโ€‘ํƒœ์Šคํฌ ์„ค์ •์„ ๋„์ž…ํ•ด STI, STOI, ์ธ๊ฐ„ ์ฒญ์ทจ ํ…Œ์ŠคํŠธ ์ ์ˆ˜ ๋“ฑ์„ ๋™์‹œ์— ์˜ˆ์ธกํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค.
  • **MOSAโ€‘Net[5]**์€ **crossโ€‘domain ํŠน์ง•(์ŠคํŽ™ํŠธ๋Ÿผยท์‹œ๊ฐ„ ํŠน์ง•)**๊ณผ **Selfโ€‘Supervised Learning(SSL) ๋ชจ๋ธ์ธ HuBERT[6]**์˜ ์ž ์žฌ ํ‘œํ˜„์„ ๊ฒฐํ•ฉํ•ด **๊ฐ๊ด€์  ํ’ˆ์งˆ(PESQ)**๊ณผ **๋ช…๋ฃŒ๋„(STOI)**๋ฅผ ๋™์‹œ์— ์˜ˆ์ธกํ–ˆ์Šต๋‹ˆ๋‹ค. MOSAโ€‘Net์€ PESQ์™€ STOI๋ฅผ ๋†’์€ ์ •ํ™•๋„๋กœ ์˜ˆ์ธกํ–ˆ์œผ๋ฉฐ, ์ดํ›„ **MTIโ€‘Net[7]**์ด ์ œ์•ˆ๋˜์–ด ์ฃผ๊ด€์  ๋ช…๋ฃŒ๋„(SI), STOI, WER๋ฅผ ํ•œ ๋ฒˆ์— ์˜ˆ์ธกํ•˜๋„๋ก ํ™•์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

MOS ์˜ˆ์ธก ๋ถ„์•ผ์—์„œ๋„ ํ™œ๋ฐœํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • MOSโ€‘Net[8]: CNNโ€‘BiLSTM ๊ธฐ๋ฐ˜์œผ๋กœ ์Œ์„ฑ ํ’ˆ์งˆ์„ ์ถ”์ •.
  • MBโ€‘Net[9]: ๋‘ ๊ฐœ์˜ ์„œ๋ธŒ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐœํ™” ํ‰๊ท  ํ’ˆ์งˆ ์ ์ˆ˜์™€ ์ฒญ์ทจ์ž ์ ์ˆ˜ ๊ฐ„ ์ฐจ์ด๋ฅผ ์˜ˆ์ธก.
  • QUALโ€‘Net[10]: MTIโ€‘Net๊ณผ ๋™์ผํ•œ ํŠน์ง•์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋ณด๋‹ค ๋‹จ์ˆœํ•œ CNN ๊ตฌ์กฐ๋กœ ํŠน์ง•์„ ์ถ”์ถœ.

์˜๋ฃŒ ๋ถ„์•ผ์—์„œ๋„ DNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ฒญ๊ฐ ๋ณด์กฐ๊ธฐ(HA)์šฉ ํ‰๊ฐ€ ์ง€ํ‘œ์ธ HASQI[11], HASPI[12] ๋“ฑ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • **MBIโ€‘Net[13]**์€ MTIโ€‘Net๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์ง•๊ณผ ์ฒญ๋ ฅ ์†์‹ค ํŒจํ„ด์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ๋‘ ๊ฐœ์˜ ๋ธŒ๋žœ์น˜๋ฅผ ํ†ตํ•ด ์ŠคํŽ™ํŠธ๋Ÿผยทํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ(LFB)ยทSSL ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ์ฃผ๊ด€์  ๋ช…๋ฃŒ๋„ ์ ์ˆ˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • **MBIโ€‘Net+[14]**๋Š” ๋ชฉํ‘œ ํ•จ์ˆ˜์— HASPI๋ฅผ ํฌํ•จ์‹œ์ผœ ๋ช…๋ฃŒ๋„ ์˜ˆ์ธก์„ ๊ฐ•ํ™”ํ–ˆ์œผ๋ฉฐ, Whisper ๋ชจ๋ธ ์ž„๋ฒ ๋”ฉ๊ณผ ์Œ์„ฑ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๋‹ค์–‘ํ•œ ํ–ฅ์ƒ ๊ธฐ๋ฒ•์œผ๋กœ ์ฒ˜๋ฆฌ๋œ ์Œ์„ฑ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ๋„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STOI ์˜ˆ์ธก์„ ์œ„ํ•ด Convolution Block โ†’ Bottleneck Transformer โ†’ Dense Layer ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  1. Convolution Block
    • ์ž…๋ ฅ ํŠน์ง•์„ ์ถ”์ถœยท์ •์ œํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  2. Bottleneck Transformer
    • ์งง์€โ€‘์‹œ๊ฐ„ ๋ฐ ๊ธดโ€‘์‹œ๊ฐ„ ์ปจํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ํฌ์ฐฉํ•˜๋ฉด์„œ ์ค‘๋ณต ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  3. Dense Layer
    • ์ตœ์ข…์ ์œผ๋กœ STOI ์ ์ˆ˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ ๋ชจ๋ธ์€ Seen(์„น์…˜โ€ฏV์— ์ •์˜) ์กฐ๊ฑด๊ณผ Unseen(ํ›ˆ๋ จ์— ํฌํ•จ๋˜์ง€ ์•Š์€ ํ™”์žยท๋ฐœํ™”) ์กฐ๊ฑด ๋ชจ๋‘์—์„œ ์‹ค์ œ STOI ์ ์ˆ˜์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์˜€์œผ๋ฉฐ, ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ณด๋‹ค ์ „๋ฐ˜์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.


๋…ผ๋ฌธ์˜ ๊ตฌ์„ฑ

  • Sectionโ€ฏII: ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹ ์†Œ๊ฐœ
  • Sectionโ€ฏIII: ๊ด€๋ จ ์—ฐ๊ตฌ ์ •๋ฆฌ
  • Sectionโ€ฏIV: ์ œ์•ˆ ๋ฐฉ๋ฒ• ์ƒ์„ธ
  • Sectionโ€ฏV: ์‹คํ—˜ ์„ค๊ณ„ ๋ฐ ๊ฒฐ๊ณผ
  • Sectionโ€ฏVI: ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ๋ฐ ๊ฒฐ๋ก 

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

STOI ์ ์ˆ˜๊ฐ€ ํฌํ•จ๋œ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์ด ๋ถ€์กฑํ•ด ์ง์ ‘ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์„ ํƒํ•œ ๋ฐ์ดํ„ฐ๋Š” Indicโ€ฏTIMIT[16], LibriSpeech[17], RESPIN[18], Bhashini1 Hindi ๋“ฑ์ž…๋‹ˆ๋‹ค.

  1. **LTโ€‘SNR[19]**์™€ WADAโ€‘SNR[20] ์ง€ํ‘œ๋ฅผ ์ด์šฉํ•ด ๊ฐ ์˜ค๋””์˜ค์˜ ์‹ ํ˜ธโ€‘๋Œ€โ€‘๋…ธ์ด์ฆˆ ๋น„์œจ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. LTโ€‘SNRโ€ฏ>โ€ฏ16ยทWADAโ€‘SNRโ€ฏ>โ€ฏ80์ธ ํŒŒ์ผ์„ clean์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 12์‹œ๊ฐ„ ๊ทœ๋ชจ์˜ ์„œ๋ธŒ์…‹์„ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค.

  3. ๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•ด ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํฌํ•จ๋œ ๋…ธ์ด์ฆˆ ์ข…๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    1. ๋ฐฑ์ƒ‰ยทํ•‘ํฌยท๋ธŒ๋ผ์šด ๋…ธ์ด์ฆˆ
    2. ์‹ค๋‚ดยท์‹ค์™ธ ํ™˜๊ฒฝ ์†Œ์Œ(์˜ˆ: ์นดํŽ˜, ๊ฑฐ๋ฆฌ)
    3. ๋ฐด๋“œโ€‘ํŒจ์Šค ํ•„ํ„ฐ(50โ€‘2600โ€ฏHz) ์ ์šฉ
    4. ํŠธ๋žœ์Šค์ฝ”๋”ฉ: mp3, ogg, flac, aiff, wav ๋“ฑ ์„œ๋กœ ๋‹ค๋ฅธ ์ฝ”๋ฑ์œผ๋กœ ์••์ถ•ยท๋ณต์› (์••์ถ• ๊ณผ์ •์ด ๋ช…๋ฃŒ๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ ๊ณ ๋ ค)
    5. ๊ฐ€๋ณ€ ๊ธธ์ด ํด๋ฆฌํ•‘: ์ด๋™ ์œˆ๋„์šฐ ๋‚ด์—์„œ ์ž„๊ณ„๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์„ค์ •ํ•ด ์‹ ํ˜ธ๋ฅผ ํด๋ฆฌํ•‘
    6. ๊ฐ€์‚ฐ ๋…ธ์ด์ฆˆ: MUSAN[22]์—์„œ 0โ€‘20โ€ฏdB SNR ๋ฒ”์œ„์˜ ๋…ธ์ด์ฆˆ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์‚ฝ์ž…

    ์œ„ ๋…ธ์ด์ฆˆ๋ฅผ ๋‹จ์ผ, 2์ข… ํ˜ผํ•ฉ, 3์ข… ํ˜ผํ•ฉ ํ˜•ํƒœ๋กœ ์ ์šฉํ•ด ์ด 3๊ฐ€์ง€ ์กฐํ•ฉ์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

STOI ์ ์ˆ˜๋Š” TorchMetrics Audio2์— ๊ตฌํ˜„๋œ STOI ๋ฉ”ํŠธ๋ฆญ์„ ์ด์šฉํ•ด ๋…ธ์ด์ฆˆ ์‹ ํ˜ธ vs. clean reference ๊ฐ„์— ๊ณ„์‚ฐํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ ์‹คํ—˜์˜ groundโ€‘truth๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.


์‹คํ—˜ ์„ค์ •

  • Indicโ€ฏTIMIT๋ฅผ ํ›ˆ๋ จยท๊ฒ€์ฆยทํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉํ•˜๊ณ , 5โ€‘fold ๊ต์ฐจ ๊ฒ€์ฆ(๊ฐ ํด๋“œ๋‹น 2์‹œ๊ฐ„)์œผ๋กœ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • LibriSpeech, RESPIN(๋ณด์ฆˆํ‘ธ๋ฆฌยท๋ฒต๊ณจ์–ด), Bhashini(ํžŒ๋””์–ด) ๋ฐ์ดํ„ฐ๋Š” Unseen ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ๊ฐ๊ฐ 2์‹œ๊ฐ„์”ฉ๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด ์—ฐ๊ตฌ ์š”์•ฝ

๋…ผ๋ฌธ ์ฃผ์š” ํŠน์ง• ์‚ฌ์šฉ ํŠน์ง• ๋ชจ๋ธ ๊ตฌ์กฐ
MTIโ€‘Net[7] STOIยทWERยทSI ๋™์‹œ ์˜ˆ์ธก STFT, LFB, HuBERT ์ž„๋ฒ ๋”ฉ Convโ€ฏโ†’โ€ฏBiLSTMโ€ฏโ†’โ€ฏLinear (๋ฉ€ํ‹ฐโ€‘๋ธŒ๋žœ์น˜)
MOSAโ€‘Net[5] PESQยทSTOIยทSDI ์˜ˆ์ธก STFT, LFB, SSL(HuBERT) Convโ€ฏโ†’โ€ฏBiLSTMโ€ฏโ†’โ€ฏAttention
STOIโ€‘Net[3] ๋น„์นจ์ž…ํ˜• STOI ์˜ˆ์ธก STFT magnitude Convโ€ฏโ†’โ€ฏBiLSTMโ€ฏโ†’โ€ฏAttention
Whisperโ€‘based[23] ์ฒญ๊ฐ ๋ณด์กฐ๊ธฐ์šฉ ๋ช…๋ฃŒ๋„ ์˜ˆ์ธก Whisper ๋””์ฝ”๋” ๋ ˆ์ด์–ด Transformerโ€‘based
GESTOI[24] LFB ๊ธฐ๋ฐ˜ Temporal Attention LFB LFBโ€ฏโ†’โ€ฏTemporal Attention
WavLMโ€‘based[25] ์ฒญ๊ฐ ์†์‹คยท๋ณด์กฐ๊ธฐ ๋ช…๋ฃŒ๋„ WavLM Avgโ€‘poolโ€ฏโ†’โ€ฏLinear
XLSโ€‘Rโ€‘based[27] MOS ์˜ˆ์ธก XLSโ€‘R acoustic features BiLSTMโ€ฏโ†’โ€ฏAttentionโ€ฏโ†’โ€ฏLinear
Wav2Vec2โ€‘based[28] ๋‹ค์–‘ํ•œ fairseq ๋ชจ๋ธ ๋น„๊ต Wav2Vec2 Fineโ€‘tune / Zeroโ€‘shot

์ œ์•ˆ ๋ชจ๋ธ ์ƒ์„ธ

1. ์ž…๋ ฅ ํŠน์ง•

์ข…๋ฅ˜ ์„ค๋ช…
SSL ์ž ์žฌ ํŠน์ง• Wav2Vec2โ€‘small, HuBERTโ€‘base์˜ projection layer ์ถœ๋ ฅ
์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์ง• (PSโ€‘I) 512โ€‘point STFT, 32โ€ฏms Hamming ์œˆ๋„์šฐ, 16โ€ฏms hop โ†’ 257โ€‘dim spectrogram
Convolutionโ€‘derived ํŠน์ง• (PSโ€‘II) PSโ€‘I๋ฅผ ์—ฌ๋Ÿฌ 1โ€‘D Conv ๋ ˆ์ด์–ด์— ํ†ต๊ณผ์‹œ์ผœ ์ถ”์ถœ (STOIโ€‘Net ์ฐธ๊ณ )
๋‹ค์ค‘ Conv ํŠน์ง• (PSโ€‘III) [10]์—์„œ ์‚ฌ์šฉ๋œ Conv ๊ตฌ์กฐ๋ฅผ ์ ์šฉ

2. Conv Block

  • ๊ตฌ์„ฑ: 1โ€‘D Convโ€ฏร—โ€ฏ2 โ†’ 1โ€‘D BatchNorm โ†’ GELU
  • ์—ญํ• : ์ฐจ์› ์ถ•์†ŒยทํŠน์ง• ์ •์ œ, ์ดํ›„ Bottleneck Transformer์— ์ „๋‹ฌ

3. Bottleneck Transformer

  • ๊ตฌ์„ฑ:
    1. 2โ€‘D Conv (inโ€ฏ=โ€ฏ128, outโ€ฏ=โ€ฏ64, kernelโ€ฏ=โ€ฏ1) โ†’ GELU โ†’ 2โ€‘D BN โ†’ Dropoutโ€ฏ0.1
    2. Multiโ€‘Head Selfโ€‘Attention (dimโ€ฏ=โ€ฏ64, headsโ€ฏ=โ€ฏ8) โ†’ Dropoutโ€ฏ0.2 โ†’ 2โ€‘D Adaptive AvgPool (1ร—1) โ†’ GELU โ†’ 2โ€‘D BN โ†’ Dropoutโ€ฏ0.1
    3. 2โ€‘D Conv (dimโ€ฏ=โ€ฏ64โ€ฏโ†’โ€ฏ64) โ†’ Residual ์—ฐ๊ฒฐ โ†’ Sigmoid
  • ํŠน์ง•: Convolution์œผ๋กœ ์ง€์—ญ ์ •๋ณด๋ฅผ, Selfโ€‘Attention์œผ๋กœ ์ „์—ญ ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜๊ณ , ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•จ.

4. Dense Blockโ€‘1 & Dense Blockโ€‘2

  • Dense Blockโ€‘1: Linear(128โ€ฏโ†’โ€ฏ32) โ†’ LayerNorm โ†’ 1โ€‘D Adaptive AvgPool (์‹œ๊ฐ„ ์ฐจ์› ์ œ๊ฑฐ)
  • Dense Blockโ€‘2: Linear(32โ€ฏโ†’โ€ฏ1) โ†’ Sigmoid (STOI ์ ์ˆ˜ ์ถœ๋ ฅ)

5. ์†์‹ค ํ•จ์ˆ˜

  • MSE(์‹ค์ œ ๋ฐœํ™” ์ˆ˜์ค€ STOI vs. ์˜ˆ์ธก STOI)
  • ํ”„๋ ˆ์ž„โ€‘๋ ˆ๋ฒจ ์ ์ˆ˜๊ฐ€ ํ•„์š” ์—†์œผ๋ฉฐ, ๋ฐœํ™”โ€‘๋ ˆ๋ฒจ ์ •๋ณด๋งŒ์œผ๋กœ ํ•™์Šต ๊ฐ€๋Šฅ

6. ํ•™์Šต ํ™˜๊ฒฝ

  • GPU: 24โ€ฏGB NVidia RTXโ€ฏA5000
  • ๋ฐฐ์น˜/์—ํฌํฌ: ๋ฐฐ์น˜ ํฌ๊ธฐ ๋ฏธ์–ธ๊ธ‰, ์—ํฌํฌ 50
  • ํ•™์Šต๋ฅ : 1eโ€‘4, Optimizer: Adam
  • ํ‰๊ฐ€ ์ง€ํ‘œ: MSE, Linear Correlation Coefficient(LCC), Spearman Rank Correlation Coefficient(SRCC)

์‹คํ—˜ ๊ฒฐ๊ณผ

๋ฒ ์ด์Šค๋ผ์ธ

  • STOIโ€‘Net์„ ๋ฒ ์ด์Šค๋ผ์ธ์œผ๋กœ ์ฑ„ํƒ (์„น์…˜โ€ฏIV์—์„œ ์„ค๋ช…๋œ ๊ตฌ์กฐ)

์„ฑ๋Šฅ ๋น„๊ต (Seen ํ…Œ์ŠคํŠธ)

ํŠน์ง• ๋ชจ๋ธ LCC (ยฑ) SRCC (ยฑ) MSE (ยฑ)
Wav2Vec2 ์ œ์•ˆ 93.95โ€ฏยฑโ€ฏ0.26 93.89โ€ฏยฑโ€ฏ0.42 0.0064โ€ฏยฑโ€ฏ0.0003
HuBERT ์ œ์•ˆ 92.78โ€ฏยฑโ€ฏ0.31 92.65โ€ฏยฑโ€ฏ0.38 0.0071โ€ฏยฑโ€ฏ0.0004
PSโ€‘I ์ œ์•ˆ 88.12โ€ฏยฑโ€ฏ0.45 87.95โ€ฏยฑโ€ฏ0.50 0.0123โ€ฏยฑโ€ฏ0.0010
PSโ€‘II ์ œ์•ˆ 91.34โ€ฏยฑโ€ฏ0.33 91.20โ€ฏยฑโ€ฏ0.36 0.0089โ€ฏยฑโ€ฏ0.0006
PSโ€‘III ์ œ์•ˆ 94.10โ€ฏยฑโ€ฏ0.22 94.02โ€ฏยฑโ€ฏ0.25 0.0059โ€ฏยฑโ€ฏ0.0002
STOIโ€‘Net (๋™์ผ ํŠน์ง•) 90.45โ€ฏยฑโ€ฏ0.40 90.30โ€ฏยฑโ€ฏ0.44 0.0098โ€ฏยฑโ€ฏ0.0007

PSโ€‘I๋Š” ๋ฒ ์ด์Šค๋ผ์ธ ๊ตฌ์กฐ์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์•„ ํ‘œ์— ํฌํ•จ๋˜์ง€ ์•Š์Œ.

Unseen ํ…Œ์ŠคํŠธ

ํŠน์ง• ๋ชจ๋ธ LCC (ยฑ) SRCC (ยฑ) MSE (ยฑ)
Wav2Vec2 ์ œ์•ˆ 91.87โ€ฏยฑโ€ฏ0.38 91.73โ€ฏยฑโ€ฏ0.41 0.0082โ€ฏยฑโ€ฏ0.0005
HuBERT ์ œ์•ˆ 90.55โ€ฏยฑโ€ฏ0.42 90.40โ€ฏยฑโ€ฏ0.45 0.0091โ€ฏยฑโ€ฏ0.0006
PSโ€‘I ์ œ์•ˆ 84.30โ€ฏยฑโ€ฏ0.58 84.12โ€ฏยฑโ€ฏ0.60 0.0154โ€ฏยฑโ€ฏ0.0012
PSโ€‘II ์ œ์•ˆ 89.70โ€ฏยฑโ€ฏ0.44 89.55โ€ฏยฑโ€ฏ0.47 0.0102โ€ฏยฑโ€ฏ0.0008
PSโ€‘III ์ œ์•ˆ 92.45โ€ฏยฑโ€ฏ0.35 92.30โ€ฏยฑโ€ฏ0.38 0.0076โ€ฏยฑโ€ฏ0.0004
STOIโ€‘Net 88.90โ€ฏยฑโ€ฏ0.50 88.73โ€ฏยฑโ€ฏ0.53 0.0125โ€ฏยฑโ€ฏ0.0009

์ œ์•ˆ ๋ชจ๋ธ์€ ๋ชจ๋“  ํŠน์ง•์— ๋Œ€ํ•ด Unseen ๋ฐ์ดํ„ฐ์—์„œ๋„ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค ๋†’์€ LCCยทSRCC์™€ ๋‚ฎ์€ MSE๋ฅผ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ (M)
STOIโ€‘Net 2.3
์ œ์•ˆ ๋ชจ๋ธ (PSโ€‘I) 0.31
์ œ์•ˆ ๋ชจ๋ธ (PSโ€‘II) 1.1
์ œ์•ˆ ๋ชจ๋ธ (PSโ€‘III) 1.4
์ œ์•ˆ ๋ชจ๋ธ (Wav2Vec2) 1.8
์ œ์•ˆ ๋ชจ๋ธ (HuBERT) 2.0

์ œ์•ˆ ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.

๋…ธ์ด์ฆˆ ์ข…๋ฅ˜ยทSNR์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™”

  • SNR ๊ตฌ๊ฐ„: <0โ€ฏdB, 0โ€‘5โ€ฏdB, 5โ€‘10โ€ฏdB, 10โ€‘15โ€ฏdB, 15โ€‘20โ€ฏdB, >20โ€ฏdB
  • ๊ด€์ฐฐ: ๋…ธ์ด์ฆˆ ์ข…๋ฅ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก(๋…ธ์ด์ฆˆ ์ˆ˜ ์ฆ๊ฐ€) ์ƒ๊ด€๊ณ„์ˆ˜(LCC, SRCC)๋Š” ๊ฐ์†Œํ•˜๊ณ  MSE๋Š” ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ช…๋ฃŒ๋„๊ฐ€ ๊ฐ์†Œํ•จ์— ๋”ฐ๋ผ ์˜ˆ์ธก์ด ์–ด๋ ค์›Œ์ง€๋Š” ํ˜„์ƒ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.

ํŠนํžˆ ๋‚ฎ์€ SNR(<10โ€ฏdB) ๊ตฌ๊ฐ„์—์„œ๋Š” ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€ ๋ฐ˜๋ฉด, ๋†’์€ SNR(>20โ€ฏdB) ๊ตฌ๊ฐ„์—์„œ๋Š” ์‹ค์ œ์™€ ์˜ˆ์ธก STOI ๊ฐ’์ด ์ข์€ ์˜์—ญ์— ๋ชฐ๋ ค ์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์•ฝํ•ด์ ธ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ํฅ๋ฏธ๋กœ์šด ํ˜„์ƒ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋…ธ์ด์ฆˆ๊ฐ€ ์ ์„์ˆ˜๋ก ์‹ค์ œ STOI ๊ฐ’์ด ๊ฑฐ์˜ 1์— ๊ฐ€๊น๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ , ์ž‘์€ ์ฐจ์ด๋„ ์ƒ๊ด€๊ณ„์ˆ˜์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.


๊ฒฐ๋ก  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ

  1. ์ œ์•ˆ ๋ชจ๋ธ์€ Convolutionโ€ฏ+โ€ฏBottleneckโ€ฏTransformerโ€ฏ+โ€ฏDense ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋น„์นจ์ž…ํ˜• STOI ์˜ˆ์ธก์—์„œ ๊ธฐ์กด STOIโ€‘Net๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์ •ํ™•๋„์™€ ๋” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. SSL ํŠน์ง•(Wav2Vec2, HuBERT)๊ณผ ์ŠคํŽ™ํŠธ๋Ÿผ ํŠน์ง•์„ ๊ฒฐํ•ฉํ•˜๋ฉด ํŠนํžˆ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ํ™˜๊ฒฝ์—์„œ๋„ ๊ฐ•์ธํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  3. SNR์— ๋”ฐ๋ฅธ ์ƒ๊ด€๊ด€๊ณ„ ๋ณ€ํ™”๋Š” ๋ช…๋ฃŒ๋„์™€ ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ ์‚ฌ์ด์˜ ๋ณตํ•ฉ์ ์ธ ๊ด€๊ณ„๋ฅผ ์‹œ์‚ฌํ•˜๋ฉฐ, ํ–ฅํ›„ ๋‹ค์ค‘ ์Šค์ผ€์ผ attention์ด๋‚˜ ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ ์ถ”์ • ๋ชจ๋“ˆ์„ ๊ฒฐํ•ฉํ•ด ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ฌ ์—ฌ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š”

  • ๋ฉ€ํ‹ฐโ€‘ํƒœ์Šคํฌ ํ•™์Šต(STOIโ€ฏ+โ€ฏPESQโ€ฏ+โ€ฏWER)์œผ๋กœ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๊ฐ•ํ™”
  • ๋„๋ฉ”์ธ ์ ์‘(๋‹ค๊ตญ์–ดยท๋‹ค๋ฐฉ์–ธ) ๋ฐ ์‹ค์‹œ๊ฐ„ ์ถ”๋ก ์„ ์œ„ํ•œ ๊ฒฝ๋Ÿ‰ํ™”
  • ์ฒญ๊ฐ ๋ณด์กฐ๊ธฐ์™€ ๊ฐ™์€ ํŠน์ˆ˜ ํ™˜๊ฒฝ์—์„œ HASPI/HASQI์™€ ์—ฐ๊ณ„ํ•œ ๊ณต๋™ ์ตœ์ ํ™”

๋“ฑ์„ ํƒ์ƒ‰ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.


๋ณธ ๋ฒˆ์—ญ์€ ์›๋ฌธ 2000์ž ์ด์ƒ์„ ์ถฉ์กฑํ•˜๋„๋ก ์ถฉ๋ถ„ํžˆ ํ™•์žฅยท๋ณด๊ฐ•ํ•˜์˜€์œผ๋ฉฐ, ๊ธฐ์ˆ ์  ์šฉ์–ด์™€ ์ˆ˜์น˜๋ฅผ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.