
๋ ๋จ๊ณ ์๊ธฐ์ง๋ ํ์ต์ผ๋ก ๊ตฌํํ ๊ณ ํจ์จ ์์ฑ ํํ ๋ฐ ์์ถ ํ๋ ์์ํฌ
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage 1 uses JEPA with DAAM to learn semantic audio features via masked prediction in lat

















































