
์ธ๋ฐํ ์ด๋ฏธ์ง ์ดํด๋ฅผ ์ํ ๋์ ํ๋ ๊ธฐ๋ฒ
Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introd
















































