ViT Registers and Fractal ViT
Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens’’ similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.
💡 Research Summary
The paper investigates whether ideas that have recently improved large language and vision models—namely, the ability of transformer decoders to learn positional information without explicit positional encodings (the “NoPE” phenomenon) and the introduction of “register” tokens that are not tied to the input—can be combined to create a more powerful Vision Transformer (ViT). The authors propose a variant called Fractal ViT, which adds a set of “summary tokens” that, like registers, are initialized to zero and have no direct correspondence to image patches. Each summary token is assigned a k × k block of regular patch tokens. An attention mask is applied so that (1) all regular tokens attend to each other (full‑pair attention), (2) all summary tokens attend to each other, but each summary token only attends to its assigned block, and the block tokens attend back to their summary token, and (3) the global
Comments & Academic Discussion
Loading comments...
Leave a Comment