Skip to content
Longterm Wiki
Back

Detection accuracy drops with newer generators

paper

Authors

Nam Hyeon-Woo·Kim Yu-Ji·Byeongho Heo·Dongyoon Han·Seong Joon Oh·Tae-Hyun Oh

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Technical study examining Vision Transformers' attention mechanisms and their robustness properties, relevant to understanding model reliability and potential failure modes in deep learning systems.

Paper Details

Citations
38
3 influential
Year
2022

Metadata

arxiv preprintprimary source

Abstract

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers. We study the role of the density of the attention. Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them. We interpret this as a strong preference for ViT models to include dense interaction. We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions. We call this method Context Broadcasting, CB. We observe that the inclusion of CB reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models. CB incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.

Summary

This paper investigates why Vision Transformers (ViTs) perform well, focusing on the role of attention density in multi-head self-attention (MSA). The authors find that ViTs naturally develop dense attention maps despite the learning difficulty this entails, suggesting a strong preference for dense interactions. They propose Context Broadcasting (CB), a simple method that explicitly injects uniform attention into each ViT layer to provide dense interactions. The approach reduces attention density in original maps while improving model capacity and generalizability, with minimal computational overhead and no additional parameters.

Cited by 1 page

PageTypeQuality
AI-Driven Legal Evidence CrisisRisk43.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202673 KB
[2210.08457] Scratching Visual Transformer’s Back with Uniform Attention 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Scratching Visual Transformer’s Back 
 with Uniform Attention

 
 
 
& Nam Hyeon-Woo 2 Kim Yu-Ji 2 
 Byeongho Heo 1 Dongyoon Han 1 Seong Joon Oh 3 Tae-Hyun Oh 2 
 1 NAVER AI Lab 2 POSTECH 3 University of Tübingen 
 
 This work was done during an intern at NAVER AI Lab. 
 

 
 Abstract

 The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention ( 𝙼𝚂𝙰 𝙼𝚂𝙰 \mathtt{MSA} ).
The 𝙼𝚂𝙰 𝙼𝚂𝙰 \mathtt{MSA} enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers.
We study the role of the density of the attention.
Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones.
This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them.
We interpret this as a strong preference for ViT models to include dense interaction.
We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions.
We call this method Context Broadcasting, 𝙲𝙱 𝙲𝙱 \mathtt{CB} .
We observe that the inclusion of 𝙲𝙱 𝙲𝙱 \mathtt{CB} reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models.
 𝙲𝙱 𝙲𝙱 \mathtt{CB} incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.

 
 
 
 1 Introduction

 
 After the success of Transformers  (Vaswani et al., 2017 ) in language domains, Dosovitskiy et al. ( 2021 ) have extended to Vision Transformers (ViTs) that operate almost identically to the Transformers but for computer vision tasks.
Recent studies  (Dosovitskiy et al., 2021 ; Touvron et al., 2021b ) have shown that ViTs achieve superior performance on image classification tasks.

 
 
 The favorable performance is often attributed to the multi-head self-attention ( 𝙼𝚂𝙰 𝙼𝚂𝙰 \mathtt{MSA} ) in ViTs  (Dosovitskiy et al., 2021 ; Touvron et al., 2021b ; Wang et al., 2018 ; Carion et al., 2020 ; Strudel et al., 2021 ; Raghu et al., 2021 ) , which facilitates long-range dependency. 1 1 1 Long-range dependency is described in the literature with various terminologies: non-local, global, large receptive fields, etc. 
Specifically, 𝙼𝚂𝙰 𝙼𝚂𝙰 \mathtt{MSA} is designed for global interactions of spatial information in all layers.
This is a structurally contrasting feature with a large body of successful predecessors, convolutional neural networks (CNNs), which gradually increase the range of interactions by stacking many fixed and hard-coded local operations, i.e ., convolutional layers.
 Raghu et al. ( 2021 ) and Naseer et al. ( 2021 ) have shown the effectiv

... (truncated, 73 KB total)
Resource ID: 48213457fb9308c2 | Stable ID: sid_9Ynl9MCkEr