Peripheral Vision Transformer

Abstract

Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions. In this work, we take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception, showing that the network learns to perceive visual data similarly to the way that human vision does. The state-of-the-art performance in image classification task across various model sizes demonstrates the efficacy of the proposed method.

Peripheral Vision

Peripheral vision gives us ability to see things where we are not directly looking, i.e., the center of our gaze. As seen from Figure.1, we have high-resolution processing near the gaze (central and para-central regions) to identify highly-detailed visual elements such as geometric shapes, and low-level details. For the regions more distant from the gaze (mid and far peripheral regions) the resolution decreases to recognize abstract visual features such as motion, and high-level contexts. This systematic strategy enables us to effectively perceive important details within a small fraction (1%) of the visual field while minimizing unnecessary processing of background clutter in the rest (99%), thus facilitating efficient visual processing for human brain. Video.1 experimentally demonstrate how human vision recognizes visual features at different peripheral regions.

Figure 1. Peripheral vision.

Video 1. Human vision percieves distinct visual features across different peripheral regions.

Experimental analyses

1. Learned attentions of PerViT

Figure 4. Learned position-based attention Φ_p of PerViT-Tiny. The query position is given at the center. Without any special supervisions, the four attended regions (heads) in most layers are learned to complement each other to cover the entire visual field, capturing different visual aspects at each peripheral region, similarly to human peripheral vision as illustrated in Figure.1.

Figure 5. Learned position-based Φ_p and mixed Φ_a attentions of PerViT-Tiny for layers 3, 4 and 8. The mixed attentions Φ_a at Layer 4 are formed dynamically (Φ_c) within statically-formed region (Φ_p) while the attentions Φ_a at Layer 8 weakly exploit position information (Φ_p) to form dynamic attentions. The results reveal that Φ_p plays two different roles; it imposes semi-dynamic attention if the attended region is focused in a small area whereas it serves as position bias injection when the attended region is relatively broad. In the paper, we constructively prove that an MPA layer in extreme case of semi-dynamic attention/position bias injection is in turn convolution/multi-head self-attention, naturally generalizing the both transformations.

2. The inner workings of PerViT

Figure 6. The measure of impact (x-axis: layer index, y-axis: the impact metric). Each bar graph shows the measure of a single head (4 heads at each layer), and the solid lines represent the trendlines which follow the average values of layers. (left: results of PerViT-T. right: results of T, S, and M.) The impact of position-based attention is significantly higher in early processing, transforming features semi-dynamically, while the later layers require less position information, regarding Φ_p as a minor position bias. Note that the impact measures of four heads (bar graphs) within each layer show high variance, implying that the network evenly utilizes both position and content information simultaneously within each MPA layer (Layer 3 in Figure.5), performing both static/local and dynamic/global transformation in a single-shot.

Figure 7. The measure of nonlocality (x-axis: layer index, y-axis: the nonlocality metric). We observe a similar trend of locality between Φ_p and Φ_a, which reveals the position information play more dominant role over the content information in forming spatial attentions (Φ_a) for feature transformation. We also observe that content- and position-based attentions behave conversely; Φ_c attends globally in early layers, i.e., large scores are distributed over the whole spatial region, while being relatively local in deeper layers.

3. Evaluation on ImageNet-1K

Table 1. Model performance on ImageNet [B].

Table 2. Study on the effect of each component in PerViT.

For additional results and analyses, please refer to our paper available on [arXiv].

Peripheral Vision Transformer

Abstract