Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions. In this work, we take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception, showing that the network learns to perceive visual data similarly to the way that human vision does. The state-of-the-art performance in image classification task across various model sizes demonstrates the efficacy of the proposed method.

Peripheral Vision

Peripheral vision gives us ability to see things where we are not directly looking, i.e., the center of our gaze. As seen from Figure.1, we have high-resolution processing near the gaze (central and para-central regions) to identify highly-detailed visual elements such as geometric shapes, and low-level details. For the regions more distant from the gaze (mid and far peripheral regions) the resolution decreases to recognize abstract visual features such as motion, and high-level contexts. This systematic strategy enables us to effectively perceive important details within a small fraction (1%) of the visual field while minimizing unnecessary processing of background clutter in the rest (99%), thus facilitating efficient visual processing for human brain. Video.1 experimentally demonstrate how human vision recognizes visual features at different peripheral regions.

Figure 1. Peripheral vision.

Video 1. Human vision percieves distinct visual features across different peripheral regions.

Peripheral Vision Transformer

Figure 2. The overall architecture of Peripheral Vision Transformer (PerViT) which bases on DeiT [A] architecture.

Figure 3. In the paper, we explore different model designs for position-based attention function Φp (Top) and propose peripheral projections (Middle) and peripheral initialization (Bottom).

Experimental analyses

1. Learned attentions of PerViT

Figure 4. Learned position-based attention Φp of PerViT-Tiny. The query position is given at the center. Without any special supervisions, the four attended regions (heads) in most layers are learned to complement each other to cover the entire visual field, capturing different visual aspects at each peripheral region, similarly to human peripheral vision as illustrated in Figure.1.

Figure 5. Learned position-based Φp and mixed Φa attentions of PerViT-Tiny for layers 3, 4 and 8. The mixed attentions Φa at Layer 4 are formed dynamically (Φc) within statically-formed region (Φp) while the attentions Φa at Layer 8 weakly exploit position information (Φp) to form dynamic attentions. The results reveal that Φp plays two different roles; it imposes semi-dynamic attention if the attended region is focused in a small area whereas it serves as position bias injection when the attended region is relatively broad. In the paper, we constructively prove that an MPA layer in extreme case of semi-dynamic attention/position bias injection is in turn convolution/multi-head self-attention, naturally generalizing the both transformations.

2. The inner workings of PerViT

Figure 6. The measure of impact (x-axis: layer index, y-axis: the impact metric). Each bar graph shows the measure of a single head (4 heads at each layer), and the solid lines represent the trendlines which follow the average values of layers. (left: results of PerViT-T. right: results of T, S, and M.) The impact of position-based attention is significantly higher in early processing, transforming features semi-dynamically, while the later layers require less position information, regarding Φp as a minor position bias. Note that the impact measures of four heads (bar graphs) within each layer show high variance, implying that the network evenly utilizes both position and content information simultaneously within each MPA layer (Layer 3 in Figure.5), performing both static/local and dynamic/global transformation in a single-shot.

Figure 7. The measure of nonlocality (x-axis: layer index, y-axis: the nonlocality metric). We observe a similar trend of locality between Φp and Φa, which reveals the position information play more dominant role over the content information in forming spatial attentions (Φa) for feature transformation. We also observe that content- and position-based attentions behave conversely; Φc attends globally in early layers, i.e., large scores are distributed over the whole spatial region, while being relatively local in deeper layers.

3. Evaluation on ImageNet-1K

Table 1. Model performance on ImageNet [B].

Table 2. Study on the effect of each component in PerViT.

For additional results and analyses, please refer to our paper available on [arXiv].


This work was supported by the IITP grants (IITP-2021-0-01696: High-Potential Individuals Global Training Program, IITP-2021-0-00537: Visual Commonsense, IITP-2019-0-01906: AI Graduate School Program - POSTECH) funded by Ministry of Science and ICT, Korea. This work was done while Juhong Min was working as an intern at Microsoft Research Asia.


Peripheral Vision Transformer
Juhong Min, Yucheng Zhao, Chong Luo, and Minsu Cho
arXiv preprint, 2022
[arXiv] [Bibtex]


Check our GitHub repository: [github]


[A] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou. Training data-efficient image transformers \& distillation through attention. In Proc. International Conference on Machine Learning (ICML), 2021.

[B] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.