Detecting robust keypoints from an image is an integral part of many computer vision problems, and the characteristic orientation and scale of keypoints play an important role for keypoint description and matching. Existing learning-based methods for keypoint detection rely on standard translation-equivariant CNNs but often fail to detect reliable keypoints against geometric variations. To learn to detect robust oriented keypoints, we introduce a self-supervised learning framework using rotation-equivariant CNNs. We propose a dense orientation alignment loss by an image pair generated by synthetic transformations for training a histogram-based orientation map. Our method outperforms the previous methods on an image matching benchmark and a camera pose estimation benchmark.

Overall architecture of REKD (Rotation-Equivariant Keypoint Detection)

Figure 1. The rotation-equivariant convolutional layer takes an input image and processes it at multiple scales. The multi-scale rotation-equivariant representation Hs pass two separate branches that predict a keypoint map K and an orientation map O.

Dense orientation alignment loss

(a) Illustration of dense orientation alignment loss. (b) Visualization of the color-coded orientation maps.

Figure 2. (a): The dense orientation histogram Ob is spatially aligned using T-1g. The equivariant histogram vectors of the feature points in Oa are shifted using T'g. The out-of-plane regions are excluded when computing the loss. (b): Upper is the source image, and the bottom is the target image. For the better view, we apply T-1g to the target image as a spatial alignment. We map the orientation range from [0,359) to [0,255) to visualize the orientations by hue of HSV color representation.

Experiments under synthetic rotations

(a) Repeatability (b) Orientation estimation accuracy

Figure 3. (a) Results of repeatability to evaluate the rotation-invariant keypoint detection under synthetic rotations with Gaussian noise. For a better view, we smooth the chart by a moving average. (b) Results of orientation estimation accuracy under synthetic rotations with Gaussian noise. We use 15° threshold for measuring the accuracy.

Experiments of keypoint matching and relative camera pose estimation

(a) Results on HPatches (b) Results on IMC2021 [1]

Table 1. (a) We use 1,000 keypoints in this experiment. `Det.' denotes keypoint detection method, `Desc.' denotes descriptor extraction method, `Rep.' denotes the repeatability score, and `pred. match.' is the average number of predicted matches. Numbers in bold indicate the best scores. (b) Mean average accuracy (mAA; 5°, 10°) of 6-DoF pose estimation and the average number of inlier matches (Num. Inl.) on IMC2021 validation set [1]. Column `K' denotes the number of keypoints. Numbers in bold indicate the best scores.

Additional experiments

(a) Outlier filtering using the estimated orientation (b) Results according to the order of group

Table 2. (a) Results for the comparison using the estimated orientations by an outlier filtering in HPatches. We use 1,000 keypoints. `Det.+Des.' denotes the keypoint detector and descriptor, `Ori.' denotes the orientation estimation method, and `fltr.' denotes whether or not to use the outlier filtering. (b) Experiment according to the order of group in HPatches. The subscript of G denotes the order of group. `out. filter.' denotes the results with outlier filtering. The last row denotes the results without the group representation and using conventional CNNs.


This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-TF2103-02.


Self-Supervised Equivariant Learning for Oriented Keypoint Detection
Jongmin Lee, Byungjin Kim, Minsu Cho
CVPR, 2022
[paper] [Bibtex]


Check our GitHub repository: [GitHub]