Symmetry plays a vital role in understanding structural patterns, aiding object recognition and scene interpretation. This paper focuses on rotation symmetry, where objects remain unchanged when rotated around a central axis, requiring detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching, while recent segmentation models based on convolutional neural networks (CNNs) detect rotation centers but struggle with 3D geometric consistency due to viewpoint distortions. To overcome this, we propose a model that directly predicts rotation centers and vertices in 3D space and projects the results back to 2D while preserving structural integrity. By incorporating a vertex reconstruction stage enforcing 3D geometric priors—such as equal side lengths and interior angles—our model enhances robustness and accuracy. Experiments on the DENDI dataset show superior performance in rotation axis detection and validate the impact of 3D priors through ablation studies.
Figure 1. Rotation symmetry detection models and results. (a) 3D detection baseline model without geometric priors, and (b) its qualitative results. (c) Our 3D detection model with geometric priors, and (d) its corresponding qualitative results. The results highlight the benefits of incorporating 3D geometric constraints.
Figure 2. Overall pipeline. The input image is processed through a backbone and transformer encoder with camera queries. The detection head predicts the 3D rotation center, seed vertex, rotation axis, and symmetry group. The seed vertex is then duplicated according to the predicted symmetry group before the 3D coordinates are projected to 2D.
Figure 3. Camera Cross Attention. The 3D reference point grids in camera coordinates are projected onto image coordinates to query the backbone image features.
The rotation symmetry detector predicts rotation centers and vertices in 3D camera coordinates. To transform backbone features from image coordinates to camera coordinates, we introduce camera queries—a set of grid-shaped learnable parameters denoted as \( \mathbf{Q} \in \mathbb{R}^{C \times N_x \times N_y} \). Here, \( N_x \) and \( N_y \) represent the spatial dimensions along the \( x \)- and \( y \)-axes, while \( C \) is the embedding dimension. Each query \( \mathbf{Q}_q \in \mathbb{R}^{C} \), located at \( \mathbf{p}_q \), corresponds to a grid cell in the camera’s local coordinate space, covering a predefined range along the \( x \)- and \( y \)-axes. Given an input feature map \( \mathbf{F} \in \mathbb{R}^{C \times H \times W} \), let \( q \) index a query feature map \( \mathbf{Q} \) with a 2D reference point \( \mathbf{p}_q \). The camera cross attention is computed as:
\[ \text{CCA}(\mathbf{Q}, q, \mathbf{F}) = \sum^{N_\mathrm{ref}}_{i=1} {\mathrm{Deform}}(\mathbf{Q}_q, \mathcal{P}(\mathbf{p}_q, z_i), \mathbf{F}) \] \[ \mathcal{P}(\mathbf{p}_q, z_i) = \begin{pmatrix} f & 0 & c_x \\ 0 & f & c_y \end{pmatrix} \begin{pmatrix} \frac{\mathbf{p}_{q,x}}{z_i} \\ \frac{\mathbf{p}_{q,y}}{z_i} \\ 1 \end{pmatrix} \]
where \( \mathrm{Deform} \) denotes deformable attention [Zhu et al.], \( f \) is the focal length, and \( c_x, c_y \) is the focal center. For each \( x \)-\( y \) position, \( N_\mathrm{ref} \) depth values along the \( z \)-axis generate 3D reference points, which are projected to 2D to sample image features.
Each query is processed by the transformer decoder and then passed to a classification branch and a regression branch. The classification branch predicts the rotation symmetry group \( g \), defining the order of symmetry \( N \). The regression branch outputs four parameters that define the 3D geometric structure as:
\[ \begin{bmatrix} \mathbf{c}^\top \ \mathbf{s}^\top \ \mathbf{a}^\top \ \beta \end{bmatrix}^\top, \]
where \( \mathbf{c} \in \mathbb{R}^3 \) is the 3D center coordinate, \( \mathbf{s} \in \mathbb{R}^3 \) is the 3D seed coordinate (a vertex on the polygon boundary), \( \mathbf{a} \in \mathbb{R}^3 \) is the 3D axis vector defining the rotation axis, and \( \beta \) is an angle bias for initial alignment in certain shapes (e.g., rectangles). These parameters define the spatial structure and orientation needed to construct the polygon in 3D space. To position vertices according to the predicted rotation symmetry group in 3D space, each vertex \( \mathbf{v}_k \) is computed by rotating a seed point \( \mathbf{s} \) around a rotation axis vector \( \mathbf{a} \), centered at the rotation center \( \mathbf{c} \). The axis \( \mathbf{a} \) is normalized to a unit vector. Each rotation vertex is given by \( \mathbf{v}_k = \mathbf{r}_k + \mathbf{c} \), where the rotated vector \( \mathbf{r}_k \) is calculated using Rodrigues' rotation formula:
\[ \mathbf{r}_k = \mathbf{r} \cos \theta_k + (\mathbf{a} \times \mathbf{r}) \sin \theta_k + \mathbf{a} (\mathbf{a} \cdot \mathbf{r}) (1 - \cos \theta_k), \]
with the initial radial vector \( \mathbf{r} = \mathbf{s} - \mathbf{c} \), and rotation angle \( \theta_k = \frac{2\pi k}{N} \). Given the predicted symmetry group order \( N \), we generate \( N \) vertices by setting \( k = 1, 2, \ldots, N \). The predicted 3D points are then projected into 2D.
Method | 3D query/pred. | vertex recon. | mAP |
---|---|---|---|
2D baseline | 24.7 | ||
3D baseline | ✓ | 23.5 | |
Ours | ✓ | ✓ | 30.6 |
Method | Prediction | Max F1-score |
---|---|---|
EquiSym [Seo et al., 2022] | segmentation | 22.5 |
Ours | detection | 33.2 |
Figure 4. Qualitative comparison of rotation vertex detection results on the DENDI dataset. Each set of four columns displays ground truth, the results of 2D baseline, 3D baseline, and ours. Only polygon predictions with all true-positive vertices are marked(green).
Figure 5. Qualitative results of rotation center detection on the DENDI dataset. Each set of three columns shows ground truth, EquiSym[Seo et al., 2022], and our method. Our detection-based model allows for analysis of individual symmetries.
This work was supported by the Samsung Electronics AI Center and also by the IITP grants (RS-2022-II220290: Visual Intelligence for Space-Time Understanding and Generation (50%), RS-2021-II212068: AI Innovation Hub (45%), RS-2019-II191906: Artificial Intelligence Graduate School Program at POSTECH (5%)) funded by the Korea government (MSIT).