Abstract

Determining the 3D orientations of an object in an image, known as single-image pose estimation, is a crucial task in 3D vision applications. Existing methods typically learn 3D rotations parametrized in the spatial domain using Euler angles or quaternions, but these representations often introduce discontinuities and singularities. SO(3)-equivariant networks enable the structured capture of pose patterns with data-efficient learning, but the parametrizations in spatial domain are incompatible with their architecture, particularly spherical CNNs, which operate in the frequency domain to enhance computational efficiency. To overcome these issues, we propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression, aligning with the operations of spherical CNNs. Our SO(3)-equivariant pose harmonics predictor overcomes the limitations of spatial parameterizations, ensuring consistent pose estimation under arbitrary rotations. Trained with a frequency-domain regression loss, our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+, with significant improvements in accuracy, robustness, and data efficiency.

Motivation

Figure 1. Types of representations for 3D rotation prediction. Existing methods consider predicting 3D rotations in the spatial domain, where the rotation representations suffer from issues such as discontinuities and singularities [1,2,3]. Our method predicts Wigner-D coefficients in the frequency domain, to obtain accurate pose in continuous space using an SO(3)-equivariant network.

Task: Single-View Object Pose Estimation

Figure 2. Task of single-view object pose estimation. Predicting the 3D pose of objects, i.e., position and orientation, in 3D space from an image is crucial for numerous applications, including augmented reality, robotics, autonomous vehicles, and cryo-electron microscopy. Estimating 3D orientation is particularly challenging due to rotational symmetries and the non-linear nature of rotations. In addition, unlike translations, rotations introduce unique challenges such as gimbal lock and the requirement for continuous, singularity-free representations.

Overall Archiecture

Figure 3. Overall architecture. Our network for SO(3)-equivariant pose estimation consists of four parts: feature extraction, spherical mapper, Fourier transformer, and SO(3)-equivariant layers. First, we extract a feature map using a pre-trained ResNet. Next, the spherical mapper orthographically projects the extracted feature map onto a spherical surface. The Fourier transformer converts this spatial information into the frequency domain. We utilize spherical convolutions to obtain the final Wigner-D harmonics coefficients $\Psi$ which represent SO(3) rotations of spherical harmonics, where $M$ denotes the total number of Wigner-D matrix coefficients.

Spherical Mapper & Spherical Convolutions

Figure 4. Illustration of spherical mapper and spherical convolution for SO(3)-equivariance. This structure allows for the prediction of 3D rotations while preserving the SO(3)-equivariance of the input structure. Predicting the Wigner-D harmonics $\Psi$ enables continuous 3D rotation modeling, without discretizing the group actions.

Loss Function: Frequency-Domain Regression Loss

Equation 1. The output Wigner-D representation $\Psi$ encodes specific object orientations in an image. To evaluate the accuracy, we calculate the Mean Squared Error (MSE) by normalizing each harmonic frequency level $l$ with $w_l$. This regression loss enables continuous output values, facilitating more precise predictions of unambiguous object poses compared to previous discretization methods.

Experiment: Comparison with Existing Pose Estimation Methods

Table 1. Results on ModelNet10-SO(3). Table 2. Results on PASCAL3D+.

Tables 1 and 2. Tables 1 and 2 show the single-view pose estimation results on the ModelNet10-SO(3) and PASCAL3D+ datasets, respectively, where our model outperforms all baselines across multiple evaluation metrics.

Experiment: Few-shot Training Views

Figure 5. Experiment on ModelNet10-SO(3) with few-shot training views. Results with solid lines of I-PDF [5], I2S [4], and RotLaplace [6] denote to a ResNet-50 backbone, while dotted lines indicate a ResNet-101 backbone. Our method outperforms all metrics and reduces training views. Baseline results [4, 6] were obtained using the source code provided by the authors. This few-shot training experiment verifies that our SO(3)-equivariant model contributes to superior data efficiency and generalization to unseen rotations.

Experiment: Ablation Studies & Design Choices

Table 3. Comparison of different rotation parametrizations & without SO(3)-Equivariant CNNs. To validate our Wigner-D representation in the frequency domain, we train using various output rotation representations. Table 4. Comparison of different loss functions. To validate our choice of MSE loss, we experiment with various distance functions between the predicted output and the ground truth.

Experiment: 3D Symmetric Objects

(a) Results on symmetric solids (SYMSOL I and II). (b) Rotation & Reflection Symmetry Cases.

Figure 6. Joint Training with the distribution loss $\mathcal{L}_{dist}$ [5] for symmetric object modeling. We visualize pose distribution following to [5].
(a) We report the average log likelihood on both parts of the SYMSOL datasets. $\mathcal{L}_{\text{wigner}}$ denotes the results obtained with our Wigner-D regression loss. $\mathcal{L}_{\text{dist}}$ denotes the results using the distribution loss from I-PDF [5], which are the same as the results of I2S [4]. The third row presents the results of joint training using both our regression loss and the distribution loss.
(b) For example, in the case of rotational symmetry, such as the mouth of a bottle, our model can detect all correct orientations. Similarly, for the standing chair example below, the model successfully captures multiple correct poses corresponding to reflection symmetry along the four directions of the chair legs.

Acknowledgements

This work was supported by IITP grants (RS-2022-II220959: Few-Shot Learning of Causal Inference in Vision and Language for Decision Making (50%), RS-2022-II220290: Visual Intelligence for Space-Time Understanding and Generation based on Multi-layered Visual Common Sense (45%), RS-2019-II191906: AI Graduate School Program at POSTECH (5%)) funded by Ministry of Science and ICT, Korea.

Paper

3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction
Jongmin Lee, Minsu Cho
NeurIPS, 2024
[paper] [ArXiv] [Bibtex]

Code

Check our GitHub repository: [SO3_EquiPose GitHub]

References