Self-supervised 3D Hand Pose Estimation
Updated: Jan 27
The blog is part of a series of five posts in which I summarize academic papers and answer a few questions pertinent to them. These papers and related questions were given to my by my PhD committee as part of my comprehensive exam.
The paper I will be writing about in this post can be access using the following link.
The paper presents a self-supervised method for hand pose estimation from depth maps that combines the advantages of the unsupervised model-based methods (no need for labelled data) and data-driven learning approaches (increased accuracy with more training data).
Network design and amount labelled data play a huge role in increasing the accuracy for learning based methods in hand pose estimation. However, obtaining accurate 3D hand pose are difficult and require careful supervision and manual cleaning. One way to address this issue is to use synthetic data. However, models trained on synthetic data suffer huge losses in accuracy when used for on real data. Apart from learning based methods, model based methods ,which treat pose estimation as frame-wise fitting problem and require no training data, are also used for hand pose estimation.
The authors have proposed a self-supervised method for hand pose estimation from depth maps that combines the advantages of the unsupervised model-based methods and data-driven learning approaches.
In this approach neural network is initialized with synthetic data and later fine-tuned on real unlabeled depth maps. the network supervises itself with the help of carefully selected data fitting terms.
Hand surface is approximated with 41 spheres. Each sphere has a center (x,y,z) and a radius r. The radii do not change during training. The pose estimation network estimates only the 3D coordinates of all N spheres.
Given a depth map as input, for 3D point estimation, the FCN network regresses a heatmap h2D, the 2D projection of the 3D points, and a latent depth map hDepth, which encodes the depth information.
The network is initialized with synthetic depth maps. Mean squared training loss is computed between 3D point estimates, generated by regressing over h2D and hDepth, and ground truth co-ordinates from depth maps.
An energy function is used to determine how well the spheres generated using estimated 3D points fit with input depth map. A differential rendering process is used to backpropagate errors to fine-tune estimations.
The self-supervised training loss is formulated as follows
Lm2d - Model to data term defines the L1 distance between input depth map and render depth map.
Ld2m - Data to model term is a registration loss between the estimated model and input depth map.
Lmultiview - The multi-view consistency term provides supervision from multiple viewpoints.
Lvae - Prior term aims to maximize the likelihood of lower bound of hand pose configuration.
Lbone - The bone length term ensures that the distance between two bone endpoints remains unchanged.
𝐿_collision - The collision term penalizes self-collision between two spheres
The process at first glance resembles model-based tracking methods (no labels, fitting pose). However, the difference is that in this approach the network parameters are being optimized, not the pose parameters to fine-tune fitting.
The benefits of data-driven approaches are harvested by minimizing model-fitting error over entire set of unlabelled depth maps rather than fitting frames independently as in model-based methods.
The approach presented by the authors combines the benefits of learning-based approaches, in which accuracy increases with more training data, and model-based approaches, which require no supervision.
The authors have also managed to not let the model trained on synthetic data dip in accuracy by using prior terms. For example, the Lmultiview mitigates the ambiguities and errors that arise from self-occlusion of hand.
Like mentioned in the paper, the model is less robust when the accuracy requirements are more stringent. It maybe attributed to the use of spheres for estimation as they cannot capture smaller fitting errors.
Similarly, the prior terms used do not place and strict constraints on the kinematic feasibility of the joints, resulting in small offsets.
What are different design choices that need to be made when specifying a self-supervised method for hand pose estimation?
Self-supervised methods are usually used to overcome the lack of sufficient labelled data or to reduce human supervision. In such a scenario, it is important to design the self-supervisory signal to cover as many aspects of human supervision as possible and provide as much context as possible to help track/identify the object of interest.
When it come to the application of hand pose estimation, this context can come in multiple forms.
Spatial Temporal Context: It is data that is collected across space and time. For example, extracting spatial temporal properties from a sequence of hand pose images can help with predicting joint angles and enhancing consistency between images.
Color: Colorization, which predicting the color of pixels in an image, may lead to better tracking/identification if hands 
Alignment: Optical flow is the apparent motion of individual pixels in a frame in a series of images. Optical flow tracking can help with hand pose tracking. It maybe most useful in tracking hand pose in videos.
Information from other modalities: Input from other modalities such as sound and texture might provide context for hand pose estimation. For example, sounds such as clicking and skin texture might provide clues for hand pose.
Supervision from multiple views: Multiple views are used to overcome ambiguities in pose caused by self-occlusion. Multi view observations from unknown poses can act as supervising signal. One way of implementing it is by enforcing geometric consistency between independently predicted shape and pose from two views of the same instance.
What specific design choices did they make in this paper?
While designing a self-supervised approach for hand pose estimation, the authors made some specific design choices. These choices are included in the in the self-supervised training loss as data and prior terms.
Lm2d - Model to data term defines the L1 distance between input depth map and render depth map. It is used to align spheres as close as possible to the surface points in the depth maps.
Ld2m - Data to model term is a registration loss between the estimated model and input depth map. It works to reduce the distance from every point in the depth map to its projection onto the estimated hand model surface.
Lmultiview - The multi-view consistency term provides supervision from multiple viewpoints. It helps in clearing the ambiguity caused by self-occlusion in hands.
Lvae - The term aims to maximize the likelihood of lower bound of hand pose configuration. It helps to penalize infeasible joint configurations.
Lbone - The bone length term ensures that the distance between two bone endpoints remains unchanged.
𝐿_collision - The collision term penalizes self-collision between two spheres.
Apart from these, the authors also use a differential depth renderer that enables a more dense and consistent error term by projecting hand poses from one viewpoint to another.
What is the main difference between self-supervised hand pose estimation and fully supervised hand pose estimation?
In fully-supervised hand pose estimation tasks, training data ie, hand poses, are annotated and task is reduced to a simple classification task where the model learns to predict the pose category of the image. The accuracy of the supervised model increases as there is more labelled data available.
On the other hand, in self-supervised hand pose estimation, information from unlabelled hand poses, in the form of depth maps or RGB maps etc, is leveraged to predict hand poses. There is minimal to no human supervision and annotated data. Self-supervisory signal is administered in the form of design choices, such as discussed in sections 3.2 and 3.3, incorporated as loss terms in self-supervised training loss that is to be minimized. This feature of self-supervised learning, I believe is the main difference between self-supervised and full-supervised hand pose estimation.
What are the key pros and cons of self-supervised methods for hand pose estimation, compared to fully supervised methods?
Key pros and cons related to self-supervised and full-supervised methods for hand pose estimations broadly fall into the categories of data and supervision requirements. I have summarized them in the form of a table
Legg, Shane and Marcus Hutter. “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17 (2007): 391-444.
Law, James et al. “A psychology based approach for longitudinal development in cognitive robotics.” Frontiers in Neurorobotics 8 (2014): n. pag.
Y. Wu, W. Ji, X. Li, G. Wang, J. Yin and F. Wu, "Context-Aware Deep Spatiotemporal Network for Hand Pose Estimation From Depth Images," in IEEE Transactions on Cybernetics, vol. 50, no. 2, pp. 787-797, Feb. 2020, doi: 10.1109/TCYB.2018.2873733.
C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- sual representation learning by context prediction. In ICCV,2015.
D. Pathak, R. B. Girshick, P. Dolla ́r, T. Darrell, and B. Har- iharan. Learning features by watching objects move. In CVPR, 2017.
X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors.In CVPR, 2018.
S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consis- tency as supervisory signal for learning shape and pose pre- diction. CVPR, 2018
A.J.Joshi, F.Porikli, and N.Papanikolopoulos,“Multi-classactive learning for image classification,” Computer Vision and Pattern Recognition (CVPR), 2009.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” International Conference on Machine Learning (ICML), 2009.
G. Hacohen, L. Choshen, and D. Weinshall, “Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets,” International Conference on Learning Representations (ICLR), 2020.
Lee, M. H. 2020. How to Grow a Robot: Developing Human-Friendly, Social AI, 1–10
Baldi, Pierre and Laurent Itti. “Of bits and wows: A Bayesian theory of surprise with applications to attention.” Neural networks : the official journal of the International Neural Network Society 23 5 (2010): 649-66
Gemici, Mevlana, et al. "Generative temporal models with memory." arXiv preprint arXiv:1702.04649 (2017)
Piloto, Luis, et al. "Probing physics knowledge using tools from developmental psychology." arXiv preprint arXiv:1804.01128 (2018)