AETv2:
AutoEncoding Transformations for Self-Supervised
Representation Learning by Minimizing Geodesic Distances in Lie Groups
Guo-Jun Qi
Laboratory for MAchine Perception and LEarning (MAPLE)
Abstract: Self-supervised learning by predicting transformations has demonstrated outstanding performances in both unsupervised and (semi-)supervised tasks. Among the state-of-the-art methods is the AutoEncoding Transformations (AET) by decoding transformations from the learned representations of original and transformed images. Both deterministic and probabilistic AETs rely on the Euclidean distance to measure the deviation of estimated transformations from their groundtruth counterparts. However, this assumption is questionable as the group of underlying transformations often reside on a curved manifold rather staying in the flat Euclidean space. For this reason, we should use the geodesic to characterize how an image transform along the manifold of a Lie group, and adopt its length to measure the loss between two transformations. Particularly, we present to autoencode the homography Lie group PG(2) that contains a rich family of spatial transformations to learn image representations. Moreover, we make an estimate of the intractable Riemannian logarithm by projecting PG(2) to a subgroup of rotation transformations SO (3)$ that allows the matrix logarithm. Experiments demonstrate the proposed AETv2 model greatly outperforms its previous version as well as the other state-of-the-art self-supervised models in multiple representation learning tasks.
Figure 1: The deviation between two transformations should be measured along the curved manifold (Lie group) of transformations rather than through the forbidden Euclidean space of transformations.
Unlike the first
AET [1], the AETv2 is a self-trained model by minimizing the geodesic distance between
two transformations, since a transformation continuously evolves to another
transformation along the manifold of transformations rather than through the
forbidden Euclidean space of transformations as shown in Figure 1.
For this purpose,
we propose to train the AETv2 in the Lie group of transformations, by
minimizing the following geodesic distance between the grountruth
through the Riemannian logarithm at the identity transformation
However, directly
computing the Riemannian logarithm can be intractable for many types of
transformations. For example, the
group of homography transformations does not have a
closed-form Riemannian logarithm, and we instead choose to project them into a
subgroup SO(3) of 3D rotations in homogeneous
coordinate space for images, as shown in Figure 2 below. This eventually
results in the following approximate loss defined over the projected SO(3) subgroup,
where we combine the geodesic distance of the
first term with the projection distance in the second term, and is the projection residual
Figure 2: Illustration of how the AETv2 is
self-trained end-to- end. The output matrix from the transformation decoder is
normalized to have a unit determinant, and the projection of onto SO(3)
follows to compute the geodesic distance and the projection distance train the
model.
Table 1 below reports the experiment results on CIFAR-10 with a three-layer nonlinear classifier being trained on top of the first two blocks that are pretrained by AETv2.
We also train Alexnet by AET and evaluate the performance of learned representations based on Conv4 and Conv 5 output feature maps. The proposed AETv2 can significantly close the performance gap with the fully supervised network as shown in Table 3 in the paper below.
More experiment results on ImageNet and Places datasets can be found in our paper [2]. The source code will be released later.
[1] Liheng Zhang, Guo-Jun
Qi, Liqiang Wang, Jiebo
Luo. AET vs. AED: Unsupervised Representation Learning by Auto-Encoding
Transformations rather than Data, in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, June 16th - June
20th, 2019. [ pdf ]
[2] Feng Lin, Haohang Xu, Houqiang Li, Haohang Xu, and Guo-Jun Qi, AETv2: AutoEncoding Transformations for Self-Supervised Representation Learning by Minimizing Geodesic Distances in Lie Groups.
November 16, 2019
© MAPLE Research