Weakly Supervised 2D Human Pose Transfer

SCIENCE CHINA Information Sciences 2021

Qian Zheng1    Yajie Liu1    Zhizhao Lin1    Dani Lischinski2    Daniel Cohen-Or1,3    Hui Huang1*

1Shenzhen University    2The Hebrew University of Jerusalem    3Tel-Aviv University


We present a novel method for pose transfer between two 2D human skeletons. When the bone lengths and proportions between the two skeletons are significantly different, pose transfer becomes a challenging task, which cannot be accomplished by simply copying the joint positions or the bone directions. Our data-driven approach utilizes a deep neural network trained, in a weakly supervised fashion, to encode a skeleton into two separate latent codes, one representing its pose, and another representing the skeleton's proportions (skeleton-ID). The network is given two skeletons, and learns to combine the pose of one with the skeleton-ID of the other. Lacking supervision on the poses, we develop a novel loss that qualitatively compares poses of different skeletons. We evaluate the performance of our method on a large set of poses. The advantages of avoiding supervision are demonstrated by showing transfer of extreme poses, as well as between uncommon skeleton proportions.

Figure 2. Comparing different pose transfer alternatives; both pose and ID inputs are from the Mixamo test set. From left to right: pose input, ID input, pose resulting from global scaling the pose input to the height of the bounding box of the ID input, pose resulting from combining 2D bone angles of pose input with the 2D bone lengths of the ID input, pose resulting from combining 3D bone angles of pose input with 3D bone lengths of the ID input, our result, and the ground-truth pose taken from the same time step of the same motion performed by the target ID. The ground truth is overlayed in light gray on top of every output for better visual comparison.

Figure 3. Our network architecture.

Figure 4. The virtual link loss measures the pose similarity between the input (the yellow skeleton in the left-most) and output pose (the purple skeleton in the right-most) by examining the spatial relations between all pair of joints (before and after pose transfer) along multiple directions. The middle left diagram shows the pair of joints i1 and j1 (indicated by rectangles in the end gures) projected onto three directions. The projections onto the red direction are ipped, and such ips are penalized by our loss. In contrast, in the middle right diagram, the pair i2 and j2 (indicated by ellipses in the end figures) exhibit consistent projections onto all three directions. To better illustrate, the positions of bones and joints in the two middle figures are slightly adjusted.

Figure 5. More results generated by the model trained with the Mixamo dataset. The outputs are shown in purple, with the ground truths overlayed in gray.

Figure 6. More results using the model trained with CMU Panoptic dataset. Both the 2D poses and the skeleton-IDs are extracted from web images, so there is no ground truth for pose transfer.

Figure 7. Pose transfer results using the model trained on the Mixamo dataset. The pose and ID inputs are 2D projections of 3D skeletons, or extracted from real person images; they were not seen during training. The ground truth, whenever available, is depicted in light gray on top of every output.

Figure 9. Two challenging poses extracted from images are transferred to different skeleton-IDs. Note the two touching hands in the Yoga pose on the left, and the unusual contact between the hand and the foot in the pose on the right.

Figure 13. Pose-guided image synthesis with our pose transfer. A pose-guided image synthesis model is trained individually for each of the two persons in the left column. The yellow boxes in the middle show a target pose extracted from a person with different proportions (a child) and the result of the transfer conditioned directly on that pose. Image synthesis after our pose transfer (purple box on the right) results in a body shape much more closely resembling the original one. It should be noted that the feet of the yellow poses had to be aligned with the blue input poses to ensure that the feet are on the ground, while our results needed no further alignment.

Figure 14. Given a source image, we extract its clothing layout, and synthesize a new layout conditioned on a target pose.

Figure 15. Failure cases. For challenging poses with extremely foreshortening or two-person close interaction poses, our method may fail to generate plausible results.


This work was supported in parts by NSFC (U2001206), GD Talent Program (2019JC05X328), GD Science and Technology Program (2020A0505100064, 2015A030312015), DEGP Key Project (2018KZDXM058), Shenzhen Science and Technology Program(RCJC20200714114435012, JCYJ20180305125709986), National Engineering Laboratory for Big Data System Computing Technology, and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ).



title={Weakly Supervised 2D Human Pose Transfer},

author={Qian Zheng and Yajie Liu and Zhizhao Lin and Dani Lischinski and Daniel Cohen-Or and Hui Huang},

journal={SCIENCE CHINA Information Sciences},





Downloads (faster for people in China)

Downloads (faster for people in other places)