Stacked Hourglass Network with Additional Skip Connection for Human Pose Estimation

The human pose estimation is a problem of localizing human joints in a single image, and that is still a challenge in the field of computer vision. The hourglass network has been used in many researches to achieve good performance in human pose estimation problems. For human pose estimation problem, not only high-level features but also low-level features are important for understanding the whole human body. However, the vanilla hourglass network has the problem of passing only high-level features to the next stack. Therefore, we propose a network structure that can solve the problems of the vanilla hourglass by using an additional skip connection. The proposed skip connection improves network performance by passing relative low-level features to the next stack. In addition, the skip connection is a simple element-wise Sum operation, so there is no increase in the number of parameters. In this work, we use the well-known human pose estimation data set, MPII, to evaluate the proposed method. We conducted experiments to evaluate the objective performance of the proposed method, and it was confirmed through this evaluation that the proposed method improves the performance of human pose estimation of the vanilla hourglass network.


Introduction
The human pose estimation problem is one of challenges in the field of computer vision. The human pose is one of the key information for extracting human behavior, that used in artificial intelligence CCTV, autonomous vehicles and security system. The goal of human pose estimation problem is to localize joints from single 2D images. The traditional method estimates the pose using additional equipment (e.g., stereo camera, depth sensor, etc.). Recently, the human pose estimation problems performance has been greatly improved by development of Convolutional Neural Network (CNN) [1][2][3]. Nevertheless, the problem of estimating human posture is still difficult to solve due to the diversity of joints, camera angles, lighting condition, clothing and partial occlusion. Fig.1 Shows difficulty of human pose estimation. The stacked hourglass network [1] is one of well-known method for human pose estimation. It has a stacked structure of hourglass modules composed of residual blocks [10]. Since the hourglass network has performed in the human pose estimation problem, a number of studies have used it as a backbone [4][5][6][7][8][9].
In stacked hourglass network, the output of the current stack is added to the input of the current stack and used as the input of the next stack. Because of this structure, only relatively high-level features are passed to the next stack. This can be a factor that degrades network performance.
Therefore, we propose a new stacked hourglass network structure that solves the problem that only high-level features are delivered to the next stack. The proposed structure can maintain relative low-level features using additional skip connections. And that structure can improve performance without increasing the number of network parameters. We used the well-known human pose estimation dataset, MPII, for the objective evaluation of the proposed method. We confirmed through experiments that the proposed method improved the performance of the stacked hourglass network. An input of the hourglass network is given an image of size 256x256. The input is reduced to 64x64 through the residual block and given as the input of the hourglass module. The hourglass module consists of a residual block with a bottleneck structure. In the encoder part of the hourglass module, the size of the feature is reduced using max-pooling, and the size is restored again using the nearest neighbour in the decoder. This structure is repeated by the stack and more accurate features are extracted. The hourglass network used in this paper is shown in Figure 2.

APPROACH
Each stack in the hourglass network is stacked using a skip connection. Therefore, input and output of the previous stack ( − 1) are added and passed to the input of the next stack ( ). In this structure, only the high-level features continue to be passed to the next stack. However, low-level features are also important for the network to understand a human whole body. So, we propose a new structure of hourglass network that can maintain information of low-level features. Fig.3 is detail of the proposed new hourglass network structure. The proposed additional skip connection (Red dashed arrow in Fig.3) is located in front of the hourglass module encoder that can deliver the previous stacks low-level feature directly to next stack. This structure helps the network understand the entire human body by maintains low-level features as well as high-level features throughout the network.

RESULT
We use the well-known MPII [11] data set to evaluate the performance of the proposed new hourglass network. The MPII dataset contains over 40,000 images of people with joint information, of which around 25,000 images were collected in real-world contexts. For human pose estimation, 16 coordinates for each joint were labeled for each person.
In order to evaluate the performance of our method, we compare the performance with the state-of-the-art lightweight method for stacked hourglass network with various experiments. As an evaluation method, we used Percentage of Correct Keypoints head (PCKh) as used in [12]. The PCKh@0.5 uses 50% of the ground-truth head segment length as a threshold. If the error rate is lower than the threshold value when comparing predicted value with ground-truth, it is determined to be the correct answer.
We followed the same training process as used for the original stacked-hourglass network with an input-image size of 256 × 256. For the data augmentation required for training, rotation ( ±30 ∘ ), scaling ( ±0.25 ), and flipping were performed. The model used in all experiments was written using PyTorch [13]. We used the Adam optimizer [14] for training and with a batch size of 8. The number of training epochs was 300, and initial learning rate was 2.5 × 10 −4 , which was reduced to 2.5 × 10 −5 , 2.5 × 10 −6 in the 150th and 220th epochs. The network was initialized by a normal distribution ( , 2 ) with mean m = 0 and standard deviation σ = 0.001. (1) The ground-truth heat map = { } =1 was generated by applying gaussian around body joints. The loss ℒ between the heat map ̂= {̂} =1 and predicted by network used Mean Squared Error (MSE). Loss is calculated using the predicted heatmaps from each stack and summed up by intermediate supervision.  We trained and compared the double stack hourglass network to compare it with the vanilla hourglass network. We confirmed that the proposed method improved the vanilla hourglass network by this experiment. The results of this experiment are summarized in Table I.
We compared the proposed stacked hourglass network (8 stack) with other methods. The results of this experiment are summarized in Table II. Fig.4 presents a visualization of pose estimation results for the MPII data set in the proposed 8-stack network. We confirmed from these experiments that the proposed method improved the performance of stacked the hourglass network.

Conclusion
In this paper, we have proposed a stacked-hourglass network with additional skip connection for human pose estimation. The vanilla stacked hourglass network delivers only the relatively high-level features, which are the outputs of each hourglass module, to the next stack. To solve this problem, we added an additional skipconnection to the hourglass module, which reflects lowlevel features to the next stack to improve network performance. In addition, since the added skip-connection is an elementwise-sum operation, so there is no significant effect on the computational cost and the flow of the gradient can be improved. We conducted various experiments to evaluate the experiment, and through this, we confirmed that the proposed method improved the existing hourglass network architecture.