Combined YOLOv5 and HRNet for High Accuracy 2D Keypoint and Human Pose Estimation

Hung-Cuong Nguyen; Thi-Hao Nguyen; Jakub Nowak; Aleksander Byrski; Agnieszka Siwocha; Van-Hung Le

doi:10.2478/jaiscr-2022-0019

.blurhash-client-img { display: none !important; }

Combined YOLOv5 and HRNet for High Accuracy 2D Keypoint and Human Pose Estimation

Journal of Artificial Intelligence and Soft Computing Research

Volume 12 (2022): Issue 4 (October 2022)

By: Hung-Cuong Nguyen , Thi-Hao Nguyen , Jakub Nowak , Aleksander Byrski , Agnieszka Siwocha and Van-Hung Le

Open Access

|Oct 2022

[1] Ssd mobilenet v1 architecture (2018). [Accessed 22 Dec 2021]
Search in Google Scholar Back to article
[2] Abdulla, W.: Mask r-cnn for object detection and instance segmentation on keras and tensorflow. https://github.com/matterport/Mask_RCNN (2017). [Accessed 20 Dec 2021]
Search in Google Scholar Back to article
[3] Babu, S.C.: A 2019 guide to human pose estimation with deep learning. https://nanonets.com/blog/human-pose-estimation-2d-guide/. [Online: Accessed 5 December 2021]
Search in Google Scholar Back to article
[4] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv (2020)
Search in Google Scholar Back to article
[5] Burrus, N.: Kinect calibration. http://nicolas.burrus.name/index.php/Research/KinectCalibration
Search in Google Scholar Back to article
[6] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Real-time multi-person 2D pose estimation using part affinity fields. In: IEEE Conference on CVPR, vol. 2017-Janua, pp. 1302–1310 (2017). DOI 10.1109/CVPR.2017.143
Search in Google Scholar Back to article
[7] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)10.1109/CVPR.2017.143
Search in Google Scholar Back to article
[8] Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. CoRR abs/1507.06550 (2015)10.1109/CVPR.2016.512
Search in Google Scholar Back to article
[9] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded Pyramid Network for Multi-person Pose Estimation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018). DOI 10.1109/CVPR.2018.00742
Search in Google Scholar Back to article
[10] Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems pp. 379–387 (2016)
Search in Google Scholar Back to article
[11] Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc. (2016). https://proceedings.neurips.cc/paper/2016/file/577ef1154f3240ad5b9b413aa7346a1e-Paper.pdf
Search in Google Scholar Back to article
[12] Dang, Q., Yin, J., Wang, B., Zheng, W.: Deep learning based 2D human pose estimation: A survey. TPAMI 24(6), 663–676 (2021). DOI 10. 26599/TST.2018.9010100
Search in Google Scholar Back to article
[13] Gao, H.: Single shot multibox detector implementation in pytorch. https://github.com/qfgaohao/pytorch-ssd (2020). [Accessed 20 Dec 2021]
Search in Google Scholar Back to article
[14] Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 1440–1448 (2015). DOI 10.1109/ICCV.2015.169
Search in Google Scholar Back to article
[15] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014). DOI 10.1109/CVPR.2014.81
Search in Google Scholar Back to article
[16] Glen., S.: “jaccard index/similarity coefficient” from statisticshowto.com: El-ementary statistics for the rest of us! https://www.statisticshowto.com/jaccard-index/. Online; accessed 6 December 2021
Search in Google Scholar Back to article
[17] Haque, M.F., Lim, H.y., Kang, D.s.: Object Detection Based on VGG with ResNet Network. In: 2019 International Conference on Electronics, Information, and Communication (ICEIC), pp. 1–3. Institute of electronics and information engineers (IEIE)10.23919/ELINFOCOM.2019.8706476
Search in Google Scholar Back to article
[18] He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)10.1109/ICCV.2017.322
Search in Google Scholar Back to article
[19] He, K., Zhang, X., Ren, S., Sun, J.: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9), 1904–1916 (2015). DOI 10.1109/TPAMI.2015.238982410.1109/TPAMI.2015.238982426353135
Search in Google Scholar Back to article
[20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on CVPR, vol. 2016-Decem, pp. 770–778 (2016). DOI 10.1109/CVPR.2016.90
Search in Google Scholar Back to article
[21] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 3296–3305 (2017). DOI 10.1109/CVPR.2017.351
Search in Google Scholar Back to article
[22] Hung, G.L., Sahimi, M.S.B., Samma, H., Almohamad, T.A., Lahasan, B.: Faster R-CNN Deep Learning Model for Pedestrian Detection from Drone Images. In: SN Computer Science, vol. 1, pp. 1–9. Springer Singapore (2020). DOI 10.1007/s42979-020-00125-y. https://doi.org/10.1007/s42979-020-00125-y
Search in Google Scholar Back to article
[23] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36(7), 1325–1339 (2014)10.1109/TPAMI.2013.24826353306
Search in Google Scholar Back to article
[24] Jocher, G.R.: Head and person detection model. https://github.com/deepakcrk/yolov5-crowdhuman. Online; accessed 6 December 2021
Search in Google Scholar Back to article
[25] Jocher, G.R.: Yolov5 tutorials. https://github.com/ultralytics/yolov5. Online; accessed 6 December 2021
Search in Google Scholar Back to article
[26] Jonathan, H.: Object detection: speed and accuracy comparison (faster r-cnn, r-fcn, ssd, fpn, retinanet and yolov3) (2018). [Accessed 18 Dec 2021]
Search in Google Scholar Back to article
[27] Krishnan, S.: Person-detection. https://github.com/SusmithKrishnan/person-detection (2021). [Accessed 20 Dec 2021]
Search in Google Scholar Back to article
[28] Li, N.: Evoskeleton, cascaded 2d-to-3d lifting. https://github.com/Nicholasli1995/EvoSkeleton. Online; accessed 25 December 2021
Search in Google Scholar Back to article
[29] Li, S., Ke, L., Pratama, K., Tai, Y.W., Tang, C.K., Cheng, K.T.: Cascaded deep monocular 3d human pose estimation with evolutionary training data. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)10.1109/CVPR42600.2020.00621
Search in Google Scholar Back to article
[30] Liang, S., Sun, X., Wei, Y.: Compositional Human Pose Regression. In: ICCV, vol. 176-177, pp. 1–8 (2017). DOI 10.1016/j.cviu.2018.10.006
Search in Google Scholar Back to article
[31] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context (2014). http://arxiv.org/abs/1405.0312
Search in Google Scholar Back to article
[32] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: European Conference on Computer Vision, vol. 9905 LNCS, pp. 21–37 (2016). DOI 10.1007/978-3-319-46448-0_2
Search in Google Scholar Back to article
[33] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: B. Leibe, J. Matas, N. Sebe, M. Welling (eds.) ECCV (1), Lecture Notes in Computer Science, vol. 9905, pp. 21–37. Springer (2016). http://dblp.uni-trier.de/db/conf/eccv/eccv2016-1.htmlLiuAESRFB16
Search in Google Scholar Back to article
[34] Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Computers and Graphics (Pergamon) 85, 15–22 (2019). DOI 10.1016/j.cag. 2019.09.00210.1016/j.cag.2019.09.002
Search in Google Scholar Back to article
[35] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016, pp. 483–499. Springer International Publishing (2016)10.1007/978-3-319-46484-8_29
Search in Google Scholar Back to article
[36] Newell, A., Yang, K., Deng, J.: Stacked Hourglass Networks for Human Pose Estimation. In: ECCV (2016)10.1007/978-3-319-46484-8_29
Search in Google Scholar Back to article
[37] openpose: openpose. https://github.com/CMU-Perceptual-Computing-Lab/openpose (2019). [Accessed 23 April 2019]
Search in Google Scholar Back to article
[38] Ramanan, D.: Learning to parse images of articulated bodies. In: In NIPS (2006)
Search in Google Scholar Back to article
[39] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 779–788 (2016). DOI 10.1109/CVPR.2016.91
Search in Google Scholar Back to article
[40] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)10.1109/CVPR.2017.690
Search in Google Scholar Back to article
[41] Redmon, J., Farhadi, A.: YOLO9000: Better, faster, stronger. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6517–6525 (2017). DOI 10.1109/CVPR.2017.690
Search in Google Scholar Back to article
[42] Redmon, J., Farhadi, A.: Yolov3 an incremental improvement (2018). http://arxiv.org/abs/1804.02767. [Accessed 18 April 2021]
Search in Google Scholar Back to article
[43] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28, pp. 91–99 (2015)
Search in Google Scholar Back to article
[44] Ren, S., He, K., Girshick, R., Sun, J.: Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (2017). DOI 10.1109/TPAMI.2016. 257703110.1109/TPAMI.2016.257703127295650
Search in Google Scholar Back to article
[45] Sapp, B., Taskar, B.: In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. DOI 10.1109/CVPR. 2013.471
Search in Google Scholar Back to article
[46] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–14 (2015)
Search in Google Scholar Back to article
[47] Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)10.1109/CVPR.2019.00584
Search in Google Scholar Back to article
[48] Tan, D.: Image geometric transformation in numpy and opencv. https://towardsdatascience.com/image-geometric-transformation-in-numpy-and-opencv-936f5cd1d315 (2019). Online; accessed 6 December 2021
Search in Google Scholar Back to article
[49] Thanh, N.T., Hùng, L.V., Công, P.T.: An Evaluation of Pose Estimation in Video of Traditional Martial Arts Presentation. Journal of Research and Development on Information and Communication Technology 2019(2), 114–126 (2019). DOI 10.32913/mic-ict-research.v2019.n2.86410.32913/mic-ict-research.v2019.n2.864
Search in Google Scholar Back to article
[50] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656. IEEE Computer Society (2015)10.1109/CVPR.2015.7298664
Search in Google Scholar Back to article
[51] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. CoRR abs/1312.4659 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1312.htmlToshevS13
Search in Google Scholar Back to article
[52] Toshev, A., Szegedy, C.: DeepPose: Human Pose Estimation via Deep Neural Networks. In: IEEE Conference on CVPR (2014)10.1109/CVPR.2014.214
Search in Google Scholar Back to article
[53] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. TPAMI
Search in Google Scholar Back to article
[54] Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)10.1109/CVPR.2016.511
Search in Google Scholar Back to article
[55] Weiming Chen, Zijie Jiang, H.G., Ni, X.: Fall Detection Based on Key Points of of human-skeleton using openpose. Symmetry (2020)10.3390/sym12050744
Search in Google Scholar Back to article
[56] Willett, N.S., Shin, H.V., Jin, Z., Li, W., Finkelstein, A.: Pose2Pose: Pose Selection and Transfer for 2D Character Animation. In: International Conference on Intelligent User Interfaces, Proceedings IUI, pp. 88–99 (2020). DOI 10.1145/3377325.3377505
Search in Google Scholar Back to article
[57] Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: European Conference on Computer Vision (ECCV) (2018)10.1007/978-3-030-01231-1_29
Search in Google Scholar Back to article
[58] Yang, W.: Human Pose Estimation 101. https://github.com/cbsudux/Human-Pose-Estimation-101percentage-of-correct-key-points—pck (2019). [Accessed 18 April 2021]
Search in Google Scholar Back to article
[59] Yang, W., Ouyang, W., Li, H., Wang, X.: Endto-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: CVPR (2016)10.1109/CVPR.2016.335
Search in Google Scholar Back to article
[60] Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. https://github.com/bearpaw/eval_pose (2016). Online; accessed 20 December 202110.1109/CVPR.2016.335
Search in Google Scholar Back to article
[61] Zhang, H., Sciutto, C., Agrawala, M., Fatahalian, K.: Vid2Player: Controllable Video Sprites That Behave and Appear Like Professional Tennis Players. ACM Transactions on Graphics 40(3), 1–16 (2021). DOI 10.1145/344897810.1145/3448978
Search in Google Scholar Back to article
[62] Zhang, X., Zou, J., He, K., Sun, J.: Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 1943–1955 (2016). DOI 10.1109/TPAMI.2015.250257910.1109/TPAMI.2015.250257926599615
Search in Google Scholar Back to article
[63] Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: A weakly-supervised approach. In: The IEEE International Conference on Computer Vision (ICCV) (2017)10.1109/ICCV.2017.51
Search in Google Scholar Back to article

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.2478/jaiscr-2022-0019

Journal RSS Feed

Language: English

Page range: 281 - 298

Submitted on: Jun 15, 2022

Accepted on: Oct 18, 2022

Published on: Oct 29, 2022

Published by: SAN University

In partnership with: Paradigm Publishing Services

Keywords:

YOLOv5,

HRNet,

2D key points estimation,

2D human pose estimation

Related subjects:

Computer sciences,

Databases and data mining,

Artificial intelligence

© 2022 Hung-Cuong Nguyen, Thi-Hao Nguyen, Jakub Nowak, Aleksander Byrski, Agnieszka Siwocha, Van-Hung Le, published by SAN University
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Volume 12 (2022): Issue 4 (October 2022)