Performance Optimization System for Hadoop and Spark Frameworks

Hrachya Astsatryan; Aram Kocharyan; Daniel Hagimont; Arthur Lalayan

doi:10.2478/cait-2020-0056

.blurhash-client-img { display: none !important; }

Performance Optimization System for Hadoop and Spark Frameworks

Cybernetics and Information Technologies

Volume 20 (2020): Issue 6 (December 2020)

By: Hrachya Astsatryan, Aram Kocharyan, Daniel Hagimont and Arthur Lalayan

Open Access

|Dec 2020

Abstract

The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

References

1. Chen, J., Y. Chen, X. Du, C. Li, J. Lu, S. Zhao, X. Zhou. Big Data Challenge: A Data Management Perspective. – Frontiers of Computer Science, Vol. 7, 2013, No 2, pp. 157-164.10.1007/s11704-013-3903-7
Search in Google Scholar Back to article
2. Lublinsky, B., K. T. Smith, A. Yakubovich. Professional Hadoop Solutions. Indiana, USA, John Wiley & Sons, 2013, p. 504.
Search in Google Scholar Back to article
3. Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi. Apache Spark: A Unified Engine for Big Data Processing. – Communications of the ACM, Vol. 59, 2016, No 11, pp. 56-65.10.1145/2934664
Search in Google Scholar Back to article
4. Cheng, D., X. Zhou, P. Lama, J. Mike, C. Jiang. Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop Clusters. – IEEE Transactions on Parallel and Distributed Systems, Vol. 29, 2017, No 1, pp. 70-82.10.1109/TPDS.2017.2745571
Search in Google Scholar Back to article
5. Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All – ACM Meas. Anal. Comput. Syst., Vol. 2, 2018, pp. 1-21.10.1145/3179422
Search in Google Scholar Back to article
6. Kothuri, P., D. Garcia, J. Hermans. Developing and Optimizing Applications in Hadoop.– Journal of Physics: Conference Series, Vol. 898, 2017, No 5.10.1088/1742-6596/898/7/072038
Search in Google Scholar Back to article
7. Dean, J., S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. – Communications of the ACM, Vol. 51, 2008, No 1, pp. 107-113.10.1145/1327452.1327492
Search in Google Scholar Back to article
8. Won, H., M. C. Nguyen, M. S. Gil, Y. S. Moon, K. Y. Whang. Moving Metadata from Ad Hoc Files to Database Tables for Robust, Highly Available, and Scalable HDFS. – The Journal of Supercomputing, Vol. 73, 2017, No 6, pp. 2657-2681.10.1007/s11227-016-1949-7
Search in Google Scholar Back to article
9. Uthayakumar, J., T. Vengattaraman, P. Dhavachelvan. A Survey on Data Compression Techniques: From the Perspective of Data Quality, Coding Schemes, Data Type and Applications. – Journal of King Saud University – Computer and Information Sciences, 2018.
Search in Google Scholar Back to article
10. Liu, L. Y., J. F. Wang, R. J. Wang, J. Y. Lee. Design and Hardware Architectures for Dynamic Huffman Coding – IEEE Proceedings-Computers and Digital Techniques, Vol. 142, 1995, No 6, pp. 411-418.10.1049/ip-cdt:19952157
Search in Google Scholar Back to article
11. Fenwick, P. M. The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements. – The Computer Journal, Vol. 39, 1996, No 9, pp. 731-740.10.1093/comjnl/39.9.731
Search in Google Scholar Back to article
12. Fang, J., J. Chen, Z. Al-Ars, P. Hofstee, J. Hidders. Work-in-Progress: A High-Bandwidth Snappy Decompressor in Reconfigurable Logic. – In: Proc. of IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Turin, Italy, 30 September – 5 October 2018, pp. 1-2.10.1109/CODESISSS.2018.8525953
Search in Google Scholar Back to article
13. Liu, W., F. Mei, C. Wang, M. O’Neill, E. E. Swartzlander. Data Compression Device Based on Modified LZ4 Algorithm. – IEEE Transactions on Consumer Electronics, Vol. 64, 2018, No 1, pp. 110-117.10.1109/TCE.2018.2810480
Search in Google Scholar Back to article
14. Rattanaopas, K., S. Kaewkeeree. Improving Hadoop MapReduce Performance with Data Compression: A Study Using Wordcount Job. – In: Proc. of 14th IEEE International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON’17), 2017, pp. 564-567.10.1109/ECTICon.2017.8096300
Search in Google Scholar Back to article
15. Haider, A., X. Yang, N. Liu, X. H. Sun, S. He. IC-Data: Improving Compressed Data Processing in Hadoop. – In: Proc. of 22nd IEEE International Conference on High Performance Computing (HiPC’15), 2015, pp. 356-365.10.1109/HiPC.2015.28
Search in Google Scholar Back to article
16. Chen, Y., A. Ganapathi, R. H. Katz. To Compress or Not to Compress-Compute vs IO Tradeoffs for Mapreduce Energy Efficiency. – In: Proc. of 1st ACM SIGCOMM Workshop on Green Networking, 2010, pp. 23-28.10.1145/1851290.1851296
Search in Google Scholar Back to article
17. Lang, W., J. M. Patel. Energy Management for MapReduce Clusters. – In: Proc. of VLDB Endowment, Vol. 3, 2010, No 1-2, pp. 129-139.10.14778/1920841.1920862
Search in Google Scholar Back to article
18. Li, W., H. Yang, Z. Luan, D. Qian. Energy Prediction for Mapreduce Workloads. – In: Proc. of 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2011, pp. 443-448.10.1109/DASC.2011.88
Search in Google Scholar Back to article
19. Wirtz, T., R. Ge. Improving Mapreduce Energy Efficiency for Computation Intensive Workloads. – In: Proc. of IEEE International Green Computing Conference and Workshops, 2011, pp. 1-8.10.1109/IGCC.2011.6008564
Search in Google Scholar Back to article
20. Leverich, J., C. Kozyrakis. On the Energy (in) Efficiency of Hadoop Clusters. – ACM SIGOPS Operating Systems Review, Vol. 44, 2010, No 1, pp. 61-65.10.1145/1740390.1740405
Search in Google Scholar Back to article
21. Tiwari, N., S. Sarkar, U. Bellur, M. Indrawan. An Empirical Study of Hadoop’s Energy Efficiency on a HPC Cluster. – Procedia Computer Science, Vol. 29, 2014, pp. 62-72.10.1016/j.procs.2014.05.006
Search in Google Scholar Back to article
22. Tatineni, M., J. Greenberg, R. Wagner, E. Hocks, C. Irving. Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer. – In: Proc. of Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, 2013, pp. 1-3.10.1145/2484762.2484831
Search in Google Scholar Back to article
23. Narkhede, S., T. Baraskar. HMR Log Analyzer: Analyze Web Application Logs over Hadoop MapReduce. – International Journal of UbiComp (IJU), Vol. 4, 2013, No 3, pp. 41-51.10.5121/iju.2013.4304
Search in Google Scholar Back to article
24. Krishna, K., M. N. Murty. Genetic k-Means Algorithm. – IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 29, No 3, 1999, pp. 433-439.10.1109/3477.76487918252317
Search in Google Scholar Back to article
25. Zhao, W., H. Ma., Q. He. Parallel K-Means Clustering Based on MapReduce. – In: CloudCom 2009. LNCS 5931. Berlin, Springer, 2009, pp. 674-679.10.1007/978-3-642-10665-1_71
Search in Google Scholar Back to article
26. Astsatryan, H., V. Sahakyan, Y. Shoukourian, P. H. Cros, M. Dayde, J. Dongarra, P. Oster. Strengthening Compute and Data Intensive Capacities of Armenia. – In: Proc. of 14th IEEE RoEduNet International Conference – Networking in Education and Research (NER’15), Craiova, Romania; September 2015, pp. 28-33.10.1109/RoEduNet.2015.7311823
Search in Google Scholar Back to article
27. Astsatryan, H., W. Narsisian, A. Kocharyan, G. da Costa, A. Hankel, A. Oleksiak. Energy Optimization Methodology for e-Infrastructure Providers. – Willey Concurrency and Computation: Practice and Experience, Vol. 29, 2017, No 10. DOI: 10.1002/cpe.4073.10.1002/cpe.4073
Search in Google Scholar Back to article
28. Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All. – Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 2, 2018, No 1, pp. 1-22.10.1145/3179422
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/cait-2020-0056 | Journal eISSN: 1314-4081 | Journal ISSN: 1311-9702

Journal RSS Feed

Language: English

Page range: 5 - 17

Submitted on: Jul 6, 2020

Accepted on: Sep 25, 2020

Published on: Dec 31, 2020

Published by: Bulgarian Academy of Sciences, Institute of Information and Communication Technologies

In partnership with: Paradigm Publishing Services

Keywords:

Hadoop,

Spark,

data compression,

CPU/IO tradeoff,

performance optimization

Related subjects:

Computer sciences,

Information technology

© 2020 Hrachya Astsatryan, Aram Kocharyan, Daniel Hagimont, Arthur Lalayan, published by Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 20 (2020): Issue 6 (December 2020)