Mastering Apache Spark 2.x

Advanced techniques in complex Big Data processing, streaming analytics and machine learning

Publisher:Packt Publishing Limited

By: Romeo Kienzler

Paid access

|Jan 2025

E-Book €34.99Institutions €134.95

Description

Advanced analytics on your Big Data with the latest Apache Spark 2.x

Key Features

Master the art of real-time Big Data processing using Apache Spark 2.x
Perform machine learning, deep learning and streaming data analytics by extending the most up-to-date functionalities of Apache Spark
An advanced guide with a unique combination of tips, instructions and practical examples on using Apache Spark effectively

Book Description

Apache Spark is an in-memory, cluster-based Big Data processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and more. This book will take your knowledge of Apache Spark to the next level by teaching you how to expand Spark’s functionality and build your data flows and machine/deep learning programs on top of the platform.

The book starts with a quick overview of the Apache Spark ecosystem, and introduces you to the new features and capabilities in Apache Spark 2.x. You will then work with the different modules in Apache Spark such as interactive querying with Spark SQL, using DataFrames and DataSets effectively, streaming analytics with Spark Streaming, and performing machine learning and deep learning on Spark using MLlib and external tools such as H20 and Deeplearning4j. The book also contains chapters on efficient graph processing, memory management and using Apache Spark on the cloud.

By the end of this book, you will have all the necessary information to master Apache Spark, and use it efficiently for Big Data processing and analytics.

What you will learn

Get to grips with the newly introduced features in Apache Spark 2.x
Perform highly optimised unified batch and real-time data processing using
SparkSQL and Structured Streaming
Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames
Perform advanced machine learning and deep learning with Spark MLlib, SparkML, SystemML, H2O and DeepLearning4J
Learn how specific parameter settings affect overall performance of an
Apache Spark cluster
Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud

Who this book is for

If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this book is for you. Big Data professionals who wish to know how to integrate and use the features of Apache Spark to build a strong Big Data pipeline will also find this book to be a useful resource. A fundamental knowledge of Apache Spark and the Scala programming language is assumed.