1. Spark Execution Model and Architecture

Hadoop

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system over and above a cluster of computers. It allows us to combine the storage capacity of hundreds and thousands of computers and use it as a unified storage system.

MapReduce

MapReduce is a distributed processing framework over and above the HDFS. It allows us to break the data processing job into smaller tasks and use the cluster of computers to finish tasks independently. This also allowed us to combine the outcome of those different tasks and produce an aggregated result.

Data Lake

Ingestion - Data collection and ingestion

Storage - Data storage and management

Processing - Data processing and transformation

Consumption - Data access and retrieval

Within the data processing stage, there are 2 parts: data processing framework and orchestration. The data processing framework is where Spark lies. The orchestration framework is responsible for managing resources and creating the cluster. The orchestration framework is where tools like Hadoop YARN and Kubernetes lie.

Why is Spark so popular?