Spark Architecture


Spark is built for large-scale data. Hadoop MadReduce used to be dealing these previously, but it required writing a lot of low-level code and was painfully slow. Spark changed the game with,

  • Speed - In-memory computation + optimized execution (upto 100x faster than MapReduce in some workload)
  • Simplicity - High-level APIs (map, filter, reduce) which looks like a normal programming, that runs in distributed nodes.
  • Flexibility - Supports batch jobs, streaming, ML, SQL and graphs all in one framework
  • Compatibility - Reads / Writes almost any data format (S3, HDFS, SQL DBs, JSON, Parquet, Avro, etc.)

Spark Architecture

Key players in the Spark Orchestration.
  1. Driver program
  2. Cluster manager
  3. Executors or Workers
  4. Spark Session
  5. Spark Context
Rule of thumb:
  • Use SparkSession for DataFrames and SQL
  • Use SparkContext for RDD

Driver Program

Driver builds logical DAG of transformations. Driver's DAG Scheduler splits those into stages (based on shuffle boundaries) and into tasks (one per partition). A stage is a set of tasks that can run in parallel.

Driver works with the data source (Hadoop InputFormat, Spark SQL FileFormat, JDBC, Kafks, S3) to enumerate the splits and it happens at the Data Source itself. Some common data sources the Driver works with are listed below.

  • RDD API sc.textFile(path, minPartitions)
Driver builds a HadoopRDD and calls FileInputFormat.getSplits(...). Splits are computed from file sizes, filesystem block sizes, and Hadoop configs like:

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
  • DataFrame API — CSV/JSON/Parquet/ORC
Driver lists files and groups bytes into FilePartitions using Spark SQL knobs:

spark.sql.files.maxPartitionBytes (default ~128 MB)
spark.sql.files.openCostInBytes
Parquet may align on row groups when feasible.
  • JDBC
Driver computes numeric ranges from partitionColumn, lowerBound, upperBound, numPartitions. If you don’t set these, you get one split.
  • Kafka
Splits are Kafka partitions + offset ranges provided by the source; driver maps them to tasks.

sc.parallelize(data, numSlices)
Driver slices your local collection into numSlices partitions.

Using these, number and type of partitions are decided by the Driver. One task is created per partition and Driver's Task Scheduler sends the serialized tasks to Executors. Here Cluster Manager provides these executors along with allocating the resources (CPU / Memory). 

All these happens only when the first action triggers the execution.

What is action?

An action is any API call that forces Spark to actually do the work and produce some outcome. Until an action happen, transformations just builds the plan.

What an “action” does

Triggers Spark to:
  • analyze/optimize the plan,
  • enumerate input splits,
  • create stages & tasks,
  • schedule tasks on executors,
  • move/shuffle data as needed,
  • return results or write output.

Cluster Manager

Decides where and how many resources (CPU, memory) to give your job.

Spark supports different managers:
  • Standalone (Spark’s own built-in manager).
  • YARN (Hadoop’s manager).
  • Mesos (like a distributed systems kernel).
  • Kubernetes (increasingly popular in cloud-native setups).
The cluster manager doesn’t do computation—it just hands out compute slots.

Executor

Executes the task received from the Driver. Task is specific to one partition. 

  • These are JVM processes launched on worker nodes.
  • Each executor:
    • Runs multiple tasks (units of work).
    • Holds cache/memory (RDDs/DataFrames can be cached in executors’ memory for speed).
    • Talks back to the driver with results/status.
  • Executors are “ephemeral”: once your Spark job ends, they go away.
  • If you rerun the job, new executors spin up again.

SparkSession vs SparkContext

  • SparkSession is the “unified entry point” (introduced in Spark 2.x). It wraps everything: SQL, DataFrames, config, catalogs.
  • SparkContext is the older, lower-level object (still alive). It’s mainly for RDDs and is accessible as spark.sparkContext.

Comments