Spark Architecture


Spark is built for large-scale data. Hadoop MadReduce used to be dealing these previously, but it required writing a lot of low-level code and was painfully slow. Spark changed the game with,

  • Speed - In-memory computation + optimized execution (upto 100x faster than MapReduce in some workload)
  • Simplicity - High-level APIs (map, filter, reduce) which looks like a normal programming, that runs in distributed nodes.
  • Flexibility - Supports batch jobs, streaming, ML, SQL and graphs all in one framework
  • Compatibility - Reads / Writes almost any data format (S3, HDFS, SQL DBs, JSON, Parquet, Avro, etc.)
When the data is too large to fit in a single box (E.g., Terabytes in size), don't drag the data to your code. move your code (tiny function) to where the data lives. This is called as Data Parallel Processing. This is exactly what Spark does via RDD, Data Frames, Spark SQL. Also Spark 3 adds Adaptive Query Execution and Dynamic Partition Pruning for smarter, faster queries.

Spark Architecture

Key players in the Spark Orchestration.
  1. Driver program
  2. Cluster manager
  3. Executors or Workers
  4. Spark Session
  5. Spark Context
Rule of thumb:
  • Use SparkSession for DataFrames and SQL
  • Use SparkContext for RDD

Driver Program

Driver builds logical DAG of transformations. Driver's DAG Scheduler splits those into stages (based on shuffle boundaries) and into tasks (one per partition). A stage is a set of tasks that can run in parallel.

Driver works with the data source (Hadoop InputFormat, Spark SQL FileFormat, JDBC, Kafks, S3) to enumerate the splits and it happens at the Data Source itself. Some common data sources the Driver works with are listed below.

  • RDD API sc.textFile(path, minPartitions)
Driver builds a HadoopRDD and calls FileInputFormat.getSplits(...). Splits are computed from file sizes, filesystem block sizes, and Hadoop configs like:

mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
  • DataFrame API — CSV/JSON/Parquet/ORC
Driver lists files and groups bytes into FilePartitions using Spark SQL knobs:

spark.sql.files.maxPartitionBytes (default ~128 MB)
spark.sql.files.openCostInBytes
Parquet may align on row groups when feasible.
  • JDBC
Driver computes numeric ranges from partitionColumn, lowerBound, upperBound, numPartitions. If you don’t set these, you get one split.
  • Kafka
Splits are Kafka partitions + offset ranges provided by the source; driver maps them to tasks.

sc.parallelize(data, numSlices)
Driver slices your local collection into numSlices partitions.

Using these, number and type of partitions are decided by the Driver. One task is created per partition and Driver's Task Scheduler sends the serialized tasks to Executors. Here Cluster Manager provides these executors along with allocating the resources (CPU / Memory). 

All these happens only when the first action triggers the execution.

What is action?

An action is any API call that forces Spark to actually do the work and produce some outcome. Until an action happen, transformations just builds the plan.

What an “action” does

Triggers Spark to:
  • analyze/optimize the plan,
  • enumerate input splits,
  • create stages & tasks,
  • schedule tasks on executors,
  • move/shuffle data as needed,
  • return results or write output.

Cluster Manager

Decides where and how many resources (CPU, memory) to give your job.

Spark supports different managers:
  • Standalone (Spark’s own built-in manager).
  • YARN (Hadoop’s manager).
  • Mesos (like a distributed systems kernel).
  • Kubernetes (increasingly popular in cloud-native setups).
The cluster manager doesn’t do computation—it just hands out compute slots.

Executor

Executes the task received from the Driver. Task is specific to one partition. 

  • These are JVM processes launched on worker nodes.
  • Each executor:
    • Runs multiple tasks (units of work).
    • Holds cache/memory (RDDs/DataFrames can be cached in executors’ memory for speed).
    • Talks back to the driver with results/status.
  • Executors are “ephemeral”: once your Spark job ends, they go away.
  • If you rerun the job, new executors spin up again.

SparkSession vs SparkContext

  • SparkSession is the “unified entry point” (introduced in Spark 2.x). It wraps everything: SQL, DataFrames, config, catalogs.
  • SparkContext is the older, lower-level object (still alive). It’s mainly for RDDs and is accessible as spark.sparkContext.

FAQs

1. Why is “move code to data” generally faster than “move data to code”?
Because data is huge, and code is tiny.

If you try to move terabytes of data across a network to where your code is… good luck, bring coffee. Networks are slow, disks are slower. But if you ship a small function to where the data already lives (i.e., on worker nodes), you avoid expensive I/O and network transfer.

This is the core of data-parallel computing, and why frameworks like Hadoop and Spark send computation (i.e., tasks/functions) to nodes that own the data blocks.


2. Name the three stages of MapReduce and which one causes the big network shuffle.

Map – processes raw input data and emits key-value pairs.
Shuffle – groups all values by key across the cluster. This is where the big network transfer happens—because keys are re-partitioned across nodes.
Reduce – aggregates values for each key (e.g., summing).

The shuffle is the performance killer. It’s why smart systems (like Spark’s reduceByKey) try to do local aggregation before shuffle to reduce the data transferred.

3. What makes RDDs “resilient”? (Hint: lineage.)

RDDs are Resilient Distributed Datasets because they can rebuild lost data through lineage.

Instead of storing every intermediate result, Spark keeps a record of the operations (transformations) used to create each RDD. If a partition is lost (e.g., a node crashes), Spark recomputes it from source data using the lineage DAG (Directed Acyclic Graph).

Resilience = don’t save everything, just save how to get it back.


4. Define “transformation” vs “action” in Spark. Which triggers execution?

Transformation = builds up a logical plan. Lazily evaluated. Examples: map(), filter(), groupBy().

Action = triggers execution. Actually runs the DAG. Examples: collect(), count(), saveAsTextFile().

Only actions trigger execution. Until then, Spark is building a recipe—it doesn’t start cooking.

5. One practical benefit of AQE (Adaptive Query Execution) in real workloads?

AQE lets Spark optimize a query at runtime, instead of only at compile time.

Example: When Spark doesn’t know the size of tables beforehand (e.g., due to missing stats), it might make a bad join choice. AQE can switch to a better join strategy (like broadcast join) mid-flight, based on actual partition sizes.

Real-world benefit? Faster joins, less shuffle, fewer out-of-memory errors, automatic optimization even when stats are outdated.

Comments

Popular Posts