Spark Architecture
Spark is built for large-scale data. Hadoop MadReduce used to be dealing these previously, but it required writing a lot of low-level code and was painfully slow. Spark changed the game with,
- Speed - In-memory computation + optimized execution (upto 100x faster than MapReduce in some workload)
- Simplicity - High-level APIs (map, filter, reduce) which looks like a normal programming, that runs in distributed nodes.
- Flexibility - Supports batch jobs, streaming, ML, SQL and graphs all in one framework
- Compatibility - Reads / Writes almost any data format (S3, HDFS, SQL DBs, JSON, Parquet, Avro, etc.)
Spark Architecture
- Driver program
- Cluster manager
- Executors or Workers
- Spark Session
- Spark Context
- Use SparkSession for DataFrames and SQL
- Use SparkContext for RDD
Driver Program
Driver builds logical DAG of transformations. Driver's DAG Scheduler splits those into stages (based on shuffle boundaries) and into tasks (one per partition). A stage is a set of tasks that can run in parallel.
Driver works with the data source (Hadoop InputFormat, Spark SQL FileFormat, JDBC, Kafks, S3) to enumerate the splits and it happens at the Data Source itself. Some common data sources the Driver works with are listed below.
- RDD API — sc.textFile(path, minPartitions)
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
- DataFrame API — CSV/JSON/Parquet/ORC
spark.sql.files.maxPartitionBytes (default ~128 MB)
spark.sql.files.openCostInBytes
- JDBC
- Kafka
Using these, number and type of partitions are decided by the Driver. One task is created per partition and Driver's Task Scheduler sends the serialized tasks to Executors. Here Cluster Manager provides these executors along with allocating the resources (CPU / Memory).
All these happens only when the first action triggers the execution.
What is action?
An action is any API call that forces Spark to actually do the work and produce some outcome. Until an action happen, transformations just builds the plan.
What an “action” does
- analyze/optimize the plan,
- enumerate input splits,
- create stages & tasks,
- schedule tasks on executors,
- move/shuffle data as needed,
- return results or write output.
Cluster Manager
Decides where and how many resources (CPU, memory) to give your job.Spark supports different managers:
- Standalone (Spark’s own built-in manager).
- YARN (Hadoop’s manager).
- Mesos (like a distributed systems kernel).
- Kubernetes (increasingly popular in cloud-native setups).
Executor
- These are JVM processes launched on worker nodes.
- Each executor:
- Runs multiple tasks (units of work).
- Holds cache/memory (RDDs/DataFrames can be cached in executors’ memory for speed).
- Talks back to the driver with results/status.
- Executors are “ephemeral”: once your Spark job ends, they go away.
- If you rerun the job, new executors spin up again.
SparkSession vs SparkContext
- SparkSession is the “unified entry point” (introduced in Spark 2.x). It wraps everything: SQL, DataFrames, config, catalogs.
- SparkContext is the older, lower-level object (still alive). It’s mainly for RDDs and is accessible as spark.sparkContext.
FAQs
1. Why is “move code to data” generally faster than “move data to code”?Because data is huge, and code is tiny.
If you try to move terabytes of data across a network to where your code is… good luck, bring coffee. Networks are slow, disks are slower. But if you ship a small function to where the data already lives (i.e., on worker nodes), you avoid expensive I/O and network transfer.
This is the core of data-parallel computing, and why frameworks like Hadoop and Spark send computation (i.e., tasks/functions) to nodes that own the data blocks.
2. Name the three stages of MapReduce and which one causes the big network shuffle.
Map – processes raw input data and emits key-value pairs.
Shuffle – groups all values by key across the cluster. This is where the big network transfer happens—because keys are re-partitioned across nodes.
Reduce – aggregates values for each key (e.g., summing).
The shuffle is the performance killer. It’s why smart systems (like Spark’s reduceByKey) try to do local aggregation before shuffle to reduce the data transferred.
3. What makes RDDs “resilient”? (Hint: lineage.)
RDDs are Resilient Distributed Datasets because they can rebuild lost data through lineage.
Instead of storing every intermediate result, Spark keeps a record of the operations (transformations) used to create each RDD. If a partition is lost (e.g., a node crashes), Spark recomputes it from source data using the lineage DAG (Directed Acyclic Graph).
Resilience = don’t save everything, just save how to get it back.
4. Define “transformation” vs “action” in Spark. Which triggers execution?
Transformation = builds up a logical plan. Lazily evaluated. Examples: map(), filter(), groupBy().
Action = triggers execution. Actually runs the DAG. Examples: collect(), count(), saveAsTextFile().
Only actions trigger execution. Until then, Spark is building a recipe—it doesn’t start cooking.
5. One practical benefit of AQE (Adaptive Query Execution) in real workloads?
AQE lets Spark optimize a query at runtime, instead of only at compile time.
Example: When Spark doesn’t know the size of tables beforehand (e.g., due to missing stats), it might make a bad join choice. AQE can switch to a better join strategy (like broadcast join) mid-flight, based on actual partition sizes.
Real-world benefit? Faster joins, less shuffle, fewer out-of-memory errors, automatic optimization even when stats are outdated.


Comments
Post a Comment