Spark is built for large-scale data. Hadoop MadReduce used to be dealing these previously, but it required writing a lot of low-level code and was painfully slow. Spark changed the game with,
- Speed - In-memory computation + optimized execution (upto 100x faster than MapReduce in some workload)
- Simplicity - High-level APIs (map, filter, reduce) which looks like a normal programming, that runs in distributed nodes.
- Flexibility - Supports batch jobs, streaming, ML, SQL and graphs all in one framework
- Compatibility - Reads / Writes almost any data format (S3, HDFS, SQL DBs, JSON, Parquet, Avro, etc.)
Spark Architecture
- Driver program
- Cluster manager
- Executors or Workers
- Spark Session
- Spark Context
- Use SparkSession for DataFrames and SQL
- Use SparkContext for RDD
Driver Program
Driver builds logical DAG of transformations. Driver's DAG Scheduler splits those into stages (based on shuffle boundaries) and into tasks (one per partition). A stage is a set of tasks that can run in parallel.
Driver works with the data source (Hadoop InputFormat, Spark SQL FileFormat, JDBC, Kafks, S3) to enumerate the splits and it happens at the Data Source itself. Some common data sources the Driver works with are listed below.
- RDD API — sc.textFile(path, minPartitions)
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
- DataFrame API — CSV/JSON/Parquet/ORC
spark.sql.files.maxPartitionBytes (default ~128 MB)
spark.sql.files.openCostInBytes
- JDBC
- Kafka
Using these, number and type of partitions are decided by the Driver. One task is created per partition and Driver's Task Scheduler sends the serialized tasks to Executors. Here Cluster Manager provides these executors along with allocating the resources (CPU / Memory).
All these happens only when the first action triggers the execution.
What is action?
An action is any API call that forces Spark to actually do the work and produce some outcome. Until an action happen, transformations just builds the plan.
What an “action” does
- analyze/optimize the plan,
- enumerate input splits,
- create stages & tasks,
- schedule tasks on executors,
- move/shuffle data as needed,
- return results or write output.
Cluster Manager
Decides where and how many resources (CPU, memory) to give your job.Spark supports different managers:
- Standalone (Spark’s own built-in manager).
- YARN (Hadoop’s manager).
- Mesos (like a distributed systems kernel).
- Kubernetes (increasingly popular in cloud-native setups).
Executor
- These are JVM processes launched on worker nodes.
- Each executor:
- Runs multiple tasks (units of work).
- Holds cache/memory (RDDs/DataFrames can be cached in executors’ memory for speed).
- Talks back to the driver with results/status.
- Executors are “ephemeral”: once your Spark job ends, they go away.
- If you rerun the job, new executors spin up again.
SparkSession vs SparkContext
- SparkSession is the “unified entry point” (introduced in Spark 2.x). It wraps everything: SQL, DataFrames, config, catalogs.
- SparkContext is the older, lower-level object (still alive). It’s mainly for RDDs and is accessible as spark.sparkContext.
Comments
Post a Comment