1. Why is “move code to data” faster than “move data to code”?
Memory limits: A single compute node can’t hold all that data.
Network bottlenecks: Shipping TBs across the network is painfully slow and expensive.
Parallelism lost: If you centralize all data on one machine, you kill the whole point of distributed computing.
Instead, Spark ships the code—a small serialized function, often just a few kilobytes—to where the data partitions already live across the cluster nodes. Each worker executes the logic locally, and only the final aggregated results (often much smaller) flow back to the driver.
So the motto is: “Move tiny code to big data, not big data to tiny code.”
That principle is why Spark, Hadoop, and similar systems scale so well.
2. Name the three stages of MapReduce and identify which one causes the big network shuffle.
MapReduce has three stages:- Map → Each worker processes its local chunk of data and emits key–value pairs.
- Shuffle → All values with the same key are moved across the cluster so they can be grouped together. This is where the network traffic explodes, because data has to be redistributed between nodes.
- Reduce → The grouped values are aggregated (sum, count, average, etc.) on the receiving node.
3. Explain why reduceByKey() is better than groupByKey()
groupByKey()
- Each mapper sends all values for a key across the network.
- Example: If key "A" appears a million times, every single one of those values is shuffled.
- Then the reducer node groups them into a big list, and only then can you aggregate (sum, average, etc.).
- Problems:
- Massive network I/O.
- High memory use at the reducer (holding huge lists).
- Risk of out-of-memory errors.
reduceByKey()
- Each mapper combines values locally before shuffling.
- For key "A" with a million values, each partition computes a partial sum first.
- Only those partial results (say, a few dozen values instead of millions) are sent across the network.
- The reducer then merges the partials into the final result.
- Benefits:
- Far less data to shuffle.
- Faster network transfer.
- More scalable.
- Limitations:
- It can only merge values of same type
- reduceFunction must be of type (T, T) → T
- It can produce only one output. E.g. sum, count, min or max
Say you have this dataset across 3 partitions:
("A", 1), ("A", 1), ("A", 1) … (1 million times)
- groupByKey:
- Shuffle sends 1 million values for "A".
- Reducer aggregates them.
- reduceByKey:
- Each partition computes "A" → (sum=333,333).
- Shuffle sends only 3 values (one per partition).
- Reducer adds them up.
- If you’re going to aggregate values per key, always prefer reduceByKey() or aggregateByKey() instead of groupByKey().
- Use groupByKey() only if you really need all raw values (e.g., to build a list or do some non-associative operation).
4. What are the difference between reduceByKey, aggregateByKey and combineByKey functions?
Method | Input Type | Output Type | When to Use |
---|---|---|---|
reduceByKey() |
(K, V) |
(K, V) |
When you just need a simple associative + commutative reduction (sum, max). |
aggregateByKey() |
(K, V) |
(K, U) |
When you need richer accumulators with an initial zeroValue. |
combineByKey() |
(K, V) |
(K, C) |
Most flexible: different input/output types, custom combining logic. |
5. What makes RDDs “resilient”?
The “R” in RDD is all about fault tolerance. Spark tracks the lineage of each RDD — basically a recipe of how it was created from other datasets. If a partition of an RDD is lost (say, a node crashes), Spark doesn’t need to checkpoint everything. Instead, it can recompute the lost partition using the lineage graph. That’s why it’s resilient: it can bounce back from data loss without needing explicit replication everywhere.6. Difference between a transformation and an action in Spark
- Transformation: A recipe, not a meal. Operations like map, filter, flatMap, groupByKey, reduceByKey build new RDDs or DataFrames but don’t immediately run anything. They’re lazy. Spark just remembers the plan.
- Action: This is when Spark finally gets off the couch. Things like collect, count, saveAsTextFile, or show actually trigger execution of the DAG (Directed Acyclic Graph) of transformations.
7. What’s one practical benefit of AQE (Adaptive Query Execution) in Spark SQL?
A very practical benefit: automatic handling of skewed joins. If one key has way more rows than others, a static plan might send all of it to one reducer (ouch). With AQE, Spark detects this skew during execution and can split the heavy partition into smaller chunks, balancing the workload and preventing one task from becoming the bottleneck.
8. What’s the difference between .withColumn()
and .select() in DataFrame
?
- .withColumn(colName, expr)
- Adds a new column or replaces an existing one.
- df.withColumn("isAdult", df.age >= 18)
- .select(exprs…)
- Creates a new DataFrame with only the specified columns.
- df.select("name", (df.age >= 18).alias("isAdult"))
Rule of thumb:
-
Use
.withColumn()
when you’re adding/updating a column. -
Use
.select()
when you’re reshaping which columns you keep.
9. How would you replace all NULL
ages with 0
?
or equivalently:
Comments
Post a Comment