PySpark - Quick Checks (Draft)

 


1. Why is “move code to data” faster than “move data to code”?

When your data is small, dragging it all to your laptop or a single server to crunch is fine. But once you’re in the terabytes or petabytes range, that plan collapses because:

Memory limits: A single compute node can’t hold all that data.
Network bottlenecks: Shipping TBs across the network is painfully slow and expensive.
Parallelism lost: If you centralize all data on one machine, you kill the whole point of distributed computing.

Instead, Spark ships the code—a small serialized function, often just a few kilobytes—to where the data partitions already live across the cluster nodes. Each worker executes the logic locally, and only the final aggregated results (often much smaller) flow back to the driver.

So the motto is: “Move tiny code to big data, not big data to tiny code.”
That principle is why Spark, Hadoop, and similar systems scale so well.  

2. Name the three stages of MapReduce and identify which one causes the big network shuffle.

MapReduce has three stages:
  1. Map → Each worker processes its local chunk of data and emits key–value pairs.
  2. Shuffle → All values with the same key are moved across the cluster so they can be grouped together. This is where the network traffic explodes, because data has to be redistributed between nodes.
  3. Reduce → The grouped values are aggregated (sum, count, average, etc.) on the receiving node.
The killer stage is Shuffle. Even if you “move code to data” during the map phase, the shuffle stage sometimes forces you to move data to data, which can become the bottleneck. Spark spends a lot of engineering effort (e.g., reduceByKey, map-side combine, AQE) to minimize shuffles because they’re the most expensive step.

3. Explain why reduceByKey() is better than groupByKey()

Both groupByKey() and reduceByKey() aim to group values that share the same key, but they behave very differently during the shuffle stage.

groupByKey()

  • Each mapper sends all values for a key across the network.
  • Example: If key "A" appears a million times, every single one of those values is shuffled.
  • Then the reducer node groups them into a big list, and only then can you aggregate (sum, average, etc.).
  • Problems:
    • Massive network I/O.
    • High memory use at the reducer (holding huge lists).
    • Risk of out-of-memory errors.

reduceByKey()

  • Each mapper combines values locally before shuffling.
  • For key "A" with a million values, each partition computes a partial sum first.
  • Only those partial results (say, a few dozen values instead of millions) are sent across the network.
  • The reducer then merges the partials into the final result.
  • Benefits:
    • Far less data to shuffle.
    • Faster network transfer.
    • More scalable.
  • Limitations:
    • It can only merge values of same type
    • reduceFunction must be of type (T, T) → T
    • It can produce only one output. E.g. sum, count, min or max
Example
Say you have this dataset across 3 partitions:
("A", 1), ("A", 1), ("A", 1) … (1 million times)
  • groupByKey:
    • Shuffle sends 1 million values for "A".
    • Reducer aggregates them.
  • reduceByKey:
    • Each partition computes "A" → (sum=333,333).
    • Shuffle sends only 3 values (one per partition).
    • Reducer adds them up.
Rule of Thumb
  • If you’re going to aggregate values per key, always prefer reduceByKey() or aggregateByKey() instead of groupByKey().
  • Use groupByKey() only if you really need all raw values (e.g., to build a list or do some non-associative operation).

4. What are the difference between reduceByKey, aggregateByKey and combineByKey functions?


Method Input Type Output Type When to Use
reduceByKey() (K, V) (K, V) When you just need a simple associative + commutative reduction (sum, max).
aggregateByKey() (K, V) (K, U) When you need richer accumulators with an initial zeroValue.
combineByKey() (K, V) (K, C) Most flexible: different input/output types, custom combining logic.

5. What makes RDDs “resilient”?

The “R” in RDD is all about fault tolerance. Spark tracks the lineage of each RDD — basically a recipe of how it was created from other datasets. If a partition of an RDD is lost (say, a node crashes), Spark doesn’t need to checkpoint everything. Instead, it can recompute the lost partition using the lineage graph. That’s why it’s resilient: it can bounce back from data loss without needing explicit replication everywhere.

6. Difference between a transformation and an action in Spark

  • Transformation: A recipe, not a meal. Operations like map, filter, flatMap, groupByKey, reduceByKey build new RDDs or DataFrames but don’t immediately run anything. They’re lazy. Spark just remembers the plan.
  • Action: This is when Spark finally gets off the couch. Things like collect, count, saveAsTextFile, or show actually trigger execution of the DAG (Directed Acyclic Graph) of transformations.
So: transformations = build plan, actions = execute plan.

7. What’s one practical benefit of AQE (Adaptive Query Execution) in Spark SQL?

AQE lets Spark adapt at runtime based on actual data stats, instead of blindly following the original logical plan.

A very practical benefit: automatic handling of skewed joins. If one key has way more rows than others, a static plan might send all of it to one reducer (ouch). With AQE, Spark detects this skew during execution and can split the heavy partition into smaller chunks, balancing the workload and preventing one task from becoming the bottleneck.

8. What’s the difference between .withColumn() and .select() in DataFrame?

  • .withColumn(colName, expr)
    • Adds a new column or replaces an existing one.
    • df.withColumn("isAdult", df.age >= 18)
    Leaves all the other columns intact. 
  • .select(exprs…)
    • Creates a new DataFrame with only the specified columns.
    • df.select("name", (df.age >= 18).alias("isAdult"))
    Drops any columns you don’t explicitly include.

Rule of thumb:

  • Use .withColumn() when you’re adding/updating a column.

  • Use .select() when you’re reshaping which columns you keep.

9. How would you replace all NULL ages with 0?

df = df.na.fill({"age": 0})

or equivalently:

from pyspark.sql.functions import coalesce, lit df = df.withColumn("age", coalesce(df.age, lit(0)))


Comments