📘 Introduction

Every Spark application tells a story — a story of how your code travels from a high-level command in Python or Scala to a fully distributed computation running across dozens or even hundreds of executors. Behind the scenes, Spark organizes this work into jobsstages, and tasks — the building blocks of its execution model.

Understanding these layers is essential if you want to interpret the Spark UI, tune performance, or debug slow applications. Each represents a level of granularity in Spark’s execution plan: from the overall job triggered by an action, down to the smallest unit of computation performed on a single partition. Let’s explore what each of these components means and how they interact to bring your Spark code to life.

🧩 Spark Execution Hierarchy

First, let’s consider how jobs, stages, and tasks relate to each other within Spark’s execution model.

⚙️ What Is a Spark Job?

job in Spark is the top-level unit of execution. It’s created every time you perform an action — such as collect()count(), or save(). Actions tell Spark that it’s time to actually compute a result, prompting the system to build an execution plan and submit a job to the cluster.

For example:

result = df.filter(df.age > 30)
count = result.count()

When you call count(), Spark launches a job that includes all the transformations required to compute that count. Each job is broken down into smaller execution units — stages and tasks — that the cluster executes in parallel.

💡
You can see each job listed separately in the Spark UI, where it shows details like duration, stages involved, and how efficiently resources were used.

🏗️ What Are Spark Stages?

Within each job, Spark divides the work into stages. A stage represents a collection of tasks that can be executed without requiring data movement between nodes.

Stages are determined by Spark’s lineage graph, which tracks how each dataset depends on previous ones. Whenever Spark encounters a dependency that can be computed locally (like a simple filter() or map()), it stays within the same stage. When it reaches a point where data must be redistributed — such as a repartition() or groupBy() — it creates a new stage.

💡
Each stage corresponds to a boundary of parallel execution. For a simple filter().count() job, there might be just one stage. For more complex pipelines, Spark may create multiple stages, each with its own set of tasks running concurrently across executors.

⚙️ What Are Spark Tasks?

task is the smallest unit of execution in Spark — the actual work assigned to one partition of data. Every stage is made up of many tasks, each operating on a different subset of the dataset.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In