PySpark - Count Distinct Values of a DataFrame Column

Introduction

In this tutorial, we want to count the distinct values of a PySpark DataFrame column. In order to do this, we use the distinct().count() method and the countDistinct() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the createDataFrame() method and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Count Distinct Values

Let's count the distinct values of the PySpark DataFrame column "language". We will explore two different possibilities to count the distinct values.

Option 1: `distinct().count()`

The first option we will explore is using the distinct() method combined with count().

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In

PySpark - Count Distinct Values of a DataFrame Column

Data Engineer

CTEs vs Subqueries in dbt: Why CTEs make your SQL cleaner

Understanding the Medallion Architecture in dbt: What happens Layer by Layer?

Different Nodes of OpenAI Agent Builder Explained: The Ultimate Guide

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Count Distinct Values

Option 1: `distinct().count()`

You can view this post with the tier: Academy Membership

PySpark - Get statistical Properties of a DataFrame

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns

PySpark - Count Distinct Values of a DataFrame Column

Data Engineer

CTEs vs Subqueries in dbt: Why CTEs make your SQL cleaner

Understanding the Medallion Architecture in dbt: What happens Layer by Layer?

Different Nodes of OpenAI Agent Builder Explained: The Ultimate Guide

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Count Distinct Values

Option 1: distinct().count()

You can view this post with the tier: Academy Membership

PySpark - Get statistical Properties of a DataFrame

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns

Option 1: `distinct().count()`