PySpark - Replace Empty Strings with Null Values

Introduction

When working with PySpark DataFrames, handling missing or empty values is a common task in data preprocessing. In many cases, empty strings ("") should be treated as null values for better compatibility with Spark operations, such as filtering, aggregations, and machine learning workflows. In this tutorial, we’ll explore how to replace empty strings with null values in a PySpark DataFrame.

Import Libraries

First, import the following Python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before working with PySpark, a SparkSession must be created. The SparkSession serves as the entry point to all Spark functionalities. To create a basic SparkSession programmatically, use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, create an example PySpark DataFrame based on a list. To do this, use the createDataFrame() method of PySpark.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("", "FastAPI", 9000),
    ("Java", "", 7000),
    ("", "", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+--------+---------+-----+
|language|framework|users|
+--------+---------+-----+
|  Python|   Django|20000|
|        |  FastAPI| 9000|
|    Java|         | 7000|
|        |         | 5000|
+--------+---------+-----+

Replace Empty Strings with Null Values

Let's replace empty strings with null values in the PySpark DataFrame.

To do this, use the replace() method of PySpark.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In

PySpark - Replace Empty Strings with Null Values

Data Engineer

Create a Data Connection in Microsoft Fabric (DP-600)

Install and use dlt (data load tool)

Understanding the dbt build command: How it works and when to use it

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Replace Empty Strings with Null Values

You can view this post with the tier: Academy Membership

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Split a Column into Multiple Columns

PySpark - Parse a Column of JSON Strings

PySpark - Convert Column from String to Timestamp Format