Introduction

When working with data in PySpark, you might often encounter scenarios where a single column contains multiple pieces of information, such as a combination of names, categories, or attributes. In such cases, it is essential to split these values into separate columns for better data organization and analysis. In this tutorial, we’ll explore how to split a column of a PySpark DataFrame into multiple columns.

Import Libraries

First, import the following Python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split

Create SparkSession

Before working with PySpark, a SparkSession must be created. The SparkSession serves as the entry point to all Spark functionalities. To create a basic SparkSession programmatically, use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, create an example PySpark DataFrame based on a list. To do this, use the createDataFrame() method of PySpark.

column_names = ["language_framework", "users"]
data = [
    ("Python - Django", 20000),
    ("Python - FastAPI", 9000),
    ("Java - Spring", 7000),
    ("JavaScript - ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+--------------------+-----+
|  language_framework|users|
+--------------------+-----+
|     Python - Django|20000|
|    Python - FastAPI| 9000|
|       Java - Spring| 7000|
|JavaScript - ReactJS| 5000|
+--------------------+-----+

Split Column into Multiple Columns

Let's split the language_framework column into two new columns: language and framework.

To do this, use the split() function of PySpark. This function allows us to break a column’s string values based on a specified delimiter. In our case, the delimiter is " - ". 

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In