Introduction

When dealing with large datasets in PySpark, it's common to encounter situations where you need to manipulate string data within your DataFrame columns. One such common operation is extracting a portion of a string—also known as a substring—from a column. In this tutorial, we will explore how to extract substrings from a DataFrame column in PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import substring

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the createDataFrame() method and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+----------+---------+-----+
|  language|framework|users|
+----------+---------+-----+
|    Python|   Django|20000|
|    Python|  FastAPI| 9000|
|      Java|   Spring| 7000|
|JavaScript|  ReactJS| 5000|
+----------+---------+-----+

Extract Substring

Now, let's say we want to extract the first 3 characters from the framework column. In PySpark, we can achieve this using the substring function of PySpark.

The substring function takes three arguments:

  • The column name from which you want to extract the substring.
  • The starting position (1-based index).
  • The length of the substring to extract.

Let's extract the first 3 characters from the framework column:

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In