Introduction

When working with PySpark DataFrames, handling different data types correctly is essential for data preprocessing. Mismatched or incorrect data types can lead to errors in Spark operations such as filtering, aggregations, and machine learning workflows. In this tutorial, we’ll explore how to convert column data types in a PySpark DataFrame.

📥 Import Libraries

First, import the necessary Python modules:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

🔌 Create SparkSession

Before working with PySpark, a SparkSession must be created. The SparkSession serves as the entry point to all Spark functionalities. To create a basic SparkSession programmatically, use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

📂 Create PySpark DataFrame

Next, create an example PySpark DataFrame based on a list. To do this, use the createDataFrame() method of PySpark.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", "20000"),
    ("Python", "FastAPI", "9000"),
    ("Java", "Spring", "7000"),
    ("JavaScript", "ReactJS", "5000")
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+----------+---------+-----+
|  language|framework|users|
+----------+---------+-----+
|    Python|   Django|20000|
|    Python|  FastAPI| 9000|
|      Java|   Spring| 7000|
|JavaScript|  ReactJS| 5000|
+----------+---------+-----+

Now, let's inspect the schema of the DataFrame to verify the data types:

df.printSchema()

Output:

root
 |-- language: string (nullable = true)
 |-- framework: string (nullable = true)
 |-- users: string (nullable = true)

🔄 Convert Column Data Type

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In