PySpark - Create a DataFrame with Schema

Introduction

In this tutorial, we want to create a PySpark DataFrame with a specific schema. In order to do this, we use the the createDataFrame() function of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Define Data

Now, we define a list containing the data of the DataFrame:

data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000)
]

Define Schema

Next, we would like to create a PySpark DataFrame with a specific schema. For the schema, we have to specify the column names along with their data types.

To do this, we use the classes StructType and StructField. StructField is used to define the column name, data type, and a flag for nullable or not.

schema = StructType([
    StructField("language",StringType(),True),
    StructField("framework",StringType(),True),
    StructField("users",IntegerType(),True),
  ])

Create Pyspark DataFrame

Next, we create the PySpark DataFrame from the defined list.

To do this, we use the method createDataFrame() and pass the defined data and the defined schema as arguments. The method show() can be used to visualize the DataFrame.

df = spark.createDataFrame(data, schema)
df.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to create a PySpark DataFrame with a specific schema. We can simply use the createDataFrame() function of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

PySpark - Create a DataFrame with Schema

Data Engineer

How to Get JSON Structured Output from LLMs with the OpenAI Python SDK

How to add MCP Servers to your AI Agent using OpenAI Agents SDK

How to add MCP Servers to your LLM using OpenAI Python SDK

Introduction

Import Libraries

Create SparkSession

Define Data

Define Schema

Create Pyspark DataFrame

Conclusion

Instagram

PySpark - Get statistical Properties of a DataFrame

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns