Introduction
In this tutorial, we want to create a PySpark DataFrame. In order to do this, we use the the createDataFrame() function of PySpark.
Import Libraries
First, we import the following python modules:
from pyspark.sql import SparkSession
Create SparkSession
Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.
In order to create a basic SparkSession programmatically, we use the following command:
spark = SparkSession \
.builder \
.appName("Python PySpark Example") \
.getOrCreate()
Define Data
Now, we define a list containing the data of the DataFrame:
data = [
("Python", "Django", 20000),
("Python", "FastAPI", 9000),
("Java", "Spring", 7000),
("JavaScript", "ReactJS", 5000)
]
Create Pyspark DataFrame
Next, we create the PySpark DataFrame from the defined list.
To do this, we use the method createDataFrame() and pass the defined data and the column names as arguments. The method show() can be used to visualize the DataFrame.
column_names = ["language", "framework", "users"]
df = spark.createDataFrame(data, column_names)
df.show()
Conclusion
Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to create a PySpark DataFrame. We can simply use the createDataFrame() function of PySpark. Try it yourself!
Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.