Introduction

In this tutorial, we will show you how to get a randomly sampled subset of a PySpark DataFrame. In order to do this, we will use the sample() function of PySpark.

What is the sample() Function?

The sample() function in PySpark is used to create a new DataFrame by randomly sampling a subset of the rows from an existing DataFrame. This can help in working with a smaller dataset that is representative of the original large dataset, making it easier to perform preliminary analysis or testing without processing the entire dataset.

The syntax of the sample() function of PySpark looks as follows:

DataFrame.sample(withReplacement, fraction, seed=None)
  • withReplacement: A boolean value that specifies whether to sample with replacement or not. The same row can be returned more than once.
  • fraction: A float value that specifies the fraction of rows to return.
  • seed (optional): An integer value that acts as the seed for the random number generator.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000), 
    ("Python", "FastAPI", 9000), 
    ("Java", "Spring", 7000), 
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Get Sampled Subset with Replacement

First, we will sample 50% of the rows from the PySpark DataFrame with replacement.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In