PySpark - Remove Duplicates from a DataFrame

Introduction

In this tutorial, we want to drop duplicates from a PySpark DataFrame. In order to do this, we use the the dropDuplicates() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame "df" with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI", 9000),
    ("Python", "FastAPI", 9000),
    ("JavaScript", "ReactJS", 7000),
    ("Python", "Django", 20000),
]
df = spark.createDataFrame(data, column_names)
df.show()

Removing duplicate Rows

Next, we would like to remove duplicate rows from the DataFrame "df".

To do this, we use the dropDuplicates() method of PySpark:

df_cleaned = df.dropDuplicates()
df_cleaned.show()

Removing duplicate Rows based on a certain Column

Next, we would like to remove duplicate rows from the DataFrame "df" based on the column "language".

To do this, we use the dropDuplicates() method of PySpark and pass the column name inside a list as argument:

df_cleaned = df.dropDuplicates(["language"])
df_cleaned.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to drop duplicates from a PySpark DataFrame. We can simply use the dropDuplicates() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

PySpark - Remove Duplicates from a DataFrame

Data Engineer

Build your first SQL model in dbt: A Step-by-Step Tutorial

Prompt Engineering Explained: How to Talk to AI Chatbots Effectively

7 Myths About AI You Should Stop Believing

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Removing duplicate Rows

Removing duplicate Rows based on a certain Column

Conclusion

Instagram

PySpark - Get statistical Properties of a DataFrame

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns