PySpark - Get statistical Properties of a DataFrame

Introduction

When working with PySpark DataFrames, understanding the statistical properties of your data is crucial for data exploration and preprocessing. PySpark provides the describe() and summary() functions to generate useful summary statistics. In this tutorial, we’ll explore how to use both functions to get insights into our dataset.

📥 Import Libraries

First, import the necessary Python modules:

from pyspark.sql import SparkSession

🔌 Create SparkSession

Before working with PySpark, a SparkSession must be created. The SparkSession serves as the entry point to all Spark functionalities. To create a basic SparkSession programmatically, use the following command:

spark = SparkSession \
    .builder \
    .appName("PySpark Statistical Summary") \
    .getOrCreate()

📂 Create PySpark DataFrame

Next, create an example PySpark DataFrame based on a list. To do this, use the createDataFrame() method of PySpark.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+----------+---------+-----+
|  language|framework|users|
+----------+---------+-----+
|    Python|   Django|20000|
|    Python|  FastAPI| 9000|
|      Java|   Spring| 7000|
|JavaScript|  ReactJS| 5000|
+----------+---------+-----+

📊 Get statistical Properties

For getting statistical information of the PySpark DataFrame, we will use the following PySpark functions:

describe(): provides only count, mean, standard deviation, min, and max.
summary(): includes additional statistics such as quartiles (25%, 50%, 75%).

🔍 Using `describe()` Function

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In

PySpark - Get statistical Properties of a DataFrame

Data Engineer

Create a Data Connection in Microsoft Fabric (DP-600)

Install and use dlt (data load tool)

Understanding the dbt build command: How it works and when to use it

Introduction

📥 Import Libraries

🔌 Create SparkSession

📂 Create PySpark DataFrame

📊 Get statistical Properties

🔍 Using `describe()` Function

You can view this post with the tier: Academy Membership

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns

PySpark - Parse a Column of JSON Strings

PySpark - Get statistical Properties of a DataFrame

Data Engineer

Create a Data Connection in Microsoft Fabric (DP-600)

Install and use dlt (data load tool)

Understanding the dbt build command: How it works and when to use it

Introduction

📥 Import Libraries

🔌 Create SparkSession

📂 Create PySpark DataFrame

📊 Get statistical Properties

🔍 Using describe() Function

You can view this post with the tier: Academy Membership

PySpark - Convert Column Data Types of a DataFrame

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns

PySpark - Parse a Column of JSON Strings

🔍 Using `describe()` Function