PySpark - Count Rows and Columns of a DataFrame

Introduction

In data processing and analysis with PySpark, it's often important to know the structure of your data, such as the number of rows and columns in a DataFrame. This is crucial for various operations, including data validation, transformations, and general exploration. In this tutorial, we'll explore how to count both the rows and columns of a PySpark DataFrame using a simple example.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the createDataFrame() method and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "Django", 20000),
    ("Python", "FastAPI", 9000),
    ("Java", "Spring", 7000),
    ("JavaScript", "ReactJS", 5000)
]
df = spark.createDataFrame(data, column_names)
df.show()

Output:

+----------+---------+-----+
|  language|framework|users|
+----------+---------+-----+
|    Python|   Django|20000|
|    Python|  FastAPI| 9000|
|      Java|   Spring| 7000|
|JavaScript|  ReactJS| 5000|
+----------+---------+-----+

Count the Number of Rows

Let's count the number of rows in the PySpark DataFrame. To do this, you can use the count() method of PySpark.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In

PySpark - Count Rows and Columns of a DataFrame

Data Engineer

Create and Configure Deployment Pipelines in Microsoft Fabric (DP-600)

Power BI - Choose between DirectQuery and Import (PL-300)

Power BI - How to reset all Slicers with one click

Introduction

Import Libraries

Create SparkSession

Create PySpark DataFrame

Count the Number of Rows

You can view this post with the tier: Academy Membership

Pandas vs. PySpark: Choosing the Right Tool for Data Processing

PySpark - Replace Empty Strings with Null Values

PySpark - Split a Column into Multiple Columns

PySpark - Parse a Column of JSON Strings

PySpark - Convert Column from String to Timestamp Format