PySpark - Create Embedding Vectors with Sentence-Transformers

Introduction

In today's data-driven world, understanding text data is crucial across various domains, from data analysis to engineering and architecture. However, dealing with text data often requires converting it into numerical representations for machine learning models to process efficiently. This is where embedding vectors come into play, offering a powerful way to represent text data in a format that algorithms can understand. In this tutorial, we will explore how to create embedding vectors using the Sentence Transformers library in PySpark.

What is Sentence Transformers?

Sentence Transformers is a Python library that provides pre-trained models for generating high-quality sentence and text embeddings. These embeddings capture the semantic meaning of text, enabling various Natural language Processing (NLP) tasks such as semantic search, text classification, clustering, and more.

Import Libraries

First, we import the following python modules:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
from sentence_transformers import SentenceTransformer
from pyspark.sql.types import ArrayType, FloatType

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create the PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["role", "description"]
data = [
    ("Data Analyst", "Analyzes data to help businesses make informed decisions."),
    ("Data Engineer", "Designs, builds, and maintains data pipelines and infrastructure."),
    ("Data Scientist", "Uses statistical techniques to analyze and interpret complex data."),
    ("DevOps Engineer", "Manages and automates deployment and operations of software systems."),
    ("Data Architect", "Designs and oversees the implementation of data architecture solutions.")
]
df = spark.createDataFrame(data, column_names)
df.show()