Run Spark with Jupyter using docker

How to run PySpark jobs in a Jupyter notebook

Feb 14, 2023

This tutorial shows how to run PySpark jobs in a Jupyter notebook using docker.

Apache Spark is one of the most popular distributed compute engines but is a pain to install and run locally. Docker helps us containerize our notebook with the right dependencies to run spark jobs locally.

Before you start: download Docker Desktop

We use docker for running Jupyter & Spark. Install docker desktop and after its running, open the Terminal (or PowerShell for windows) to follow along

Step 1: Create a python virtual environment

Create a python3 virtual environment for data tasks and activate it

python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

Step 2: Install and initialize phidata

Phidata converts data tools like Jupyter and Spark into plug-n-play apps. Install phidata in your virtual env and follow the steps to initialize

pip install phidata
phi init -l

Step 3: Create a Workspace

Workspace is a directory containing the source code for the data products. Create a new workspace using

phi ws init

Input 4 to select the aws-spark-data-platform template.

Provide a workspace name or press Enter for the default: spark-data-platform

Step 4: Start the workspace

Your workspace comes pre-configured with a Jupyter notebook, start your workspace to run it

phi ws up

Press Enter to confirm. Give a few minutes for the image to download (this takes a while) and container to run. Verify using the docker dashboard or the `docker ps` command
Check logs using: `docker logs -f jupyter-container`

Step 5: Open the Jupyter UI

Open localhost:8888 in a new tab to view the jupyterlab UI.

Password: admin

Open notebooks/examples/spark_test.ipynb and run all cells using Run → Run All Cells. This will run sample PySpark code.

The following steps are Optional.
Follow along if you want to run a spark cluster locally.

Step 6: Run Spark Cluster

Open the workspace/settings.py file and uncomment dev_spark_enabled=True (line 21). Start the workspace using

phi ws up

Press Enter to confirm and give a few minutes for the containers to run. Verify using the docker dashboard or the `docker ps` command
Check logs using: `docker logs -f spark-driver-container`

Step 7: Open the Spark Driver UI

Open localhost:9080 in a new tab to view the Spark Driver UI.

Step 8: Connect PySpark to cluster

Uncomment the second cell in the spark_test.ipynb notebook:

from workspace.dev.spark import dev_spark_driver

spark = SparkSession.builder.master(dev_spark_driver.driver_url).getOrCreate()

This will connect the SparkSession to the Spark Driver

Step 9: Shut down

Play around and then stop the workspace using

phi ws down

Summary

This tutorial showed how to run PySpark jobs in a Jupyter notebook using docker. Leave a comment and let me know if you finished this in under 30 minutes :)

For questions, come chat with us on Discord.

Datain30

Discussion about this post