This tutorial shows how to run PySpark jobs in a Jupyter notebook using docker.
Apache Spark is one of the most popular distributed compute engines but is a pain to install and run locally. Docker helps us containerize our notebook with the right dependencies to run spark jobs locally.
Before you start: download Docker Desktop
We use docker for running Jupyter & Spark. Install docker desktop and after its running, open the Terminal
(or PowerShell
for windows) to follow along
Step 1: Create a python virtual environment
Create a python3 virtual environment for data tasks and activate it
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
Step 2: Install and initialize phidata
Phidata converts data tools like Jupyter and Spark into plug-n-play apps. Install phidata
in your virtual env and follow the steps to initialize
pip install phidata
phi init -l
Step 3: Create a Workspace
Workspace is a directory containing the source code for the data products. Create a new workspace using
phi ws init
Input 4 to select the aws-spark-data-platform
template.
Provide a workspace name or press Enter for the default: spark-data-platform
Step 4: Start the workspace
Your workspace comes pre-configured with a Jupyter
notebook, start your workspace to run it
phi ws up
Press Enter to confirm. Give a few minutes for the image to download (this takes a while) and container to run. Verify using the docker dashboard or the `
docker ps`
commandCheck logs using: `
docker logs -f jupyter-container`
Step 5: Open the Jupyter UI
Open localhost:8888 in a new tab to view the jupyterlab UI.
Password: admin
Open notebooks/examples/spark_test.ipynb
and run all cells using Run → Run All Cells.
This will run sample PySpark code.
The following steps are Optional.
Follow along if you want to run a spark cluster locally.
Step 6: Run Spark Cluster
Open the workspace/settings.py
file and uncomment dev_spark_enabled=True
(line 21). Start the workspace using
phi ws up
Press Enter to confirm and give a few minutes for the containers to run. Verify using the docker dashboard or the `
docker ps`
commandCheck logs using: `
docker logs -f spark-driver-container`
Step 7: Open the Spark Driver UI
Open localhost:9080 in a new tab to view the Spark Driver UI.
Step 8: Connect PySpark to cluster
Uncomment the second cell in the spark_test.ipynb
notebook:
from workspace.dev.spark import dev_spark_driver
spark = SparkSession.builder.master(dev_spark_driver.driver_url).getOrCreate()
This will connect the SparkSession to the Spark Driver
Step 9: Shut down
Play around and then stop the workspace using
phi ws down
Summary
This tutorial showed how to run PySpark jobs in a Jupyter notebook using docker. Leave a comment and let me know if you finished this in under 30 minutes :)
For questions, come chat with us on Discord.
I got the jupyter + spark cluster running in one go!
Great tutorial, I was able to complete it in 20 minutes!