Data Products come in many shapes and sizes - tables, metrics, dashboards and ML models. But they have 2 things in common:
They need to be created
They need to be updated on a schedule (hourly, daily, weekly, monthly)
This tutorial shows how to setup Jupyter and Airflow to build data products and run them on a schedule. Both these tools are open-source, free and market leaders in their domain - so you can’t go wrong
Before you start: download docker
Docker is a great tool for testing locally, install docker desktop before you start
Open the
Terminal
(orPowerShell
for windows) and follow along to build a sample data product using crypto data.
Step 1: Create a python virtual environment
Create a python3 virtual environment for data tasks and activate it
python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate
Step 2: Install and initialize phidata
Phidata converts data tools into plug-n-play apps. Enable an app and run it with 1 command. It’s the fastest way to run data tools locally.
Install phidata
in your virtual env and follow the steps to initialize
pip install phidata
phi init -l
Step 3: Create a Workspace
Workspace is a directory that contains the code for your data products. Create a new workspace using
phi ws init
Press Enter to select the default workspace name and template
Step 4: Start the workspace
Your workspace comes pre-configured with a jupyter
notebook, start your workspace to run it
phi ws up
Press Enter to confirm, give a few minutes for the image to download and container to run.
Verify the container is running using the docker dashboard or
docker ps
Step 5: Open the Jupyter UI
Open localhost:8888 in a new tab to view the jupyterlab UI.
Password: admin
Open notebooks/examples/crypto_nb.ipynb
and run all cells using Run → Run All Cells
This will download crypto prices and store them in a CSV Table at storage/tables/crypto_prices
Step 6: Run Airflow
Open the workspace/settings.py
file and uncomment dev_airflow_enabled=True
(line 19). Start the workspace using
phi ws up
Press Enter to confirm. Give about 5 minutes for the containers to run and database to initialize.
Check progress using:
docker logs -f airflow-scheduler-container
Step 7: Open the Airflow UI
Open localhost:8310 in a new tab to view the Airflow UI.
User: admin
Pass: admin
Step 8: Run workflow using Airflow
Switch ON the crypto_prices DAG which contains the same task as the crypto_nb.ipynb
notebook, but as a daily workflow.
Checkout the workflows/crypto/prices.py
file for the full code. The table is written to the storage/tables/crypto_prices
directory.
Step 9: Play around
Play around, create notebooks, DAGs and read more about phidata
Step 10: Shut down
Stop the workspace using
phi ws down
Summary
This tutorial showed how to run Jupyter and Airflow to setup a local data development environment. In the next tutorial, we’ll run this in production on AWS. Leave a comment to let me know if you finished this in under 30 minutes :)
Love to all,
Ashpreet
Getting this error on running phi init
ImportError: cannot import name 'Literal' from 'typing' (/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py)
Very clear tutorial! I was able to complete it in 15 minutes