Getting Started with Apache Polaris and Apache Spark
ℹ️ Assets for this guide can be accessed from the Apache Polaris Git repository
This getting started guide provides a docker-compose file to set up Apache Spark with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark.
A Jupyter notebook is used to run PySpark.
Build the Polaris imageđź”—
If a Polaris image is not already present locally, build one with the following command:
./gradlew \
:polaris-server:assemble \
:polaris-server:quarkusAppPartsBuild --rerun \
-Dquarkus.container-image.build=true
Run the docker-compose fileđź”—
To start the docker-compose file with the necessary dependencies, run these commands from the repo’s root directory:
make client-regenerate
docker compose -f site/content/guides/spark/docker-compose.yml up
This will spin up 2 container services
- The
polarisservice for running Apache Polaris using an in-memory metastore - The
jupyterservice for running Jupyter notebook with PySpark
Access the Jupyter notebook interfaceđź”—
In the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the format, http://127.0.0.1:8888/lab?token=<token>.
Open the Jupyter notebook in a browser.
Navigate to notebooks/SparkPolaris.ipynb
Run the Jupyter notebookđź”—
You can now run all cells in the notebook or write your own code!