Getting Started with Apache Polaris and Apache Spark
This getting started guide provides a docker-compose file to set up Apache Spark with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark.
A Jupyter notebook is used to run PySpark.
Build the Polaris image🔗
If a Polaris image is not already present locally, build one with the following command:
1./gradlew \
2 :polaris-server:assemble \
3 :polaris-server:quarkusAppPartsBuild --rerun \
4 -Dquarkus.container-image.build=true
Run the docker-compose file🔗
To start the docker-compose file with the necessary dependencies, run these commands from the repo’s root directory:
1make client-regenerate
2docker compose -f site/content/guides/spark/docker-compose.yml up
This will spin up 2 container services
- The
polarisservice for running Apache Polaris using an in-memory metastore - The
jupyterservice for running Jupyter notebook with PySpark
Access the Jupyter notebook interface🔗
In the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the format, http://127.0.0.1:8888/lab?token=<token>.
Open the Jupyter notebook in a browser.
Navigate to notebooks/SparkPolaris.ipynb
Run the Jupyter notebook🔗
You can now run all cells in the notebook or write your own code!