Getting Started with Apache Polaris and Apache Spark

This getting started guide provides a docker-compose file to set up Apache Spark with Apache Polaris. Apache Polaris is configured as an Iceberg REST Catalog in Spark. A Jupyter notebook is used to run PySpark.

Build the Polaris image🔗

If a Polaris image is not already present locally, build one with the following command:

1./gradlew \
2   :polaris-server:assemble \
3   :polaris-server:quarkusAppPartsBuild --rerun \
4   -Dquarkus.container-image.build=true

Run the docker-compose file🔗

To start the docker-compose file with the necessary dependencies, run these commands from the repo’s root directory:

1make client-regenerate
2docker compose -f site/content/guides/spark/docker-compose.yml up

This will spin up 2 container services

  • The polaris service for running Apache Polaris using an in-memory metastore
  • The jupyter service for running Jupyter notebook with PySpark

Access the Jupyter notebook interface🔗

In the Jupyter notebook container log, look for the URL to access the Jupyter notebook. The url should be in the format, http://127.0.0.1:8888/lab?token=<token>.

Open the Jupyter notebook in a browser. Navigate to notebooks/SparkPolaris.ipynb

Run the Jupyter notebook🔗

You can now run all cells in the notebook or write your own code!