Getting Started with Apache Polaris and Ceph

Overview🔗

This guide describes how to spin up a single-node Ceph cluster with RADOS Gateway (RGW) for S3-compatible storage and configure it for use by Polaris.

This example cluster is configured for basic access key authentication only. It does not include STS (Security Token Service) or temporary credentials. All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create).

Spark is used as a query engine. This example assumes a local Spark installation. See the Spark Notebooks Example for a more advanced Spark setup.

Starting the Example🔗

Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs.

The services are started in sequence:

Monitor + Manager
OSD
RGW
Polaris

Note: this example pulls the apache/polaris:latest image, but assumes the image is 1.2.0-incubating or later.

1. Copy the example environment file🔗

1cp dot-env.example .env

2. Start the docker compose group by running the following command:🔗

1docker compose up -d

Check status🔗

1docker exec ceph-mon1-1 ceph -s

You should see something like:

cluster:
  id:     b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e
  health: HEALTH_WARN
          mon is allowing insecure global_id reclaim
          1 monitors have not enabled msgr2
          6 pool(s) have no replicas configured

services:
  mon: 1 daemons, quorum mon1 (age 49m)
  mgr: mgr(active, since 94m)
  osd: 1 osds: 1 up (since 36m), 1 in (since 93m)
  rgw: 1 daemon active (1 hosts, 1 zones)

3. Connecting From Spark🔗

 1bin/spark-sql \
 2    --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1 \
 3    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
 4    --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
 5    --conf spark.sql.catalog.polaris.type=rest \
 6    --conf spark.sql.catalog.polaris.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
 7    --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \
 8    --conf spark.sql.catalog.polaris.token-refresh-enabled=true \
 9    --conf spark.sql.catalog.polaris.warehouse=quickstart_catalog \
10    --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \
11    --conf spark.sql.catalog.polaris.credential=root:s3cr3t \
12    --conf spark.sql.catalog.polaris.client.region=irrelevant \
13    --conf spark.sql.catalog.polaris.s3.access-key-id=POLARIS123ACCESS \
14    --conf spark.sql.catalog.polaris.s3.secret-access-key=POLARIS456SECRET

Note: s3cr3t is defined as the password for the root user in the docker-compose.yml file.

Note: The client.region configuration is required for the AWS S3 client to work, but it is not used in this example since Ceph does not require a specific region.

4. Running Queries🔗

Run inside the Spark SQL shell:

1USE polaris;
2
3CREATE NAMESPACE ns;
4
5CREATE TABLE ns.t1 AS SELECT 'abc';
6
7SELECT * FROM ns.t1;
8-- abc

Lack of Credential Vending🔗

Notice that the Spark configuration does not contain a X-Iceberg-Access-Delegation header. This is because example cluster does not include STS (Security Token Service) or temporary credentials.

The lack of STS API is represented in the Catalog storage configuration by the stsUnavailable=true property.