Build a Local Open Data Lakehouse with k3d, Apache Ozone, Apache Polaris and Trino
TL;DR β Spin up a fully integrated, locally-running open data lakehouse on your laptop in under 30 minutes using Kubernetes in Docker (k3d), Apache Ozone as S3-compatible object storage, Apache Polaris as the Iceberg REST catalog and Trino as the SQL query engine. No cloud account required.
Why This Stack?π
The modern open data lakehouse is built on open standards: Apache Iceberg as the table format, a REST catalog to manage metadata, object storage for the actual files and a decoupled compute engine for queries. This separation lets you swap any layer without rewriting the others.
But spinning up a realistic multi-component stack locally has historically meant juggling
docker-compose files, manual wiring and frustrating networking issues. Helm + k3d changes
that. You get a real Kubernetes environment (with proper service discovery, namespaces and
resource management) running entirely inside Docker on your laptop.
Here’s what each tool does in our stack:
| Tool | Role | Why |
|---|---|---|
| k3d | Local Kubernetes cluster inside Docker | Lightweight, fast to create/destroy, great for dev/test |
| Apache Ozone | S3-compatible distributed object store | Stores the actual Iceberg data and metadata files |
| Apache Polaris | Iceberg REST catalog (Apache Top-Level Project) | Manages table metadata; any Iceberg-compatible engine can use it |
| Trino | Distributed SQL query engine | Reads Iceberg tables via Polaris, files from Ozone |
The data flow looks like this:
βββββββββββ 1. catalog ops ββββββββββββββββββ
β Trino β βββββββββββββββββββββββββΆ β Apache Polaris β
β (Query) β βββββββββββββββββββββββββ β (Catalog) β
ββββββ¬βββββ 2. metadata location βββββββββ¬βββββββββ
β β 3. write metadata
β 4. read/write data files (S3 API) β JSON to Ozone
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Apache Ozone (S3 Gateway) β
β (Object Storage) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
When you run a query:
- Trino calls Polaris (via the Iceberg REST API) to get table metadata β schema, snapshot, and the location of data files in Ozone
- Polaris also handles commit orchestration: when Trino creates or writes a table, Polaris writes the Iceberg metadata JSON files directly to Ozone
- Trino reads and writes the actual Parquet data files directly to Ozone using static S3 credentials configured in its Helm values
Why does Trino go directly to Ozone instead of through Polaris? In a production cloud setup, Polaris would use AWS STS to vend short-lived, scoped credentials to Trino for each table access (credential vending). Trino would then use those temporary credentials to hit S3. However, Ozone has no STS endpoint currently, so credential vending doesn’t work here. Instead, we configure
stsUnavailable: trueon the catalog and give Trino static Ozone credentials directly intrino-values.yaml. The architecture is otherwise identical to a production deployment.
Prerequisitesπ
Before starting, make sure you have the following installed:
- Docker Engine (e.g., via Docker Desktop or Colima) β k3d runs Kubernetes inside Docker
- k3d β₯ v5.x β
brew install k3d/ k3d.io - kubectl β
brew install kubectl - Helm β₯ v3.x β
brew install helm - curl + jq β for Polaris REST API calls
- AWS CLI (
aws) β for verifying Ozone S3 connectivity1brew install awscli
Minimum hardware: 8 GB RAM and 4 CPU cores recommended. Ozone is resource-hungry.
Step 1 β Create a k3d Clusterπ
We create a k3d cluster with Traefik (k3d’s built-in ingress controller) enabled and map
host port 8080 to the cluster’s load balancer port 80. This lets us use clean
host-based routing without touching /etc/hosts, thanks to nip.io β a
free public wildcard DNS that resolves anything like *.127.0.0.1.nip.io to 127.0.0.1
automatically, with zero configuration.
Offline / no internet? nip.io requires a DNS lookup. If you’re working offline, add this to
/etc/hostsinstead and everything will work identically:1sudo tee -a /etc/hosts <<EOF 2127.0.0.1 polaris.127.0.0.1.nip.io trino.127.0.0.1.nip.io 3EOF
Our service URLs will be:
| Service | URL |
|---|---|
| Apache Polaris | http://polaris.127.0.0.1.nip.io:8080 |
| Trino Web UI | http://trino.127.0.0.1.nip.io:8080 |
1k3d cluster create lakehouse \
2 --servers 1 \
3 --agents 2 \
4 -p "8080:80@loadbalancer"
Verify the cluster is up:
1kubectl cluster-info
2kubectl get nodes
Step 2 β Install Apache Ozoneπ
Apache Ozone is the storage foundation of our stack. It provides an S3-compatible API (via its S3 Gateway service) that both Polaris and Trino will use to read/write Iceberg table files.
Add the Helm repo and installπ
1helm repo add ozone https://apache.github.io/ozone-helm-charts/
2helm repo update
For local development, we use a minimal values.yaml to reduce resource usage and expose
the S3 Gateway as a ClusterIP service (we’ll use kubectl port-forward to access it locally):
1# ozone-values.yaml
2scm:
3 replicaCount: 1
4
5om:
6 replicaCount: 1
7
8datanode:
9 replicaCount: 3 # minimum for block placement
10
11s3g:
12 replicaCount: 1
13
14# Disable TLS for local dev
15tls:
16 enabled: false
1helm install ozone ozone/ozone \
2 --namespace ozone \
3 --create-namespace \
4 --values ozone-values.yaml \
5 --wait --timeout 5m
Watch the pods come up:
1kubectl get pods -n ozone -w
You should see pods for scm (Storage Container Manager), om (Ozone Manager),
datanode-0/1/2, and s3g.
Create a bucket via the S3 APIπ
We create the bucket using the AWS CLI against the Ozone S3 Gateway. This ensures the
bucket is owned by testuser β the same credentials Polaris and Trino will use β so
path resolution is guaranteed to work.
Make sure the port-forward is running in a separate terminal:
1kubectl port-forward -n ozone svc/ozone-s3g-rest 9878:9878
Then create the bucket:
1AWS_ACCESS_KEY_ID=testuser AWS_SECRET_ACCESS_KEY=testpassword \
2 aws s3 mb s3://warehouse --endpoint-url http://localhost:9878
Verify it was created:
1AWS_ACCESS_KEY_ID=testuser AWS_SECRET_ACCESS_KEY=testpassword \
2 aws s3 ls --endpoint-url http://localhost:9878
You should see:
YYYY-MM-DD HH:MM:SS warehouse
S3 credentials in non-secure modeπ
Because we’re running Ozone without security (ozone.security.enabled=false β the default
for the Helm chart in local dev), the S3 Gateway accepts any access key and secret key.
There is no credential validation.
We’ll use these placeholder values consistently in both Polaris and Trino:
OZONE_ACCESS_KEY=testuser
OZONE_SECRET_KEY=testpassword
For production: enable Ozone security and use
ozone s3 getsecret -u <username>to generate real per-user credentials backed by Kerberos.
Step 3 β Install Apache Polarisπ
Apache Polaris is an open-source Iceberg REST catalog and an Apache Top-Level Project. It stores and serves Iceberg table metadata and acts as the single source of truth for schema, partitioning, and snapshot history. Trino (and any other Iceberg engine) talks to Polaris using the standard Iceberg REST API.
Add the Helm repo and installπ
1helm repo add polaris https://downloads.apache.org/polaris/helm-chart
2helm repo update
For local dev we override a few key values:
1# polaris-values.yaml
2
3# Use in-memory persistence (good enough for local dev; loses state on pod restart)
4persistence:
5 type: in-memory
6
7extraEnv:
8 - name: POLARIS_BOOTSTRAP_CREDENTIALS
9 value: "POLARIS,root,polaris-secret"
10 - name: AWS_ACCESS_KEY_ID
11 valueFrom:
12 secretKeyRef:
13 name: polaris-ozone-secret
14 key: access-key
15 - name: AWS_SECRET_ACCESS_KEY
16 valueFrom:
17 secretKeyRef:
18 name: polaris-ozone-secret
19 key: secret-key
20 - name: AWS_REGION
21 value: "us-east-1"
How credentials work:
POLARIS_BOOTSTRAP_CREDENTIALSsets the root principal on first boot (format:realm,clientId,clientSecret). TheAWS_*env vars give Polaris the static S3 credentials it uses when writing Iceberg metadata files to Ozone.We use
stsUnavailable: truein the catalog’sstorageConfigInfo(see the next step) to tell Polaris that STS is not available and to use the static credentials directly β while still propagating the custom S3 endpoint and path-style settings to the FileIO client.
Create the secret before installing Polaris:
1kubectl create namespace polaris --dry-run=client -o yaml | kubectl apply -f -
2kubectl create secret generic polaris-ozone-secret \
3 --namespace polaris \
4 --from-literal=access-key=testuser \
5 --from-literal=secret-key=testpassword
1helm upgrade --install polaris polaris/polaris \
2 --namespace polaris \
3 --create-namespace \
4 --values polaris-values.yaml \
5 --version 1.3.0-incubating \
6 --wait --timeout 3m
Note: The Apache Polaris project graduated from the Incubator in February 2026, but the Helm chart hasn’t been republished under a non-incubating version yet. Helm skips pre-release versions by default, so
--version 1.3.0-incubatingis required for now. Once a post-graduation chart is released, the version string will drop the-incubatingsuffix (e.g.--version 1.4.0), or you can omit--versionentirely to get the latest.
Verify:
1kubectl get pods -n polaris
Apply an Ingress so Polaris is reachable at http://polaris.127.0.0.1.nip.io:8080:
1# polaris-ingress.yaml
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5 name: polaris
6 namespace: polaris
7 annotations:
8 traefik.ingress.kubernetes.io/router.entrypoints: web
9spec:
10 rules:
11 - host: polaris.127.0.0.1.nip.io
12 http:
13 paths:
14 - path: /
15 pathType: Prefix
16 backend:
17 service:
18 name: polaris
19 port:
20 number: 8181
1kubectl apply -f polaris-ingress.yaml
Verify it’s up:
1curl http://polaris.127.0.0.1.nip.io:8080/api/catalog/v1/config
Configure Polaris via the REST APIπ
Polaris manages everything through its REST API. We need to:
- Get an access token using the root credentials
- Create a principal (service account for Trino)
- Create a catalog (backed by Ozone storage)
- Create a namespace inside the catalog
1. Get an access tokenπ
1TOKEN=$(curl -s -o /tmp/token.json -w "%{http_code}" \
2 -X POST http://polaris.127.0.0.1.nip.io:8080/api/catalog/v1/oauth/tokens \
3 -H "Content-Type: application/x-www-form-urlencoded" \
4 -d "grant_type=client_credentials&client_id=root&client_secret=polaris-secret&scope=PRINCIPAL_ROLE:ALL")
5
6echo "HTTP $TOKEN"
7TOKEN=$(jq -r '.access_token' /tmp/token.json)
8echo "Token: $TOKEN"
2. Create a Trino principal and credentialsπ
Polaris returns credentials only once at creation time β capture them immediately:
1CREDS=$(curl -s -X POST http://polaris.127.0.0.1.nip.io:8080/api/management/v1/principals \
2 -H "Authorization: Bearer $TOKEN" \
3 -H "Content-Type: application/json" \
4 -o /tmp/principal.json -w "%{http_code}" \
5 -d '{
6 "name": "trino-principal",
7 "type": "SERVICE"
8 }')
9
10echo "HTTP $CREDS"
11# $CREDS holds only the status code; the JSON body is in /tmp/principal.json
12TRINO_CLIENT_ID=$(jq -r '.credentials.clientId' /tmp/principal.json)
13TRINO_CLIENT_SECRET=$(jq -r '.credentials.clientSecret' /tmp/principal.json)
14
15echo "Trino Client ID: $TRINO_CLIENT_ID"
16echo "Trino Client Secret: $TRINO_CLIENT_SECRET"
If you see HTTP 409, the principal already exists (e.g. from a previous attempt) and the secret cannot be retrieved again. Delete it and recreate:
1curl -s -X DELETE http://polaris.127.0.0.1.nip.io:8080/api/management/v1/principals/trino-principal \ 2 -H "Authorization: Bearer $TOKEN" \ 3 -w "\nHTTP %{http_code}"Then re-run the block above.
Save these β you’ll use them in the Trino Helm values.
3. Create a catalog backed by Ozoneπ
We create an internal Polaris catalog and configure its default storage to use our Ozone S3 bucket.
1curl -X POST http://polaris.127.0.0.1.nip.io:8080/api/management/v1/catalogs \
2 -H "Authorization: Bearer $TOKEN" \
3 -H "Content-Type: application/json" \
4 -w "\nHTTP %{http_code}" \
5 -d '{
6 "name": "ozone_catalog",
7 "type": "INTERNAL",
8 "properties": {
9 "default-base-location": "s3://warehouse/iceberg"
10 },
11 "storageConfigInfo": {
12 "storageType": "S3",
13 "allowedLocations": ["s3://warehouse/"],
14 "endpoint": "http://ozone-s3g-rest.ozone.svc.cluster.local:9878",
15 "endpointInternal": "http://ozone-s3g-rest.ozone.svc.cluster.local:9878",
16 "stsUnavailable": true,
17 "pathStyleAccess": true
18 }
19 }'
Note: Use
s3://(nots3a://) in Polaris catalog config. Trino usess3a://when reading/writing files, but Polaris stores and validates locations usings3://internally.storageType: S3is the correct type for any S3-compatible storage including OzoneThe critical fields in
storageConfigInfoare:
stsUnavailable: trueβ tells Polaris not to call the AWS STS service for temporary credentials (Ozone has no STS endpoint). Polaris will use theAWS_*environment credentials directly instead.endpoint/endpointInternalβ the S3-compatible endpoint for Ozone, injected into theStorageAccessConfig.extraPropertiespassed toS3FileIOso file writes go to Ozone.pathStyleAccess: trueβ forces path-style requests (host/bucket/key) instead of virtual-hosted style (bucket.host/key), which Ozone requires.
4. Grant the principal access to the catalogπ
Polaris uses a three-tier RBAC model: principal β principal role β catalog role β privileges. We need to wire all of these together:
1# 4a. Create a principal role
2curl -X POST "http://polaris.127.0.0.1.nip.io:8080/api/management/v1/principal-roles" \
3 -H "Authorization: Bearer $TOKEN" \
4 -H "Content-Type: application/json" \
5 -w "\nHTTP %{http_code}" \
6 -d '{"principalRole": {"name": "trino-role"}}'
7
8# 4b. Create a catalog role inside ozone_catalog
9curl -X POST "http://polaris.127.0.0.1.nip.io:8080/api/management/v1/catalogs/ozone_catalog/catalog-roles" \
10 -H "Authorization: Bearer $TOKEN" \
11 -H "Content-Type: application/json" \
12 -w "\nHTTP %{http_code}" \
13 -d '{"catalogRole": {"name": "trino-catalog-role"}}'
14
15# 4c. Grant CATALOG_MANAGE_CONTENT privilege to the catalog role
16curl -X PUT "http://polaris.127.0.0.1.nip.io:8080/api/management/v1/catalogs/ozone_catalog/catalog-roles/trino-catalog-role/grants" \
17 -H "Authorization: Bearer $TOKEN" \
18 -H "Content-Type: application/json" \
19 -w "\nHTTP %{http_code}" \
20 -d '{"grant": {"type": "catalog", "privilege": "CATALOG_MANAGE_CONTENT"}}'
21
22# 4d. Assign the catalog role to the principal role
23curl -X PUT "http://polaris.127.0.0.1.nip.io:8080/api/management/v1/principal-roles/trino-role/catalog-roles/ozone_catalog" \
24 -H "Authorization: Bearer $TOKEN" \
25 -H "Content-Type: application/json" \
26 -w "\nHTTP %{http_code}" \
27 -d '{"catalogRole": {"name": "trino-catalog-role"}}'
28
29# 4e. Assign the principal role to the trino principal
30curl -X PUT "http://polaris.127.0.0.1.nip.io:8080/api/management/v1/principals/trino-principal/principal-roles" \
31 -H "Authorization: Bearer $TOKEN" \
32 -H "Content-Type: application/json" \
33 -w "\nHTTP %{http_code}" \
34 -d '{"principalRole": {"name": "trino-role"}}'
Each command should return HTTP 201. A 409 on steps 4a or 4b means the role already
exists (e.g. from a previous attempt) β that’s fine, just continue to the next step.
5. Create a namespaceπ
1curl -X POST "http://polaris.127.0.0.1.nip.io:8080/api/catalog/v1/ozone_catalog/namespaces" \
2 -H "Authorization: Bearer $TOKEN" \
3 -H "Content-Type: application/json" \
4 -w "\nHTTP %{http_code}" \
5 -d '{
6 "namespace": ["demo"],
7 "properties": {
8 "location": "s3://warehouse/iceberg/demo"
9 }
10 }'
Step 4 β Install Trinoπ
Trino is our SQL query engine. We configure it with two things:
- An Iceberg connector catalog that points to Polaris as the REST catalog
- S3 file system settings pointing to Ozone
Prepare the Helm valuesπ
1# trino-values.yaml
2
3image:
4 tag: "480" # latest stable version at time of writing
5
6server:
7 workers: 1 # single worker is enough for local dev
8
9# Allow Traefik's X-Forwarded-For headers (required when running behind an ingress proxy)
10additionalConfigProperties:
11 - http-server.process-forwarded=true
12
13coordinator:
14 resources:
15 requests:
16 memory: "1Gi"
17 cpu: "500m"
18 limits:
19 memory: "2Gi"
20 cpu: "1"
21
22worker:
23 resources:
24 requests:
25 memory: "2Gi"
26 cpu: "500m"
27 limits:
28 memory: "4Gi"
29 cpu: "1"
30
31service:
32 type: ClusterIP
33 port: 8080
34
35# Define the Iceberg catalog backed by Polaris
36additionalCatalogs:
37 lakehouse: |
38 connector.name=iceberg
39 iceberg.catalog.type=rest
40 iceberg.rest-catalog.uri=http://polaris.polaris.svc.cluster.local:8181/api/catalog
41 iceberg.rest-catalog.warehouse=ozone_catalog
42 iceberg.rest-catalog.security=OAUTH2
43 # For local dev: credential shorthand (clientId:clientSecret)
44 iceberg.rest-catalog.oauth2.credential=<TRINO_CLIENT_ID>:<TRINO_CLIENT_SECRET>
45 iceberg.rest-catalog.oauth2.server-uri=http://polaris.polaris.svc.cluster.local:8181/api/catalog/v1/oauth/tokens
46 iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL
47 fs.native-s3.enabled=true
48 s3.endpoint=http://ozone-s3g-rest.ozone.svc.cluster.local:9878
49 s3.path-style-access=true
50 s3.aws-access-key=testuser
51 s3.aws-secret-key=testpassword
52 s3.region=us-east-1
Replace the placeholders:
<TRINO_CLIENT_ID>/<TRINO_CLIENT_SECRET>β from Step 3 (Polaris principal credentials)
Add the repo and installπ
1helm repo add trino https://trinodb.github.io/charts/
2helm repo update
3
4helm install trino trino/trino \
5 --namespace trino \
6 --create-namespace \
7 --values trino-values.yaml \
8 --wait --timeout 3m
Verify all pods are running:
1kubectl get pods -n trino
You should see a trino-coordinator-* and trino-worker-* pod.
Apply an Ingress so the Trino UI is reachable at http://trino.127.0.0.1.nip.io:8080:
1# trino-ingress.yaml
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5 name: trino
6 namespace: trino
7 annotations:
8 traefik.ingress.kubernetes.io/router.entrypoints: web
9spec:
10 rules:
11 - host: trino.127.0.0.1.nip.io
12 http:
13 paths:
14 - path: /
15 pathType: Prefix
16 backend:
17 service:
18 name: trino
19 port:
20 number: 8080
1kubectl apply -f trino-ingress.yaml
Open http://trino.127.0.0.1.nip.io:8080 in your browser to see the Trino Web UI.
Step 5 β End-to-End Testπ
Time to put it all together. We’ll exec into the Trino coordinator pod and use the built-in Trino CLI to create an Iceberg table, insert data, and query it back.
1TRINO_POD=$(kubectl get pod -n trino \
2 -l app.kubernetes.io/name=trino,app.kubernetes.io/component=coordinator \
3 -o jsonpath='{.items[0].metadata.name}')
4
5kubectl exec -n trino -it $TRINO_POD -- trino \
6 --server http://localhost:8080 \
7 --catalog lakehouse \
8 --schema demo
Inside the Trino CLI:
1-- Create an Iceberg table in the 'demo' namespace
2CREATE TABLE lakehouse.demo.events (
3 event_id BIGINT,
4 event_type VARCHAR,
5 user_id BIGINT,
6 created_at TIMESTAMP(6) WITH TIME ZONE
7)
8WITH (
9 format = 'PARQUET',
10 partitioning = ARRAY['day(created_at)']
11);
12
13-- Insert some rows
14INSERT INTO lakehouse.demo.events VALUES
15 (1, 'page_view', 101, TIMESTAMP '2024-10-01 10:00:00 UTC'),
16 (2, 'click', 102, TIMESTAMP '2024-10-01 11:30:00 UTC'),
17 (3, 'purchase', 101, TIMESTAMP '2024-10-02 09:15:00 UTC');
18
19-- Query the data
20SELECT event_type, COUNT(*) AS cnt
21FROM lakehouse.demo.events
22GROUP BY event_type
23ORDER BY cnt DESC;
Expected output:
event_type | cnt
------------+-----
page_view | 1
click | 1
purchase | 1
You can also verify the files are physically present in Ozone:
1kubectl exec -n ozone ozone-om-0 -- \
2 ozone sh key list /s3v/warehouse
You’ll see .parquet data files and a metadata/ directory with Iceberg JSON metadata β
exactly what you’d see in S3 with a real cloud deployment.
Tear Downπ
When you’re done:
1k3d cluster delete lakehouse
This destroys everything β the Kubernetes cluster, all Helm releases, and all data. Since we used in-memory persistence for Polaris and ephemeral storage for Ozone, nothing leaks onto your filesystem.
Summaryπ
In this tutorial we built a complete open data lakehouse locally using:
- k3d to create a throwaway Kubernetes cluster in Docker
- Apache Ozone as an S3-compatible object store (installed via Helm)
- Apache Polaris as the Iceberg REST catalog (installed via Helm)
- Trino as the SQL query engine (installed via Helm, configured to use Polaris + Ozone)
The entire stack runs on open standards (Iceberg REST API, S3 API) which means you can swap any layer for a compatible alternative without changing the others. That portability is the real value of this architecture.