Apache Polaris and Lance: Bringing AI-Native Storage to the Open Multimodal Lakehouse

Introduction

We are excited to announce the integration between Apache Polaris and the Lance ecosystem, enabling users to manage Lance tables through the Apache Polaris Generic Table API. This integration brings AI-native columnar storage to the open multimodal lakehouse, allowing organizations to leverage Apache Polaris as a unified catalog for both Iceberg and Lance tables.

What is Lance?

Lance is an open lakehouse format designed for multimodal AI workloads. It contains a file format, table format, and catalog spec that allows you to build a complete multimodal lakehouse on top of object storage to power your AI workflows. The key features of Lance include:

  • Expressive hybrid search: Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

  • Lightning-fast random access: 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

  • Native multimodal data support: Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

  • Data evolution: Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

  • Zero-copy versioning: ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

  • Rich ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

What is Lance Namespace?

Lance Namespace is the catalog spec layer for the Lance Open Lakehouse Format. While Lance tables can be stored directly on object storage, production AI/ML workflows require integration with enterprise metadata services for governance, access control, and discovery.

Lance Namespace addresses this need by defining both native catalog spec as well as a standardized framework for accessing and operating on a collection of Lance tables across different open catalog specs including Apache Polaris.

Here are some example systems and how they are mapped in Lance Namespace:

SystemStructureLance Namespace Mapping
Directory/data/users.lanceTable ["users"]
Hive Metastoredefault.ordersTable ["default", "orders"]
Apache Polaris/my-catalog/namespaces/team_a/tables/vectorsTable ["my-catalog", "team_a", "vectors"]

For Directory namespace, only the .lance table directories are tracked as tables, while the parent directory (e.g., /data) serves as the root path for the namespace.

Apache Polaris supports arbitrary namespace nesting, making it particularly flexible for organizing Lance tables in complex data architectures.

What is the Generic Table API in Apache Polaris?

Apache Polaris is best known as an open-source catalog for Apache Iceberg. In addition, Apache Polaris also offers the Generic Table API that can be used for managing non-Iceberg table formats such as Delta, Apache Hudi, Lance, and others.

Generic Table Definition

A generic table in Apache Polaris is an entity with the following fields:

FieldRequiredDescription
nameYesUnique identifier for the table within a namespace
formatYesThe table format (e.g., delta, csv, lance)
base-locationNoTable base location in URI format (e.g., s3://bucket/path/to/table)
propertiesNoKey-value properties for the table
docNoComment or description for the table

Generic tables share the same namespace hierarchy as Iceberg tables, and table names must be unique within a namespace regardless of format.

Generic Table API vs. Iceberg Table API

Apache Polaris provides separate API endpoints for generic tables and Iceberg tables:

OperationIceberg Table API EndpointGeneric Table API Endpoint
Create TablePOST .../namespaces/{namespace}/tablesPOST .../namespaces/{namespace}/generic-tables
Load TableGET .../namespaces/{namespace}/tables/{table}GET .../namespaces/{namespace}/generic-tables/{table}
Drop TableDELETE .../namespaces/{namespace}/tables/{table}DELETE .../namespaces/{namespace}/generic-tables/{table}
List TablesGET .../namespaces/{namespace}/tablesGET .../namespaces/{namespace}/generic-tables

The Iceberg Table APIs handle the management of Iceberg tables, while the Generic Table APIs manage Generic (non-Iceberg) tables. This clear separation enforces well-defined boundaries between table formats, while still allowing them to coexist within the same catalog and namespace structure.

Lance Integration with Generic Table API

The Lance Namespace implementation for Apache Polaris maps Lance Namespace operations to the Generic Table API. Lance tables are registered as generic tables with the format field set to lance, and the base-location pointing to the Lance table root directory.

Table Identification

A table in Apache Polaris is identified as a Lance table when:

  • It is registered as a Generic Table
  • The format field is set to lance
  • The base-location points to a valid Lance table root directory
  • The properties contain table_type=lance for consistency with other Lance Namespace implementations (e.g., REST, Unity Catalog)

Supported Operations

The Lance Namespace Apache Polaris implementation supports the following operations:

OperationDescription
CreateNamespaceCreate a new namespace hierarchy
ListNamespacesList child namespaces
DescribeNamespaceGet namespace properties
DropNamespaceRemove a namespace
DeclareTableDeclare a new table exists at a given location
ListTablesList all Lance tables in a namespace
DescribeTableGet table metadata and location
DeregisterTableDeregister a table from the namespace without deleting the underlying data (similar to DROP TABLE without PURGE)

Using Lance with Apache Polaris

The power of the Lance and Apache Polaris integration is that you can now store Lance tables in Apache Polaris and access them from any engine that supports Lance. Whether you’re ingesting data with Spark, running feature engineering with Ray, building RAG applications with LanceDB, or analyzing with Trino, all these engines can work with the same Lance tables managed through Apache Polaris. For more information on getting started with Apache Polaris, see the Apache Polaris Getting Started Guide.

Let’s walk through a complete end-to-end workflow using the BeIR/quora dataset from Hugging Face to build a question-answering system. The examples below work against a locally deployed Apache Polaris instance with the endpoint set to http://localhost:8181. You can update the endpoint to match your deployed Apache Polaris service.

Step 1: Ingest Data with Apache Spark

First, use Spark to load the Quora dataset and write it to a Lance table in Apache Polaris:

from pyspark.sql import SparkSession
from datasets import load_dataset

# Create Spark session with Apache Polaris catalog
spark = SparkSession.builder \
    .appName("lance-polaris-ingest") \
    .config("spark.jars.packages", "org.lance:lance-spark-bundle-3.5_2.12:0.0.7") \
    .config("spark.sql.catalog.lance", "org.lance.spark.LanceNamespaceSparkCatalog") \
    .config("spark.sql.catalog.lance.impl", "polaris") \
    .config("spark.sql.catalog.lance.endpoint", "http://localhost:8181") \
    .config("spark.sql.catalog.lance.auth_token", "<your-token>") \
    .getOrCreate()

# Create namespace for ML workloads
spark.sql("CREATE NAMESPACE IF NOT EXISTS lance.my_catalog.ml")

# Load Quora dataset from Hugging Face
dataset = load_dataset("BeIR/quora", "corpus", split="corpus[:10000]", trust_remote_code=True)
pdf = dataset.to_pandas()
pdf = pdf.rename(columns={"_id": "id"})

# Convert to Spark DataFrame and write to Lance table
df = spark.createDataFrame(pdf)
df.writeTo("lance.my_catalog.ml.quora_questions").create()

# Verify the data
spark.sql("SELECT COUNT(*) FROM lance.my_catalog.ml.quora_questions").show()

Step 2: Feature Engineering with Ray

Next, use Ray’s distributed computing to generate embeddings for all documents:

import ray
import pyarrow as pa
import lance_namespace as ln
from lance_ray import add_columns

ray.init()

# Connect to Apache Polaris
namespace = ln.connect("polaris", {
    "endpoint": "http://localhost:8181",
    "auth_token": "<your-token>"
})

def generate_embeddings(batch: pa.RecordBatch) -> pa.RecordBatch:
    """Generate embeddings using sentence-transformers."""
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('BAAI/bge-small-en-v1.5')

    texts = []
    for i in range(len(batch)):
        title = batch["title"][i].as_py() or ""
        text = batch["text"][i].as_py() or ""
        texts.append(f"{title}. {text}".strip())

    embeddings = model.encode(texts, normalize_embeddings=True)

    return pa.RecordBatch.from_arrays(
        [pa.array(embeddings.tolist(), type=pa.list_(pa.float32(), 384))],
        names=["vector"]
    )

# Add embeddings column using distributed processing
add_columns(
    uri=None,
    namespace=namespace,
    table_id=["my_catalog", "ml", "quora_questions"],
    transform=generate_embeddings,
    read_columns=["title", "text"],
    batch_size=100,
    concurrency=4,
)

print("Embeddings generated successfully!")

Step 3: SQL Analytics with Trino

Use Trino for SQL analytics on the same dataset:

# etc/catalog/lance.properties
connector.name=lance
lance.impl=polaris
lance.endpoint=http://localhost:8181
lance.auth_token=<your-token>
-- Explore the dataset
SHOW SCHEMAS FROM lance;
SHOW TABLES FROM lance.my_catalog.ml;
DESCRIBE lance.my_catalog.ml.quora_questions;

-- Basic analytics
SELECT COUNT(*) as total_questions
FROM lance.my_catalog.ml.quora_questions;

-- Find questions by keyword
SELECT id, title, text
FROM lance.my_catalog.ml.quora_questions
WHERE text LIKE '%machine learning%'
LIMIT 10;

-- Aggregate statistics
SELECT
    LENGTH(text) as text_length,
    COUNT(*) as count
FROM lance.my_catalog.ml.quora_questions
GROUP BY LENGTH(text)
ORDER BY count DESC
LIMIT 10;

Step 4: Agentic Search with LanceDB

Finally, use LanceDB for AI-native semantic search and full-text search on the enriched dataset:

import lancedb
from sentence_transformers import SentenceTransformer

# Connect to Apache Polaris via LanceDB
db = lancedb.connect_namespace(
    "polaris",
    {
        "endpoint": "http://localhost:8181",
        "auth_token": "<your-token>"
    }
)

# Open the table with embeddings
table = db.open_table("quora_questions", namespace=["my_catalog", "ml"])

# Create vector index for fast similarity search
table.create_index(
    metric="cosine",
    vector_column_name="vector",
    index_type="IVF_PQ",
    num_partitions=32,
    num_sub_vectors=48,
)

# Create full-text search index
table.create_fts_index("text")
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
query_text = "How do I learn machine learning?"
query_embedding = model.encode([query_text], normalize_embeddings=True)[0]

results = (
    table.search(query_embedding, vector_column_name="vector")
    .limit(5)
    .to_pandas()
)

print("=== Vector Search Results ===")
for idx, row in results.iterrows():
    print(f"{idx + 1}. {row['title']}")
    print(f"   {row['text'][:150]}...")
results = (
    table.search("machine learning algorithms", query_type="fts")
    .limit(5)
    .to_pandas()
)

print("=== Full-Text Search Results ===")
for idx, row in results.iterrows():
    print(f"{idx + 1}. {row['title']}")
    print(f"   {row['text'][:150]}...")
results = (
    table.search(query_embedding, vector_column_name="vector")
    .where("text LIKE '%python%'", prefilter=True)
    .limit(5)
    .to_pandas()
)

Try It Yourself

The lance-namespace-impls repository provides a Docker Compose setup that makes it easy to try out the Apache Polaris integration locally.

Quick Start

git clone https://github.com/lance-format/lance-namespace-impls.git
cd lance-namespace-impls/docker

# Start Apache Polaris
make setup
make up-polaris

# Get authentication token
make polaris-token

# Create a test catalog
make polaris-create-catalog

The setup includes:

  • Apache Polaris API on port 8181
  • Apache Polaris Management on port 8182
  • PostgreSQL backend for metadata storage

Configuration

Once Apache Polaris is running, configure your Lance Namespace client:

from lance_namespace_impls.polaris import PolarisNamespace

ns = PolarisNamespace({
    "endpoint": "http://localhost:8181",
    "auth_token": "<token-from-make-polaris-token>"
})

Next Steps

We are excited about this collaboration between the Lance and Apache Polaris communities. Our integration with the Generic Table API opens up new possibilities for managing AI-native workloads in the open multimodal lakehouse.

Looking ahead, we plan to continue improving the Generic Table API based on our learnings from the Lance Namespace specification:

  • Credentials Vending: Lance Namespace already supports credentials vending end-to-end and is integrated with any engine that uses the Lance Rust/Python/Java SDKs, allowing namespace servers to provide temporary credentials for accessing table data. We would like to collaborate with the Apache Polaris community to add credentials vending support to the Generic Table API, enabling secure, fine-grained access control for Lance tables stored in AWS S3, Azure Blob Storage, and Google Cloud Storage. This would allow Apache Polaris to fully manage credentials for Lance tables, just as it does for Iceberg tables today.

  • OAuth Integration: We plan to integrate with various OAuth workflows for better client connectivity and auth token refresh, making it easier for applications to maintain secure, long-lived connections to Apache Polaris.

  • Table Commit Path: Formats like Delta Lake and Lance do not go through the catalog for commits after initial table creation. This creates challenges for centralized governance over table update operations. We will continue to evolve the Generic Table API to integrate better with these formats at the commit code path, enabling finer-grained control and governance over table modifications.

We welcome contributions and feedback from the community. Join us in building the future of AI-native data management in the open multimodal lakehouse!

Resources