Apache Polaris and Lance: Bringing AI-Native Storage to the Open Multimodal Lakehouse

Introduction๐Ÿ”—

We are excited to announce the integration between Apache Polaris and the Lance ecosystem, enabling users to manage Lance tables through the Apache Polaris Generic Table API. This integration brings AI-native columnar storage to the open multimodal lakehouse, allowing organizations to leverage Apache Polaris as a unified catalog for both Iceberg and Lance tables.

What is Lance?๐Ÿ”—

Lance is an open lakehouse format designed for multimodal AI workloads. It contains a file format, table format, and catalog spec that allows you to build a complete multimodal lakehouse on top of object storage to power your AI workflows. The key features of Lance include:

  • Expressive hybrid search: Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

  • Lightning-fast random access: 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

  • Native multimodal data support: Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

  • Data evolution: Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

  • Zero-copy versioning: ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

  • Rich ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

What is Lance Namespace?๐Ÿ”—

Lance Namespace is the catalog spec layer for the Lance Open Lakehouse Format. While Lance tables can be stored directly on object storage, production AI/ML workflows require integration with enterprise metadata services for governance, access control, and discovery.

Lance Namespace addresses this need by defining both native catalog spec as well as a standardized framework for accessing and operating on a collection of Lance tables across different open catalog specs including Apache Polaris.

Here are some example systems and how they are mapped in Lance Namespace:

SystemStructureLance Namespace Mapping
Directory/data/users.lanceTable ["users"]
Hive Metastoredefault.ordersTable ["default", "orders"]
Apache Polaris/my-catalog/namespaces/team_a/tables/vectorsTable ["my-catalog", "team_a", "vectors"]

For Directory namespace, only the .lance table directories are tracked as tables, while the parent directory (e.g., /data) serves as the root path for the namespace.

Apache Polaris supports arbitrary namespace nesting, making it particularly flexible for organizing Lance tables in complex data architectures.

What is the Generic Table API in Apache Polaris?๐Ÿ”—

Apache Polaris is best known as an open-source catalog for Apache Iceberg. In addition, Apache Polaris also offers the Generic Table API that can be used for managing non-Iceberg table formats such as Delta, Apache Hudi, Lance, and others.

Generic Table Definition๐Ÿ”—

A generic table in Apache Polaris is an entity with the following fields:

FieldRequiredDescription
nameYesUnique identifier for the table within a namespace
formatYesThe table format (e.g., delta, csv, lance)
base-locationNoTable base location in URI format (e.g., s3://bucket/path/to/table)
propertiesNoKey-value properties for the table
docNoComment or description for the table

Generic tables share the same namespace hierarchy as Iceberg tables, and table names must be unique within a namespace regardless of format.

Generic Table API vs. Iceberg Table API๐Ÿ”—

Apache Polaris provides separate API endpoints for generic tables and Iceberg tables:

OperationIceberg Table API EndpointGeneric Table API Endpoint
Create TablePOST .../namespaces/{namespace}/tablesPOST .../namespaces/{namespace}/generic-tables
Load TableGET .../namespaces/{namespace}/tables/{table}GET .../namespaces/{namespace}/generic-tables/{table}
Drop TableDELETE .../namespaces/{namespace}/tables/{table}DELETE .../namespaces/{namespace}/generic-tables/{table}
List TablesGET .../namespaces/{namespace}/tablesGET .../namespaces/{namespace}/generic-tables

The Iceberg Table APIs handle the management of Iceberg tables, while the Generic Table APIs manage Generic (non-Iceberg) tables. This clear separation enforces well-defined boundaries between table formats, while still allowing them to coexist within the same catalog and namespace structure.

Lance Integration with Generic Table API๐Ÿ”—

The Lance Namespace implementation for Apache Polaris maps Lance Namespace operations to the Generic Table API. Lance tables are registered as generic tables with the format field set to lance, and the base-location pointing to the Lance table root directory.

Table Identification๐Ÿ”—

A table in Apache Polaris is identified as a Lance table when:

  • It is registered as a Generic Table
  • The format field is set to lance
  • The base-location points to a valid Lance table root directory
  • The properties contain table_type=lance for consistency with other Lance Namespace implementations (e.g., REST, Unity Catalog)

Supported Operations๐Ÿ”—

The Lance Namespace Apache Polaris implementation supports the following operations:

OperationDescription
CreateNamespaceCreate a new namespace hierarchy
ListNamespacesList child namespaces
DescribeNamespaceGet namespace properties
DropNamespaceRemove a namespace
DeclareTableDeclare a new table exists at a given location
ListTablesList all Lance tables in a namespace
DescribeTableGet table metadata and location
DeregisterTableDeregister a table from the namespace without deleting the underlying data (similar to DROP TABLE without PURGE)

Using Lance with Apache Polaris๐Ÿ”—

The power of the Lance and Apache Polaris integration is that you can now store Lance tables in Apache Polaris and access them from any engine that supports Lance. Whether you’re ingesting data with Spark, running feature engineering with Ray, building RAG applications with LanceDB, or analyzing with Trino, all these engines can work with the same Lance tables managed through Apache Polaris. For more information on getting started with Apache Polaris, see the Apache Polaris Getting Started Guide.

Let’s walk through a complete end-to-end workflow using the BeIR/quora dataset from Hugging Face to build a question-answering system. The examples below work against a locally deployed Apache Polaris instance with the endpoint set to http://localhost:8181. You can update the endpoint to match your deployed Apache Polaris service.

Step 1: Ingest Data with Apache Spark๐Ÿ”—

First, use Spark to load the Quora dataset and write it to a Lance table in Apache Polaris:

 1from pyspark.sql import SparkSession
 2from datasets import load_dataset
 3
 4# Create Spark session with Apache Polaris catalog
 5spark = SparkSession.builder \
 6    .appName("lance-polaris-ingest") \
 7    .config("spark.jars.packages", "org.lance:lance-spark-bundle-3.5_2.12:0.0.7") \
 8    .config("spark.sql.catalog.lance", "org.lance.spark.LanceNamespaceSparkCatalog") \
 9    .config("spark.sql.catalog.lance.impl", "polaris") \
10    .config("spark.sql.catalog.lance.endpoint", "http://localhost:8181") \
11    .config("spark.sql.catalog.lance.auth_token", "<your-token>") \
12    .getOrCreate()
13
14# Create namespace for ML workloads
15spark.sql("CREATE NAMESPACE IF NOT EXISTS lance.my_catalog.ml")
16
17# Load Quora dataset from Hugging Face
18dataset = load_dataset("BeIR/quora", "corpus", split="corpus[:10000]", trust_remote_code=True)
19pdf = dataset.to_pandas()
20pdf = pdf.rename(columns={"_id": "id"})
21
22# Convert to Spark DataFrame and write to Lance table
23df = spark.createDataFrame(pdf)
24df.writeTo("lance.my_catalog.ml.quora_questions").create()
25
26# Verify the data
27spark.sql("SELECT COUNT(*) FROM lance.my_catalog.ml.quora_questions").show()

Step 2: Feature Engineering with Ray๐Ÿ”—

Next, use Ray’s distributed computing to generate embeddings for all documents:

 1import ray
 2import pyarrow as pa
 3import lance_namespace as ln
 4from lance_ray import add_columns
 5
 6ray.init()
 7
 8# Connect to Apache Polaris
 9namespace = ln.connect("polaris", {
10    "endpoint": "http://localhost:8181",
11    "auth_token": "<your-token>"
12})
13
14def generate_embeddings(batch: pa.RecordBatch) -> pa.RecordBatch:
15    """Generate embeddings using sentence-transformers."""
16    from sentence_transformers import SentenceTransformer
17
18    model = SentenceTransformer('BAAI/bge-small-en-v1.5')
19
20    texts = []
21    for i in range(len(batch)):
22        title = batch["title"][i].as_py() or ""
23        text = batch["text"][i].as_py() or ""
24        texts.append(f"{title}. {text}".strip())
25
26    embeddings = model.encode(texts, normalize_embeddings=True)
27
28    return pa.RecordBatch.from_arrays(
29        [pa.array(embeddings.tolist(), type=pa.list_(pa.float32(), 384))],
30        names=["vector"]
31    )
32
33# Add embeddings column using distributed processing
34add_columns(
35    uri=None,
36    namespace=namespace,
37    table_id=["my_catalog", "ml", "quora_questions"],
38    transform=generate_embeddings,
39    read_columns=["title", "text"],
40    batch_size=100,
41    concurrency=4,
42)
43
44print("Embeddings generated successfully!")

Step 3: SQL Analytics with Trino๐Ÿ”—

Use Trino for SQL analytics on the same dataset:

1# etc/catalog/lance.properties
2connector.name=lance
3lance.impl=polaris
4lance.endpoint=http://localhost:8181
5lance.auth_token=<your-token>
 1-- Explore the dataset
 2SHOW SCHEMAS FROM lance;
 3SHOW TABLES FROM lance.my_catalog.ml;
 4DESCRIBE lance.my_catalog.ml.quora_questions;
 5
 6-- Basic analytics
 7SELECT COUNT(*) as total_questions
 8FROM lance.my_catalog.ml.quora_questions;
 9
10-- Find questions by keyword
11SELECT id, title, text
12FROM lance.my_catalog.ml.quora_questions
13WHERE text LIKE '%machine learning%'
14LIMIT 10;
15
16-- Aggregate statistics
17SELECT
18    LENGTH(text) as text_length,
19    COUNT(*) as count
20FROM lance.my_catalog.ml.quora_questions
21GROUP BY LENGTH(text)
22ORDER BY count DESC
23LIMIT 10;

Step 4: Agentic Search with LanceDB๐Ÿ”—

Finally, use LanceDB for AI-native semantic search and full-text search on the enriched dataset:

 1import lancedb
 2from sentence_transformers import SentenceTransformer
 3
 4# Connect to Apache Polaris via LanceDB
 5db = lancedb.connect_namespace(
 6    "polaris",
 7    {
 8        "endpoint": "http://localhost:8181",
 9        "auth_token": "<your-token>"
10    }
11)
12
13# Open the table with embeddings
14table = db.open_table("quora_questions", namespace=["my_catalog", "ml"])
15
16# Create vector index for fast similarity search
17table.create_index(
18    metric="cosine",
19    vector_column_name="vector",
20    index_type="IVF_PQ",
21    num_partitions=32,
22    num_sub_vectors=48,
23)
24
25# Create full-text search index
26table.create_fts_index("text")
 1model = SentenceTransformer('BAAI/bge-small-en-v1.5')
 2query_text = "How do I learn machine learning?"
 3query_embedding = model.encode([query_text], normalize_embeddings=True)[0]
 4
 5results = (
 6    table.search(query_embedding, vector_column_name="vector")
 7    .limit(5)
 8    .to_pandas()
 9)
10
11print("=== Vector Search Results ===")
12for idx, row in results.iterrows():
13    print(f"{idx + 1}. {row['title']}")
14    print(f"   {row['text'][:150]}...")
 1results = (
 2    table.search("machine learning algorithms", query_type="fts")
 3    .limit(5)
 4    .to_pandas()
 5)
 6
 7print("=== Full-Text Search Results ===")
 8for idx, row in results.iterrows():
 9    print(f"{idx + 1}. {row['title']}")
10    print(f"   {row['text'][:150]}...")
1results = (
2    table.search(query_embedding, vector_column_name="vector")
3    .where("text LIKE '%python%'", prefilter=True)
4    .limit(5)
5    .to_pandas()
6)

Try It Yourself๐Ÿ”—

The lance-namespace-impls repository provides a Docker Compose setup that makes it easy to try out the Apache Polaris integration locally.

Quick Start๐Ÿ”—

 1git clone https://github.com/lance-format/lance-namespace-impls.git
 2cd lance-namespace-impls/docker
 3
 4# Start Apache Polaris
 5make setup
 6make up-polaris
 7
 8# Get authentication token
 9make polaris-token
10
11# Create a test catalog
12make polaris-create-catalog

The setup includes:

  • Apache Polaris API on port 8181
  • Apache Polaris Management on port 8182
  • PostgreSQL backend for metadata storage

Configuration๐Ÿ”—

Once Apache Polaris is running, configure your Lance Namespace client:

1from lance_namespace_impls.polaris import PolarisNamespace
2
3ns = PolarisNamespace({
4    "endpoint": "http://localhost:8181",
5    "auth_token": "<token-from-make-polaris-token>"
6})

Next Steps๐Ÿ”—

We are excited about this collaboration between the Lance and Apache Polaris communities. Our integration with the Generic Table API opens up new possibilities for managing AI-native workloads in the open multimodal lakehouse.

Looking ahead, we plan to continue improving the Generic Table API based on our learnings from the Lance Namespace specification:

  • Credentials Vending: Lance Namespace already supports credentials vending end-to-end and is integrated with any engine that uses the Lance Rust/Python/Java SDKs, allowing namespace servers to provide temporary credentials for accessing table data. We would like to collaborate with the Apache Polaris community to add credentials vending support to the Generic Table API, enabling secure, fine-grained access control for Lance tables stored in AWS S3, Azure Blob Storage, and Google Cloud Storage. This would allow Apache Polaris to fully manage credentials for Lance tables, just as it does for Iceberg tables today.

  • OAuth Integration: We plan to integrate with various OAuth workflows for better client connectivity and auth token refresh, making it easier for applications to maintain secure, long-lived connections to Apache Polaris.

  • Table Commit Path: Formats like Delta Lake and Lance do not go through the catalog for commits after initial table creation. This creates challenges for centralized governance over table update operations. We will continue to evolve the Generic Table API to integrate better with these formats at the commit code path, enabling finer-grained control and governance over table modifications.

We welcome contributions and feedback from the community. Join us in building the future of AI-native data management in the open multimodal lakehouse!

Resources๐Ÿ”—