The Format War Is Over. The Catalog War Just Started.

Both format creators have publicly stated the choice should be irrelevant. The convergence is shipped code, not hand-waving. But the catalog layer above the format is where the next decade of lock-in is being constructed right now.
Nidhi VichareApril 16, 2026
18 min read
Data ArchitectureApache IcebergDelta LakeData StrategyCDOEnterprise AI
Get insights delivered
LinkedIn
The Catalog WarsPart 1 of 4
~
The Catalog Wars
1
Catalog Wars Part 1
2
Catalog Wars Part 2
3
Catalog Wars Part 3
4
Catalog Wars Part 4

TL;DR. The Catalog Is the Real Architecture Decision.

The format question is settled. The catalog question is the one that will define your next decade of optionality.

The creators of both Delta Lake and Apache Iceberg have publicly stated the format choice should be irrelevant to enterprises. The convergence is not hand-waving; it is shipped code. But governance, AI asset management, and semantic layers are diverging fast, because that is where the lock-in value lives. This is Part 1 of 2. The argument for why the catalog is the real decision. Part 2 covers the contenders and the bet.

78.6%
EXCLUSIVE ICEBERG USE
$2B
TABULAR ACQUISITION
10K+
ENTERPRISES ON UNITY CATALOG

Every Chief Data and AI Officer I talk to right now is preparing for Summit season with the wrong question on the table.

They want to know whether their team should standardize on Delta or Iceberg. They want a tiebreaker between Snowflake and Databricks. They want a clean answer they can put in the architecture review deck before June.

The answer is that the people who built both formats just publicly admitted the question does not matter anymore.

If your roadmap still treats Delta versus Iceberg as a strategic decision, you are reading from a 2023 playbook. The conversation moved. And the conversation that replaced it is a much harder one, because no vendor on a Summit stage in June is going to frame it for you honestly. Their job is to sell you a catalog. Your job is to pick one without locking yourself out of the next ten years of optionality.

This is the longer version of the conversation I think senior architects need to have before they walk into Snowflake Summit (June 2-5) or Databricks Data + AI Summit (June 15-18). It is structured around four claims:

  1. The format war is functionally over and the people who created it have said so.
  2. The convergence is not hand-waving. It is shipped code.
  3. The catalog war is the real architectural decision of 2026, and most organizations are sleepwalking into it.
  4. The catalog decision carries 10x the strategic weight of the format decision.

The remaining claims (the three contenders, the convergence boundary, the bet and decisions, and the three-year timeline) are covered in Part 2: Picking Your Catalog.

Let me work through each.


Part One: The Admission You Probably Missed

In a recent on-record conversation between Ryan Blue, the original creator of Apache Iceberg at Netflix, and the Databricks team responsible for Delta Lake, both sides said something that should reorient your roadmap.

They said the Delta versus Iceberg choice was a mistake they inadvertently created and a problem they are now actively trying to make irrelevant.

Here is the substance of what they actually said, paraphrased:

  • The two formats started as parallel solutions to the same problem. Both teams hit the same wall with Hive metastore. Both built transactional, scalable storage layers on top of Parquet. The architectures look remarkably similar at the core because they were solving identical problems.
  • The fracturing of "which one do I use" became a bigger problem than the original problems either format was built to solve. The worst outcome of having two competing standards is that organizations stay paralyzed on Hive because they cannot pick a winner.
  • The technical convergence is already well underway. Field IDs from Iceberg were adopted into Delta to enable schema evolution. The variant type is being pushed upstream into Parquet so there is no Delta-variant versus Iceberg-variant distinction. Delta snapshots increasingly resemble Iceberg metadata. Iceberg's REST catalog API is moving toward change-based commits.
  • The honest recommendation from both creators: store your data in whatever format your existing pipeline already produces, and let the catalog handle the translation.

Ryan Blue's framing was the sharpest line in the conversation. He said there should be perhaps twenty people in the world who care about which underlying format is in use, and none of them should work in your organization.

Read that sentence twice. The person who created Apache Iceberg, who built one of the two formats your platform team has been arguing about for three years, is telling you the argument should not be happening inside your company at all.

If that is true, and I believe it is, then the architectural question shifts. The format is no longer where the lock-in lives. The format is no longer where governance lives. The format is no longer where engine compatibility lives. All three of those concerns moved up one layer. They moved into the catalog.

And the catalog is where the next decade of vendor lock-in is being constructed right now, while everyone is still staring at the format question.


Part Two: What Is Actually Converging, and the Numbers Behind It

Before I make the catalog argument, it is worth being precise about what "convergence" means at the format layer, because the convergence is not hand-waving. It is shipped code.

The 2025 State of the Apache Iceberg Ecosystem survey puts a number on this: 78.6% of respondents report exclusive Iceberg use. All major cloud providers have deeply integrated Iceberg: Google BigLake, Amazon S3 Tables, Snowflake Iceberg V3 support, Databricks native Iceberg read/write. Both the creator of Iceberg (Ryan Blue, now at Databricks after the $2 billion Tabular acquisition) and the Databricks team have publicly stated the format choice should be irrelevant to enterprises.

Here is what convergence looks like in shipped code.

The Great Convergence: Delta Lake + Apache Iceberg

Field IDs and schema evolution. One of the original differentiators Iceberg had over Delta was robust schema evolution: the ability to rename a column, add a column, drop a column, without breaking existing readers. Iceberg used field IDs to track columns independently of their names. Delta originally tracked by name. That was a real difference, and it mattered for organizations migrating from Teradata or Vertica where column rename is table stakes. Delta has since adopted field IDs from the Iceberg approach. The schema evolution gap is closed.

Parquet variant type. Both communities are now contributing the variant type implementation upstream into Apache Parquet itself. This means that a "Delta variant" and an "Iceberg variant" will be the same object on disk. The two communities are explicitly working together in Spark to make sure there is one variant type that survives across both formats.

Deletion vectors. Both Iceberg v3 and Delta now use identical binary encodings for deletion vectors: the same on-disk representation. Row-level lineage, geospatial types, and nanosecond timestamps are all being standardized at the same layer.

Snapshot and metadata architecture. Iceberg historically encoded a tree of data files and metadata files. Delta historically encoded changes to that tree. The two approaches have grown remarkably close. Delta snapshots now look like Iceberg metadata when reconstructed at a point in time. Iceberg's new REST catalog API is moving toward change-based commits to enable lower-latency writes. Databricks engineers are driving Iceberg v4 proposals including the "adaptive metadata tree," a simplification of Iceberg's metadata model toward something closer to the Delta/Iceberg midpoint. The two architectures are converging from opposite directions on the same endpoint.

The Supporting Cast: Engines, Formats, and Standards

Translation layers in production today. Databricks UniForm exposes Delta tables as Iceberg metadata so external engines can read them as Iceberg without copying data. Snowflake's expanded support for the Iceberg REST API means Snowflake can read and write to Polaris-managed tables. Apache XTable (formerly OneTable) translates between Delta, Iceberg, and Hudi in place. The "I can only use one format" constraint is operationally false in most modern stacks.

The Apache Polaris 1.3 release in January 2026 is instructive on this. It promoted Generic Tables to general availability, which means Polaris can now serve as a catalog for tables that are not Iceberg at all. The catalog is being explicitly designed to be format-neutral, because the catalog community already understands that betting on a single format would be a strategic mistake.

If the catalog is going to be format-neutral, then the question of which format you write today becomes a tactical decision, not a strategic one. The strategic decision is which catalog mediates access to all your data going forward.


Part Three: The Real Challenges That No One Puts on a Slide

Before we get to the catalog decision, the honest conversation requires naming the problems that exist in both formats today. Summit keynotes will not cover these.

Iceberg's Challenges

Catalog fragmentation is Iceberg's biggest operational problem. Iceberg was designed to be catalog-agnostic, which is a strength architecturally but a problem operationally. There are now at least 10 production-grade Iceberg catalog implementations: Polaris, Unity, Glue, Nessie, Gravitino, BigLake, Lakekeeper, Hive Metastore, S3 Tables, and vendor-specific variants. Each has different capabilities, different governance models, and different operational characteristics. The Iceberg REST Catalog spec provides interoperability for table operations but does not standardize security, credential vending patterns, semantic layers, or AI asset governance. "We use Iceberg" tells you almost nothing about the catalog experience.

Metadata overhead at extreme scale. Iceberg's hierarchical metadata tree is excellent for read performance but creates management complexity at extreme scale. Each commit creates new metadata files, manifest lists, and manifests. Without aggressive maintenance (snapshot expiry, orphan file cleanup, metadata compaction), the metadata layer itself can grow to consume significant storage and slow down operations. Organizations with thousands of tables and millions of daily commits report that metadata maintenance is a non-trivial operational burden.

Write concurrency under high contention. Iceberg uses optimistic concurrency control. Under high write contention (many concurrent writers to the same table) conflict resolution can become a bottleneck. The retry-based approach works for moderate concurrency but degrades for workloads with hundreds of concurrent writers.

The spec is getting too wide. This is the challenge nobody on a Summit stage will name directly, but Firebolt CEO Benjamin Wagner put it plainly at Iceberg Summit 2026: "A lean spec is a durable spec." Firebolt is a cloud-native analytics engine built for sub-second query performance on Iceberg tables, with native support for aggregating indexes, text search, and vector search. Iceberg v4 is pushing to embed indexing (vector search, full-text, expression indexes) and security (row-level, column-level) into the spec itself. The argument for this is portability: indexes that travel with the data when you switch engines. The problem is that spec-mandated formats trail best-of-breed by years. Vector search is evolving in almost every engine release cycle. Writing a specific index layout into a spec today, just for engines to adopt it years later, does not solve problems. It freezes innovation. Every engine with rich indexes today (ClickHouse, Apache Doris, Spark, Pinot, Firebolt) uses its own managed format, not Iceberg. The right model: indexes are engine artifacts, rebuilt from Iceberg at any time. The format stays portable. The alternative is a world of fragmented support across engines, which is already happening. DuckLake exists precisely because engineers looked at Iceberg's complexity and chose to start over.

The agentic readiness gap. Iceberg was built for human analysts querying one table at a time. Agents fan out across 20 tables, take action, and move on without human review. Three capabilities break immediately: fast retrieval (vector and text search at sub-second latency), semantic context (agents need to understand what CUST_NM and rev_adj_us_gaap actually mean), and security (row-level and column-level access control that travels with the data). A 2025 ACL study measured text-to-SQL accuracy at 95% on clean benchmarks but just 39% on real enterprise schemas with 4,000+ columns and ambiguous naming. Same models, same capability. The schema got real. The failure mode is particularly dangerous: syntactically correct SQL that returns semantically wrong answers. It runs. It returns numbers. You have no way to know if the agent is telling the truth. None of these are format problems. They are engine and catalog problems that the Iceberg community is currently trying to absorb into the spec, and the cure may be worse than the disease.

Delta Lake's Challenges

The Databricks dependency is still real. Despite the open-sourcing in 2022, there is a material gap between Delta-on-Databricks and Delta-on-open-source. Key features like liquid clustering and predictive optimization remain Databricks-specific. Organizations running Delta outside Databricks get a subset of the feature set and a subset of the performance.

Multi-engine support lags Iceberg. Delta was designed with Spark as the primary engine. Support for Trino, Flink, DuckDB, and other engines exists but is less mature than Iceberg's native multi-engine support. The Delta Kernel project is helping close this gap, but Iceberg still has a structural advantage in engine breadth.

UniForm adds complexity, not just simplicity. UniForm is a clever solution to format interoperability, but it adds an asynchronous metadata generation layer that must be monitored. When UniForm metadata generation fails or lags, external engines reading the table as Iceberg see stale data. Debugging UniForm issues requires understanding both Delta and Iceberg metadata internals.

The convergence narrative creates decision paralysis. Databricks is simultaneously telling customers "Delta is the best write format" and "Iceberg is fully supported in Databricks." Should you standardize on Delta and use UniForm for interop? Write Iceberg natively? Wait for further convergence? The "both are fine" message, while technically true, does not help organizations that need to make pipeline decisions today.

Where Format Performance Still Matters

Within 18 months, format performance differences will be negligible for 95% of enterprise workloads. But today, in the remaining 5%:

  • Iceberg wins on massive table scans with partition pruning (tables with thousands of partitions and millions of files), hidden partition evolution, and consistent multi-engine read performance.
  • Delta wins on update-heavy streaming and near-real-time workloads (high-frequency small commits, CDC), and on Databricks-optimized workloads where proprietary features like Photon and liquid clustering apply.

The prediction that matters: the real performance differentiator going forward is not the format. It is the catalog's maintenance layer. Automated compaction, statistics generation, storage optimization, and query planning hints from the catalog will have a larger impact than the underlying format. This is why AWS S3 Tables (3x query performance claim), Unity Catalog's predictive optimization, and Polaris's Table Maintenance Services matter more than Delta-vs-Iceberg benchmarks.


Part Four: Why the Catalog Is the New Lock-In

Here is what a catalog actually does in a modern lakehouse, stripped to essentials.

A catalog is the system of record that maps a table name to the location of the current metadata file for that table. When a query engine wants to read or write to a table, it asks the catalog two questions: where is this table, and am I allowed to do what I am about to do. Everything else (the storage format, the file layout, the engine) is downstream of those two answers.

That makes the catalog the chokepoint. And the numbers show why this fight matters.

The data catalog market is valued at $1.3-2.5 billion in 2025, projected to reach $5-10 billion by 2032-2035. The broader data governance market is $5.4 billion, heading to $18 billion by 2032. Gartner predicts active metadata adoption will grow 70-75% by 2027. These are not abstract projections. They represent real enterprise spending decisions being made right now.

The Modern Lakehouse Stack

Whoever owns the catalog owns four things:

Governance. Who can read, who can write, what columns are masked, what row-level filters apply, what audit log gets written. Every modern catalog has its own RBAC model. Once your access policies are encoded in one catalog's model, migrating to another means re-implementing your entire governance layer. With 8 new US state privacy laws taking effect in 2025 and 3 more in 2026, on top of GDPR, CCPA/CPRA, SOX, and HIPAA, this is not optional infrastructure.

Engine compatibility. Which query engines can talk to your data. The Iceberg REST API is supposed to make this open, but in practice every catalog has its own extensions for things the spec does not yet cover, particularly around security and credential vending. Your catalog determines which of Spark, Trino, Snowflake, Databricks SQL, Flink, DuckDB, Presto, Dremio, and the long tail of new engines can actually read your tables in production.

Semantics and business logic. This is the new front. Databricks just made Unity Catalog Business Semantics generally available, with the explicit goal of making business definitions of revenue, customer, and risk live at the catalog layer rather than in BI tools. Snowflake is pushing semantic interoperability as part of its April 2026 governance portability announcement. Whoever owns the semantic layer owns where business logic lives.

AI and agent context. Unity Catalog's recent additions explicitly include Tool Catalogs for generative AI agents. The catalog is being positioned as the governance layer not just for tables but for AI assets: models, functions, prompts, agent tools. If your AI agents authenticate through and pull context from a particular catalog, that catalog becomes the foundation of your AI architecture, not just your data architecture.

Take those four together and the catalog decision carries 10x the strategic weight of the format decision. A format change requires a one-time rewrite. A catalog change requires re-implementing governance, re-certifying engines, re-encoding semantics, and re-anchoring your AI stack.


The format war is over. The catalog war just started. The question is no longer which format. It is which catalog mediates access to all your data going forward.

In Part 2, we compare the three contenders, name the convergence boundary where lock-in actually lives, and make a defensible bet.


Technology Reference

A quick reference for technologies, projects, and standards mentioned in this post.

Table Formats

Technology What It Is Link
Apache Iceberg Open table format for huge analytic datasets. Engine-agnostic, ACID transactions, schema evolution, hidden partitioning. Created at Netflix by Ryan Blue. iceberg.apache.org
Delta Lake Open table format from Databricks. Transaction log-based, optimized for streaming and high-frequency writes. Open-sourced in 2022. delta.io
Apache Hudi Open table format originated at Uber. Focused on change data capture and incremental processing. Third format, declining relative share. hudi.apache.org
Apache Parquet Columnar storage file format. The on-disk layer underneath Iceberg, Delta, and Hudi. Both formats store data as Parquet files. parquet.apache.org
DuckLake Lightweight open table format by DuckDB Labs. Uses PostgreSQL for metadata, Parquet for data. Created as a simpler alternative to Iceberg's complexity. ducklake.select

Catalogs

Technology What It Is Link
Apache Polaris Apache Top-Level Project (Feb 2026). Open-source Iceberg REST catalog. Vendor-neutral governance, credential vending, table maintenance. Reference implementation of the Iceberg REST spec. Originally contributed by Snowflake. polaris.apache.org
Unity Catalog Databricks catalog. Governs tables, ML models, AI tools, notebooks in a single namespace. Open-sourced under Apache 2.0 (June 2024). Over 10,000 enterprises in production. unitycatalog.io
AWS Glue Data Catalog AWS-native metastore. Default catalog for Athena, EMR, Redshift Spectrum. 39.3% adoption (market leader). AWS-only. aws.amazon.com/glue
Hive Metastore Legacy catalog that both Iceberg and Delta were originally built to replace. Still widely deployed. hive.apache.org

Query Engines

Technology What It Is Link
Snowflake Cloud data warehouse. Polaris sponsor and contributor. snowflake.com
Databricks Unified analytics and AI platform. Creator of Delta Lake, acquirer of Tabular ($2B). Unity Catalog steward. databricks.com
Firebolt Cloud-native analytics engine for sub-second Iceberg queries. Native aggregating indexes, text search, vector search. firebolt.io
Apache Spark Unified engine for large-scale data processing. Native Iceberg and Delta support. spark.apache.org
Trino / Presto Distributed SQL query engines for federated analytics across data sources. trino.io
Apache Flink Stream and batch processing engine. Iceberg connector for streaming writes. flink.apache.org
DuckDB Embedded analytical database. In-process SQL, REST catalog support. duckdb.org

Interoperability Standards and Tools

Technology What It Is Link
Iceberg REST Catalog API OpenAPI specification for catalog operations. The de facto universal interface for all catalogs. iceberg.apache.org/spec
Delta UniForm Databricks feature exposing Delta tables as Iceberg metadata. External engines read them as Iceberg without copying data. docs.databricks.com
Apache XTable Cross-format translation (Iceberg, Delta, Hudi) without data copying. Formerly OneTable. xtable.apache.org
Stay Connected
Enterprise AI strategy, data architecture, and the leadership decisions that determine whether AI investments deliver measurable business lift.