This Apache Iceberg review covers the open table format that's become the dominant choice for modern data lakehouses in 2026. Iceberg is an open-source table format under Apache 2.0, designed for huge analytic datasets with schema evolution, time travel, and multi-engine querying across Spark, Trino, Flink, and Snowflake. Originally developed at Netflix and now a graduated Apache project, Iceberg has emerged as the winner in the lakehouse-format battle against competitors like Delta Lake and Hudi. We evaluated it against the broader data-warehouse and lakehouse ecosystem to answer the real question: when is Iceberg the right foundation for your data lake, and when should you use an alternative?
Overview
Apache Iceberg is a table format — not a database, not an engine, not a storage system. It sits between your object storage (S3, GCS, Azure Blob) and your query engines (Spark, Trino, Flink, Snowflake, Databricks) as a metadata layer that makes raw Parquet files behave like database tables. This positioning is fundamental: Iceberg doesn't replace your data warehouse — it lets you query object-storage data with warehouse-like guarantees (ACID transactions, schema evolution, time travel) at object-storage economics.
The format matters because it's solved the open-lakehouse problem that's been open since Hadoop. Before Iceberg (and its peers Delta Lake and Hudi), querying object-storage data required choosing between cheap raw Parquet (fast to read, painful to update) and expensive proprietary warehouses. Iceberg gives you warehouse semantics on cheap storage. Target audience: data engineering teams running Spark-based pipelines, organizations consolidating multi-engine analytics (Snowflake + Spark + Trino reading the same tables), and teams leaving proprietary warehouses for cost reasons.
Key Features and Architecture
Iceberg's architecture is the defining feature. A table is a collection of manifest files (metadata listing the data files) plus data files (Parquet, ORC, or Avro). Every commit to a table creates a new snapshot — the table's metadata at that moment. This design enables time travel (query any historical snapshot by ID or timestamp), incremental reads (scan only files added since snapshot X), and hidden partitioning (partition by month automatically, without users writing WHERE partition_date = ...).
Schema evolution is genuinely robust — add columns, drop columns, rename columns, reorder columns, change types (within safe conversions), all without rewriting existing data. This is table-stakes for production data engineering, and Iceberg handles it better than traditional Hive-style partitioning. Multi-engine querying means the same Iceberg table can be read by Spark, Trino, Flink, Snowflake, Athena, Databricks, and more — each engine uses the Iceberg spec directly rather than vendor-specific formats.
Row-level operations (MERGE INTO, DELETE, UPDATE) are supported via two strategies: copy-on-write (rewrite affected files, safer) or merge-on-read (write delete files, faster writes). Partition evolution lets you change the partition strategy without rewriting data — historically one of the hardest operations in data engineering. Metadata scaling handles tables with millions of partitions and billions of rows without metadata operations becoming the bottleneck.
Ideal Use Cases
Best for:
- Data engineering teams building modern lakehouses who want warehouse semantics on object-storage economics. Iceberg is the leading open format for this pattern.
- Multi-engine analytics organizations running Spark for ETL, Trino for ad-hoc queries, and Snowflake for BI on the same underlying data. Iceberg lets all three read the same tables.
- Teams migrating off proprietary warehouses (Snowflake, Redshift, BigQuery) for cost reasons — Iceberg on S3 plus a query engine can be dramatically cheaper at large scale.
- Streaming data pipelines needing ACID guarantees — Iceberg with Flink or Spark Structured Streaming handles exactly-once ingestion with row-level updates.
- Organizations with strict data governance needs — Iceberg's snapshot model plus row-level operations plus schema evolution give you the audit trail enterprise governance requires.
Not suitable for:
- Small teams without platform engineering depth — running Iceberg requires a query engine, a catalog (AWS Glue, Hive Metastore, or Iceberg REST catalog), and optimization jobs (file compaction, orphan cleanup). Realistically needs data platform engineers.
- Sub-second OLTP or analytics — Iceberg is batch/streaming-oriented. For low-latency analytics, a purpose-built OLAP engine (ClickHouse, Apache Druid) is better.
- Organizations deeply committed to a single vendor — Delta Lake plus Databricks or BigQuery-native tables plus GCP make more sense if you've already standardized. Iceberg's multi-engine advantage doesn't pay off in single-engine shops.
- Teams needing mature commercial support without Databricks — Iceberg's commercial ecosystem is catching up but still behind Delta Lake's Databricks-backed tooling.
Pricing and Licensing
Apache Iceberg is free and open source under Apache 2.0 — there's no license cost for the table format itself. Your cost comes from three components: query engines (Spark, Trino, Flink, or managed services), object storage (S3, GCS, Azure Blob), and catalog infrastructure (AWS Glue Data Catalog, Hive Metastore, or an Iceberg REST catalog).
| Component | Pricing Shape | Typical Cost |
|---|---|---|
| Iceberg format | Free (Apache 2.0) | $0 |
| Query engines (self-hosted) | Free (Spark, Trino, Flink are all open source) | Infrastructure + ops time |
| Query engines (managed) | AWS Athena ($5/TB scanned), Snowflake (credit-based), Databricks (DBU-based) | Varies |
| Object storage | S3 ($0.023/GB/month), GCS (similar), Azure Blob (similar) | Scales with data volume |
| Catalog infrastructure | AWS Glue ($1 per 100K catalog requests), Hive Metastore (self-hosted), or Iceberg REST catalog | $10-$100/month typical |
The practical cost is dominated by query engines and storage. Tabular (acquired by Databricks in 2024) offers managed Iceberg services; Dremio offers a commercial managed lakehouse on Iceberg; cloud vendors (AWS, GCP, Azure) all have Iceberg-native query services. Total cost of ownership varies wildly based on query volume and storage scale — a small team can run Iceberg on $50/month (S3 plus Athena); a large enterprise can spend $100K+/month on Iceberg-based analytics.
Pros and Cons
Pros:
- Open format with broad engine support — Spark, Trino, Flink, Snowflake, Athena, Databricks all read Iceberg natively.
- Robust schema evolution handles real-world production change patterns cleanly.
- Time travel and snapshots enable reproducible analytics and rollback.
- Hidden partitioning removes a major source of user error versus Hive-style partitioning.
- Apache 2.0 license — no vendor lock-in at the format level.
- Row-level operations (MERGE, DELETE, UPDATE) handle GDPR and compliance requirements.
Cons:
- Operational complexity is real — file compaction, orphan cleanup, snapshot expiration all need automation.
- Commercial support ecosystem is less mature than Delta Lake's Databricks-backed tooling.
- Small-file problem isn't solved by default — streaming writes create many small files that need compaction.
- Catalog choices are confusing — AWS Glue vs Hive Metastore vs REST catalog each have trade-offs.
- Performance tuning requires expertise — partition design, sort order, and file size all affect query performance.
Alternatives and How It Compares
Iceberg is the emerging standard, but it's not the only game in town.
- Delta Lake — the Databricks-originated alternative. Similar features, different architectural choices. Choose Delta Lake when you're committed to Databricks; choose Iceberg when you want multi-engine flexibility.
- Apache Hudi — the streaming-first lakehouse format. Stronger for CDC and real-time ingestion; weaker for ad-hoc analytics. Choose Hudi when streaming upserts dominate your workload.
- Snowflake — the proprietary warehouse competitor. Managed, polished, expensive. Choose Snowflake when you want zero operational overhead and budget matches; Iceberg is cheaper at large scale.
- Databricks — the commercial platform that owns Delta Lake and, as of 2024, Tabular (Iceberg). Works natively with both. Choose Databricks when you want a managed lakehouse and will commit to their ecosystem.
- Google BigQuery — the serverless warehouse alternative with native Iceberg support now. Hybrid: BigQuery reads Iceberg tables while maintaining its native BigLake tables.
- Trino — an SQL query engine often paired with Iceberg. Trino doesn't replace Iceberg; it queries it. Pair them for an open-source federated analytics stack.
Iceberg wins when multi-engine querying, open format, and enterprise governance matter. It loses to Delta Lake on Databricks-shop fit, to Snowflake on operational simplicity, and to Hudi on streaming-upsert-heavy workloads. For most greenfield data platforms in 2026, Iceberg is the right default — the ecosystem momentum is clearly in its favor.