Apache Iceberg

learn Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

concepts

  • data lakehouse - data architecture that blends a data lake and data warehouse together

data layer

  • stores the actual data of the table and is primarily made up of the data files themselves, though also included are delete files and puffin files.
  • Iceberg is file-format agnostic and currently supports Apache Parquet, Apache ORC, and Apache Avro
    • parquet is the most common underlying file format
  • delete files - track which records in the dataset have been deleted
    • can either be a copy of the old file with the changes reflected in a new copy of it (called copy-on-write) or it can be a new file that only has the changes written, which then engines reading the data coalesce (called merge-on-read).
    • two ways to identify a given row that needs to be removed from the logical dataset when an engine reads the dataset: either identify the row by its exact position in the dataset or identify the row by the values of one or more fields of the row.
      • positional delete files - denote what rows have been logically deleted
      • equality delete files - denote what rows have been logically deleted
  • puffin file format - stores statistics and indexes about the data in the table that improve the performance of an even broader range of queries than the statistics stored in the data files and metadata files.

metadata layer

  • tree structure that tracks the data files and metadata about them as well as the operations that made them.
  • made up of three file types, all of which are stored in data lake storage: manifest files, manifest lists, and metadata files
  • enabling core features like time travel and schema evolution.
  • Manifest files - keep track of files in the data layer (i.e., data files, delete files, and puffin files) as well as additional details and statistics about each file.
  • manifest file keeps track of a subset of the data files. They contain information such as details about partition membership, record count, and lower and upper bounds of columns, that is used to improve efficiency and performance while reading the data from these data files.

catalog

  • central place where you go to find the current location of the current metadata pointer is the Iceberg catalog
  • Within the catalog, there is a reference or pointer for each table to that table’s current metadata file.

Diagrams

iceberg architecture

parquet file architecture

resources