Querying Apache Iceberg Tables
The SELECT command queries table data and table metadata, and can query by timestamp and snapshot ID.
Dremio supports only the copy-on-write storage mechanism and reads only the latest data files for each Iceberg v2 table that you run SQL commands against. Dremio does not support Iceberg v2 tables that have merge-on-read manifests.
Querying a Table's Data
{{< codeheader "Syntax" >}}
SELECT [ <column1_name> [ , <column2_name> ... ] | * ]
FROM <table_path>.<table_name>
[WHERE where_condition]
[constructs]
Parameters
{{< sql-section file="data/sql/apache-iceberg-tables.json" data="queryingATable" >}}
Metadata Queries
Iceberg includes helpful system-table references which provide an easy access to Iceberg-specific information on tables, including:
- The data files for a table
- The history of a table
- The manifest files for a table
- The snapshots for a table
Querying a Table's Data File Metadata
Queries use the table_files() function.
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT *
FROM TABLE( table_files('<table_path>.<table_name>') )
Dremio returns records that have these fields:
| Column | Data Type | Description |
|---|---|---|
| file_path | VARCHAR | Full file path and name |
| file_format | VARCHAR | Format, e.g. PARQUET |
| record_count | BIGINT | Number of rows |
| file_size_in_bytes | BIGINT | Size of file |
| column_sizes | VARCHAR | List of columns with size of each column |
| value_counts | VARCHAR | List of columns with number of records with a value |
| null_value_counts | VARCHAR | List of columns with number of records as NULL |
| nan_value_counts | VARCHAR | List of columns with number of records as NaN |
| lower_bounds | VARCHAR | List of columns with lower bound of each |
| upper_bounds | VARCHAR | List of columns with upper bound of each |
| key_metadata | VARCHAR | Key metrics |
| split_offsets | VARCHAR | Split offsets |
Parameters
{{< sql-section file="data/sql/apache-iceberg-tables.json" data="metadataQueries" >}}
Querying a Table's History Metadata
Queries use the table_history() function.
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT *
FROM TABLE( table_history('<table_path>.<table_name>') )
Dremio returns records that have these fields:
| Column | Data Type | Description |
|---|---|---|
| made_current_at | TIMESTAMP | The timestamp the Iceberg snapshot was made at |
| snapshot_id | VARCHAR | The Iceberg snapshot ID |
| parent_id | VARCHAR | The parent snapshot ID, null if not exists |
| is_current_ancestor | BOOLEAN | If the snapshot is part of the current history, shows abandoned snapshots |
Example
{{< codeheader "Example with table_history()" >}}
SELECT *
FROM TABLE( table_history('myTable'))
WHERE snapshot_id = 4593468819579153853
Querying a Table's Manifest File Metadata
Queries use the table_manifests() function.
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT *
FROM TABLE( table_manifests('<table_path>.<table_name>') )
Dremio returns records that have these fields:
| Column | Data Type | Description |
|---|---|---|
| path | VARCHAR | Full path and name of manifest file |
| length | BIGINT | Size in bytes |
| partition_spec_id | VARCHAR | |
| added_snapshot_id | VARCHAR | ID of snapshot added to manifest |
| added_data_files_count | BIGINT | Number of new data files added |
| existing_data_files_count | BIGINT | Number of existing data files |
| deleted_data_files_count | BIGINT | Number of files removed |
| partition_summaries | VARCHAR | Partition information |
Querying a Table's Snapshot Metadata
Queries use the table_snapshot() function.
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT *
FROM TABLE( table_snapshot('<table_path>.<table_name>') )
Dremio returns records that have these fields:
| Column | Data Type | Description |
|---|---|---|
| committed_at | TIMESTAMP | The timestamp the Iceberg snapshot was committed |
| snapshot_id | VARCHAR | The Iceberg snapshot ID |
| parent_id | VARCHAR | The parent snapshot ID, null if not exists |
| operation | VARCHAR | The Iceberg operation (e.g. append) |
| manifest_list | VARCHAR | List of manifest files for the snapshot |
| summary | VARCHAR | Additional attributes (records added, etc) |
Example
{{< codeheader "Finds the number of snapshots for a table" >}}
SELECT count(*)
FROM TABLE( table_snapshot('myTable'))
GROUP BY snapshot_id
Time Travel Queries
Iceberg tables support both TIMESTAMP-based references and Snapshot ID-based references to specify an earlier version of a table to read.
Time Travel by Timestamps
The query returns an error stating that the reference for this table was out of range if either of these two conditions is true:
- The value of the timestamp is some time in the future, even by one second.
- The value of the timestamp is older than the oldest valid snapshot for the table.
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT <column_name>
FROM <table_path>.<table_name> AT <timestamp>
Parameters
{{< sql-section file="data/sql/apache-iceberg-tables.json" data="timeTravelTimestamps" >}}
Example
{{< codeheader "Time travel query on an Iceberg table using a timestamp" >}}
SELECT count(*)
FROM my_table AT TIMESTAMP '2022-07-01 01:30:00.000'
Time Travel by Snapshot ID
To run a SELECT command, the user must have Dremio's SELECT privilege.
{{< codeheader "Syntax" >}}
SELECT <column_name>
FROM <table_path>.<table_name> AT SNAPSHOT '<snapshot-id>'
Parameters
{{< sql-section file="data/sql/apache-iceberg-tables.json" data="timeTravelSnapshots" >}}
Example
{{< codeheader "Time travel query on an Iceberg table using a snapshot ID" >}}
SELECT *
FROM myTable AT SNAPSHOT '5393090506354317772'