Voiced by Amazon Polly |
Overview
Apache Iceberg has revolutionized how we handle big data tables, enabling efficient table management with features like partition evolution, time travel, and atomic operations. Combining Iceberg with AWS Glue Catalog and Amazon Athena simplifies data lake workflows, making it accessible to modern cloud environments.
In this blog, we will explore:
- Registering Iceberg Tables in AWS Glue Catalog using Amazon Athena and AWS Glue
- For partitioned and unpartitioned tables.
2. Performing UPSERT operations with AWS Glue and Amazon Athena.
3. Enabling and leveraging Time Travel in Iceberg tables.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Registering Iceberg Tables in the AWS Glue Catalog
Iceberg tables can be partitioned or unpartitioned and registering them in the AWS Glue Data Catalog allows Amazon Athena and AWS Glue ETL jobs to query and manipulate these tables.
a. Registering Unpartitioned Iceberg Tables
To register an unpartitioned Iceberg table in the AWS Glue Catalog, follow these steps:
Step 1: Create the Table in Amazon Athena
Iceberg tables can be created using Amazon Athena’s SQL interface:
1 2 3 4 5 6 7 |
CREATE TABLE glue_catalog.database_name.unpartitioned_table ( id BIGINT, name STRING, age INT ) LOCATION 's3://amzn-s3-demo-bucket/your-folder/' TBLPROPERTIES ( 'table_type' = 'ICEBERG' );; |
This command:
- Creates an unpartitioned Iceberg table in the AWS Glue Catalog.
- Sets the default storage type to Iceberg.
Step 2: Verify Registration
Confirm that the table appears under the appropriate database in the AWS Glue console.
Step 3: Query the Table with Amazon Athena
Test the table with simple queries in Athena:
1 |
SELECT * FROM glue_catalog.database_name.unpartitioned_table; |
b. Registering Partitioned Iceberg Tables
Partitioned tables allow efficient queries by reducing data scanning. To register a partitioned Iceberg table:
Step 1: Create the Partitioned Table
1 2 3 4 5 6 7 8 |
CREATE TABLE glue_catalog.database_name.partitioned_table ( id BIGINT, name STRING, age INT ) PARTITIONED BY (age) LOCATION 's3://amzn-s3-demo-bucket/your-folder/' TBLPROPERTIES ( 'table_type' = 'ICEBERG' ); |
- The PARTITIONED BY clause defines the partition key (age in this example).
Step 2: Load Data
Data can be inserted using Amazon Athena’s SQL:
1 2 3 |
INSERT INTO glue_catalog.database_name.partitioned_table VALUES (1, 'Alice', 25), (2, 'Bob', 30); |
Step 3: Verify Partitions
Use Amazon Athena to list partitions:
1 |
SHOW PARTITIONS glue_catalog.database_name.partitioned_table; |
c. Registering Iceberg Tables via AWS Glue ETL Jobs
For AWS Glue ETL jobs to manage Iceberg tables:
- Use AWS Glue version 3.0 or later.
- Add the Iceberg connector jar (aws-glue-iceberg.jar) to the job if required.
Example PySpark Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import sys from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Glue-Iceberg-Table") \ .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalog.glue_catalog.warehouse", "s3://your-bucket/path/") \ .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \ .getOrCreate() #Create an Iceberg table spark.sql(""" CREATE TABLE glue_catalog.database_name.etl_table ( id BIGINT, name STRING, age INT ) USING iceberg PARTITIONED BY (age) LOCATION 's3://your-bucket/path/' TBLPROPERTIES ("format-version"="2") """) |
Performing UPSERT Operations
Iceberg tables support merge-on-read operations for upserts, combining INSERT and DELETE into one atomic operation.
Using Glue ETL for UPSERTs
Step 1: Load the Delta Data
Delta data (new or updated records) can be loaded into a Spark DataFrame.
Step 2: Perform the Merge
Iceberg uses the MERGE INTO SQL command for upserts.
1 2 3 4 5 6 7 |
spark.sql(""" MERGE INTO glue_catalog.database_name.target_table t USING glue_catalog.database_name.delta_table d ON t.id = d.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """) |
- WHEN MATCHED: Updates existing records.
- WHEN NOT MATCHED: Inserts new records.
Using Amazon Athena for UPSERTs
- Create the Iceberg Table (if not already created):
1 2 3 4 5 6 |
CREATE TABLE glue_catalog.database_name.iceberg_table ( id INT, name STRING, updated_at TIMESTAMP ) USING ICEBERG; |
2. Load the New Data into a Staging Table (Optional): If your new data comes from Amazon S3, create an external table:
1 2 3 4 5 6 7 |
CREATE TABLE glue_catalog.database_name.staging_table ( id INT, name STRING, updated_at TIMESTAMP ) USING PARQUET LOCATION 's3://your-bucket/new-data/'; |
3. Execute MERGE INTO to Perform UPSERT:
1 2 3 4 5 6 7 8 |
MERGE INTO glue_catalog.database_name.iceberg_table AS target USING glue_catalog.database_name.staging_table AS source ON target.id = source.id WHEN MATCHED THEN UPDATE SET target.name = source.name, target.updated_at = source.updated_at WHEN NOT MATCHED THEN INSERT (id, name, updated_at) VALUES (source.id, source.name, source.updated_at); |
Using Time Travel
Time travel is a powerful Iceberg feature that allows historical snapshots of the table to be accessed.
a. Querying Historical Snapshots
- Version Based queries
Find the snapshot ID using the Iceberg metadata table:
1 |
SELECT * FROM glue_catalog.database_name.target_table.snapshots; |
Query a snapshot using its ID:
1 2 |
SELECT * FROM glue_catalog.database_name.target_table FOR SYSTEM_VERSION AS OF 'snapshot-id'; |
2. Time travel queries
1 2 |
SELECT * FROM glue_catalog.database_name.target_table FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00'; |
b. Using AWS Glue ETL for Time Travel
In AWS Glue ETL jobs, time travel is configured using Iceberg properties.
Example PySpark script:
1 2 3 4 5 |
# Query a historical snapshot spark.sql(""" SELECT * FROM glue_catalog.database_name.target_table FOR SYSTEM_TIME AS OF TIMESTAMP '2024-12-01 00:00:00' """).show() |
Conclusion
Iceberg’s rich feature set and AWS’s powerful ecosystem empower modern data workflows to achieve scalability, consistency, and query performance.
Drop a query if you have any questions regarding Apache Iceberg, AWS Glue or Amazon Athena and we will get back to you quickly.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.
CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 650k+ professionals in 500+ cloud certifications and completed 300+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training Partner, AWS Migration Partner, AWS Data and Analytics Partner, AWS DevOps Competency Partner, AWS GenAI Competency Partner, Amazon QuickSight Service Delivery Partner, Amazon EKS Service Delivery Partner, AWS Microsoft Workload Partners, Amazon EC2 Service Delivery Partner, Amazon ECS Service Delivery Partner, AWS Glue Service Delivery Partner, Amazon Redshift Service Delivery Partner, AWS Control Tower Service Delivery Partner, AWS WAF Service Delivery Partner, Amazon CloudFront, Amazon OpenSearch, AWS DMS and many more.
FAQs
1. How do I register an Iceberg table in the AWS Glue Catalog?
ANS: – Use Amazon Athena’s CREATE TABLE command to register the table, then verify in the AWS Glue console.
2. Can I perform UPSERT operations on Iceberg tables?
ANS: – Yes, you can use the MERGE INTO command in Spark for UPSERTs or emulate it in Amazon Athena with temporary tables.

WRITTEN BY Rishi Raj Saikia
Rishi Raj Saikia is working as Sr. Research Associate - Data & AI IoT team at CloudThat. He is a seasoned Electronics & Instrumentation engineer with a history of working in Telecom and the petroleum industry. He also possesses a deep knowledge of electronics, control theory/controller designing, and embedded systems, with PCB designing skills for relevant domains. He is keen on learning new advancements in IoT devices, IIoT technologies, and cloud-based technologies.
Comments