Apache Iceberg Table — v1 Format — copy-on-write

4 min readJul 8, 2023

In previous blog we have already seen different types of iceberg table format & write mode supported. Please refer below link.

Apache Iceberg Table Formats

In this blog we will explore mainly these things.

bigdataenthusiast.medium.com

In this blog we will explore the behavior of update/delete on v1 table format i.e. copy-on-write.

Step 1: Create table & append data.

I have created a v1 format table (without partition) & appended some data from spark application.

import pyspark
from pyspark.sql import SparkSession

conf = (
    pyspark.SparkConf()
        .setAppName('iceberg')
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        .set('spark.sql.catalog.iceberg', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.iceberg.type', 'hadoop')
        .set('spark.sql.catalog.iceberg.warehouse', 'iceberg-warehouse')
)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

ddl = "create table iceberg.nyc_yellowtaxi_tripdata_v1 (vendorid  bigint, tpep_pickup_datetime  timestamp, tpep_dropoff_datetime  timestamp, passenger_count  double, trip_distance  double, ratecodeid  double, store_and_fwd_flag  string, pulocationid  bigint, dolocationid  bigint, payment_type  bigint, fare_amount  double, extra  double, mta_tax  double, tip_amount  double, tolls_amount  double, improvement_surcharge  double, total_amount  double, congestion_surcharge  double, airport_fee  double) USING iceberg"

spark.sql(ddl)

df=spark.read.parquet("/home/docker/data/*")
df.writeTo("iceberg.nyc_yellowtaxi_tripdata_v1").append()

Table iceberg.nyc_yellowtaxi_tripdata_v1 is created lets explore its metadata files.

Metadata Files

As we have performed append operations, hence 1st snapshot (7016469376893257879) got created. See below excerpt from metadata.json file.

Two datafiles added via this append. Metadata file is also providing all details like count of datafiles added, added record count etc.

Manifest List & File

Manifest list is providing us the path for manifest file.

Manifest file is providing us the details of actual data files. It is clearly showing two entries i.e. two datafiles added via first snapshot.

So as of now we have two datafiles for iceberg table iceberg.nyc_yellowtaxi_tripdata_v1 in data directory.

Step 2: Update/Delete

Let’s update/delete some record to see the behavior of v1 table copy-on-write.

spark.sql("delete from iceberg.nyc_yellowtaxi_tripdata_v1 where vendorid=1 and tpep_pickup_datetime='2022-01-01 00:44:50' ")

As we have performed delete operations, hence 2nd snapshot (6886885402418486640) got created. See below excerpt from new metadata.json file. Notice here operation as overwrite, also showing count of file added/removed, record count added/removed.

It is evident from above metadata, that Iceberg has re-written the whole affected data file, even though there was change in just one record.

Also it has discarded its old copy. Notice the status of the files in manifest list.

Let’s explore metadata list. It has 2 entries for manifest file.

1st Metadata File — Showing data file is added.
2nd Metadata File — Showing 1 datafile deleted & 1 datafile existing.

Manifest list — latest snapshot (snap-6886885402418486640-…-.avro)

Let’s see the metadata lists.

To read the manifest files, we should see the status of datafile. Refer below table.

status : 0=EXISTING, aka rewrite, 1= ADDED, 2 =DELETED

+--------+-------------------------+
| status |  description            |
+--------+-------------------------+
| 0      | 0=EXISTING, aka rewrite | 
| 1      | ADDED                   |
| 2      | DELETED                 | 
+--------+-------------------------+

1st Metadata File — Showing data file is added

1st Manifest File — 35dd9eb5–5a27–415e-94bc-59b961e308f2-m1.avro

2nd Metadata File — Showing 1 datafile deleted & 1 datafile existing.

2nd Manifest File — 35dd9eb5–5a27–415e-94bc-59b961e308f2-m0.avro

So here if you notice one datafile is re-used from previous snapshot (7016469376893257879), as it has no changes.

Now we have three datafiles in iceberg table iceberg.nyc_yellowtaxi_tripdata_v1. Though one datafile which is discarded is having no use, other than time travel.

Apache Iceberg Table — v1 Format — copy-on-write

Apache Iceberg Table Formats

In this blog we will explore mainly these things.

Metadata Files

References

Written by BigDataEnthusiast

No responses yet