Apache Iceberg Table — v1 Format — copy-on-write

BigDataEnthusiast
4 min readJul 8, 2023

--

In previous blog we have already seen different types of iceberg table format & write mode supported. Please refer below link.

In this blog we will explore the behavior of update/delete on v1 table format i.e. copy-on-write.

Step 1: Create table & append data.

I have created a v1 format table (without partition) & appended some data from spark application.

import pyspark
from pyspark.sql import SparkSession

conf = (
pyspark.SparkConf()
.setAppName('iceberg')
.set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0')
.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.set('spark.sql.catalog.iceberg', 'org.apache.iceberg.spark.SparkCatalog')
.set('spark.sql.catalog.iceberg.type', 'hadoop')
.set('spark.sql.catalog.iceberg.warehouse', 'iceberg-warehouse')
)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

ddl = "create table iceberg.nyc_yellowtaxi_tripdata_v1 (vendorid bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count double, trip_distance double, ratecodeid double, store_and_fwd_flag string, pulocationid bigint, dolocationid bigint, payment_type bigint, fare_amount double, extra double, mta_tax double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double, congestion_surcharge double, airport_fee double) USING iceberg"

spark.sql(ddl)

df=spark.read.parquet("/home/docker/data/*")
df.writeTo("iceberg.nyc_yellowtaxi_tripdata_v1").append()

Table iceberg.nyc_yellowtaxi_tripdata_v1 is created lets explore its metadata files.

Metadata Files

As we have performed append operations, hence 1st snapshot (7016469376893257879) got created. See below excerpt from metadata.json file.

Two datafiles added via this append. Metadata file is also providing all details like count of datafiles added, added record count etc.

metadata.json

Manifest List & File

Manifest list is providing us the path for manifest file.

Manifest List

Manifest file is providing us the details of actual data files. It is clearly showing two entries i.e. two datafiles added via first snapshot.

Manifest File

So as of now we have two datafiles for iceberg table iceberg.nyc_yellowtaxi_tripdata_v1 in data directory.

Step 2: Update/Delete

Let’s update/delete some record to see the behavior of v1 table copy-on-write.

spark.sql("delete from iceberg.nyc_yellowtaxi_tripdata_v1 where vendorid=1 and tpep_pickup_datetime='2022-01-01 00:44:50' ")

As we have performed delete operations, hence 2nd snapshot (6886885402418486640) got created. See below excerpt from new metadata.json file. Notice here operation as overwrite, also showing count of file added/removed, record count added/removed.

Laltest metadata.json

It is evident from above metadata, that Iceberg has re-written the whole affected data file, even though there was change in just one record.

Also it has discarded its old copy. Notice the status of the files in manifest list.

Let’s explore metadata list. It has 2 entries for manifest file.

  • 1st Metadata File — Showing data file is added.
  • 2nd Metadata File — Showing 1 datafile deleted & 1 datafile existing.
Manifest list — latest snapshot (snap-6886885402418486640-…-.avro)

Let’s see the metadata lists.

To read the manifest files, we should see the status of datafile. Refer below table.

status : 0=EXISTING, aka rewrite, 1= ADDED, 2 =DELETED

+--------+-------------------------+
| status | description |
+--------+-------------------------+
| 0 | 0=EXISTING, aka rewrite |
| 1 | ADDED |
| 2 | DELETED |
+--------+-------------------------+
  • 1st Metadata File — Showing data file is added
1st Manifest File — 35dd9eb5–5a27–415e-94bc-59b961e308f2-m1.avro
  • 2nd Metadata File — Showing 1 datafile deleted & 1 datafile existing.
2nd Manifest File — 35dd9eb5–5a27–415e-94bc-59b961e308f2-m0.avro

So here if you notice one datafile is re-used from previous snapshot (7016469376893257879), as it has no changes.

Now we have three datafiles in iceberg table iceberg.nyc_yellowtaxi_tripdata_v1. Though one datafile which is discarded is having no use, other than time travel.

--

--

BigDataEnthusiast
BigDataEnthusiast

Written by BigDataEnthusiast

AWS Certified Data Engineer | Databricks Certified Apache Spark 3.0 Developer | Oracle Certified SQL Expert

No responses yet