Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.netflix:iceberg:0.6.3'
}
dependencies {
implementation("com.github.netflix:iceberg:0.6.3")
}
<dependency>
<groupId>com.github.netflix</groupId>
<artifactId>iceberg</artifactId>
<version>0.6.3</version>
</dependency>
libraryDependencies += "com.github.netflix" % "iceberg" % "0.6.3"
:dependencies [[com.github.netflix/iceberg "0.6.3"]]
Iceberg has moved! Iceberg has been donated to the Apache Software Foundation.
Please use the new Apache mailing lists, site, and repository:
Iceberg is a new table format for storing large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built into Hive, Presto, and Spark.
Iceberg is under active development at Netflix.
The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on integrating Iceberg into Spark and Presto.
The Iceberg format specification is being actively updated and is open for comment. Until the specification is complete and released, it carries no compatibility guarantees. The spec is currently evolving as the Java reference implementation changes.
Java API javadocs are available for the 0.3.0 (latest) release.
We welcome collaboration on both the Iceberg library and specification. The draft spec is open for comments.
For other discussion, please use the Iceberg mailing list or open issues on the Iceberg github page.
Iceberg is built using Gradle 4.4.
Iceberg table support is organized in library modules:
iceberg-common
contains utility classes used in other modulesiceberg-api
contains the public Iceberg APIiceberg-core
contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend oniceberg-parquet
is an optional module for working with tables backed by Parquet filesiceberg-orc
is an optional module for working with tables backed by ORC files (experimental)iceberg-hive
is am implementation of iceberg tables backed by hive metastore thrift clientThis project Iceberg also has modules for adding Iceberg support to processing engines:
iceberg-spark
is an implementation of Spark's Datasource V2 API for Iceberg (use iceberg-runtime for a shaded version)iceberg-data
is a client library used to read Iceberg tables from JVM applicationsiceberg-pig
is an implementation of Pig's LoadFunc API for Icebergiceberg-presto-runtime
generates a shaded runtime jar that is used by presto to integrate with iceberg tablesIceberg's Spark integration is compatible with the following Spark versions:
| Iceberg version | Spark version | | --------------- | ------------- | | 0.2.0+ | 2.3.0 | | 0.3.0+ | 2.3.2 |
Iceberg tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.
Table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic operation. The table metadata file tracks the table schema, partitioning config, other properties, and snapshots of the table contents. Each snapshot is a complete set of data files in the table at some point in time. Snapshots are listed in the metadata file, but the files in a snapshot are stored in separate manifest files.
The atomic transitions from one table metadata file to the next provide snapshot isolation. Readers use the snapshot that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.
Data files in snapshots are stored in one or more manifest files that contain a row for each data file in the table, its partition data, and its metrics. A snapshot is the union of all files in its manifests. Manifest files can be shared between snapshots to avoid rewriting metadata that is slow-changing.
This design addresses specific problems with the hive layout: file listing is no longer used to plan jobs and files are created in place without renaming.
This also provides improved guarantees and performance:
There are several problems with the current format:
Table data is tracked in both a central metastore, for partitions, and the file system, for files. The central metastore can be a scale bottleneck and the file system doesn't---and shouldn't---provide transactions to isolate concurrent reads and writes. The current table layout cannot be patched to fix its major problems.
In addition to changes in how table contents are tracked, Iceberg's design improves a few other areas: