From ea3012e3729b934d51f2e5be7905fd6faf1ed451 Mon Sep 17 00:00:00 2001 From: Nir Ozery Date: Tue, 5 Mar 2024 15:25:18 +0200 Subject: [PATCH 1/2] Add README content --- README.md | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 113 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f0a3a87..50ff9a4 100644 --- a/README.md +++ b/README.md @@ -1 +1,113 @@ -# lakefs-iceberg-catalog \ No newline at end of file +lakeFS logo          Apache Iceberg logo + +## lakeFS Iceberg Catalog + +lakeFS enriches your Iceberg tables with Git capabilities: create a branch and make your changes in isolation, without affecting other team members. + +See the instructions below on build, configuration and usage + +## Build + +From the repository root run the following maven command + +```sh +mvn clean install -U -DskipTests +``` + +Under the `target` directory you will find the jar: + +`lakefs-iceberg-catalog-.jar` + +Load this jar into your environment. + +## Configuration + +lakeFS Catalog is using lakeFS HadoopFileSystem under the hood to interact with lakeFS. +In addition, for better performance we configure the S3A FS to interact directly with the underlying storage: + +```scala +conf.set("spark.hadoop.fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") +conf.set("spark.hadoop.fs.lakefs.access.key", "AKIAIOSFDNN7EXAMPLEQ") +conf.set("spark.hadoop.fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") +conf.set("spark.hadoop.fs.lakefs.endpoint", "http://localhost:8000/api/v1") +conf.set("spark.hadoop.fs.s3a.access.key", "") +conf.set("spark.hadoop.fs.s3a.secret.key", "") +``` + +In the catalog configuration pass the lakefs FS scheme configured previously as the warehouse location + +```scala +conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") +conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") +conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://") +``` + +## Usage + +For our examples, assume lakeFS repository called `myrepo`. + +### Create a table + +Let's create a table called `table1` under `main` branch and namespace `name.space.` +To create the table, use the following syntax: + +```sql +CREATE TABLE lakefs.myrepo.main.name.space.table1 (id int, data string); +``` + +### Create a branch + +We will create a new branch `dev` from `main`, but first lets commit the creation of the table to the main branch: + +``` +lakectl commit lakefs://myrepo/main -m "my first iceberg commit" +``` + +To create a new branch: + +``` +lakectl branch create lakefs://myrepo/dev -s lakefs://myrepo/main +``` + +### Make changes on the branch + +We can now make changes on `dev` branch: + +```sql +INSERT INTO lakefs.myrepo.dev.name.space.table1 VALUES (3, 'data3'); +``` + +### Query the table + +If we query the table on the `dev` branch, we will see the data we inserted: + +```sql +SELECT * FROM lakefs.myrepo.dev.name.space.table1; +``` + +Results in: +``` ++----+------+ +| id | data | ++----+------+ +| 1 | data1| +| 2 | data2| +| 3 | data3| ++----+------+ +``` + +However, data on the `main` branch remains unaffected: + +```sql +SELECT * FROM lakefs.myrepo.main.name.space.table1; +``` + +Results in: +``` ++----+------+ +| id | data | ++----+------+ +| 1 | data1| +| 2 | data2| ++----+------+ +``` From a12b8500877c623462d36c1ac80855c7baa3e616 Mon Sep 17 00:00:00 2001 From: Nir Ozery Date: Tue, 5 Mar 2024 17:43:19 +0200 Subject: [PATCH 2/2] CR Fixes --- README.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 50ff9a4..18b77da 100644 --- a/README.md +++ b/README.md @@ -22,24 +22,25 @@ Load this jar into your environment. ## Configuration -lakeFS Catalog is using lakeFS HadoopFileSystem under the hood to interact with lakeFS. +lakeFS Catalog is using [lakeFS Hadoop FileSystem](https://docs.lakefs.io/integrations/spark.html#lakefs-hadoop-filesystem) under the hood to interact with lakeFS. In addition, for better performance we configure the S3A FS to interact directly with the underlying storage: ```scala conf.set("spark.hadoop.fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") conf.set("spark.hadoop.fs.lakefs.access.key", "AKIAIOSFDNN7EXAMPLEQ") conf.set("spark.hadoop.fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") -conf.set("spark.hadoop.fs.lakefs.endpoint", "http://localhost:8000/api/v1") +conf.set("spark.hadoop.fs.lakefs.endpoint", "/api/v1") conf.set("spark.hadoop.fs.s3a.access.key", "") conf.set("spark.hadoop.fs.s3a.secret.key", "") ``` -In the catalog configuration pass the lakefs FS scheme configured previously as the warehouse location +To configure a custom lakeFS catalog using Spark: +In the catalog configuration pass the lakefs FS schema configured previously as the warehouse location ```scala conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") -conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://") +conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://") // Should be equal to the name of the lakefs FS configured ``` ## Usage @@ -48,7 +49,7 @@ For our examples, assume lakeFS repository called `myrepo`. ### Create a table -Let's create a table called `table1` under `main` branch and namespace `name.space.` +Let's create a table called `table1` under `main` branch and namespace `name.space` To create the table, use the following syntax: ```sql @@ -111,3 +112,10 @@ Results in: | 2 | data2| +----+------+ ``` + +### Merge changes + +After changing the data on `dev` branch, it is possible to merge the data back to `main` using lakeFS UI, lakectl, or +any of our various clients. +Note that currently for Iceberg tables only fast-forward merge is supported. To ensure the validity of the table history +the table in the `main` branch must not be altered before merging from `dev`. \ No newline at end of file