-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from treeverse/task/add-readme-content
Add README content
- Loading branch information
Showing
1 changed file
with
121 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,121 @@ | ||
# lakefs-iceberg-catalog | ||
<img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/> | ||
|
||
## lakeFS Iceberg Catalog | ||
|
||
lakeFS enriches your Iceberg tables with Git capabilities: create a branch and make your changes in isolation, without affecting other team members. | ||
|
||
See the instructions below on build, configuration and usage | ||
|
||
## Build | ||
|
||
From the repository root run the following maven command | ||
|
||
```sh | ||
mvn clean install -U -DskipTests | ||
``` | ||
|
||
Under the `target` directory you will find the jar: | ||
|
||
`lakefs-iceberg-catalog-<version>.jar` | ||
|
||
Load this jar into your environment. | ||
|
||
## Configuration | ||
|
||
lakeFS Catalog is using [lakeFS Hadoop FileSystem](https://docs.lakefs.io/integrations/spark.html#lakefs-hadoop-filesystem) under the hood to interact with lakeFS. | ||
In addition, for better performance we configure the S3A FS to interact directly with the underlying storage: | ||
|
||
```scala | ||
conf.set("spark.hadoop.fs.lakefs.impl", "io.lakefs.LakeFSFileSystem") | ||
conf.set("spark.hadoop.fs.lakefs.access.key", "AKIAIOSFDNN7EXAMPLEQ") | ||
conf.set("spark.hadoop.fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY") | ||
conf.set("spark.hadoop.fs.lakefs.endpoint", "<your-lakefs-endpoint>/api/v1") | ||
conf.set("spark.hadoop.fs.s3a.access.key", "<your-aws-access-key>") | ||
conf.set("spark.hadoop.fs.s3a.secret.key", "<your-aws-secret-key>") | ||
``` | ||
|
||
To configure a custom lakeFS catalog using Spark: | ||
In the catalog configuration pass the lakefs FS schema configured previously as the warehouse location | ||
|
||
```scala | ||
conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") | ||
conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") | ||
conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://") // Should be equal to the name of the lakefs FS configured | ||
``` | ||
|
||
## Usage | ||
|
||
For our examples, assume lakeFS repository called `myrepo`. | ||
|
||
### Create a table | ||
|
||
Let's create a table called `table1` under `main` branch and namespace `name.space` | ||
To create the table, use the following syntax: | ||
|
||
```sql | ||
CREATE TABLE lakefs.myrepo.main.name.space.table1 (id int, data string); | ||
``` | ||
|
||
### Create a branch | ||
|
||
We will create a new branch `dev` from `main`, but first lets commit the creation of the table to the main branch: | ||
|
||
``` | ||
lakectl commit lakefs://myrepo/main -m "my first iceberg commit" | ||
``` | ||
|
||
To create a new branch: | ||
|
||
``` | ||
lakectl branch create lakefs://myrepo/dev -s lakefs://myrepo/main | ||
``` | ||
|
||
### Make changes on the branch | ||
|
||
We can now make changes on `dev` branch: | ||
|
||
```sql | ||
INSERT INTO lakefs.myrepo.dev.name.space.table1 VALUES (3, 'data3'); | ||
``` | ||
|
||
### Query the table | ||
|
||
If we query the table on the `dev` branch, we will see the data we inserted: | ||
|
||
```sql | ||
SELECT * FROM lakefs.myrepo.dev.name.space.table1; | ||
``` | ||
|
||
Results in: | ||
``` | ||
+----+------+ | ||
| id | data | | ||
+----+------+ | ||
| 1 | data1| | ||
| 2 | data2| | ||
| 3 | data3| | ||
+----+------+ | ||
``` | ||
|
||
However, data on the `main` branch remains unaffected: | ||
|
||
```sql | ||
SELECT * FROM lakefs.myrepo.main.name.space.table1; | ||
``` | ||
|
||
Results in: | ||
``` | ||
+----+------+ | ||
| id | data | | ||
+----+------+ | ||
| 1 | data1| | ||
| 2 | data2| | ||
+----+------+ | ||
``` | ||
|
||
### Merge changes | ||
|
||
After changing the data on `dev` branch, it is possible to merge the data back to `main` using lakeFS UI, lakectl, or | ||
any of our various clients. | ||
Note that currently for Iceberg tables only fast-forward merge is supported. To ensure the validity of the table history | ||
the table in the `main` branch must not be altered before merging from `dev`. |