From 9fdbbfbb324437016e55083aa85550912df11ca6 Mon Sep 17 00:00:00 2001 From: Amit Kesarwani <93291915+kesarwam@users.noreply.github.com> Date: Mon, 19 Aug 2024 17:41:40 -0700 Subject: [PATCH] Added Iceberg integration notebooks which work with AWS Glue --- 00_notebooks/00_index.ipynb | 1 + .../aws-glue-iceberg/README.md | 26 + .../aws-glue-iceberg/iceberg-books.ipynb | 531 ++++++++++++++ .../aws-glue-iceberg/iceberg-lakefs-nyc.ipynb | 694 ++++++++++++++++++ README.md | 1 + 5 files changed, 1253 insertions(+) create mode 100644 01_standalone_examples/aws-glue-iceberg/README.md create mode 100644 01_standalone_examples/aws-glue-iceberg/iceberg-books.ipynb create mode 100644 01_standalone_examples/aws-glue-iceberg/iceberg-lakefs-nyc.ipynb diff --git a/00_notebooks/00_index.ipynb b/00_notebooks/00_index.ipynb index 5c6a3f47d..ae96f808c 100644 --- a/00_notebooks/00_index.ipynb +++ b/00_notebooks/00_index.ipynb @@ -82,6 +82,7 @@ "* [AWS **Databricks**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-databricks/)\n", "* [AWS **Glue and Athena**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-glue-athena/)\n", "* [AWS **Glue and Trino**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-glue-trino/)\n", + "* [AWS **Glue and Iceberg**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-glue-iceberg/)\n", "* [lakeFS + **Dagster**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/dagster-integration/)\n", "* [lakeFS + **Prefect**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/prefect-integration/)\n", "* [Reproducibility and Data Version Control for **LangChain** and **LLM/OpenAI** Models](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/llm-openai-langchain-integration/)
_See also the [accompanying blog](https://lakefs.io/blog/lakefs-langchain-loader/)_\n", diff --git a/01_standalone_examples/aws-glue-iceberg/README.md b/01_standalone_examples/aws-glue-iceberg/README.md new file mode 100644 index 000000000..3b89b536d --- /dev/null +++ b/01_standalone_examples/aws-glue-iceberg/README.md @@ -0,0 +1,26 @@ +# Integration of lakeFS with Glue Notebooks and Iceberg + +Start by ⭐️ starring [lakeFS open source](https://go.lakefs.io/oreilly-course) project. + +This repository includes following Glue notebooks: + +1. iceberg-books: +* Use Case: Isolated Dev/Test Environments +* This interactive notebook demonstrate integration of lakeFS with [Glue Notebooks](https://docs.aws.amazon.com/glue/latest/dg/notebook-getting-started.html) and [Iceberg](https://docs.lakefs.io/integrations/iceberg.html) with a basic example. + +2. iceberg-lakefs-nyc: +* Use Case: Isolated Dev/Test Environments +* This interactive notebook demonstrate integration of lakeFS with Glue Notebooks and Iceberg for [NYC Film Permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p/about_data) example. + + +## Prerequisites +* lakeFS installed and running in your AWS environment or in the lakeFS Cloud. If you don't have lakeFS already running then either use [lakeFS Cloud](https://lakefs.cloud/) which provides free lakeFS server on-demand with a single click or [Deploy lakeFS on AWS](https://docs.lakefs.io/howto/deploy/aws.html) doc. + + +## Setup + +Download these notebooks from GitHub and upload these notebooks as Jupyter notebooks in your AWS Glue Studio. + +## Demo Instructions + +Open the notebook in Glue Studio and follow the instructions. diff --git a/01_standalone_examples/aws-glue-iceberg/iceberg-books.ipynb b/01_standalone_examples/aws-glue-iceberg/iceberg-books.ipynb new file mode 100644 index 000000000..747cd4408 --- /dev/null +++ b/01_standalone_examples/aws-glue-iceberg/iceberg-books.ipynb @@ -0,0 +1,531 @@ +{ + "metadata": { + "kernelspec": { + "name": "glue_pyspark", + "display_name": "Glue PySpark", + "language": "python" + }, + "language_info": { + "name": "Python_Glue_Session", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "pygments_lexer": "python3", + "file_extension": ".py" + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "source": "## Data As Code - lakeFS basic example ", + "metadata": {}, + "id": "1041ae6f" + }, + { + "cell_type": "markdown", + "source": "### Glue session configuration", + "metadata": {}, + "id": "9af12139-265b-469c-830c-58a5882cf5ea" + }, + { + "cell_type": "code", + "source": "%stop_session\n%session_id_prefix 'iceberg-books-demo'\n%idle_timeout 120\n%glue_version 4.0\n%worker_type G.1X\n%number_of_workers 2\n\n%additional_python_modules 'lakefs'\n%extra_jars https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.3.0/iceberg-spark-runtime-3.3_2.12-1.3.0.jar, https://repo1.maven.org/maven2/io/lakefs/lakefs-iceberg/0.1.3/lakefs-iceberg-0.1.3.jar\n\n%spark_conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "c4c3d384-bb97-4b51-935e-941bf1c8196e" + }, + { + "cell_type": "markdown", + "source": "## Config\n\n**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**", + "metadata": { + "tags": [] + }, + "id": "d42d5e48-44f1-4da4-b8c1-b354b1b61e17" + }, + { + "cell_type": "markdown", + "source": "### lakeFS endpoint and credentials", + "metadata": { + "tags": [] + }, + "id": "7d01815b-cc6e-420a-bca9-970b23fb895b" + }, + { + "cell_type": "code", + "source": "lakefsEndPoint = 'https://username.aws_region_name.lakefscloud.io' \nlakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'\nlakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "9ad336dd-cb3d-478a-bc90-cd1df7ca7e6d" + }, + { + "cell_type": "markdown", + "source": "### Object Storage", + "metadata": { + "tags": [] + }, + "id": "fe0cc4bb-ebd4-4731-9ada-d924ae7a08da" + }, + { + "cell_type": "code", + "source": "storageNamespace = 's3://example' # e.g. \"s3://bucket\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "f452ecfd-a7be-4a3a-9e5c-5a59b1d6c9de" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "7fcfc3fe-0617-447c-8c1e-6d0db05717ab" + }, + { + "cell_type": "markdown", + "source": "## Setup\n\n**(you shouldn't need to change anything in this section, just run it)**", + "metadata": { + "tags": [] + }, + "id": "c2595447-3d6c-425f-b2cd-b1e0e38827e9" + }, + { + "cell_type": "code", + "source": "repo_name = \"data-as-code-repo\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "ada2113d-b298-4606-88ef-42e0a3b528ca" + }, + { + "cell_type": "markdown", + "source": "### Versioning Information", + "metadata": {}, + "id": "07cbe85b-12f8-4939-9eae-1a263333e8ed" + }, + { + "cell_type": "code", + "source": "mainBranch = \"main\"\ndevBranch = \"dev\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "0ce82b03-53ce-4e5c-914d-d01d4af980c0" + }, + { + "cell_type": "markdown", + "source": "### Import libraries", + "metadata": {}, + "id": "a686bc3f-4136-4ed7-86e0-a5d70496aed8" + }, + { + "cell_type": "code", + "source": "import os\nimport lakefs\nfrom pyspark.sql.functions import when, col", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "cd4ab1cd-c781-42e8-9646-c554c8c2b6b7" + }, + { + "cell_type": "markdown", + "source": "### Define some helper functions", + "metadata": {}, + "id": "707c6531-31f2-4ee3-866a-eb66aa44031c" + }, + { + "cell_type": "code", + "source": "def print_commit(log):\n from datetime import datetime\n from pprint import pprint\n\n print('Message:', log.message)\n print('ID:', log.id)\n print('Committer:', log.committer)\n print('Creation Date:', datetime.utcfromtimestamp(log.creation_date).strftime('%Y-%m-%d %H:%M:%S'))\n print('Parents:', log.parents)\n print('Metadata:')\n pprint(log.metadata)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "b41ba6c3-e8e0-41f4-860f-a9919825632d" + }, + { + "cell_type": "markdown", + "source": "### Set environment variables", + "metadata": {}, + "id": "ba4efc82-eecb-47af-9554-3449e3b41bae" + }, + { + "cell_type": "code", + "source": "os.environ[\"LAKECTL_SERVER_ENDPOINT_URL\"] = lakefsEndPoint\nos.environ[\"LAKECTL_CREDENTIALS_ACCESS_KEY_ID\"] = lakefsAccessKey\nos.environ[\"LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY\"] = lakefsSecretKey", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "62a4b068-3c55-48ab-8f92-2de80a61f197" + }, + { + "cell_type": "markdown", + "source": "#### Verify lakeFS credentials by getting lakeFS version", + "metadata": {}, + "id": "851be970-1ba6-4514-b9fc-d489342db8d0" + }, + { + "cell_type": "code", + "source": "print(\"Verifying lakeFS credentials…\")\ntry:\n v=lakefs.client.Client().version\nexcept:\n print(\"🛑 failed to get lakeFS version\")\nelse:\n print(f\"…✅lakeFS credentials verified\\n\\nℹ️lakeFS version {v}\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "9d332303-8ceb-40a7-9c4c-1956480bdcad" + }, + { + "cell_type": "markdown", + "source": "### Define lakeFS Repository", + "metadata": {}, + "id": "0d36e1b9-607f-467e-8ef2-98cbb2c60e18" + }, + { + "cell_type": "code", + "source": "repo = lakefs.Repository(repo_name).create(storage_namespace=f\"{storageNamespace}/{repo_name}/4/18/a\", default_branch=mainBranch, exist_ok=True)\nbranchMain = repo.branch(mainBranch)\nprint(repo)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "0e2db150-3df3-43b9-97e0-e6fe69aea3d9" + }, + { + "cell_type": "markdown", + "source": "### Set up Spark", + "metadata": {}, + "id": "959c5b07-f5b7-4d72-879c-e18bf6e4f0b1" + }, + { + "cell_type": "code", + "source": "spark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3.impl\",\"org.apache.hadoop.fs.s3a.S3AFileSystem\")\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\",lakefsSecretKey)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\",lakefsEndPoint)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\",lakefsAccessKey)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.path.style.access\",\"true\")\nspark.conf.set(\"spark.sql.defaultCatalog\",\"lakefs\")\nspark.conf.set(\"spark.sql.catalog.lakefs\",\"org.apache.iceberg.spark.SparkCatalog\")\nspark.conf.set(\"spark.sql.catalog.lakefs.catalog-impl\",\"io.lakefs.iceberg.LakeFSCatalog\")\nspark.conf.set(\"spark.sql.catalog.lakefs.warehouse\",f\"lakefs://{repo_name}\")\nspark.conf.set(\"spark.sql.catalog.lakefs.uri\",lakefsEndPoint)\nspark.conf.set(\"spark.sql.catalog.lakefs.cache-enabled\",\"false\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "6aa42e18-cf84-46ac-a395-80116a31b1e7" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "0efb2cfd-9393-46ee-b1a9-f409e0a23b74" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "e58bd823-5b30-43cb-bf7c-cc2aef57db39" + }, + { + "cell_type": "markdown", + "source": "## Create an Iceberg table in the lakeFS catalog `main` branch", + "metadata": {}, + "id": "dbd1fc9b-ca63-4e24-a2a4-83c2b9495408" + }, + { + "cell_type": "code", + "source": "%%sql\n\nCREATE OR REPLACE TABLE main.lakefs_demo.authors(id int, name string) USING iceberg;\n", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "4016b8b0-1bf9-46a1-be2d-31faeeb8e793" + }, + { + "cell_type": "code", + "source": "%%sql\n\nCREATE OR REPLACE TABLE main.lakefs_demo.books(id int, title string, author_id int) USING iceberg;\n", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "9b2b7734-ae2f-4259-b40c-6f527eaf1ba7" + }, + { + "cell_type": "code", + "source": "%%sql\n\nCREATE OR REPLACE TABLE main.lakefs_demo.book_sales(id int, sale_date date, book_id int, price double) USING iceberg;\n", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "6a03042c-f7f0-4fd9-8847-7eb3ab2764ab" + }, + { + "cell_type": "code", + "source": "%%sql \n\nINSERT INTO main.lakefs_demo.authors (id, name)\nVALUES (1, \"J.R.R. Tolkien\"), (2, \"George R.R. Martin\"),\n (3, \"Agatha Christie\"), (4, \"Isaac Asimov\"), (5, \"Stephen King\");", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3d595fe6-a530-4653-ac5d-ae5b6cadd276" + }, + { + "cell_type": "code", + "source": "%%sql\n\nINSERT INTO main.lakefs_demo.books (id, title, author_id)\nVALUES (1, \"The Lord of the Rings\", 1), (2, \"The Hobbit\", 1), \n (3, \"A Song of Ice and Fire\", 2), (4, \"A Clash of Kings\", 2),\n (5, \"And Then There Were None\", 3), (6, \"Murder on the Orient Express\", 3),\n (7, \"Foundation\", 4), (8, \"I, Robot\", 4),\n (9, \"The Shining\", 5), (10, \"It\", 5);", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "0972269d-47d5-41ba-a732-d360094f5a1a" + }, + { + "cell_type": "code", + "source": "%%sql\n\nINSERT INTO main.lakefs_demo.book_sales (id, sale_date, book_id, price)\nVALUES (1, DATE '2024-04-12', 1, 25.50),\n (2, DATE '2024-04-11', 2, 17.99), \n (3, DATE '2024-04-10', 3, 12.95), \n (4, DATE '2024-04-13', 4, 32.00), \n (5, DATE '2024-04-12', 5, 29.99), \n (6, DATE '2024-03-15', 1, 23.99), \n (7, DATE '2024-02-22', 2, 19.50), \n (8, DATE '2024-01-10', 3, 14.95), \n (9, DATE '2023-12-05', 4, 28.00), \n (10, DATE '2023-11-18', 5, 27.99), \n (11, DATE '2023-10-26', 2, 18.99), \n (12, DATE '2023-10-12', 1, 22.50), \n (13, DATE '2024-04-09', 3, 11.95), \n (14, DATE '2024-03-28', 4, 35.00), \n (15, DATE '2024-04-05', 5, 31.99), \n (16, DATE '2024-03-01', 1, 27.50), \n (17, DATE '2024-02-14', 2, 21.99), \n (18, DATE '2024-01-07', 3, 13.95), \n (19, DATE '2023-12-20', 4, 29.00), \n (20, DATE '2023-11-03', 5, 28.99); ", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "968c531f-e7ac-4f94-b2d9-4a79f685b74b" + }, + { + "cell_type": "code", + "source": "ref = branchMain.commit(message=\"Initial data load\",\n metadata={'author': 'lakefs'})\nprint_commit(ref.get_commit())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "c121b4e0-a2b2-492e-8fec-d9f376e0c073" + }, + { + "cell_type": "markdown", + "source": "# Main demo starts here 🚦 👇🏻", + "metadata": {}, + "id": "9464902f-39ce-4d01-813e-fda188188b43" + }, + { + "cell_type": "markdown", + "source": "## Read my production data from my main branch", + "metadata": {}, + "id": "71ac46f0-388e-4a9b-a1cd-7ccafafc2d1d" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT * FROM main.lakefs_demo.authors LIMIT 5;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "4fa402ea-0ccd-4193-af80-b349f6a33f21" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT * FROM main.lakefs_demo.books LIMIT 5;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "09feda26-6c78-4bfc-875c-11cccc3edaf1" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT * FROM main.lakefs_demo.book_sales LIMIT 5;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "cdc3fc7a-94b5-453b-9006-131a401ebc6a" + }, + { + "cell_type": "markdown", + "source": "## Mess with the data - Create a development sandbox", + "metadata": {}, + "id": "75663622-69c5-480a-b837-41cd62f9f158" + }, + { + "cell_type": "code", + "source": "branchDev = repo.branch(devBranch).create(source_reference=mainBranch, exist_ok=True)\nprint(f\"{devBranch} ref:\", branchDev.get_commit().id)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "7d2f6c49-44a9-402d-a0cd-ec02d9cf578a" + }, + { + "cell_type": "markdown", + "source": "## Read data from my development sandbox", + "metadata": {}, + "id": "33027782-83e9-43b1-b949-8f039e77e473" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT * FROM dev.lakefs_demo.book_sales LIMIT 5;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "c647b58e-563c-4c82-83f8-cae70b5ce7a0" + }, + { + "cell_type": "code", + "source": "%%sql \nSELECT 'Prod', SUM(price) AS total_sales\nFROM main.lakefs_demo.book_sales\nUNION ALL\nSELECT 'Dev', SUM(price) AS total_sales\nFROM dev.lakefs_demo.book_sales;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "72a1e80b-bd15-4d41-a9c6-4e623821d12c" + }, + { + "cell_type": "code", + "source": "%%sql\nSELECT\n au.name AS author_name,\n ROUND(SUM(s.price), 2) AS total_sales\nFROM main.lakefs_demo.books b\nLEFT JOIN main.lakefs_demo.authors au ON b.author_id = au.id\nLEFT JOIN main.lakefs_demo.book_sales s ON b.id = s.book_id\nGROUP BY au.name\nORDER BY total_sales DESC\nLIMIT 3;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "91cb40f6-863e-4e60-b238-0d07761e715d" + }, + { + "cell_type": "markdown", + "source": "## Running pipelines in isolation", + "metadata": {}, + "id": "d4fd6da2-1183-4fea-a18a-9e1ee8613765" + }, + { + "cell_type": "markdown", + "source": "### Remove Cancelled Sales", + "metadata": {}, + "id": "fce4be00-7588-4e74-bf8b-f58abc72eb80" + }, + { + "cell_type": "code", + "source": "%%sql\n\nDELETE FROM dev.lakefs_demo.book_sales\nWHERE id IN (10, 15, 2, 1, 6);", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "62906bfb-bb7d-4573-8ca8-8af54a9b9f1a" + }, + { + "cell_type": "markdown", + "source": "### Who are my top selling authors?", + "metadata": {}, + "id": "9f17ff41-ff66-4fb4-8f7b-742cf1bfef72" + }, + { + "cell_type": "code", + "source": "%%sql \n\nSELECT\n au.name AS author_name,\n ROUND(SUM(s.price), 2) AS total_sales\nFROM dev.lakefs_demo.books b\nLEFT JOIN dev.lakefs_demo.authors au ON b.author_id = au.id\nLEFT JOIN dev.lakefs_demo.book_sales s ON b.id = s.book_id\nGROUP BY au.name\nORDER BY total_sales DESC\nLIMIT 3;\n", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "cd67a158-72a3-4bcf-8321-da7774172ede" + }, + { + "cell_type": "markdown", + "source": "### Compare dev and main", + "metadata": {}, + "id": "1527eb15-9582-46ea-b613-e37994c25e23" + }, + { + "cell_type": "code", + "source": "%%sql\nSELECT\n au.name AS author_name,\n ROUND(SUM(s.price), 2) AS total_sales\nFROM main.lakefs_demo.books b\nLEFT JOIN main.lakefs_demo.authors au ON b.author_id = au.id\nLEFT JOIN main.lakefs_demo.book_sales s ON b.id = s.book_id\nGROUP BY au.name\nORDER BY total_sales DESC\nLIMIT 3;\n\n", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "d0ce540a-4f64-47b0-9c8a-5d8033d44f7f" + }, + { + "cell_type": "markdown", + "source": "### Commit dev changes", + "metadata": {}, + "id": "b0cacc8c-32e0-4eb7-8271-89136e82bc64" + }, + { + "cell_type": "code", + "source": "branchDev = repo.branch(devBranch)\n\nref = branchDev.commit(message=\"Removed Cancelled Sales\",\n metadata={'author': 'lakefs', \n '::lakefs::CodeVersion::url[url:ui]': 'http://localhost:8888/lab/workspaces/auto-y/tree/iceberg-books.ipynb'})\n\nprint_commit(ref.get_commit())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "b8453101-69c5-4e4c-a180-e269dfa574ab" + }, + { + "cell_type": "markdown", + "source": "### Merge Changes", + "metadata": {}, + "id": "e5acded1-302e-4ff8-83a9-2de885016f09" + }, + { + "cell_type": "code", + "source": "res = branchDev.merge_into(branchMain)\nprint(res)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "42a5c777-2838-4757-b8d3-989773aa298b" + }, + { + "cell_type": "code", + "source": "", + "metadata": {}, + "execution_count": null, + "outputs": [], + "id": "882c1bf9-6d3d-403d-ba75-1cbc3c8dd1d5" + } + ] +} \ No newline at end of file diff --git a/01_standalone_examples/aws-glue-iceberg/iceberg-lakefs-nyc.ipynb b/01_standalone_examples/aws-glue-iceberg/iceberg-lakefs-nyc.ipynb new file mode 100644 index 000000000..c90375d44 --- /dev/null +++ b/01_standalone_examples/aws-glue-iceberg/iceberg-lakefs-nyc.ipynb @@ -0,0 +1,694 @@ +{ + "metadata": { + "kernelspec": { + "name": "glue_pyspark", + "display_name": "Glue PySpark", + "language": "python" + }, + "language_info": { + "name": "Python_Glue_Session", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "pygments_lexer": "python3", + "file_extension": ".py" + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "source": "\"lakeFS \"Apache \n\n## lakeFS ❤️ Apache Iceberg - an example using NYC Film Permits dataset", + "metadata": {}, + "id": "2ccf8b6a-7076-43b9-82ee-d84fdfd75522" + }, + { + "cell_type": "code", + "source": "%stop_session\n%session_id_prefix 'iceberg-lakefs-nyc-demo'\n%idle_timeout 120\n%glue_version 4.0\n%worker_type G.1X\n%number_of_workers 2\n\n%additional_python_modules 'lakefs,tabulate'\n%extra_jars https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.3.0/iceberg-spark-runtime-3.3_2.12-1.3.0.jar, https://repo1.maven.org/maven2/io/lakefs/lakefs-iceberg/0.1.3/lakefs-iceberg-0.1.3.jar, https://repo1.maven.org/maven2/io/lakefs/lakefs-spark-extensions_2.12/0.0.3/lakefs-spark-extensions_2.12-0.0.3.jar\n\n%spark_conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "2ddc7800-646c-4e7e-a2e0-f521914b073f" + }, + { + "cell_type": "markdown", + "source": "# Config\n\n**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**", + "metadata": { + "tags": [] + }, + "id": "7e9ba792-9948-4d21-8346-8dac596063ab" + }, + { + "cell_type": "markdown", + "source": "### lakeFS endpoint and credentials", + "metadata": {}, + "id": "5a6cb169-f6ad-4927-abbc-cc740d7e1a82" + }, + { + "cell_type": "code", + "source": "lakefsEndPoint = 'https://username.aws_region_name.lakefscloud.io' # e.g. 'https://username.aws_region_name.lakefscloud.io' \nlakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'\nlakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "418e64da-c208-49f3-963d-9bf9c488b52a" + }, + { + "cell_type": "markdown", + "source": "### Object Storage", + "metadata": { + "tags": [] + }, + "id": "b1f01e2d-b8a5-4820-bed0-efa56d784b80" + }, + { + "cell_type": "code", + "source": "storageNamespace = 's3://example' # e.g. \"s3://bucket\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "6ce9539c-ca43-466c-a820-ee8f3f6fa312" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "61b9f0c9-e77f-4d4f-a5d5-4b3fcff9bf65" + }, + { + "cell_type": "markdown", + "source": "# Setup\n\n**(you shouldn't need to change anything in this section, just run it)**", + "metadata": { + "tags": [] + }, + "id": "7b052ac3-055a-4b05-8f6a-cc39747ae5d3" + }, + { + "cell_type": "code", + "source": "repo_name = \"lakefs-iceberg-nyc\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "80d51253-2d89-4dc0-9879-1104a5228b60" + }, + { + "cell_type": "markdown", + "source": "### Versioning Information", + "metadata": {}, + "id": "8b702811-4da6-450b-8e5b-87fffb78e68e" + }, + { + "cell_type": "code", + "source": "mainBranch = \"main\"\ndevBranch = \"dev\"", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "244e5a2f-7c7f-4e62-ba58-082024804c3a" + }, + { + "cell_type": "markdown", + "source": "### Import libraries", + "metadata": {}, + "id": "fff51734-059c-4521-9df6-2dc2504ee1ec" + }, + { + "cell_type": "code", + "source": "import os\nimport lakefs", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3c69510e-3107-4055-811e-d232fb11d666" + }, + { + "cell_type": "code", + "source": "def print_diff(diff):\n results = map(\n lambda n:[n.path,n.path_type,n.size_bytes,n.type],\n diff)\n\n from tabulate import tabulate\n print(tabulate(\n results,\n headers=['Path','Path Type','Size(Bytes)','Type']))\n\ndef print_commit(log):\n from datetime import datetime\n from pprint import pprint\n\n print('Message:', log.message)\n print('ID:', log.id)\n print('Committer:', log.committer)\n print('Creation Date:', datetime.utcfromtimestamp(log.creation_date).strftime('%Y-%m-%d %H:%M:%S'))\n print('Parents:', log.parents)\n print('Metadata:')\n pprint(log.metadata)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "f953e6ee-1e6e-43b4-998a-c772bc346207" + }, + { + "cell_type": "markdown", + "source": "### Set environment variables", + "metadata": { + "tags": [] + }, + "id": "69d1ff89-480c-4421-8ed5-1b2e33a1fda9" + }, + { + "cell_type": "code", + "source": "os.environ[\"LAKECTL_SERVER_ENDPOINT_URL\"] = lakefsEndPoint\nos.environ[\"LAKECTL_CREDENTIALS_ACCESS_KEY_ID\"] = lakefsAccessKey\nos.environ[\"LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY\"] = lakefsSecretKey", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "fde3e9b6-2da3-4ae2-968a-acf874867a71" + }, + { + "cell_type": "markdown", + "source": "#### Verify lakeFS credentials by getting lakeFS version", + "metadata": {}, + "id": "af42373c-9c28-47ce-b060-204825fbce34" + }, + { + "cell_type": "code", + "source": "print(\"Verifying lakeFS credentials…\")\ntry:\n v=lakefs.client.Client().version\nexcept:\n print(\"🛑 failed to get lakeFS version\")\nelse:\n print(f\"…✅lakeFS credentials verified\\n\\nℹ️lakeFS version {v}\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "88d9debc-e52e-4b42-8e56-b52e7d798410" + }, + { + "cell_type": "markdown", + "source": "### Define lakeFS Repository", + "metadata": {}, + "id": "b7a2ebb0-b842-4584-8776-ef89976feb9a" + }, + { + "cell_type": "code", + "source": "repo = lakefs.Repository(repo_name).create(storage_namespace=f\"{storageNamespace}/{repo_name}\", default_branch=mainBranch, exist_ok=True)\nbranchMain = repo.branch(mainBranch)\nprint(repo)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "bf73bf83-546b-4cc0-a920-cfcb16d87a8c" + }, + { + "cell_type": "markdown", + "source": "### Set up Spark", + "metadata": {}, + "id": "c7403de6-88df-4aed-a8eb-61e6d5899859" + }, + { + "cell_type": "code", + "source": "spark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3.impl\",\"org.apache.hadoop.fs.s3a.S3AFileSystem\")\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\",lakefsSecretKey)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\",lakefsEndPoint)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\",lakefsAccessKey)\nspark.sparkContext._jsc.hadoopConfiguration().set(\"fs.s3a.path.style.access\",\"true\")\nspark.conf.set(\"spark.sql.defaultCatalog\",\"lakefs\")\nspark.conf.set(\"spark.sql.catalog.lakefs\",\"org.apache.iceberg.spark.SparkCatalog\")\nspark.conf.set(\"spark.sql.catalog.lakefs.catalog-impl\",\"io.lakefs.iceberg.LakeFSCatalog\")\nspark.conf.set(\"spark.sql.catalog.lakefs.warehouse\",f\"lakefs://{repo_name}\")\nspark.conf.set(\"spark.sql.catalog.lakefs.uri\",lakefsEndPoint)\nspark.conf.set(\"spark.sql.catalog.lakefs.cache-enabled\",\"false\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "e2169aff-ec7d-4f3d-9631-958c83f3f7ef" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "aae4b709-7631-4fec-b75b-0fc4a4c417f9" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "a226c33f-5a90-4b19-a037-56966130aa6c" + }, + { + "cell_type": "markdown", + "source": "# Main demo starts here 🚦 👇🏻", + "metadata": {}, + "id": "4a827676-a2bc-48f2-9f4b-c4b2d4ba8ff3" + }, + { + "cell_type": "markdown", + "source": "# Load some Data", + "metadata": {}, + "id": "eead44c0" + }, + { + "cell_type": "markdown", + "source": "For this demo, we will use the [New York City Film Permits dataset](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) available as part of the NYC Open Data initiative. We're using a locally saved copy of a 1000 record sample, but feel free to download the entire dataset to use in this notebook!\n\nWe'll save the sample dataset into an Iceberg table called `permits`, using lakeFS for the catalog.", + "metadata": {}, + "id": "6f9a9f41" + }, + { + "cell_type": "code", + "source": "df = spark.read.option(\"inferSchema\",\"true\").option(\"multiline\",\"true\").json(f\"s3a://{repo_name}/{mainBranch}/nyc_film_permits.json\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "7df5d071-3b20-4314-a857-dfd88956aba0" + }, + { + "cell_type": "code", + "source": "df.write.saveAsTable(\"lakefs.main.nyc.permits\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "a98c6bf0-746d-46d9-b4f8-0d17dc3c9e1f" + }, + { + "cell_type": "markdown", + "source": "Taking a quick peek at the data, you can see that there are a number of permits for different boroughs in New York.", + "metadata": {}, + "id": "378cf187" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT borough, count(*) AS permit_cnt\nFROM lakefs.main.nyc.permits\nGROUP BY borough", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "f3170161" + }, + { + "cell_type": "markdown", + "source": "### Commit the new table and its data", + "metadata": {}, + "id": "cf24d569-83ca-4c04-8435-2f51d55778dc" + }, + { + "cell_type": "code", + "source": "ref = branchMain.commit(\n message=\"Initial data load\",\n metadata={'author': 'lakefs',\n 'data source': 'https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p'})\nprint_commit(ref.get_commit())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "9b23e4a7-5245-425a-a5a8-8919ecb9a17c" + }, + { + "cell_type": "markdown", + "source": "# Create a new branch\n\n_This is copy-on-write; we're not duplicating the data_", + "metadata": {}, + "id": "3825889e-49bf-4eac-a23e-9250ab0b53e0" + }, + { + "cell_type": "code", + "source": "branchDev = repo.branch(devBranch).create(source_reference=mainBranch, exist_ok=True)\nprint(f\"{devBranch} ref:\", branchDev.get_commit().id)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "2118f922-c2c4-4d4d-b7ae-5fbcc4fff04d" + }, + { + "cell_type": "markdown", + "source": "### Confirm that we can see the data on the `dev` branch", + "metadata": {}, + "id": "4b81878b-3cca-457b-b344-35ef476ce3fb" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT count(*)\nFROM lakefs.dev.nyc.permits;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3591df18-ba42-47a8-bf1b-38374f0cfe79" + }, + { + "cell_type": "markdown", + "source": "# Making [and reverting] changes on the dev branch", + "metadata": {}, + "id": "437088f6" + }, + { + "cell_type": "markdown", + "source": "Let's go big! Let's see what happens when we delete the contents of the table with a careless `DELETE` omitting an all-important predicate", + "metadata": {}, + "id": "6cdece16-eb56-4459-94e5-9d0da350b235" + }, + { + "cell_type": "code", + "source": "%%sql\n\nDELETE FROM lakefs.dev.nyc.permits;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3d066bae-77d2-4ead-a04b-ded869e96ace" + }, + { + "cell_type": "markdown", + "source": "How's that data looking now?", + "metadata": {}, + "id": "e2f890ff-3353-4e87-9c5d-8a1f682d4ab1" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT count(*)\nFROM lakefs.dev.nyc.permits;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "5bba978c-b4f7-4c7f-9dfa-f51386072f93" + }, + { + "cell_type": "markdown", + "source": "But `main` is safe and unsullied 😌", + "metadata": {}, + "id": "95b9425b-ce20-44c3-a134-096d0fb9250c" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT count(*)\nFROM lakefs.main.nyc.permits;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "ceea4d24-d556-41f6-9ad4-708f603916c6" + }, + { + "cell_type": "markdown", + "source": "## Reverting changes to the `dev` branch", + "metadata": {}, + "id": "fada3e38-bd04-4639-b7c0-a0edf50b0263" + }, + { + "cell_type": "markdown", + "source": "### Uncommitted objects:", + "metadata": {}, + "id": "ef0ab7c3-0ddc-44f1-b4f0-6c9343994d5c" + }, + { + "cell_type": "code", + "source": "print_diff(branchDev.uncommitted())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "a44da2b8-068a-47c3-91b8-6140f6d2fbbb" + }, + { + "cell_type": "markdown", + "source": "### Reset the branch", + "metadata": {}, + "id": "4fae7c8a-2f9c-4e63-8a47-c3cdddfb7953" + }, + { + "cell_type": "code", + "source": "branchDev.reset_changes(path_type='common_prefix', path=\"nyc/permits/\")", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "fc5e7a5f-afe1-43e5-81cc-76d6422df220" + }, + { + "cell_type": "markdown", + "source": "_This just resets the changes to the files for this table. To reset the whole branch use_:\n\n```python\nbranchDev.reset_changes(path_type='reset')\n```", + "metadata": {}, + "id": "18e71e27-9d1e-4918-b674-631e484819f1" + }, + { + "cell_type": "markdown", + "source": "### Uncommitted objects:", + "metadata": {}, + "id": "353fe4a8-f178-46b3-8f6a-5161f69e6cf4" + }, + { + "cell_type": "code", + "source": "print_diff(branchDev.uncommitted())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "6d8342d8-ca4f-4dd9-a7bc-ff1ca5b01392" + }, + { + "cell_type": "markdown", + "source": "## Our data's back!", + "metadata": {}, + "id": "5e57d467-fade-415d-ba87-2ac72eb7927e" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT count(*)\nFROM lakefs.dev.nyc.permits;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "0b2f4a09-52ac-439e-993a-f408de64b94a" + }, + { + "cell_type": "markdown", + "source": "# Making changes to the `dev` branch as a collection", + "metadata": {}, + "id": "aac5a375-c3af-42d6-a12a-00ef37469a9d" + }, + { + "cell_type": "markdown", + "source": "## Delete all rows for permits in `Manhattan` from the table", + "metadata": {}, + "id": "7fe94d73-cbfa-4570-9736-e82498441cab" + }, + { + "cell_type": "code", + "source": "%%sql\n\nDELETE FROM lakefs.dev.nyc.permits WHERE borough='Manhattan';", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "14843243" + }, + { + "cell_type": "markdown", + "source": "## Build an aggregate of the data to show how many permits we issued by category", + "metadata": {}, + "id": "c891a7c9-d235-408c-be70-25fa701f923f" + }, + { + "cell_type": "code", + "source": "%%sql\n\nCREATE OR REPLACE TABLE lakefs.dev.nyc.agg_permit_category AS\nSELECT category, count(*) permit_cnt\nFROM lakefs.dev.nyc.permits\nGROUP BY category;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "e22154ae-e74d-4ae2-ad89-496dedfbcfe4" + }, + { + "cell_type": "code", + "source": "%%sql \n\nSELECT * FROM lakefs.dev.nyc.agg_permit_category LIMIT 5;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "4068734a-80e9-446a-a6ab-ccbe67f952a5" + }, + { + "cell_type": "markdown", + "source": "# Compare `main` and `dev`", + "metadata": {}, + "id": "8eb688c8" + }, + { + "cell_type": "markdown", + "source": "## `dev`", + "metadata": {}, + "id": "63d6217b-9963-46f6-8311-d1bc9286a889" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT borough, count(*) permit_cnt\nFROM lakefs.dev.nyc.permits\nGROUP BY borough", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "dfbd0d4b" + }, + { + "cell_type": "markdown", + "source": "## `main`", + "metadata": {}, + "id": "85332bf5" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT borough, count(*) permit_cnt\nFROM lakefs.main.nyc.permits\nGROUP BY borough", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "95df15e9" + }, + { + "cell_type": "markdown", + "source": "## `Row level diff`", + "metadata": {}, + "id": "c0b0791e-4e1f-42dd-8423-8f02a6a3e8b6" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT * FROM refs_data_diff('lakefs', 'main', 'dev', 'nyc.permits');", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "bad631fd-470c-43bc-a087-1da0469ea549" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT lakefs_change, borough, count(*) AS permit_diffs_cnt\nFROM refs_data_diff('lakefs', 'main', 'dev', 'nyc.permits')\nGROUP BY lakefs_change, borough;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "dfe5140d-4f77-42fb-ad7f-e7ae5f1b605f" + }, + { + "cell_type": "markdown", + "source": "# Partition the data in the `dev` branch", + "metadata": {}, + "id": "1d43d15c-db6e-4a78-a094-a7a3efeffca5" + }, + { + "cell_type": "code", + "source": "%%sql\n\nCREATE TABLE lakefs.dev.nyc.permits_partitioned\nUSING iceberg\nPARTITIONED BY (borough)\nAS SELECT * FROM lakefs.dev.nyc.permits\nORDER BY borough;", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "e379178f-4e5d-42b7-b4cc-fd58e03a47e6" + }, + { + "cell_type": "code", + "source": "%%sql\n\nSELECT borough, count(*) permit_cnt\nFROM lakefs.dev.nyc.permits_partitioned\nGROUP BY borough", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3f18a68c-02ef-44be-ae26-e44bf9b454b9" + }, + { + "cell_type": "markdown", + "source": "# Commit the changes to the `dev` branch", + "metadata": {}, + "id": "fa771262-2dde-4ecc-b1a1-3dac9f580cfa" + }, + { + "cell_type": "code", + "source": "ref = branchDev.commit(\n message=\"Remove data for Manhattan from permits dataset, build category aggregate and partition the data\",\n metadata={\"etl job name\": \"etl_job_42\",\n \"author\": \"lakefs\"})\nprint_commit(ref.get_commit())", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "d8b77b63-0011-4051-bc69-4a37c18f88ff" + }, + { + "cell_type": "markdown", + "source": "# Merge the branch back into `main`", + "metadata": {}, + "id": "17435550-2212-4caf-a17e-d534ad9df31e" + }, + { + "cell_type": "code", + "source": "res = branchDev.merge_into(branchMain)\nprint(res)", + "metadata": { + "trusted": true, + "tags": [] + }, + "execution_count": null, + "outputs": [], + "id": "3d17bffc-facd-497f-ba55-5cc309538baf" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "ecea713c-8489-43e0-9968-9f4fd16e2d49" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "380d2d46-d9f8-4880-a924-aaeb7312ed42" + }, + { + "cell_type": "markdown", + "source": "---", + "metadata": {}, + "id": "99a77a15-7f39-4707-a364-99451c6fe0f6" + }, + { + "cell_type": "markdown", + "source": "## More Questions?\n\n###### Join the lakeFS Slack group - https://lakefs.io/slack", + "metadata": {}, + "id": "24c7709f-3395-492a-bc94-53f5d4e0b756" + }, + { + "cell_type": "code", + "source": "", + "metadata": {}, + "execution_count": null, + "outputs": [], + "id": "cff8fcb9-3fc0-4336-8b4a-6fc38f83f701" + } + ] +} \ No newline at end of file diff --git a/README.md b/README.md index 868eb8d0f..15dbaf015 100644 --- a/README.md +++ b/README.md @@ -88,6 +88,7 @@ Under the [standalone_examples](./01_standalone_examples/) folder are a set of e * [Databricks CI/CD](./01_standalone_examples/databricks-ci-cd/) * [AWS Glue and Athena](./01_standalone_examples/aws-glue-athena/) * [AWS Glue and Trino](./01_standalone_examples/aws-glue-trino/) +* [AWS Glue and Iceberg](./01_standalone_examples/aws-glue-iceberg/) * [lakeFS + Dagster](./01_standalone_examples/dagster-integration/) * [lakeFS + Prefect](./01_standalone_examples/prefect-integration/) * [Image Segmentation Demo: ML Data Version Control and Reproducibility at Scale](./01_standalone_examples/image-segmentation/)