From fa584adbfd0ac97b593f32d75e21413a4f9e217f Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Thu, 11 Jul 2024 18:25:33 +0530 Subject: [PATCH 1/5] Added how dlt uses arrow by jorrit (#1577) --- .../2024-07-11-how-dlt-uses-apache-arrow.md | 307 ++++++++++++++++++ 1 file changed, 307 insertions(+) create mode 100644 docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md diff --git a/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md b/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md new file mode 100644 index 0000000000..b589bf0071 --- /dev/null +++ b/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md @@ -0,0 +1,307 @@ +--- +slug: how-dlt-uses-apache-arrow +title: "How dlt uses Apache Arrow" +authors: + name: Jorrit Sandbrink + title: Open Source Software Engineer + url: https://github.com/jorritsandbrink + tags: [apache arrow, dlt] +canonical_url: "https://jorritsandbrink.substack.com/p/how-dlt-uses-apache-arrow-for-fast-pipelines" +--- + +:::tip TL;DR: +`dlt` uses Apache Arrow to make pipelines faster. The Arrow format is a better way +to represent tabular data in memory than native Python objects (list of dictionaries). It enables +offloading computation to Arrow’s fast C++ library, and prevents processing rows one by one. +::: + +Speed matters. Pipelines should move data quickly and efficiently. The bigger the data, the more +that holds true. Growing data volumes force performance optimization upon data processing tools. In +this blog I describe how `dlt` uses Arrow and why it makes data pipelines faster. + +## What is `dlt`? + +`dlt` is an open source Python library that lets you build data pipelines as code. It tries to make +data movement between systems easier. It gives data engineers a set of abstractions (e.g. source, +destination, pipeline) and a declarative API that saves them from writing lower level code. + +`dlt` doesn’t use a backend server/database. It’s “just a library” that can be embedded in a Python +process. `pip install dlt` and `import dlt` is all it takes. + +An example use case is loading data from a REST API (the source) into a data warehouse (the +destination) with a `dlt` pipeline that runs in a serverless cloud function (e.g. AWS Lambda). + +## What is Arrow? + +Arrow is an Apache project that standardizes data analytics systems. Among other things, it +specifies a format to represent analytics data in memory. + +Format characteristics: + +- language agnostic → it’s the same in C++, Rust, Python, or any other language + +- columnar → values for a column are stored contiguously + +- lightweight encoding → no general purpose compression (e.g. Snappy) or complex encodings + +- O(1) (constant-time) random access + +System interoperability and performance are two of the benefits of having this standard. + +## How `dlt` works + +Before explaining how `dlt` uses Arrow, I will first describe how `dlt` works at a high level. + +Pipeline steps + +A basic `dlt` pipeline has three main steps: + +1. extract + +1. normalize + +1. load + +**extract →** fetch data from source system and write to local disk + +**normalize →** read extracted data from local disk infer schema and transform data in memory write +transformed data to local disk + +**load →** read normalized data from local disk and ingest into destination system + +### Extraction + +extract is I/O intensive. + +`dlt` uses a Python generator function that fetches data from a source system and yields it into the +pipeline. This function is called a resource. + +### Normalization + +Steps 1 and 3 of normalize are I/O intensive. Step 2 is compute intensive. Step 2 has several +“substeps”: + + +1. identify tables, columns and their data types + +2. apply naming convention (e.g. snake_case) to table and column identifiers + +3. add system columns → e.g. `_dlt_id` (row identifier) and `_dlt_load_id` (load identifier) + +4. split nested data into parent and child tables + + +> Some of these substeps are already done during extract when using the Arrow route, as I explain +> later in this blog. + +### Loading + +Load is I/O intensive (and in some cases also compute intensive). + +The data files persisted during normalize are loaded into the destination. How this is done differs +per destination. + +## How `dlt` uses Arrow + +`dlt` currently supports two different pipeline “routes”: + +1. The traditional route → has existed since earliest versions of `dlt` + +1. The Arrow route → was added later as improvement + +The user decides which route is taken. It’s an implicit choice that depends on the type of object +yielded by the resource. +![Picture](https://storage.googleapis.com/dlt-blog-images/blog_data_engineering_with_jorrit.png) + +## Traditional route + +The traditional route uses native Python objects and row orientation to represent tabular data in +memory. + +```py +@dlt.resource +def my_traditional_resource(): + + # native Python objects as table + table = [ + {"foo": 23, "bar": True}, + {"foo": 7, "bar": False} + ] + + yield table + +pipeline.run(my_traditional_resource()) +``` + +### extract + +The resource yields Python dictionaries or lists of dictionaries into the pipeline. Each dictionary +is a row: keys are column names, values are column values. A list of such dictionaries can be seen +as a table. + +The pipeline serializes the Python objects into a JSON-like byte-stream (using orjson) and persists +to binary disk files with .typed-jsonl extension. + +### normalize + +The pipeline reads the extracted data from .typed-jsonl files back into memory and deserializes it. +It iterates over all table values in a nested for loop. The outer loop iterates over the rows, the +inner loop iterates over the columns. While looping, the pipeline performs the steps mentioned in +the paragraph called Normalization. + +The normalized data is persisted to disk in a format that works well for the destination it will be +loaded into. For example, two of the formats are: + +- jsonl → JSON Lines—default for filesystem destination + +- insert_values → a file storing INSERT SQL statements, compressed by default—default for some of + the SQL destinations + +### load + +As mentioned, this step differs per destination. It also depends on the format of the file persisted +during normalize. Here are two examples to give an idea: + +- jsonl files and filesystem destination → use PUT operation + +- insert_values files and SQL destination (e.g. postgres) → execute SQL statements on SQL engine + +### Arrow route + +The Arrow route uses columnar Arrow objects to represent tabular data in memory. It relies on the +pyarrow Python libary. + +```py +import pyarrow as pa + +@dlt.resource +def my_arrow_resource(): + + ... # some process that creates a Pandas DataFrame + + # Arrow object as table + table = pa.Table.from_pandas(df) + + yield table + +pipeline.run(my_arrow_resource()) +``` + +### extract + +The resource yields Arrow objects into the pipeline. These can be Arrow tables (pyarrow.Table) or +Arrow record batches (pyarrow.RecordBatch). Arrow objects are schema aware, meaning they store +column names and data types alongside the data. + +The pipeline serializes the Arrow objects into Parquet files on disk. This is done with pyarrow’s +Parquet writer (pyarrow.parquet.ParquetWriter). Like Arrow objects, Parquet files are schema aware. +The Parquet writer simply translates the Arrow schema to a Parquet schema and persists it in the +file. + +> The yielded Arrow objects are slightly normalized in the extract step. This prevents a rewrite in +> the normalize step. The normalization done here are cheap metadata operations that don’t add much +> overhead to extract. For example, column names are adjusted if they don’t match the naming +> convention and column order is adjusted if it doesn’t match the table schema. + +### normalize + +Schema inference is not needed because the table schema can be read from the Parquet file. + +There are tree cases—in the ideal case, data does not need to be transformed: + +1. **destination supports Parquet loading — no normalization (ideal):** the extracted Parquet + files are simply “moved” to the load folder using an atomic rename. This is a cheap metadata + operation. Data is not transformed and the data doesn’t actually move. `dlt` does not add row and + load identifier columns. + +1. **destination supports Parquet loading — yes normalization (okay):** the extracted Parquet + files are loaded into memory in Arrow format. The necessary transformations (e.g. adding system + columns or renaming column identifiers) are done using pyarrow methods. These operations are + relatively cheap. Parquet and Arrow are both columnar and have similar data layouts. + Transformations are done in batch, not on individual rows. Computations are done in C++, because + pyarrow is a wrapper around the Arrow C++ library. + +1. **destination does not support Parquet loading (not good):** the extracted Parquet files are + read in memory and converted to a format supported by the destination (e.g. insert_values). This + is an expensive operation. Parquet’s columnar format needs to be converted to row orientation. + The rows are iterated over one by one to generate the load file. + +### load + +This step is the same as in the traditional route. + +## Main differences + +The most important differences between the traditional and Arrow routes are as follows. + +- **in memory format** + + - traditional → native Python objects + - Arrow → pyarrow objects + +- **on disk format for normalized data** + + - traditional → defaults to jsonl + - Arrow → defaults to parquet + +- **schema inference** + + - traditional → handled by `dlt` during normalize—done in Python while iterating over rows + - Arrow → two cases: + - source system provides Arrow data: schema taken from source (no schema inference needed) + - source system does not provide Arrow data: handled by pyarrow during extract when data is + converted into Arrow objects, done in C++ + +- **data transformation for normalization** + + - traditional → handled by dlt—done in Python while iterating over rows + - Arrow → handled by pyarrow—done in C++ on columnar batches of rows + +## Why `dlt` uses Arrow + +`dlt` uses Arrow to make pipelines faster. The normalize step in particular can be much more efficient +in the Arrow route. + +Using pyarrow objects for tabular data is faster than using native Python objects (lists of +dictionaries), because they are: + +- schema aware + +- columnar + +- computed in C++ + +Generally speaking, C++ is much faster than Python. Moreover, Arrow’s C++ implementation can use +vectorization (SIMD) thanks to the columnar data layout. The Arrow route can process batches of +values concurrently in C++, while `dlt’s` traditional route needs iteration over values one by one in +a nested Python loop. + +Schema aware Arrow objects prevents `dlt` from having to infer column types from column values. + +## Further thoughts + +A potential optimization I can think of (but haven’t tested) is to use the Arrow IPC File Format to +serialize data between extract and normalize. This saves two format conversions: + +1. Arrow → Parquet (serialization at the end of extract) + +1. Parquet → Arrow (deserialization at the start of normalize) + +Although Arrow and Parquet have relatively similar layouts (especially when using Parquet without +general purpose compression), removing the (de)serialization steps might still improve performance +significantly. + +Simply disabling compression when writing the Parquet file could be an easier way to achieve similar +results. + +## Context + +I contribute to the open source `dlt` library, but didn’t implement the core framework logic related +to extraction, normalization, and loading described in this post. I’m enthusiastic about Arrow and +its implications for the data ecosystem, but haven’t contributed to their open source libraries. + +# Call to action +Try the SQL connector here with the various backends: [Docs](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database#pick-the-right-backend-to-load-table-data) + +Want to discuss performance? +[Join the dlt slack community!](https://dlthub.com/community) From 22e8e3899d6089273c71b7bc8c87acc4e4a130a3 Mon Sep 17 00:00:00 2001 From: adrianbr Date: Thu, 11 Jul 2024 15:08:26 +0200 Subject: [PATCH 2/5] add image link to blog post 2024-07-11-how-dlt-uses-apache-arrow.md --- docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md b/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md index b589bf0071..4ae6f12013 100644 --- a/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md +++ b/docs/website/blog/2024-07-11-how-dlt-uses-apache-arrow.md @@ -1,6 +1,7 @@ --- slug: how-dlt-uses-apache-arrow title: "How dlt uses Apache Arrow" +image: https://storage.googleapis.com/dlt-blog-images/blog_data_engineering_with_jorrit.png authors: name: Jorrit Sandbrink title: Open Source Software Engineer From 75410e609c4f724bf64a38dcf08f916f507d92db Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Wed, 17 Jul 2024 22:58:45 +0300 Subject: [PATCH 3/5] Remove links to the blog and docs from the navigation (#1602) --- docs/website/docusaurus.config.js | 7 ------- 1 file changed, 7 deletions(-) diff --git a/docs/website/docusaurus.config.js b/docs/website/docusaurus.config.js index d011de95b3..689e20caff 100644 --- a/docs/website/docusaurus.config.js +++ b/docs/website/docusaurus.config.js @@ -84,13 +84,6 @@ const config = { }, items: [ { label: 'dlt ' + (process.env.IS_MASTER_BRANCH ? "stable ": "devel ") + (process.env.DOCUSAURUS_DLT_VERSION || "0.0.1"), position: 'left', href: 'https://github.com/dlt-hub/dlt', className: 'version-navbar' }, - { - type: 'doc', - docId: 'intro', - position: 'left', - label: 'Docs', - }, - { to: 'blog', label: 'Blog', position: 'left' }, { href: 'https://dlthub.com/community', label: 'Join community', From 8b3166e5f92ed45203c04fa0bc3ca586f0730510 Mon Sep 17 00:00:00 2001 From: Anton Burnashev Date: Thu, 15 Aug 2024 15:06:59 +0200 Subject: [PATCH 4/5] Docs: handle renamed credential docs pages (#1694) --- docs/website/netlify.toml | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/website/netlify.toml b/docs/website/netlify.toml index 76e233efe0..ec6bd0ce49 100644 --- a/docs/website/netlify.toml +++ b/docs/website/netlify.toml @@ -5,3 +5,18 @@ to = "/docs/intro" [[redirects]] from = "/docs" to = "/docs/intro" + +[[redirects]] +from = "/docs/general-usage/credentials/configuration" +to = "/docs/general-usage/credentials/setup" +status = 301 + +[[redirects]] +from = "/docs/general-usage/credentials/config_providers" +to = "/docs/general-usage/credentials" +status = 301 + +[[redirects]] +from = "/docs/general-usage/credentials/config_specs" +to = "/docs/general-usage/credentials" +status = 301 From 9e9f20f5ca65186ac37c01c08f08f7952cc08bba Mon Sep 17 00:00:00 2001 From: Dave Date: Fri, 6 Sep 2024 10:53:24 +0200 Subject: [PATCH 5/5] Add historic docs versions (cherry picked from commit 8995f702cd26e82e289ca29414a6512e4803084d) --- docs/website/.gitignore | 6 ++ docs/website/README.md | 27 +++++- docs/website/docusaurus.config.js | 56 +++++++----- docs/website/package-lock.json | 27 ++---- docs/website/package.json | 11 ++- docs/website/sidebars.js | 17 ---- docs/website/tools/clear_versions.js | 11 +++ docs/website/tools/update_version_env.js | 17 ---- docs/website/tools/update_versions.js | 107 +++++++++++++++++++++++ 9 files changed, 198 insertions(+), 81 deletions(-) create mode 100644 docs/website/tools/clear_versions.js delete mode 100644 docs/website/tools/update_version_env.js create mode 100644 docs/website/tools/update_versions.js diff --git a/docs/website/.gitignore b/docs/website/.gitignore index bc94ffc693..861b383383 100644 --- a/docs/website/.gitignore +++ b/docs/website/.gitignore @@ -23,3 +23,9 @@ jaffle_shop npm-debug.log* yarn-debug.log* yarn-error.log* + +# ignore all versions, there are generated dynamically +versions.json +versioned_docs +versioned_sidebars +.dlt-repo \ No newline at end of file diff --git a/docs/website/README.md b/docs/website/README.md index b907d1380a..31bdbdde07 100644 --- a/docs/website/README.md +++ b/docs/website/README.md @@ -64,4 +64,29 @@ The site is deployed using `netlify`. The `netlify` build command is: npm run build:netlify ``` -It will place the build in `build/docs` folder. The `netlify.toml` redirects from root path `/` into `/docs`. \ No newline at end of file +It will place the build in `build/docs` folder. The `netlify.toml` redirects from root path `/` into `/docs`. + +## Docs versions + +We keep a few additional versions of our docs for the users to be able read about how former and future versions of dlt work. We use docusaurus versions for this but we do not check the historical versions into the repo but rather use a script to build the former versions on deployment. To locally build the versions run: + +``` +npm run update-versions +``` + +This will execute the script at tools/update_versions.js. This tool will do the following: + +* Find all the highest minor versions the tags of the repo (e.g. 0.4.13, 0.5.22, 1.1.3) +* It will create a version for all of these tags that are larger than the minimum version defined in MINIMUM_SEMVER_VERSION in the script. +* It will NOT create a version for the highest version, we assume the most up to date docs for the highest versions are the tip of master +* It will NOT create any docs versions for pre-releases. +* It will create a future version called "devel" from the current commit of this repo. +* It will set up docusaurus to display all of these versions correctly. + +You can clear these versions with + +``` +npm run clear-versions +``` + +The netflify deployment of these docs need to happen from the master branch so that the current version gets properly selected. \ No newline at end of file diff --git a/docs/website/docusaurus.config.js b/docs/website/docusaurus.config.js index 689e20caff..3afdcec52e 100644 --- a/docs/website/docusaurus.config.js +++ b/docs/website/docusaurus.config.js @@ -1,11 +1,43 @@ // @ts-check // Note: type annotations allow type checking and IDEs autocompletion +const fs = require("fs") require('dotenv').config() const lightCodeTheme = require('prism-react-renderer/themes/dracula'); // const lightCodeTheme = require('prism-react-renderer/themes/github'); const darkCodeTheme = require('prism-react-renderer/themes/dracula'); +// create versions config +const versions = {"current": { + label: 'devel', + path: 'devel', + noIndex: true +}} + +// inject master version renaming only if versions present +if (fs.existsSync("versions.json")) { + let latestLabel = "latest" + if (process.env.DOCUSAURUS_DLT_VERSION) { + latestLabel = `${process.env.DOCUSAURUS_DLT_VERSION} (latest)` + } + + + versions["master"] = { + label: latestLabel, + path: '/' + } + // disable indexing for all known versions + for (let v of JSON.parse(fs.readFileSync("versions.json"))) { + if (v == "master") { + continue; + } + versions[v] = { + noIndex: true + } + } + +} + /** @type {import('@docusaurus/types').Config} */ const config = { title: 'dlt Docs', @@ -30,8 +62,6 @@ const config = { locales: ['en'], }, - - presets: [ [ '@docusaurus/preset-classic', @@ -50,12 +80,7 @@ const config = { editUrl: (params) => { return "https://github.com/dlt-hub/dlt/tree/devel/docs/website/docs/" + params.docPath; }, - versions: { - current: { - label: 'current', - }, - }, - lastVersion: 'current', + versions: versions, showLastUpdateAuthor: true, showLastUpdateTime: true, }, @@ -83,7 +108,9 @@ const config = { href: 'https://dlthub.com' }, items: [ - { label: 'dlt ' + (process.env.IS_MASTER_BRANCH ? "stable ": "devel ") + (process.env.DOCUSAURUS_DLT_VERSION || "0.0.1"), position: 'left', href: 'https://github.com/dlt-hub/dlt', className: 'version-navbar' }, + { + type: 'docsVersionDropdown', + }, { href: 'https://dlthub.com/community', label: 'Join community', @@ -189,15 +216,4 @@ const config = { ], }; -if (!process.env.IS_MASTER_BRANCH && config.themeConfig) { - config.themeConfig.announcementBar = { - id: 'devel docs', - content: - 'This is the development version of the dlt docs. Go to the stable docs.', - backgroundColor: '#4c4898', - textColor: '#fff', - isCloseable: false, - } -} - module.exports = config; diff --git a/docs/website/package-lock.json b/docs/website/package-lock.json index f99a70dbf7..f1409905f1 100644 --- a/docs/website/package-lock.json +++ b/docs/website/package-lock.json @@ -20,6 +20,7 @@ "react": "^17.0.2", "react-dom": "^17.0.2", "react-twitter-embed": "^4.0.4", + "semver": "^7.6.3", "string-dedent": "^3.0.1", "sync-fetch": "^0.5.2", "toml": "^3.0.0" @@ -11075,12 +11076,10 @@ } }, "node_modules/semver": { - "version": "7.5.4", - "resolved": "https://registry.npmjs.org/semver/-/semver-7.5.4.tgz", - "integrity": "sha512-1bCSESV6Pv+i21Hvpxp3Dx+pSD8lIPt8uVjRrxAUt/nbswYc+tK6Y2btiULjd4+fnq15PX+nqQDC7Oft7WkwcA==", - "dependencies": { - "lru-cache": "^6.0.0" - }, + "version": "7.6.3", + "resolved": "https://registry.npmjs.org/semver/-/semver-7.6.3.tgz", + "integrity": "sha512-oVekP1cKtI+CTDvHWYFUcMtsK/00wmAEfyqKfNdARm8u1wNVhSgaX7A8d4UuIlUI5e84iEwOhs7ZPYRmzU9U6A==", + "license": "ISC", "bin": { "semver": "bin/semver.js" }, @@ -11107,22 +11106,6 @@ "semver": "bin/semver.js" } }, - "node_modules/semver/node_modules/lru-cache": { - "version": "6.0.0", - "resolved": "https://registry.npmjs.org/lru-cache/-/lru-cache-6.0.0.tgz", - "integrity": "sha512-Jo6dJ04CmSjuznwJSS3pUeWmd/H0ffTlkXXgwZi+eq1UCmqQwCh+eLsYOYCwY991i2Fah4h1BEMCx4qThGbsiA==", - "dependencies": { - "yallist": "^4.0.0" - }, - "engines": { - "node": ">=10" - } - }, - "node_modules/semver/node_modules/yallist": { - "version": "4.0.0", - "resolved": "https://registry.npmjs.org/yallist/-/yallist-4.0.0.tgz", - "integrity": "sha512-3wdGidZyq5PB084XLES5TpOSRA3wjXAlIWMhum2kRcv/41Sn2emQ0dycQW4uZXLejwKvg6EsvbdlVL+FYEct7A==" - }, "node_modules/send": { "version": "0.18.0", "resolved": "https://registry.npmjs.org/send/-/send-0.18.0.tgz", diff --git a/docs/website/package.json b/docs/website/package.json index becf1c8bc6..d904c8d77b 100644 --- a/docs/website/package.json +++ b/docs/website/package.json @@ -4,16 +4,18 @@ "private": true, "scripts": { "docusaurus": "docusaurus", - "start": "PYTHONPATH=. poetry run pydoc-markdown && node tools/update_version_env.js && node tools/preprocess_docs.js && concurrently --kill-others \"node tools/preprocess_docs.js --watch\" \"docusaurus start\"", - "build": "node tools/preprocess_docs.js && PYTHONPATH=. poetry run pydoc-markdown && node tools/update_version_env.js && docusaurus build", - "build:netlify": "node tools/preprocess_docs.js && PYTHONPATH=. pydoc-markdown && node tools/update_version_env.js && docusaurus build --out-dir build/docs", + "start": "PYTHONPATH=. poetry run pydoc-markdown && node tools/preprocess_docs.js && concurrently --kill-others \"node tools/preprocess_docs.js --watch\" \"docusaurus start\"", + "build": "node tools/update_versions.js && node tools/preprocess_docs.js && PYTHONPATH=. poetry run pydoc-markdown && docusaurus build", + "build:netlify": "node tools/update_versions.js && node tools/preprocess_docs.js && PYTHONPATH=. pydoc-markdown && docusaurus build --out-dir build/docs", "swizzle": "docusaurus swizzle", "clear": "docusaurus clear", "serve": "docusaurus serve", "write-translations": "docusaurus write-translations", "write-heading-ids": "docusaurus write-heading-ids", "preprocess-docs": "node tools/preprocess_docs.js", - "generate-api-reference": "PYTHONPATH=. poetry run pydoc-markdown" + "generate-api-reference": "PYTHONPATH=. poetry run pydoc-markdown", + "clear-versions": "node tools/clear_versions.js", + "update-versions": "node tools/update_versions.js" }, "dependencies": { "@docusaurus/core": "2.4.3", @@ -28,6 +30,7 @@ "react": "^17.0.2", "react-dom": "^17.0.2", "react-twitter-embed": "^4.0.4", + "semver": "^7.6.3", "string-dedent": "^3.0.1", "sync-fetch": "^0.5.2", "toml": "^3.0.0" diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 921c3c0dc4..56c2eb165c 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -356,21 +356,4 @@ if (fs.existsSync('./docs_processed/api_reference/sidebar.json')) { } } -// on the master branch link to devel and vice versa -if (process.env.IS_MASTER_BRANCH) { - sidebars.tutorialSidebar.push( { - type: 'link', - label: 'Switch to Devel Docs', - href: 'https://dlthub.com/devel/intro', - className: 'learn-more-link', - }) -} else { - sidebars.tutorialSidebar.push( { - type: 'link', - label: 'Switch to Stable Docs', - href: 'https://dlthub.com/docs/intro', - className: 'learn-more-link', - }) -} - module.exports = sidebars; diff --git a/docs/website/tools/clear_versions.js b/docs/website/tools/clear_versions.js new file mode 100644 index 0000000000..e66f5ed823 --- /dev/null +++ b/docs/website/tools/clear_versions.js @@ -0,0 +1,11 @@ +const fs = require('fs'); + +version_files = [ + "versions.json", + "versioned_docs", + "versioned_sidebars" +] + +for (let f of version_files) { + fs.rmSync(f, { recursive: true, force: true }) +} \ No newline at end of file diff --git a/docs/website/tools/update_version_env.js b/docs/website/tools/update_version_env.js deleted file mode 100644 index 15d10bd970..0000000000 --- a/docs/website/tools/update_version_env.js +++ /dev/null @@ -1,17 +0,0 @@ -// creates an .env file with the current version of the dlt for consumption in docusaurus.config.js - -const { readFile, writeFile } = require('fs/promises') -const tom = require('toml'); - -const TOML_FILE = '../../pyproject.toml'; -const ENV_FILE = '.env' - -async function update_env() { - const fileContent = await readFile(TOML_FILE, 'utf8'); - const toml = tom.parse(fileContent); - const version = toml['tool']['poetry']['version']; - const envFileContent = `DOCUSAURUS_DLT_VERSION=${version}`; - await writeFile(ENV_FILE, envFileContent, 'utf8'); -} - -update_env(); \ No newline at end of file diff --git a/docs/website/tools/update_versions.js b/docs/website/tools/update_versions.js new file mode 100644 index 0000000000..855766c5dd --- /dev/null +++ b/docs/website/tools/update_versions.js @@ -0,0 +1,107 @@ +const proc = require('child_process') +const fs = require('fs'); +const semver = require('semver') + + +// const +const REPO_DIR = ".dlt-repo" +const REPO_DOCS_DIR = REPO_DIR + "/docs/website" +const REPO_URL = "https://github.com/dlt-hub/dlt.git" +const VERSIONED_DOCS_FOLDER = "versioned_docs" +const VERSIONED_SIDEBARS_FOLDER = "versioned_sidebars" +const ENV_FILE = '.env' + +// no doc versions below this version will be deployed +const MINIMUM_SEMVER_VERSION = "0.5.0" + +// clear old repo version +fs.rmSync(REPO_DIR, { recursive: true, force: true }) + +// checkout fresh +console.log("Checking out dlt repo") +proc.execSync(`git clone ${REPO_URL} ${REPO_DIR}`) + +// find tags +console.log("Discovering versions") +const tags = proc.execSync(`cd ${REPO_DIR} && git tag`).toString().trim().split("\n"); +console.log(`Found ${tags.length} tags`) + +// parse and filter invalid tags +let versions = tags.map(v => semver.valid(v)).filter(v => v != null) + +// remove all tags below the min version and sort +min_version = semver.valid(MINIMUM_SEMVER_VERSION) +versions = semver.rsort(versions.filter(v => semver.gt(v, min_version))) + +// remove prelease versions +versions.filter(v => semver.prerelease(v) == null) + +console.log(`Found ${versions.length} elligible versions`) +if (versions.length < 2) { + console.error("Sanity check failed, not enough elligble version tags found") + process.exit(1) +} + +// write last version into env file +const envFileContent = `DOCUSAURUS_DLT_VERSION=${versions[0]}`; +fs.writeFileSync(ENV_FILE, envFileContent, 'utf8'); + +// go through the versions and find all newest versions of any major version +// the newest version is replace by the master branch here so the master head +// always is the "current" doc +const selectedVersions = ["master"]; +let lastVersion = versions[0]; +for (let ver of versions) { + if ( semver.major(ver) != semver.major(lastVersion)) { + selectedVersions.push(ver) + } + lastVersion = ver; +} + +console.log(`Will create docs versions for ${selectedVersions}`) + +// create folders +fs.rmSync(VERSIONED_DOCS_FOLDER, { recursive: true, force: true }) +fs.rmSync(VERSIONED_SIDEBARS_FOLDER, { recursive: true, force: true }) +fs.rmSync("versions.json", { force: true }) + +fs.mkdirSync(VERSIONED_DOCS_FOLDER); +fs.mkdirSync(VERSIONED_SIDEBARS_FOLDER); + +// check that checked out repo is on devel +console.log("Checking branch") +const branch = proc.execSync(`cd ${REPO_DIR} && git rev-parse --abbrev-ref HEAD`).toString().trim() + +// sanity check +if (branch != "devel") { + console.error("Could not check out devel branch") + process.exit(1) +} + +selectedVersions.reverse() +for (const version of selectedVersions) { + + // checkout verison and verify we have the right tag + console.log(`Generating version ${version}, switching to tag:`) + proc.execSync(`cd ${REPO_DIR} && git checkout ${version}`) + + // const tag = proc.execSync(`cd ${REPO_DIR} && git describe --exact-match --tags`).toString().trim() + // if (tag != version) { + // console.error(`Could not checkout version ${version}`) + // process.exit(1) + // } + + // build doc version, we also run preprocessing and markdown gen for each doc version + console.log(`Building docs...`) + proc.execSync(`cd ${REPO_DOCS_DIR} && npm run preprocess-docs && PYTHONPATH=. pydoc-markdown`) + + console.log(`Snapshotting version...`) + proc.execSync(`cd ${REPO_DOCS_DIR} && npm run docusaurus docs:version ${version}`) + + console.log(`Moving snapshot`) + fs.cpSync(REPO_DOCS_DIR+"/"+VERSIONED_DOCS_FOLDER, VERSIONED_DOCS_FOLDER, {recursive: true}) + fs.cpSync(REPO_DOCS_DIR+"/"+VERSIONED_SIDEBARS_FOLDER, VERSIONED_SIDEBARS_FOLDER, {recursive: true}) + +} + +fs.cpSync(REPO_DOCS_DIR+"/versions.json", "versions.json")