merging with main for linting fix

Alex-Monahan · Aug 24, 2023 · 5f6f07e · 5f6f07e
2 parents 510d1d6 + 3e11d47
commit 5f6f07e
Show file tree

Hide file tree

Showing 58 changed files with 215 additions and 173 deletions.
diff --git a/.github/linkchecker/linkchecker.conf b/.github/linkchecker/linkchecker.conf
@@ -18,3 +18,4 @@ ignore=
    https://nbviewer\.org/.*
    https://open\.spotify\.com/.*
    https://www\.dataengineeringpodcast\.com/.*
+   https://www.tpc.org/.*
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -18,9 +18,7 @@ jobs:
       - uses: articulate/actions-markdownlint@main
         with:
           config: .markdownlint.jsonc
-          files: 'docs/**/*.md'
-          # TODO:
-          # files: 'docs/**/*.md _posts/*.md dev/*.md'
+          files: 'docs/**/*.md _posts/*.md dev/*.md'
 
   python:
     runs-on: ubuntu-latest

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -41,11 +41,13 @@ Some of this style guide is automated with GitHub Actions, but feel free to run
 * Quoted blocks (lines starting with `>`) are rendered as [a "Note" box](https://duckdb.org/docs/archive/0.8.1/guides/python/filesystems).
 * Always format SQL code, variable names, function names, etc. as code. For example, when talking about the `CREATE TABLE` statement, the keywords should be formatted as code.
 * When presenting SQL statements, do not include the DuckDB prompt (`D `) in the documentation.
+* SQL statements should end with a semicolon (`;`) to allow readers to quickly paste them into a SQL console.
 
 ### Headers
 
 * The title of the page should be encoded in the front matter's `title` property. The body of the page should not start with a repetition of this title.
 * In the body of the page, restrict the use of headers to the following levels: h2 (`##`), h3 (`###`), and h4 (`####`).
+* Use headline capitalization as defined in the [Chicago Manual of Style](https://headlinecapitalization.com/).
 
 ### SQL style
 

diff --git a/_config_exclude_archive.yml b/_config_exclude_archive.yml
@@ -1,2 +1,2 @@
 # bundler exec jekyll serve --incremental --config _config.yml,_config_exclude_archive.yml
-exclude: ['docs/archive']
+exclude: ['docs/archive', 'vendor']
diff --git a/_data/menu_docs_dev.json b/_data/menu_docs_dev.json
@@ -925,6 +925,10 @@
               "page": "HTTPFS",
               "url": "httpfs"
             },
+            {
+              "page": "Iceberg",
+              "url": "iceberg"
+            },
             {
               "page": "JSON",
               "url": "json"

diff --git a/_posts/2021-08-27-external-sorting.md b/_posts/2021-08-27-external-sorting.md
@@ -372,12 +372,12 @@ We see similar trends at SF10 and SF100, but for SF100, at around 12 payload col
 ClickHouse switches to an external sorting strategy, which is much slower than its in-memory strategy.
 Therefore, adding a few payload columns results in a runtime that is orders of magnitude higher.
 At 20 payload columns ClickHouse runs into the following error:
-```
+```text
 DB::Exception: Memory limit (for query) exceeded: would use 11.18 GiB (attempt to allocate chunk of 4204712 bytes), maximum: 11.18 GiB: (while reading column cs_list_price): (while reading from part ./store/523/5230c288-7ed5-45fa-9230-c2887ed595fa/all_73_108_2/ from mark 4778 with max_rows_to_read = 8192): While executing MergeTreeThread.
 ```
 
 HyPer also drops in performance before erroring out with the following message:
-```
+```text
 ERROR:  Cannot allocate 333982248 bytes of memory: The `global memory limit` limit of 12884901888 bytes was exceeded.
 ```
 As far as we are aware, HyPer uses [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html), which creates a mapping between memory and a file.
@@ -392,7 +392,7 @@ Using swap usually slows down processing significantly, but the SSD is so fast t
 While Pandas loads the data, swap size grows to an impressive \~40 GB: Both the file and the data frame are fully in memory/swap at the same time, rather than streamed into memory.
 This goes down to \~20 GB of memory/swap when the file is done being read.
 Pandas is able to get quite far into the experiment until it crashes with the following error:
-```
+```text
 UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
 ```
 
@@ -521,7 +521,7 @@ We have set the number of threads that DuckDB and ClickHouse use to 8 because we
 
 Pandas performs comparatively worse than on the MacBook, because it has a single-threaded implementation, and this CPU has a lower single-thread performance.
 Again, Pandas crashes with an error (this machine does not dynamically increase swap):
-```
+```text
 numpy.core._exceptions.MemoryError: Unable to allocate 6.32 GiB for an array with shape (6, 141430723) and data type float64
 ```
 

diff --git a/_posts/2021-10-29-duckdb-wasm.md b/_posts/2021-10-29-duckdb-wasm.md
@@ -245,14 +245,14 @@ In 2018, the Spectre and Meltdown vulnerabilities sent crippling shockwaves thro
 
 Without `SharedArrayBuffers`, WebAssembly modules can run in a dedicated web worker to unblock the main event loop but won't be able to spawn additional workers for parallel computations within the same instance. By default, we therefore cannot unleash the parallel query execution of DuckDB in the web. However, browser vendors have recently started to reenable `SharedArrayBuffers` for websites that are [cross-origin-isolated](https://web.dev/coop-coep/). A website is cross-origin-isolated if it ships the main document with the following HTTP headers:
 
-```
+```text
 Cross-Origin-Embedded-Policy: require-corp
 Cross-Origin-Opener-Policy: same-origin
 ```
 
 These headers will instruct browsers to A) isolate the top-level document from other top-level documents outside its own origin and B) prevent the document from making arbitrary cross-origin requests unless the requested resource explicitly opts in. Both restrictions have far reaching implications for a website since many third-party data sources won't yet provide the headers today and the top-level isolation currently hinders the communication with, for example, OAuth pop up's ([there are plans to lift that](https://github.com/whatwg/html/issues/6364)).
 
-*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users. Share your thoughts with us [here]().*  
+*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users.*
 
 ## Web Shell
 

diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md
@@ -121,7 +121,7 @@ df_out <- dbReadTable(con, "characters")
 
 To demonstrate the performance of DuckDB when running operations on categorical columns of Pandas DataFrames, we present a number of benchmarks. The source code for the benchmarks is available [here](https://raw.githubusercontent.com/duckdb/duckdb-web/main/_posts/benchmark_scripts/enum.py). In our benchmarks we always consume and produce Pandas DataFrames.
 
-#### Dataset 
+### Dataset
 
 Our dataset is composed of one dataframe with 4 columns and 10 million rows. The first two columns are named ```race``` and ```subrace``` representing races. They are both categorical, with the same categories but different values. The other two columns ```race_string``` and ```subrace_string``` are the string representations of ```race``` and ```subrace```.
 

diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md
@@ -83,7 +83,7 @@ nyc.filter("year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amou
 
 In this section, we will look at some basic examples of the code needed to read and output Arrow tables in both Python and R.
 
-#### Setup
+### Setup
 
 First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below.
 ```bash

diff --git a/_posts/2022-05-04-friendlier-sql.md b/_posts/2022-05-04-friendlier-sql.md
@@ -267,7 +267,7 @@ In addition to what has already been implemented, several other improvements hav
     - Clickhouse supports this with the [`COLUMNS` expression](https://clickhouse.com/docs/en/sql-reference/statements/select/#columns-expression) 
  - Incremental column aliases
     - Refer to previously defined aliases in subsequent calculated columns rather than re-specifying the calculations
-- Dot operators for JSON types
+ - Dot operators for JSON types
     - The JSON extension is brand new ([see our documentation!](https://duckdb.org/docs/extensions/json)) and already implements friendly `->` and `->>` syntax
 
 Thanks for checking out DuckDB! May the Force be with you...
diff --git a/_posts/2022-09-30-postgres-scanner.md b/_posts/2022-09-30-postgres-scanner.md
@@ -51,7 +51,7 @@ CALL postgres_attach('dbname=myshinydb');
 `postgres_attach` takes a single required string parameter, which is the [`libpq` connection string](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING). For example you can pass `'dbname=myshinydb'` to select a different database name. In the simplest case, the parameter is just `''`. There are three additional named parameters to the function:
  * `source_schema` the name of a non-standard schema name in Postgres to get tables from. Default is `public`.
  * `overwrite` whether we should overwrite existing views in the target schema, default is `false`.
-* `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls.
+ * `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls.
 
 The tables in the database are registered as views in DuckDB, you can list them with
 ```SQL

diff --git a/_posts/2022-11-14-announcing-duckdb-060.md b/_posts/2022-11-14-announcing-duckdb-060.md
@@ -107,7 +107,7 @@ CREATE TABLE messages(u UNION(num INT, error VARCHAR));
 INSERT INTO messages VALUES (42);
 INSERT INTO messages VALUES ('oh my globs');
 ```
-```
+```text
 SELECT * FROM messages;
 ┌─────────────┐
 │      u      │
@@ -142,7 +142,7 @@ CREATE TABLE obs(id INT, val1 INT, val2 INT);
 INSERT INTO obs VALUES (1, 10, 100), (2, 20, NULL), (3, NULL, 300);
 SELECT MIN(COLUMNS(*)), COUNT(*) from obs;
 ```
-```
+```text
 ┌─────────────┬───────────────┬───────────────┬──────────────┐
 │ min(obs.id) │ min(obs.val1) │ min(obs.val2) │ count_star() │
 ├─────────────┼───────────────┼───────────────┼──────────────┤
@@ -155,7 +155,7 @@ The `COLUMNS` expression supports all star expressions, including [the `EXCLUDE`
 ```sql
 SELECT COLUMNS('val[0-9]+') from obs;
 ```
-```
+```text
 ┌──────┬──────┐
 │ val1 │ val2 │
 ├──────┼──────┤
@@ -170,7 +170,7 @@ SELECT COLUMNS('val[0-9]+') from obs;
 ```sql
 SELECT [x + 1 for x in [1, 2, 3]] AS l;
 ```
-```
+```text
 ┌───────────┐
 │     l     │
 ├───────────┤
@@ -211,8 +211,10 @@ The DuckDB shell also offers several improvements over the SQLite shell, such as
 
 The number of rows that are rendered can be changed by using the `.maxrows X` setting, and you can switch back to the old rendering using the `.mode box` command.
 
-```
+```sql
 D SELECT * FROM '~/Data/nyctaxi/nyc-taxi/2014/04/data.parquet';
+```
+```text
 ┌───────────┬─────────────────────┬─────────────────────┬───┬────────────┬──────────────┬──────────────┐
 │ vendor_id │      pickup_at      │     dropoff_at      │ … │ tip_amount │ tolls_amount │ total_amount │
 │  varchar  │      timestamp      │      timestamp      │   │   float    │    float     │    float     │
@@ -265,7 +267,7 @@ SELECT student_id FROM 'data/ -> data/grades.csv
 
 **Progress Bars**. DuckDB has [supported progress bars in queries for a while now](https://github.com/duckdb/duckdb/pull/1432), but they have always been opt-in. In this release we have [prettied up the progress bar](https://github.com/duckdb/duckdb/pull/5187) and enabled it by default in the shell. The progress bar will pop up when a query is run that takes more than 2 seconds, and display an estimated time-to-completion for the query.
 
-```
+```sql
 D copy lineitem to 'lineitem-big.parquet';
  32% ▕███████████████████▏                                        ▏ 
 ```

diff --git a/_posts/2023-02-13-announcing-duckdb-070.md b/_posts/2023-02-13-announcing-duckdb-070.md
@@ -44,7 +44,7 @@ COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month));
 
 This will cause the Parquet files to be written in the following directory structure:
 
-```
+```text
 orders
 ├── year=2021
 │    ├── month=1
@@ -145,7 +145,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
 >>> lineitem = duckdb.sql('FROM lineitem.parquet')
 >>> lineitem.limit(3).show()
 ```
-```
+```text
 ┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐
 │ l_orderkey │ l_partkey │ l_suppkey │ … │  l_shipinstruct   │ l_shipmode │      l_comment       │
 │   int32    │   int32   │   int32   │   │      varchar      │  varchar   │       varchar        │
@@ -161,7 +161,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
 >>> lineitem_filtered = duckdb.sql('FROM lineitem WHERE l_orderkey>5000')
 >>> lineitem_filtered.limit(3).show()
 ```
-```
+```text
 ┌────────────┬───────────┬───────────┬───┬────────────────┬────────────┬──────────────────────┐
 │ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │      l_comment       │
 │   int32    │   int32   │   int32   │   │    varchar     │  varchar   │       varchar        │
@@ -176,7 +176,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
 ```py
 >>> duckdb.sql('SELECT MIN(l_orderkey), MAX(l_orderkey) FROM lineitem_filtered').show()
 ```
-```
+```text
 ┌─────────────────┬─────────────────┐
 │ min(l_orderkey) │ max(l_orderkey) │
 │      int32      │      int32      │
@@ -193,7 +193,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk
 >>> lineitem = duckdb.read_csv('lineitem.csv')
 >>> lineitem.limit(3).show()
 ```
-```
+```text
 ┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐
 │ l_orderkey │ l_partkey │ l_suppkey │ … │  l_shipinstruct   │ l_shipmode │      l_comment       │
 │   int32    │   int32   │   int32   │   │      varchar      │  varchar   │       varchar        │
@@ -208,7 +208,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk
 ```py
 >>> duckdb.sql('select min(l_orderkey) from lineitem').show()
 ```
-```
+```text
 ┌─────────────────┐
 │ min(l_orderkey) │
 │      int32      │
@@ -225,7 +225,7 @@ import duckdb
 duckdb.sql('select 42').pl()
 ```
 
-```
+```text
 shape: (1, 1)
 ┌─────┐
 │ 42  │
@@ -245,7 +245,7 @@ df = pl.DataFrame({'a': 42})
 duckdb.sql('select * from df').pl()
 ```
 
-```
+```text
 shape: (1, 1)
 ┌─────┐
 │ a   │

diff --git a/_posts/2023-04-14-h2oai.md b/_posts/2023-04-14-h2oai.md
@@ -30,7 +30,7 @@ The time reported is the sum of the time it takes to run all 5 queries twice.
 
 More information about the specific queries can be found below.
 
-#### The Data and Queries
+### The Data and Queries
 
 The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation.
 
@@ -87,7 +87,7 @@ You can also look at the results [here](https://duckdblabs.github.io/db-benchmar
 
 Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the [_utils/repro.sh](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh) script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss.
 
-# Maintenance plan
+## Maintenance plan
 
 DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided.
 

diff --git a/_posts/2023-04-21-swift.md b/_posts/2023-04-21-swift.md
@@ -69,7 +69,7 @@ One problem with our current `ExoplanetStore` type is that it doesn’t yet cont
 
 There are hundreds of configuration options for this incredible resource, but today we want each exoplanet’s name and its discovery year packaged as a CSV. [Checking the docs](https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html) gives us the following endpoint:
 
-```
+```text
 https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv
 ```
 

diff --git a/_posts/2023-05-26-correlated-subqueries-in-sql.md b/_posts/2023-05-26-correlated-subqueries-in-sql.md
@@ -238,7 +238,7 @@ WHERE distance=(
 );
 ```
 
-```
+```text
 ┌───────────────────────────┐
 │         HASH_JOIN         │ 
 │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │ 

diff --git a/dev/sqllogictest/intro.md b/dev/sqllogictest/intro.md
@@ -62,7 +62,7 @@ A syntax highlighter exists for [Visual Studio Code](https://marketplace.visuals
 
 A syntax highlighter is also available for [CLion](https://plugins.jetbrains.com/plugin/15295-sqltest). It can be installed directly on the IDE by searching SQLTest on the marketplace. A [github repository](https://github.com/pdet/SQLTest) is also available, with extensions and bug reports being welcome.
 
-##### Temporary Files
+#### Temporary Files
 
 For some tests (e.g. CSV/Parquet file format tests) it is necessary to create temporary files. Any temporary files should be created in the temporary testing directory. This directory can be used by placing the string `__TEST_DIR__` in a query. This string will be replaced by the path of the temporary testing directory.
 
@@ -71,7 +71,7 @@ statement ok
 COPY csv_data TO '__TEST_DIR__/output_file.csv.gz' (COMPRESSION GZIP);
 ```
 
-##### Require & Extensions
+#### Require & Extensions
 
 To avoid bloating the core system, certain functionality of DuckDB is available only as an extension. Tests can be build for those extensions by adding a `require` field in the test. If the extension is not loaded, any statements that occurs after the require field will be skipped. Examples of this are `require parquet` or `require icu`.
 

diff --git a/docs/api/cli.md b/docs/api/cli.md
@@ -275,9 +275,9 @@ D .schema
 ```
 
 ```sql
-CREATE TABLE fliers(animal VARCHAR);;
-CREATE TABLE swimmers(animal VARCHAR);;
-CREATE TABLE walkers(animal VARCHAR);;
+CREATE TABLE fliers(animal VARCHAR);
+CREATE TABLE swimmers(animal VARCHAR);
+CREATE TABLE walkers(animal VARCHAR);
 ```
 
 ## Opening Database Files
@@ -452,7 +452,7 @@ Note that the duck head is built with unicode characters and does not always wor
 -- Duck head prompt
 .prompt '⚫◗ '
 -- Example SQL statement
-SELECT 'Begin quacking!' as "Ready, Set, ..."
+SELECT 'Begin quacking!' AS "Ready, Set, ...";
 ```
 
 To invoke that file on initialization, use this command:
@@ -477,7 +477,7 @@ Use ".open FILENAME" to reopen on a persistent database.
 ⚫◗
 ```
 
-## Non-interactive usage
+## Non-interactive Usage
 
 To read/process a file and exit immediately, pipe the file contents in to `duckdb`:
 
@@ -521,7 +521,7 @@ D LOAD 'fts';
 
 <!-- SQL parameters do not appear to work -->
 
-## Reading from stdin and writing to stdout
+## Reading from stdin and Writing to stdout
 
 When in a Unix environment, it can be useful to pipe data between multiple commands. 
 DuckDB is able to read data from stdin as well as write to stdout using the file location of stdin (`/dev/stdin`) and stdout (`/dev/stdout`) within SQL commands, as pipes act very similarly to file handles.