From 344accae80d18e6af02cada98254ce336eed9e32 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 23 Aug 2023 16:11:06 +0200 Subject: [PATCH 01/17] Add short section on extension template --- docs/extensions/overview.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/extensions/overview.md b/docs/extensions/overview.md index 14fb1ac02a8..4370d74c78e 100644 --- a/docs/extensions/overview.md +++ b/docs/extensions/overview.md @@ -99,6 +99,9 @@ INSTALL 'httpfs.duckdb_extension'; LOAD 'httpfs.duckdb_extension'; ``` +## Extension template + +A template for creating extensions is available in the [`extension-template` repository](https://github.com/duckdb/extension-template/). Note that this project is work-in-progress. ## Pages in this Section From 1f49fc0c9cad49a6d752899f4ad7faa68bacdad0 Mon Sep 17 00:00:00 2001 From: Elliana May Date: Wed, 23 Aug 2023 22:53:33 +0800 Subject: [PATCH 02/17] Update _config_exclude_archive.yml to exclude vendor --- _config_exclude_archive.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config_exclude_archive.yml b/_config_exclude_archive.yml index 7394c3a46c7..6c37c41ac6e 100644 --- a/_config_exclude_archive.yml +++ b/_config_exclude_archive.yml @@ -1,2 +1,2 @@ # bundler exec jekyll serve --incremental --config _config.yml,_config_exclude_archive.yml -exclude: ['docs/archive'] +exclude: ['docs/archive', 'vendor'] From b6fedd2a902ab3679a102f9e0f56926f8aafc47c Mon Sep 17 00:00:00 2001 From: Elliana May Date: Wed, 23 Aug 2023 22:56:17 +0800 Subject: [PATCH 03/17] Update lint.yml to lint more things --- .github/workflows/lint.yml | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml index 168b3c5cb6e..1d2a7095840 100644 --- a/.github/workflows/lint.yml +++ b/.github/workflows/lint.yml @@ -18,9 +18,7 @@ jobs: - uses: articulate/actions-markdownlint@main with: config: .markdownlint.jsonc - files: 'docs/**/*.md' - # TODO: - # files: 'docs/**/*.md _posts/*.md dev/*.md' + files: 'docs/**/*.md _posts/*.md dev/*.md' python: runs-on: ubuntu-latest From 5ef800c6c0c0887360629e8a224fb470481b8458 Mon Sep 17 00:00:00 2001 From: Elliana May Date: Wed, 23 Aug 2023 23:07:30 +0800 Subject: [PATCH 04/17] fix lint issues --- _posts/2021-08-27-external-sorting.md | 8 ++++---- _posts/2021-10-29-duckdb-wasm.md | 4 ++-- _posts/2021-11-26-duck-enum.md | 2 +- _posts/2021-12-03-duck-arrow.md | 2 +- _posts/2022-05-04-friendlier-sql.md | 2 +- _posts/2022-09-30-postgres-scanner.md | 2 +- _posts/2022-11-14-announcing-duckdb-060.md | 14 ++++++++------ _posts/2023-02-13-announcing-duckdb-070.md | 16 ++++++++-------- _posts/2023-04-14-h2oai.md | 4 ++-- _posts/2023-04-21-swift.md | 2 +- .../2023-05-26-correlated-subqueries-in-sql.md | 2 +- dev/sqllogictest/intro.md | 4 ++-- 12 files changed, 32 insertions(+), 30 deletions(-) diff --git a/_posts/2021-08-27-external-sorting.md b/_posts/2021-08-27-external-sorting.md index 838acf68553..5cbc099edb1 100644 --- a/_posts/2021-08-27-external-sorting.md +++ b/_posts/2021-08-27-external-sorting.md @@ -372,12 +372,12 @@ We see similar trends at SF10 and SF100, but for SF100, at around 12 payload col ClickHouse switches to an external sorting strategy, which is much slower than its in-memory strategy. Therefore, adding a few payload columns results in a runtime that is orders of magnitude higher. At 20 payload columns ClickHouse runs into the following error: -``` +```text DB::Exception: Memory limit (for query) exceeded: would use 11.18 GiB (attempt to allocate chunk of 4204712 bytes), maximum: 11.18 GiB: (while reading column cs_list_price): (while reading from part ./store/523/5230c288-7ed5-45fa-9230-c2887ed595fa/all_73_108_2/ from mark 4778 with max_rows_to_read = 8192): While executing MergeTreeThread. ``` HyPer also drops in performance before erroring out with the following message: -``` +```text ERROR: Cannot allocate 333982248 bytes of memory: The `global memory limit` limit of 12884901888 bytes was exceeded. ``` As far as we are aware, HyPer uses [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html), which creates a mapping between memory and a file. @@ -392,7 +392,7 @@ Using swap usually slows down processing significantly, but the SSD is so fast t While Pandas loads the data, swap size grows to an impressive \~40 GB: Both the file and the data frame are fully in memory/swap at the same time, rather than streamed into memory. This goes down to \~20 GB of memory/swap when the file is done being read. Pandas is able to get quite far into the experiment until it crashes with the following error: -``` +```text UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown ``` @@ -521,7 +521,7 @@ We have set the number of threads that DuckDB and ClickHouse use to 8 because we Pandas performs comparatively worse than on the MacBook, because it has a single-threaded implementation, and this CPU has a lower single-thread performance. Again, Pandas crashes with an error (this machine does not dynamically increase swap): -``` +```text numpy.core._exceptions.MemoryError: Unable to allocate 6.32 GiB for an array with shape (6, 141430723) and data type float64 ``` diff --git a/_posts/2021-10-29-duckdb-wasm.md b/_posts/2021-10-29-duckdb-wasm.md index 89bf64317b7..cd50dc8d2b0 100644 --- a/_posts/2021-10-29-duckdb-wasm.md +++ b/_posts/2021-10-29-duckdb-wasm.md @@ -245,14 +245,14 @@ In 2018, the Spectre and Meltdown vulnerabilities sent crippling shockwaves thro Without `SharedArrayBuffers`, WebAssembly modules can run in a dedicated web worker to unblock the main event loop but won't be able to spawn additional workers for parallel computations within the same instance. By default, we therefore cannot unleash the parallel query execution of DuckDB in the web. However, browser vendors have recently started to reenable `SharedArrayBuffers` for websites that are [cross-origin-isolated](https://web.dev/coop-coep/). A website is cross-origin-isolated if it ships the main document with the following HTTP headers: -``` +```text Cross-Origin-Embedded-Policy: require-corp Cross-Origin-Opener-Policy: same-origin ``` These headers will instruct browsers to A) isolate the top-level document from other top-level documents outside its own origin and B) prevent the document from making arbitrary cross-origin requests unless the requested resource explicitly opts in. Both restrictions have far reaching implications for a website since many third-party data sources won't yet provide the headers today and the top-level isolation currently hinders the communication with, for example, OAuth pop up's ([there are plans to lift that](https://github.com/whatwg/html/issues/6364)). -*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users. Share your thoughts with us [here]().* +*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users.* ## Web Shell diff --git a/_posts/2021-11-26-duck-enum.md b/_posts/2021-11-26-duck-enum.md index fa552a06dc2..336a14826ff 100644 --- a/_posts/2021-11-26-duck-enum.md +++ b/_posts/2021-11-26-duck-enum.md @@ -121,7 +121,7 @@ df_out <- dbReadTable(con, "characters") To demonstrate the performance of DuckDB when running operations on categorical columns of Pandas DataFrames, we present a number of benchmarks. The source code for the benchmarks is available [here](https://raw.githubusercontent.com/duckdb/duckdb-web/main/_posts/benchmark_scripts/enum.py). In our benchmarks we always consume and produce Pandas DataFrames. -#### Dataset +### Dataset Our dataset is composed of one dataframe with 4 columns and 10 million rows. The first two columns are named ```race``` and ```subrace``` representing races. They are both categorical, with the same categories but different values. The other two columns ```race_string``` and ```subrace_string``` are the string representations of ```race``` and ```subrace```. diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md index 6ed9f087fbc..ccf414a3ffd 100644 --- a/_posts/2021-12-03-duck-arrow.md +++ b/_posts/2021-12-03-duck-arrow.md @@ -83,7 +83,7 @@ nyc.filter("year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amou In this section, we will look at some basic examples of the code needed to read and output Arrow tables in both Python and R. -#### Setup +### Setup First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below. ```bash diff --git a/_posts/2022-05-04-friendlier-sql.md b/_posts/2022-05-04-friendlier-sql.md index 40ce4400a1b..e66e688e456 100644 --- a/_posts/2022-05-04-friendlier-sql.md +++ b/_posts/2022-05-04-friendlier-sql.md @@ -267,7 +267,7 @@ In addition to what has already been implemented, several other improvements hav - Clickhouse supports this with the [`COLUMNS` expression](https://clickhouse.com/docs/en/sql-reference/statements/select/#columns-expression) - Incremental column aliases - Refer to previously defined aliases in subsequent calculated columns rather than re-specifying the calculations -- Dot operators for JSON types + - Dot operators for JSON types - The JSON extension is brand new ([see our documentation!](https://duckdb.org/docs/extensions/json)) and already implements friendly `->` and `->>` syntax Thanks for checking out DuckDB! May the Force be with you... diff --git a/_posts/2022-09-30-postgres-scanner.md b/_posts/2022-09-30-postgres-scanner.md index 515060aa4bd..2f04793f911 100644 --- a/_posts/2022-09-30-postgres-scanner.md +++ b/_posts/2022-09-30-postgres-scanner.md @@ -51,7 +51,7 @@ CALL postgres_attach('dbname=myshinydb'); `postgres_attach` takes a single required string parameter, which is the [`libpq` connection string](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING). For example you can pass `'dbname=myshinydb'` to select a different database name. In the simplest case, the parameter is just `''`. There are three additional named parameters to the function: * `source_schema` the name of a non-standard schema name in Postgres to get tables from. Default is `public`. * `overwrite` whether we should overwrite existing views in the target schema, default is `false`. -* `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls. + * `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls. The tables in the database are registered as views in DuckDB, you can list them with ```SQL diff --git a/_posts/2022-11-14-announcing-duckdb-060.md b/_posts/2022-11-14-announcing-duckdb-060.md index 782f60fa617..f2bbb0c896b 100644 --- a/_posts/2022-11-14-announcing-duckdb-060.md +++ b/_posts/2022-11-14-announcing-duckdb-060.md @@ -107,7 +107,7 @@ CREATE TABLE messages(u UNION(num INT, error VARCHAR)); INSERT INTO messages VALUES (42); INSERT INTO messages VALUES ('oh my globs'); ``` -``` +```text SELECT * FROM messages; ┌─────────────┐ │ u │ @@ -142,7 +142,7 @@ CREATE TABLE obs(id INT, val1 INT, val2 INT); INSERT INTO obs VALUES (1, 10, 100), (2, 20, NULL), (3, NULL, 300); SELECT MIN(COLUMNS(*)), COUNT(*) from obs; ``` -``` +```text ┌─────────────┬───────────────┬───────────────┬──────────────┐ │ min(obs.id) │ min(obs.val1) │ min(obs.val2) │ count_star() │ ├─────────────┼───────────────┼───────────────┼──────────────┤ @@ -155,7 +155,7 @@ The `COLUMNS` expression supports all star expressions, including [the `EXCLUDE` ```sql SELECT COLUMNS('val[0-9]+') from obs; ``` -``` +```text ┌──────┬──────┐ │ val1 │ val2 │ ├──────┼──────┤ @@ -170,7 +170,7 @@ SELECT COLUMNS('val[0-9]+') from obs; ```sql SELECT [x + 1 for x in [1, 2, 3]] AS l; ``` -``` +```text ┌───────────┐ │ l │ ├───────────┤ @@ -211,8 +211,10 @@ The DuckDB shell also offers several improvements over the SQLite shell, such as The number of rows that are rendered can be changed by using the `.maxrows X` setting, and you can switch back to the old rendering using the `.mode box` command. -``` +```sql D SELECT * FROM '~/Data/nyctaxi/nyc-taxi/2014/04/data.parquet'; +``` +```text ┌───────────┬─────────────────────┬─────────────────────┬───┬────────────┬──────────────┬──────────────┐ │ vendor_id │ pickup_at │ dropoff_at │ … │ tip_amount │ tolls_amount │ total_amount │ │ varchar │ timestamp │ timestamp │ │ float │ float │ float │ @@ -265,7 +267,7 @@ SELECT student_id FROM 'data/ -> data/grades.csv **Progress Bars**. DuckDB has [supported progress bars in queries for a while now](https://github.com/duckdb/duckdb/pull/1432), but they have always been opt-in. In this release we have [prettied up the progress bar](https://github.com/duckdb/duckdb/pull/5187) and enabled it by default in the shell. The progress bar will pop up when a query is run that takes more than 2 seconds, and display an estimated time-to-completion for the query. -``` +```sql D copy lineitem to 'lineitem-big.parquet'; 32% ▕███████████████████▏ ▏ ``` diff --git a/_posts/2023-02-13-announcing-duckdb-070.md b/_posts/2023-02-13-announcing-duckdb-070.md index 60fac06c88d..caed69fd382 100644 --- a/_posts/2023-02-13-announcing-duckdb-070.md +++ b/_posts/2023-02-13-announcing-duckdb-070.md @@ -44,7 +44,7 @@ COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month)); This will cause the Parquet files to be written in the following directory structure: -``` +```text orders ├── year=2021 │ ├── month=1 @@ -145,7 +145,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2; >>> lineitem = duckdb.sql('FROM lineitem.parquet') >>> lineitem.limit(3).show() ``` -``` +```text ┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐ │ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │ │ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │ @@ -161,7 +161,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2; >>> lineitem_filtered = duckdb.sql('FROM lineitem WHERE l_orderkey>5000') >>> lineitem_filtered.limit(3).show() ``` -``` +```text ┌────────────┬───────────┬───────────┬───┬────────────────┬────────────┬──────────────────────┐ │ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │ │ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │ @@ -176,7 +176,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2; ```py >>> duckdb.sql('SELECT MIN(l_orderkey), MAX(l_orderkey) FROM lineitem_filtered').show() ``` -``` +```text ┌─────────────────┬─────────────────┐ │ min(l_orderkey) │ max(l_orderkey) │ │ int32 │ int32 │ @@ -193,7 +193,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk >>> lineitem = duckdb.read_csv('lineitem.csv') >>> lineitem.limit(3).show() ``` -``` +```text ┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐ │ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │ │ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │ @@ -208,7 +208,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk ```py >>> duckdb.sql('select min(l_orderkey) from lineitem').show() ``` -``` +```text ┌─────────────────┐ │ min(l_orderkey) │ │ int32 │ @@ -225,7 +225,7 @@ import duckdb duckdb.sql('select 42').pl() ``` -``` +```text shape: (1, 1) ┌─────┐ │ 42 │ @@ -245,7 +245,7 @@ df = pl.DataFrame({'a': 42}) duckdb.sql('select * from df').pl() ``` -``` +```text shape: (1, 1) ┌─────┐ │ a │ diff --git a/_posts/2023-04-14-h2oai.md b/_posts/2023-04-14-h2oai.md index 4ee2b43a567..95601dbc7fa 100644 --- a/_posts/2023-04-14-h2oai.md +++ b/_posts/2023-04-14-h2oai.md @@ -30,7 +30,7 @@ The time reported is the sum of the time it takes to run all 5 queries twice. More information about the specific queries can be found below. -#### The Data and Queries +### The Data and Queries The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation. @@ -87,7 +87,7 @@ You can also look at the results [here](https://duckdblabs.github.io/db-benchmar Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the [_utils/repro.sh](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh) script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss. -# Maintenance plan +## Maintenance plan DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided. diff --git a/_posts/2023-04-21-swift.md b/_posts/2023-04-21-swift.md index 0656e1c0a8e..0527b036676 100644 --- a/_posts/2023-04-21-swift.md +++ b/_posts/2023-04-21-swift.md @@ -69,7 +69,7 @@ One problem with our current `ExoplanetStore` type is that it doesn’t yet cont There are hundreds of configuration options for this incredible resource, but today we want each exoplanet’s name and its discovery year packaged as a CSV. [Checking the docs](https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html) gives us the following endpoint: -``` +```text https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv ``` diff --git a/_posts/2023-05-26-correlated-subqueries-in-sql.md b/_posts/2023-05-26-correlated-subqueries-in-sql.md index 7689f8ff153..3471e07c12f 100644 --- a/_posts/2023-05-26-correlated-subqueries-in-sql.md +++ b/_posts/2023-05-26-correlated-subqueries-in-sql.md @@ -238,7 +238,7 @@ WHERE distance=( ); ``` -``` +```text ┌───────────────────────────┐ │ HASH_JOIN │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ diff --git a/dev/sqllogictest/intro.md b/dev/sqllogictest/intro.md index 7e8eb5006b3..e1a1ede0f6a 100644 --- a/dev/sqllogictest/intro.md +++ b/dev/sqllogictest/intro.md @@ -62,7 +62,7 @@ A syntax highlighter exists for [Visual Studio Code](https://marketplace.visuals A syntax highlighter is also available for [CLion](https://plugins.jetbrains.com/plugin/15295-sqltest). It can be installed directly on the IDE by searching SQLTest on the marketplace. A [github repository](https://github.com/pdet/SQLTest) is also available, with extensions and bug reports being welcome. -##### Temporary Files +#### Temporary Files For some tests (e.g. CSV/Parquet file format tests) it is necessary to create temporary files. Any temporary files should be created in the temporary testing directory. This directory can be used by placing the string `__TEST_DIR__` in a query. This string will be replaced by the path of the temporary testing directory. @@ -71,7 +71,7 @@ statement ok COPY csv_data TO '__TEST_DIR__/output_file.csv.gz' (COMPRESSION GZIP); ``` -##### Require & Extensions +#### Require & Extensions To avoid bloating the core system, certain functionality of DuckDB is available only as an extension. Tests can be build for those extensions by adding a `require` field in the test. If the extension is not loaded, any statements that occurs after the require field will be skipped. Examples of this are `require parquet` or `require icu`. From d5cdfb77ad41d5bd7ec80b5b899c0c7d78b038df Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Wed, 23 Aug 2023 20:17:52 +0200 Subject: [PATCH 05/17] CI: Ignore tpc.org for Link Checker tpc.org frequently times out, causing the CI to go red. At the same time, tpc.org and its links (/tpch, /tpcds) are unlikely to disappear anytime soon, so I put them in the ignore list. --- .github/linkchecker/linkchecker.conf | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/linkchecker/linkchecker.conf b/.github/linkchecker/linkchecker.conf index d18f447d125..0867e3ed503 100644 --- a/.github/linkchecker/linkchecker.conf +++ b/.github/linkchecker/linkchecker.conf @@ -18,3 +18,4 @@ ignore= https://nbviewer\.org/.* https://open\.spotify\.com/.* https://www\.dataengineeringpodcast\.com/.* + https://www.tpc.org/.* From 0ad0463b28d8ed2e450b5615a659b7523dafad45 Mon Sep 17 00:00:00 2001 From: Alex-Monahan <52226177+Alex-Monahan@users.noreply.github.com> Date: Wed, 23 Aug 2023 15:01:52 -0700 Subject: [PATCH 06/17] import pandas in main thread There is currently a bug that pandas must be imported before `.df()` can be used in multiple threads. --- docs/guides/python/multiple_threads.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/guides/python/multiple_threads.md b/docs/guides/python/multiple_threads.md index f607868be31..54d15af4548 100644 --- a/docs/guides/python/multiple_threads.md +++ b/docs/guides/python/multiple_threads.md @@ -11,6 +11,7 @@ Feel free to follow along in this [Google Collaboratory Notebook](https://colab. ## Setup First, import duckdb and several modules from the Python standard library. +Note: if using Pandas, add `import pandas` at the top of the script as well (as it must be imported prior to the multi-threading). Then connect to a file-backed DuckDB database and create an example table to store inserted data. This table will track the name of the thread that completed the insert and automatically insert the timestamp when that insert occurred using the [`DEFAULT` expression](../../sql/statements/create_table#syntax). ```python From 6fc27d56d94db6a13f24205622bf68324a6b7b68 Mon Sep 17 00:00:00 2001 From: Mark Harrison Date: Wed, 23 Aug 2023 16:00:59 -0700 Subject: [PATCH 07/17] Noting that default parameters must be named. --- docs/sql/statements/create_macro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sql/statements/create_macro.md b/docs/sql/statements/create_macro.md index d3b5cdf2201..1b24e8c2808 100644 --- a/docs/sql/statements/create_macro.md +++ b/docs/sql/statements/create_macro.md @@ -60,7 +60,7 @@ SELECT add(1, 2); -- 3 ``` -Macro's can have default parameters. +Macro's can have default parameters. Unlike some languages, default parameters must be named. ```sql -- b is a default parameter CREATE MACRO add_default(a, b := 5) AS a + b; From 6e3cb925c6aa6789783c5d08be546e2e57a92c38 Mon Sep 17 00:00:00 2001 From: Mark Harrison Date: Wed, 23 Aug 2023 16:12:28 -0700 Subject: [PATCH 08/17] Making it clear that this is for invocation. --- docs/sql/statements/create_macro.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/sql/statements/create_macro.md b/docs/sql/statements/create_macro.md index 1b24e8c2808..8299980709a 100644 --- a/docs/sql/statements/create_macro.md +++ b/docs/sql/statements/create_macro.md @@ -60,7 +60,8 @@ SELECT add(1, 2); -- 3 ``` -Macro's can have default parameters. Unlike some languages, default parameters must be named. +Macro's can have default parameters. Unlike some languages, default parameters must be named +when the macro is invoked. ```sql -- b is a default parameter CREATE MACRO add_default(a, b := 5) AS a + b; From 66c0c552512d762f2f271d59d87dc8e7ea68ae0e Mon Sep 17 00:00:00 2001 From: Elliana May Date: Thu, 24 Aug 2023 19:25:59 +0800 Subject: [PATCH 09/17] Correct link to Arrow docs Related to https://github.com/duckdb/duckdb/issues/7647 --- docs/guides/python/sql_on_arrow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/python/sql_on_arrow.md b/docs/guides/python/sql_on_arrow.md index 7eaf317d68c..85ee290dfff 100644 --- a/docs/guides/python/sql_on_arrow.md +++ b/docs/guides/python/sql_on_arrow.md @@ -94,7 +94,7 @@ results = con.execute("SELECT * FROM arrow_scanner").arrow() ## Apache Arrow RecordBatchReaders -[Arrow RecordBatchReaders](https://arrow.apache.org/docs/python/generated/pyarrow.ipc.RecordBatchStreamReader.html) are a reader for Arrow's streaming binary format and can also be queried directly as if they were tables. This streaming format is useful when sending Arrow data for tasks like interprocess communication or communicating between language runtimes. +[Arrow RecordBatchReaders](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html) are a reader for Arrow's streaming binary format and can also be queried directly as if they were tables. This streaming format is useful when sending Arrow data for tasks like interprocess communication or communicating between language runtimes. ```python import duckdb import pyarrow as pa From 7438655cbef5db3371a4495a7b7abed0ef1e0ef9 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 14:45:33 +0200 Subject: [PATCH 10/17] Fix FROM page, add TPC-H tables --- docs/sql/query_syntax/from.md | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/docs/sql/query_syntax/from.md b/docs/sql/query_syntax/from.md index 4f775beadf8..0c0af238580 100644 --- a/docs/sql/query_syntax/from.md +++ b/docs/sql/query_syntax/from.md @@ -70,19 +70,45 @@ attributes from one side to attributes from the other side. The conditions can be explicitly specified using an `ON` clause with the join (clearer) or implied by the `WHERE` clause (old-fashioned). +We use the `l_regions` and the `l_nations` tables from the TPC-H schema: + +```sql +CREATE TABLE l_regions(r_regionkey INTEGER NOT NULL PRIMARY KEY, + r_name CHAR(25) NOT NULL, + r_comment VARCHAR(152)); + +CREATE TABLE l_nations (n_nationkey INTEGER NOT NULL PRIMARY KEY, + n_name CHAR(25) NOT NULL, + n_regionkey INTEGER NOT NULL, + n_comment VARCHAR(152), + FOREIGN KEY (n_regionkey) REFERENCES l_regions(r_regionkey)); +``` + ```sql -- return the regions for the nations SELECT n.*, r.* -FROM l_nations n, JOIN l_regions r ON (n_regionkey = r_regionkey) +FROM l_nations n JOIN l_regions r ON (n_regionkey = r_regionkey); ``` If the column names are the same and are required to be equal, then the simpler `USING` syntax can be used: +```sql +CREATE TABLE l_regions(regionkey INTEGER NOT NULL PRIMARY KEY, + name CHAR(25) NOT NULL, + comment VARCHAR(152)); + +CREATE TABLE l_nations (nationkey INTEGER NOT NULL PRIMARY KEY, + name CHAR(25) NOT NULL, + regionkey INTEGER NOT NULL, + comment VARCHAR(152), + FOREIGN KEY (regionkey) REFERENCES l_regions(regionkey)); +``` + ```sql -- return the regions for the nations SELECT n.*, r.* -FROM l_nations n, JOIN l_regions r USING (regionkey) +FROM l_nations n JOIN l_regions r USING (regionkey); ``` The expressions to not have to be equalities - any predicate can be used: From 903279d066a0df7fb0cb9c31fe8f4cb56da873eb Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 14:48:41 +0200 Subject: [PATCH 11/17] Recommend SQL statements to end with a semicolon in the documentation --- CONTRIBUTING.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9adb79a0ade..799e59ad5e8 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -41,6 +41,7 @@ Some of this style guide is automated with GitHub Actions, but feel free to run * Quoted blocks (lines starting with `>`) are rendered as [a "Note" box](https://duckdb.org/docs/archive/0.8.1/guides/python/filesystems). * Always format SQL code, variable names, function names, etc. as code. For example, when talking about the `CREATE TABLE` statement, the keywords should be formatted as code. * When presenting SQL statements, do not include the DuckDB prompt (`D `) in the documentation. +* SQL statements should end with a semicolon (`;`) to allow readers to quickly paste them into a SQL console. ### Headers From 996315f16267de8768fd0901f2d5025f26b56165 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 15:31:04 +0200 Subject: [PATCH 12/17] Add semicolons to SQL statements --- docs/api/cli.md | 8 +++--- docs/data/multiple_files/combining_schemas.md | 6 ++--- docs/data/multiple_files/overview.md | 6 ++--- docs/extensions/httpfs.md | 4 +-- docs/guides/python/multiple_threads.md | 1 + docs/guides/python/sql_on_arrow.md | 2 +- docs/sql/data_types/enum.md | 12 ++++----- docs/sql/duckdb_table_functions.md | 12 +++------ docs/sql/expressions/case.md | 8 +++--- docs/sql/functions/dateformat.md | 2 +- docs/sql/functions/nested.md | 26 +++++++++---------- docs/sql/functions/timestamptz.md | 8 +++--- docs/sql/query_syntax/groupby.md | 4 +-- docs/sql/query_syntax/grouping_sets.md | 2 +- docs/sql/query_syntax/orderby.md | 6 ++--- docs/sql/statements/create_macro.md | 3 ++- docs/sql/statements/create_table.md | 4 +-- docs/sql/statements/export.md | 2 +- docs/sql/statements/pivot.md | 2 +- docs/sql/statements/select.md | 2 +- docs/sql/statements/unpivot.md | 4 +-- docs/sql/window_functions.md | 14 +++++----- 22 files changed, 68 insertions(+), 70 deletions(-) diff --git a/docs/api/cli.md b/docs/api/cli.md index eecc302ab53..db987c0db90 100644 --- a/docs/api/cli.md +++ b/docs/api/cli.md @@ -275,9 +275,9 @@ D .schema ``` ```sql -CREATE TABLE fliers(animal VARCHAR);; -CREATE TABLE swimmers(animal VARCHAR);; -CREATE TABLE walkers(animal VARCHAR);; +CREATE TABLE fliers(animal VARCHAR); +CREATE TABLE swimmers(animal VARCHAR); +CREATE TABLE walkers(animal VARCHAR); ``` ## Opening Database Files @@ -452,7 +452,7 @@ Note that the duck head is built with unicode characters and does not always wor -- Duck head prompt .prompt '⚫◗ ' -- Example SQL statement -SELECT 'Begin quacking!' as "Ready, Set, ..." +SELECT 'Begin quacking!' AS "Ready, Set, ..."; ``` To invoke that file on initialization, use this command: diff --git a/docs/data/multiple_files/combining_schemas.md b/docs/data/multiple_files/combining_schemas.md index a8d411b23f7..c7f3914358d 100644 --- a/docs/data/multiple_files/combining_schemas.md +++ b/docs/data/multiple_files/combining_schemas.md @@ -9,9 +9,9 @@ title: Combining Schemas ```sql -- read a set of CSV files combining columns by position -SELECT * FROM read_csv_auto('flights*.csv') +SELECT * FROM read_csv_auto('flights*.csv'); -- read a set of CSV files combining columns by name -SELECT * FROM read_csv_auto('flights*.csv', union_by_name=True) +SELECT * FROM read_csv_auto('flights*.csv', union_by_name=True); ``` ### Combining Schemas @@ -73,7 +73,7 @@ FlightDate|UniqueCarrier|OriginCityName|DestCityName Reading these when unifying column names **by position** results in an error - as the two files have a different number of columns. When specifying the `union_by_name` option, the columns are correctly unified, and any missing values are set to `NULL`. ```sql -SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True) +SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True); ``` | FlightDate | OriginCityName | DestCityName | UniqueCarrier | diff --git a/docs/data/multiple_files/overview.md b/docs/data/multiple_files/overview.md index 78139babfc5..c74adf176e8 100644 --- a/docs/data/multiple_files/overview.md +++ b/docs/data/multiple_files/overview.md @@ -17,9 +17,9 @@ SELECT * FROM '*/*/*.csv'; -- read all files with a name ending in ".csv", at any depth in the folder "dir" SELECT * FROM 'dir/**/*.csv'; -- read the CSV files 'flights1.csv' and 'flights2.csv' -SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv']) +SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv']); -- read the CSV files 'flights1.csv' and 'flights2.csv', unifying schemas by name and outputting a `filename` column -SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True, filename=True) +SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True, filename=True); ``` ### Parquet @@ -86,7 +86,7 @@ DuckDB can read multiple CSV files at the same time using either the glob syntax The `filename` argument can be used to add an extra `filename` column to the result that indicates which row came from which file. For example: ```sql -SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True, filename=True) +SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True, filename=True); ``` | FlightDate | OriginCityName | DestCityName | UniqueCarrier | filename | diff --git a/docs/extensions/httpfs.md b/docs/extensions/httpfs.md index a7ff4b93ddf..72031e183eb 100644 --- a/docs/extensions/httpfs.md +++ b/docs/extensions/httpfs.md @@ -141,7 +141,7 @@ File globbing is implemented using the ListObjectV2 API call and allows to use f multiple files, for example: ```sql -SELECT * FROM read_parquet('s3://bucket/*.parquet') +SELECT * FROM read_parquet('s3://bucket/*.parquet'); ``` This query matches all files in the root of the bucket with the parquet extension. @@ -150,7 +150,7 @@ Several features for matching are supported, such as `*` to match any number of character or `[0-9]` for a single character in a range of characters: ```sql -SELECT COUNT(*) FROM read_parquet('s3://bucket/folder*/100?/t[0-9].parquet') +SELECT COUNT(*) FROM read_parquet('s3://bucket/folder*/100?/t[0-9].parquet'); ``` A useful feature when using globs is the `filename` option which adds a column with the file that a row originated from: diff --git a/docs/guides/python/multiple_threads.md b/docs/guides/python/multiple_threads.md index f607868be31..54d15af4548 100644 --- a/docs/guides/python/multiple_threads.md +++ b/docs/guides/python/multiple_threads.md @@ -11,6 +11,7 @@ Feel free to follow along in this [Google Collaboratory Notebook](https://colab. ## Setup First, import duckdb and several modules from the Python standard library. +Note: if using Pandas, add `import pandas` at the top of the script as well (as it must be imported prior to the multi-threading). Then connect to a file-backed DuckDB database and create an example table to store inserted data. This table will track the name of the thread that completed the insert and automatically insert the timestamp when that insert occurred using the [`DEFAULT` expression](../../sql/statements/create_table#syntax). ```python diff --git a/docs/guides/python/sql_on_arrow.md b/docs/guides/python/sql_on_arrow.md index 7eaf317d68c..85ee290dfff 100644 --- a/docs/guides/python/sql_on_arrow.md +++ b/docs/guides/python/sql_on_arrow.md @@ -94,7 +94,7 @@ results = con.execute("SELECT * FROM arrow_scanner").arrow() ## Apache Arrow RecordBatchReaders -[Arrow RecordBatchReaders](https://arrow.apache.org/docs/python/generated/pyarrow.ipc.RecordBatchStreamReader.html) are a reader for Arrow's streaming binary format and can also be queried directly as if they were tables. This streaming format is useful when sending Arrow data for tasks like interprocess communication or communicating between language runtimes. +[Arrow RecordBatchReaders](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html) are a reader for Arrow's streaming binary format and can also be queried directly as if they were tables. This streaming format is useful when sending Arrow data for tasks like interprocess communication or communicating between language runtimes. ```python import duckdb import pyarrow as pa diff --git a/docs/sql/data_types/enum.md b/docs/sql/data_types/enum.md index 49f04d4d1e7..158d1b00841 100644 --- a/docs/sql/data_types/enum.md +++ b/docs/sql/data_types/enum.md @@ -17,10 +17,10 @@ The `ENUM` type represents a dictionary data structure with all possible unique Enum types are created from either a hardcoded set of values or from a select statement that returns a single column of varchars. The set of values in the select statement will be deduplicated, but if the enum is created from a hardcoded set there may not be any duplicates. ```sql -- Create enum using hardcoded values -CREATE TYPE ${enum_name} AS ENUM ([${value_1},${value_2},...]) +CREATE TYPE ${enum_name} AS ENUM ([${value_1},${value_2},...]); -- Create enum using a select statement that returns a single column of varchars -CREATE TYPE ${enum_name} AS ENUM (${SELECT expression}) +CREATE TYPE ${enum_name} AS ENUM (${SELECT expression}); ``` For example: ```sql @@ -93,7 +93,7 @@ DuckDB Enums are automatically cast to `VARCHAR` types whenever necessary. This For example: ```sql -- regexp_matches is a function that takes a VARCHAR, hence current_mood is cast to VARCHAR -SELECT regexp_matches(current_mood, '.*a.*') FROM person +SELECT regexp_matches(current_mood, '.*a.*') FROM person; ---- TRUE FALSE @@ -124,19 +124,19 @@ SELECT * FROM person_2 where current_mood = past_mood; Enum types are stored in the catalog, and a catalog dependency is added to each table that uses them. It is possible to drop an Enum from the catalog using the following command: ```sql -DROP TYPE ${enum_name} +DROP TYPE ${enum_name}; ``` Note that any dependent must be removed before dropping the enum, or the enum must be dropped with the additional `CASCADE` parameter. For example: ```sql -- This will fail since person has a catalog dependency to the mood type -DROP TYPE mood +DROP TYPE mood; DROP TABLE person; DROP TABLE person_2; -- This successfully removes the mood type. -- Another option would be to DROP TYPE mood CASCADE (Drops the type and its dependents) -DROP TYPE mood +DROP TYPE mood; ``` diff --git a/docs/sql/duckdb_table_functions.md b/docs/sql/duckdb_table_functions.md index 5898cee1e04..287a4308855 100644 --- a/docs/sql/duckdb_table_functions.md +++ b/docs/sql/duckdb_table_functions.md @@ -9,13 +9,13 @@ The resultset returned by a `duckdb_` table function may be used just like an or Table functions are still functions, and you should write parenthesis after the function name to call it to obtain its returned resultset: ```sql -SELECT * FROM duckdb_settings() +SELECT * FROM duckdb_settings(); ``` Alternatively, you may execute table functions also using the `CALL`-syntax: ```sql -CALL duckdb_settings() +CALL duckdb_settings(); ``` In this case too, the parentheses are mandatory. @@ -26,13 +26,9 @@ Example: ```sql -- duckdb_views table function: returns all views, including those marked internal -SELECT * -FROM duckdb_views() -; +SELECT * FROM duckdb_views(); -- duckdb_views view: returns views that are not marked as internal -SELECT * -FROM duckdb_views -; +SELECT * FROM duckdb_views; ``` ## duckdb_columns diff --git a/docs/sql/expressions/case.md b/docs/sql/expressions/case.md index 46519153880..120aa010330 100644 --- a/docs/sql/expressions/case.md +++ b/docs/sql/expressions/case.md @@ -7,7 +7,7 @@ railroad: expressions/case.js The `CASE` statement performs a switch based on a condition. The basic form is identical to the ternary condition used in many programming languages (`CASE WHEN cond THEN a ELSE b END` is equivalent to `cond ? a : b`). With a single condition this can be expressed with `IF(cond, a, b)`. ```sql -CREATE OR REPLACE TABLE INTEGERS as SELECT UNNEST([1, 2, 3]) as i; +CREATE OR REPLACE TABLE INTEGERS AS SELECT UNNEST([1, 2, 3]) AS i; SELECT i, CASE WHEN i>2 THEN 1 ELSE 0 END AS test FROM integers; -- 1, 2, 3 -- 0, 0, 1 @@ -21,7 +21,7 @@ SELECT i, IF(i > 2, 1, 0) AS test FROM integers; The `WHEN cond THEN expr` part of the `CASE` statement can be chained, whenever any of the conditions returns true for a single tuple, the corresponding expression is evaluated and returned. ```sql -CREATE OR REPLACE TABLE INTEGERS as SELECT UNNEST([1, 2, 3]) as i; +CREATE OR REPLACE TABLE INTEGERS AS SELECT UNNEST([1, 2, 3]) AS i; SELECT i, CASE WHEN i=1 THEN 10 WHEN i=2 THEN 20 ELSE 0 END AS test FROM integers; -- 1, 2, 3 -- 10, 20, 0 @@ -30,7 +30,7 @@ SELECT i, CASE WHEN i=1 THEN 10 WHEN i=2 THEN 20 ELSE 0 END AS test FROM integer The `ELSE` part of the `CASE` statement is optional. If no else statement is provided and none of the conditions match, the `CASE` statement will return `NULL`. ```sql -CREATE OR REPLACE TABLE INTEGERS as SELECT UNNEST([1, 2, 3]) as i; +CREATE OR REPLACE TABLE INTEGERS AS SELECT UNNEST([1, 2, 3]) AS i; SELECT i, CASE WHEN i=1 THEN 10 END AS test FROM integers; -- 1, 2, 3 -- 10, NULL, NULL @@ -39,7 +39,7 @@ SELECT i, CASE WHEN i=1 THEN 10 END AS test FROM integers; After the `CASE` but before the `WHEN` an individual expression can also be provided. When this is done, the `CASE` statement is essentially transformed into a switch statement. ```sql -CREATE OR REPLACE TABLE INTEGERS as SELECT UNNEST([1, 2, 3]) as i; +CREATE OR REPLACE TABLE INTEGERS AS SELECT UNNEST([1, 2, 3]) AS i; SELECT i, CASE i WHEN 1 THEN 10 WHEN 2 THEN 20 WHEN 3 THEN 30 END AS test FROM integers; -- 1, 2, 3 -- 10, 20, 30 diff --git a/docs/sql/functions/dateformat.md b/docs/sql/functions/dateformat.md index 05a21a79e42..608502a7595 100644 --- a/docs/sql/functions/dateformat.md +++ b/docs/sql/functions/dateformat.md @@ -33,7 +33,7 @@ The date formats can also be specified during CSV parsing, either in the `COPY` ```sql -- in COPY statement -COPY dates FROM 'test.csv' (DATEFORMAT '%d/%m/%Y', TIMESTAMPFORMAT '%A, %-d %B %Y - %I:%M:%S %p') +COPY dates FROM 'test.csv' (DATEFORMAT '%d/%m/%Y', TIMESTAMPFORMAT '%A, %-d %B %Y - %I:%M:%S %p'); -- in read_csv function SELECT * FROM read_csv('test.csv', dateformat='%m/%d/%Y'); diff --git a/docs/sql/functions/nested.md b/docs/sql/functions/nested.md index 7b759494734..44f344a7204 100644 --- a/docs/sql/functions/nested.md +++ b/docs/sql/functions/nested.md @@ -191,7 +191,7 @@ SELECT list_aggregate([2, 4, 8, 42], 'sum'); SELECT list_aggregate([[1, 2], [NULL], [2, 10, 3]], 'last'); -- [2, 10, 3] -SELECT list_aggregate([2, 4, 8, 42], 'string_agg', '|') +SELECT list_aggregate([2, 4, 8, 42], 'string_agg', '|'); -- 2|4|8|42 ``` @@ -230,17 +230,17 @@ By default if no modifiers are provided, DuckDB sorts ASC NULLS FIRST, i.e., the ```sql -- default sort order and default NULL sort order -SELECT list_sort([1, 3, NULL, 5, NULL, -5]) +SELECT list_sort([1, 3, NULL, 5, NULL, -5]); ---- [NULL, NULL, -5, 1, 3, 5] -- only providing the sort order -SELECT list_sort([1, 3, NULL, 2], 'ASC') +SELECT list_sort([1, 3, NULL, 2], 'ASC'); ---- [NULL, 1, 2, 3] -- providing the sort order and the NULL sort order -SELECT list_sort([1, 3, NULL, 2], 'DESC', 'NULLS FIRST') +SELECT list_sort([1, 3, NULL, 2], 'DESC', 'NULLS FIRST'); ---- [NULL, 3, 2, 1] ``` @@ -249,12 +249,12 @@ SELECT list_sort([1, 3, NULL, 2], 'DESC', 'NULLS FIRST') ```sql -- default NULL sort order -SELECT list_sort([1, 3, NULL, 5, NULL, -5]) +SELECT list_sort([1, 3, NULL, 5, NULL, -5]); ---- [NULL, NULL, -5, 1, 3, 5] -- providing the NULL sort order -SELECT list_reverse_sort([1, 3, NULL, 2], 'NULLS LAST') +SELECT list_reverse_sort([1, 3, NULL, 2], 'NULLS LAST'); ---- [3, 2, 1, NULL] ``` @@ -279,17 +279,17 @@ Returns a list that is the result of applying the lambda function to each elemen ```sql -- incrementing each list element by one -SELECT list_transform([1, 2, NULL, 3], x -> x + 1) +SELECT list_transform([1, 2, NULL, 3], x -> x + 1); ---- [2, 3, NULL, 4] -- transforming strings -SELECT list_transform(['duck', 'a', 'b'], duck -> CONCAT(duck, 'DB')) +SELECT list_transform(['duck', 'a', 'b'], duck -> CONCAT(duck, 'DB')); ---- [duckDB, aDB, bDB] -- combining lambda functions with other functions -SELECT list_transform([5, NULL, 6], x -> COALESCE(x, 0) + 1) +SELECT list_transform([5, NULL, 6], x -> COALESCE(x, 0) + 1); ---- [6, 1, 7] ``` @@ -304,17 +304,17 @@ Constructs a list from those elements of the input list for which the lambda fun ```sql -- filter out negative values -SELECT list_filter([5, -6, NULL, 7], x -> x > 0) +SELECT list_filter([5, -6, NULL, 7], x -> x > 0); ---- [5, 7] -- divisible by 2 and 5 -SELECT list_filter(list_filter([2, 4, 3, 1, 20, 10, 3, 30], x -> x % 2 == 0), y -> y % 5 == 0) +SELECT list_filter(list_filter([2, 4, 3, 1, 20, 10, 3, 30], x -> x % 2 == 0), y -> y % 5 == 0); ---- [20, 10, 30] -- in combination with range(...) to construct lists -SELECT list_filter([1, 2, 3, 4], x -> x > #1) FROM range(4) +SELECT list_filter([1, 2, 3, 4], x -> x > #1) FROM range(4); ---- [1, 2, 3, 4] [2, 3, 4] @@ -327,7 +327,7 @@ Lambda functions can be arbitrarily nested. ```sql -- nested lambda functions to get all squares of even list elements -SELECT list_transform(list_filter([0, 1, 2, 3, 4, 5], x -> x % 2 = 0), y -> y * y) +SELECT list_transform(list_filter([0, 1, 2, 3, 4, 5], x -> x % 2 = 0), y -> y * y); ---- [0, 4, 16] ``` diff --git a/docs/sql/functions/timestamptz.md b/docs/sql/functions/timestamptz.md index 9f91526c897..5e3edc78e38 100644 --- a/docs/sql/functions/timestamptz.md +++ b/docs/sql/functions/timestamptz.md @@ -38,7 +38,7 @@ This will let you specify an instant correctly without access to time zone infor For portability, `TIMESTAMPTZ` values will always be displayed using GMT offsets: ```sql -SELECT '2022-10-08 13:13:34-07'::TIMESTAMPTZ' +SELECT '2022-10-08 13:13:34-07'::TIMESTAMPTZ; -- 2022-10-08 20:13:34+00 ``` @@ -47,7 +47,7 @@ and cast to a representation in the local time zone: ```sql SELECT '2022-10-08 13:13:34 Europe/Amsterdam'::TIMESTAMPTZ::VARCHAR; --- 2022-10-08 04:13:34-07 +-- 2022-10-08 04:13:34-07 -- the offset will differ based on your local time zone ``` ## ICU Timestamp With Time Zone Operators @@ -131,9 +131,9 @@ Often the same functionality can be implemented more reliably using the `struct` The `AT TIME ZONE` syntax is syntactic sugar for the (two argument) `timezone` function listed above: ```sql -timestamp '2001-02-16 20:38:40' AT TIME ZONE 'America/Denver' +timestamp '2001-02-16 20:38:40' AT TIME ZONE 'America/Denver'; -- 2001-02-16 19:38:40-08 -timestamp with time zone '2001-02-16 20:38:40-05' AT TIME ZONE 'America/Denver' +timestamp with time zone '2001-02-16 20:38:40-05' AT TIME ZONE 'America/Denver'; -- 2001-02-16 18:38:40 ``` diff --git a/docs/sql/query_syntax/groupby.md b/docs/sql/query_syntax/groupby.md index d9b2c697b32..70a9f16c261 100644 --- a/docs/sql/query_syntax/groupby.md +++ b/docs/sql/query_syntax/groupby.md @@ -41,7 +41,7 @@ GROUP BY city, street_name; -- Group by city and street_name to remove any duplicate values SELECT city, street_name FROM addresses -GROUP BY ALL +GROUP BY ALL; -- GROUP BY city, street_name ; @@ -49,7 +49,7 @@ GROUP BY ALL -- Since income is wrapped in an aggregate function, do not include it in the GROUP BY SELECT city, street_name, AVG(income) FROM addresses -GROUP BY ALL +GROUP BY ALL; -- GROUP BY city, street_name ; diff --git a/docs/sql/query_syntax/grouping_sets.md b/docs/sql/query_syntax/grouping_sets.md index bde364b242b..322fb576049 100644 --- a/docs/sql/query_syntax/grouping_sets.md +++ b/docs/sql/query_syntax/grouping_sets.md @@ -78,7 +78,7 @@ GROUP BY course UNION ALL -- group by nothing SELECT NULL AS course, NULL AS type, COUNT(*) -FROM students +FROM students; ``` `CUBE` and `ROLLUP` are syntactic sugar to easily produce commonly used grouping sets. diff --git a/docs/sql/query_syntax/orderby.md b/docs/sql/query_syntax/orderby.md index 80224a5fb74..01814195759 100644 --- a/docs/sql/query_syntax/orderby.md +++ b/docs/sql/query_syntax/orderby.md @@ -52,7 +52,7 @@ CREATE OR REPLACE TABLE addresses AS UNION ALL SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111' UNION ALL - SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001' + SELECT '111 Duck Duck Goose Ln', 'Duck Town', '11111-0001'; ; ``` @@ -81,7 +81,7 @@ ORDER BY city COLLATE DE; -- Order from left to right (by address, then by city, then by zip) in ascending order SELECT * FROM addresses -ORDER BY ALL +ORDER BY ALL; ``` | address | city | zip | @@ -96,7 +96,7 @@ ORDER BY ALL -- Order from left to right (by address, then by city, then by zip) in descending order SELECT * FROM addresses -ORDER BY ALL DESC +ORDER BY ALL DESC; ``` | address | city | zip | diff --git a/docs/sql/statements/create_macro.md b/docs/sql/statements/create_macro.md index d3b5cdf2201..8299980709a 100644 --- a/docs/sql/statements/create_macro.md +++ b/docs/sql/statements/create_macro.md @@ -60,7 +60,8 @@ SELECT add(1, 2); -- 3 ``` -Macro's can have default parameters. +Macro's can have default parameters. Unlike some languages, default parameters must be named +when the macro is invoked. ```sql -- b is a default parameter CREATE MACRO add_default(a, b := 5) AS a + b; diff --git a/docs/sql/statements/create_table.md b/docs/sql/statements/create_table.md index 4207c2ef9e8..903f733689b 100644 --- a/docs/sql/statements/create_table.md +++ b/docs/sql/statements/create_table.md @@ -153,10 +153,10 @@ Currently only the `VIRTUAL` kind is supported, and it is also the default optio ```sql -- The simplest syntax for a generated column. -- The type is derived from the expression, and the variant defaults to VIRTUAL -CREATE TABLE t1(x FLOAT, two_x AS (2 * x)) +CREATE TABLE t1(x FLOAT, two_x AS (2 * x)); -- Fully specifying the same generated column for completeness -CREATE TABLE t1(x FLOAT, two_x FLOAT GENERATED ALWAYS AS (2 * x) VIRTUAL) +CREATE TABLE t1(x FLOAT, two_x FLOAT GENERATED ALWAYS AS (2 * x) VIRTUAL); ``` ### Syntax diff --git a/docs/sql/statements/export.md b/docs/sql/statements/export.md index 02cf1543f53..830c1b9b62d 100644 --- a/docs/sql/statements/export.md +++ b/docs/sql/statements/export.md @@ -17,7 +17,7 @@ EXPORT DATABASE 'target_directory' (FORMAT PARQUET); -- export as parquet, compressed with ZSTD, with a row_group_size of 100000 EXPORT DATABASE 'target_directory' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000); --reload the database again -IMPORT DATABASE 'target_directory' +IMPORT DATABASE 'target_directory'; ``` For details regarding the writing of Parquet files, see the [Parquet Files page in the Data Import section](../../data/parquet#writing-to-parquet-files), and the [Copy Statement page](copy). diff --git a/docs/sql/statements/pivot.md b/docs/sql/statements/pivot.md index 99e20d6c710..933b62baeee 100644 --- a/docs/sql/statements/pivot.md +++ b/docs/sql/statements/pivot.md @@ -305,7 +305,7 @@ PIVOT ( [column_2] IN ([in_list]) ... GROUP BY [rows(s)] -) +); ``` Unlike the simplified syntax, the `IN` clause must be specified for each column to be pivoted. If you are interested in dynamic pivoting, the simplified syntax is recommended. diff --git a/docs/sql/statements/select.md b/docs/sql/statements/select.md index f0340e88086..07b52f429fc 100644 --- a/docs/sql/statements/select.md +++ b/docs/sql/statements/select.md @@ -39,7 +39,7 @@ HAVING group_filter WINDOW window_expr QUALIFY qualify_filter ORDER BY order_expr -LIMIT n +LIMIT n; ``` Optionally, the `SELECT` statement can be prefixed with a [`WITH` clause](../../sql/query_syntax/with). diff --git a/docs/sql/statements/unpivot.md b/docs/sql/statements/unpivot.md index b3d2496a67b..a9b9c0dd276 100644 --- a/docs/sql/statements/unpivot.md +++ b/docs/sql/statements/unpivot.md @@ -20,7 +20,7 @@ UNPIVOT [dataset] ON [column(s)] INTO NAME [name-column-name] - VALUE [value-column-name(s)] + VALUE [value-column-name(s)]; ``` @@ -255,7 +255,7 @@ The full syntax diagram is below, but the SQL Standard `UNPIVOT` syntax can be s FROM [dataset] UNPIVOT [INCLUDE NULLS] ( [value-column-name(s)] - FOR [name-column-name] IN [column(s)] + FOR [name-column-name] IN [column(s)]; ) ``` diff --git a/docs/sql/window_functions.md b/docs/sql/window_functions.md index 2ffb64d3f4b..a17d3bf64da 100644 --- a/docs/sql/window_functions.md +++ b/docs/sql/window_functions.md @@ -128,9 +128,9 @@ The simplest window function is `ROW_NUMBER()`. This function just computes the 1-based row number within the partition using the query: ```sql -SELECT "Plant", "Date", row_number() over (partition by "Plant" order by "Date") AS "Row" +SELECT "Plant", "Date", row_number() OVER (PARTITION BY "Plant" ORDER BY "Date") AS "Row" FROM "History" -ORDER BY 1, 2 +ORDER BY 1, 2; ``` The result will be @@ -170,7 +170,7 @@ SELECT points, SUM(points) OVER ( ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) we -FROM results +FROM results; ``` This query computes the `SUM` of each point and the points on either side of it: @@ -194,7 +194,7 @@ SELECT "Plant", "Date", AND INTERVAL 3 DAYS FOLLOWING) AS "MWh 7-day Moving Average" FROM "Generation History" -ORDER BY 1, 2 +ORDER BY 1, 2; ``` This query partitions the data by `Plant` (to keep the different power plants' data separate), @@ -232,7 +232,7 @@ WINDOW seven AS ( ORDER BY "Date" ASC RANGE BETWEEN INTERVAL 3 DAYS PRECEDING AND INTERVAL 3 DAYS FOLLOWING) -ORDER BY 1, 2 +ORDER BY 1, 2; ``` The three window functions will also share the data layout, which will improve performance. @@ -259,7 +259,7 @@ WINDOW ORDER BY "Date" ASC RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND INTERVAL 1 DAYS FOLLOWING) -ORDER BY 1, 2 +ORDER BY 1, 2; ``` The queries above do not use a number of clauses commonly found in select statements, like @@ -284,5 +284,5 @@ WINDOW seven AS ( ORDER BY "Date" ASC RANGE BETWEEN INTERVAL 3 DAYS PRECEDING AND INTERVAL 3 DAYS FOLLOWING) -ORDER BY 1, 2 +ORDER BY 1, 2; ``` From 7448a8642f8fc87307216e61698fbecfe6092291 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 15:32:28 +0200 Subject: [PATCH 13/17] Add semicolons to SQL statements in the FROM page --- docs/sql/query_syntax/from.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/sql/query_syntax/from.md b/docs/sql/query_syntax/from.md index 0c0af238580..1e1e2e0dd80 100644 --- a/docs/sql/query_syntax/from.md +++ b/docs/sql/query_syntax/from.md @@ -60,7 +60,7 @@ and it just returns all the possible pairs. ```sql -- return all pairs of rows -SELECT a.*, b.* FROM a CROSS JOIN b +SELECT a.*, b.* FROM a CROSS JOIN b; ``` #### Conditional Joins @@ -160,7 +160,7 @@ Connecting them using this ordering is called a _positional join_: ```sql -- treat two data frames as a single table SELECT df1.*, df2.* -FROM df1 POSITIONAL JOIN df2 +FROM df1 POSITIONAL JOIN df2; ``` Positional joins are always `FULL OUTER` joins. @@ -175,7 +175,7 @@ This is called an _as-of join_: -- attach prices to stock trades SELECT t.*, p.price FROM trades t ASOF JOIN prices p - ON t.symbol = p.symbol AND t.when >= p.when + ON t.symbol = p.symbol AND t.when >= p.when; ``` The `ASOF` join requires at least one inequality condition on the ordering field. @@ -192,7 +192,7 @@ It can be specified as an `OUTER` join to find unpaired rows -- attach prices or NULLs to stock trades SELECT * FROM trades t ASOF LEFT JOIN prices p - ON t.symbol = p.symbol AND t.when >= p.when + ON t.symbol = p.symbol AND t.when >= p.when; ``` `ASOF` joins can also specify join conditions on matching column names with the `USING` syntax, @@ -201,7 +201,7 @@ which will be greater than or equal to (`>=`): ```sql SELECT * -FROM trades t ASOF JOIN prices p USING (symbol, when) +FROM trades t ASOF JOIN prices p USING (symbol, when); -- Returns symbol, trades.when, price (but NOT prices.when) ``` @@ -212,7 +212,7 @@ To get the `prices` times in the example, you will need to list the columns expl ```sql SELECT t.symbol, t.when AS trade_when, p.when AS price_when, price -FROM trades t ASOF LEFT JOIN prices p USING (symbol, when) +FROM trades t ASOF LEFT JOIN prices p USING (symbol, when); ``` ### Syntax From ffd7cfbba12ea92d5014a36c9fbc665abdbe017b Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 15:40:04 +0200 Subject: [PATCH 14/17] Fix semicolon position --- docs/sql/statements/unpivot.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql/statements/unpivot.md b/docs/sql/statements/unpivot.md index a9b9c0dd276..2bf8c3689c0 100644 --- a/docs/sql/statements/unpivot.md +++ b/docs/sql/statements/unpivot.md @@ -255,8 +255,8 @@ The full syntax diagram is below, but the SQL Standard `UNPIVOT` syntax can be s FROM [dataset] UNPIVOT [INCLUDE NULLS] ( [value-column-name(s)] - FOR [name-column-name] IN [column(s)]; -) + FOR [name-column-name] IN [column(s)] +); ``` Note that only one column can be included in the `name-column-name` expression. From b6af433a3a5846843447e42b573123a912ca9ab6 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 15:58:20 +0200 Subject: [PATCH 15/17] Add Iceberg page --- _data/menu_docs_dev.json | 4 ++++ docs/extensions/iceberg.md | 7 +++++++ 2 files changed, 11 insertions(+) create mode 100644 docs/extensions/iceberg.md diff --git a/_data/menu_docs_dev.json b/_data/menu_docs_dev.json index 8593092860e..bf419123ddb 100644 --- a/_data/menu_docs_dev.json +++ b/_data/menu_docs_dev.json @@ -925,6 +925,10 @@ "page": "HTTPFS", "url": "httpfs" }, + { + "page": "Iceberg", + "url": "iceberg" + }, { "page": "JSON", "url": "json" diff --git a/docs/extensions/iceberg.md b/docs/extensions/iceberg.md new file mode 100644 index 00000000000..31ee9ac5178 --- /dev/null +++ b/docs/extensions/iceberg.md @@ -0,0 +1,7 @@ +--- +layout: docu +title: Iceberg +--- +The [__iceberg__ extension](https://github.com/duckdblabs/duckdb_iceberg) is a loadable extension that implements support for the [Apache Iceberg format](https://iceberg.apache.org/). + +> This extension currently only works on main branch of DuckDB. From 40a9fb683b03387eab201526a8faabec2f8d575e Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 16:01:23 +0200 Subject: [PATCH 16/17] Add Iceberg to the extensions table --- docs/extensions/overview.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/extensions/overview.md b/docs/extensions/overview.md index 4370d74c78e..9e09785cc5b 100644 --- a/docs/extensions/overview.md +++ b/docs/extensions/overview.md @@ -36,6 +36,7 @@ SELECT * FROM duckdb_extensions(); | [excel](excel) | Adds support for Excel-like format strings | | | [fts](full_text_search) | Adds support for Full-Text Search Indexes | | | [httpfs](httpfs) | Adds support for reading and writing files over a HTTP(S) connection | http, https, s3 | +| [iceberg](iceberg) [![GitHub logo](/images/github-mark.svg)](https://github.com/duckdblabs/duckdb_iceberg) | Adds support for the Apache Iceberg format | | | icu | Adds support for time zones and collations using the ICU library | | | inet | Adds support for IP-related data types and functions | | | jemalloc | Overwrites system allocator with JEMalloc | | From bc73472c735e2425d0e7524666b62d406c218ec6 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Aug 2023 16:23:18 +0200 Subject: [PATCH 17/17] Enforce headline capitalization. Fixes #1004 --- CONTRIBUTING.md | 1 + docs/api/cli.md | 4 ++-- docs/api/python/dbapi.md | 2 +- docs/api/python/types.md | 4 ++-- docs/data/csv/tips.md | 10 +++++----- docs/data/json/overview.md | 6 +++--- docs/data/multiple_files/overview.md | 2 +- docs/data/partitioning/hive_partitioning.md | 2 +- docs/data/partitioning/partitioned_writes.md | 2 +- docs/extensions/httpfs.md | 2 +- docs/extensions/overview.md | 16 ++++++++-------- docs/extensions/postgres_scanner.md | 2 +- docs/extensions/sqlite_scanner.md | 4 ++-- docs/guides/python/filesystems.md | 2 +- docs/guides/sql_features/full_text_search.md | 2 +- docs/sql/data_types/struct.md | 20 ++++++++++---------- docs/sql/data_types/union.md | 8 ++++---- docs/sql/functions/overview.md | 2 +- docs/sql/functions/patternmatching.md | 2 +- docs/sql/indexes.md | 4 ++-- docs/sql/query_syntax/groupby.md | 2 +- docs/sql/query_syntax/with.md | 14 +++++++------- docs/sql/statements/pivot.md | 12 ++++++------ 23 files changed, 63 insertions(+), 62 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 799e59ad5e8..69b54813491 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -47,6 +47,7 @@ Some of this style guide is automated with GitHub Actions, but feel free to run * The title of the page should be encoded in the front matter's `title` property. The body of the page should not start with a repetition of this title. * In the body of the page, restrict the use of headers to the following levels: h2 (`##`), h3 (`###`), and h4 (`####`). +* Use headline capitalization as defined in the [Chicago Manual of Style](https://headlinecapitalization.com/). ### SQL style diff --git a/docs/api/cli.md b/docs/api/cli.md index db987c0db90..01257878435 100644 --- a/docs/api/cli.md +++ b/docs/api/cli.md @@ -477,7 +477,7 @@ Use ".open FILENAME" to reopen on a persistent database. ⚫◗ ``` -## Non-interactive usage +## Non-interactive Usage To read/process a file and exit immediately, pipe the file contents in to `duckdb`: @@ -521,7 +521,7 @@ D LOAD 'fts'; -## Reading from stdin and writing to stdout +## Reading from stdin and Writing to stdout When in a Unix environment, it can be useful to pipe data between multiple commands. DuckDB is able to read data from stdin as well as write to stdout using the file location of stdin (`/dev/stdin`) and stdout (`/dev/stdout`) within SQL commands, as pipes act very similarly to file handles. diff --git a/docs/api/python/dbapi.md b/docs/api/python/dbapi.md index cc1ed703b2e..81c582081e7 100644 --- a/docs/api/python/dbapi.md +++ b/docs/api/python/dbapi.md @@ -91,7 +91,7 @@ print(con.fetchall()) # [('duck', 'duck', 'goose')] ``` -## Named parameters +## Named Parameters Besides the standard unnamed parameters, like `$1`, `$2` etc, it's also possible to supply named parameters, like `$my_parameter`. When using named parameters, you have to provide a dictionary mapping of `str` to value in the `parameters` argument diff --git a/docs/api/python/types.md b/docs/api/python/types.md index d06695f143b..fb702adaf47 100644 --- a/docs/api/python/types.md +++ b/docs/api/python/types.md @@ -5,7 +5,7 @@ title: Types API The `DuckDBPyType` class represents a type instance of our [data types](../../sql/data_types/overview). -## Converting from other types +## Converting from Other Types To make the API as easy to use as possible, we have added implicit conversions from existing type objects to a DuckDBPyType instance. This means that wherever a DuckDBPyType object is expected, it is also possible to provide any of the options listed below. @@ -41,7 +41,7 @@ The table below shows the mapping of Numpy DType to DuckDB type. |*`float32`*|FLOAT| |*`float64`*|DOUBLE| -### Nested types +### Nested Types #### *`list[child_type]`* diff --git a/docs/data/csv/tips.md b/docs/data/csv/tips.md index 1d76cf77d15..f81d9f42779 100644 --- a/docs/data/csv/tips.md +++ b/docs/data/csv/tips.md @@ -5,7 +5,7 @@ title: CSV Loading Tips Below is a collection of tips to help when attempting to process especially gnarly CSV files. -#### Override the header flag if the header is not correctly detected +#### Override the Header Flag if the Header Is Not Correctly Detected If a file contains only string columns the `header` auto-detection might fail. Provide the `header` option to override this behavior. @@ -13,7 +13,7 @@ If a file contains only string columns the `header` auto-detection might fail. P SELECT * FROM read_csv_auto('flights.csv', header=True); ``` -#### Provide names if the file does not contain a header +#### Provide Names if the File Does Not Contain a Header If the file does not contain a header, names will be auto-generated by default. You can provide your own names with the `names` option. @@ -21,7 +21,7 @@ If the file does not contain a header, names will be auto-generated by default. SELECT * FROM read_csv_auto('flights.csv', names=['FlightDate', 'UniqueCarrier']); ``` -#### Override the types of specific columns +#### Override the Types of Specific Columns The `types` flag can be used to override types of only certain columns by providing a struct of `name -> type` mappings. @@ -29,7 +29,7 @@ The `types` flag can be used to override types of only certain columns by provid SELECT * FROM read_csv_auto('flights.csv', types={'FlightDate': 'DATE'}); ``` -#### Use `COPY` when loading data into a table +#### Use `COPY` When Loading Data into a Table The [`COPY` statement](../../sql/statements/copy) copies data directly into a table. The CSV reader uses the schema of the table instead of auto-detecting types from the file. This speeds up the auto-detection, and prevents mistakes from being made during auto-detection. @@ -37,7 +37,7 @@ The [`COPY` statement](../../sql/statements/copy) copies data directly into a ta COPY tbl FROM 'test.csv' (AUTO_DETECT 1); ``` -#### Use `union_by_name` when loading files with different schemas +#### Use `union_by_name` When Loading Files with Different Schemas The `union_by_name` option can be used to unify the schema of files that have different or missing columns. For files that do not have certain columns, `NULL` values are filled in. diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md index fdb27dcdc6f..44c47ff6fae 100644 --- a/docs/data/json/overview.md +++ b/docs/data/json/overview.md @@ -61,7 +61,7 @@ Below are parameters that can be passed in to the JSON reader. When using `read_json_auto`, every parameter that supports auto-detection is enabled. -## Examples of format settings +## Examples of Format Settings The JSON extension can attempt to determine the format of a JSON file when setting `format` to `auto`. Here are some example JSON files and the corresponding `format` settings that should be used. @@ -139,7 +139,7 @@ SELECT * FROM read_json_auto(unstructured.json, format=unstructured); | `value2` | `value2` | | `value3` | `value3` | -## Examples of records settings +## Examples of Records Settings The JSON extension can attempt to determine whether a JSON file contains records when setting `records=auto`. When `records=true`, the JSON extension expects JSON objects, and will unpack the fields of JSON objects into individual columns. @@ -191,7 +191,7 @@ SELECT * FROM read_json_auto(arrays.json, records=false); The contents of tables or the result of queries can be written directly to a JSON file using the `COPY` statement. See the [COPY documentation](../../sql/statements/copy#copy-to) for more information. -## read_json_auto function +## read_json_auto Function The `read_json_auto` is the simplest method of loading JSON files: it automatically attempts to figure out the correct configuration of the JSON reader. It also automatically deduces types of columns. diff --git a/docs/data/multiple_files/overview.md b/docs/data/multiple_files/overview.md index c74adf176e8..8db63d89de2 100644 --- a/docs/data/multiple_files/overview.md +++ b/docs/data/multiple_files/overview.md @@ -96,7 +96,7 @@ SELECT * FROM read_csv_auto(['flights1.csv', 'flights2.csv'], union_by_name=True | 1988-01-03 | New York, NY | Los Angeles, CA | AA | flights2.csv | -### Glob function to find filenames +### Glob Function to Find Filenames The glob pattern matching syntax can also be used to search for filenames using the `glob` table function. It accepts one parameter: the path to search (which may include glob patterns). diff --git a/docs/data/partitioning/hive_partitioning.md b/docs/data/partitioning/hive_partitioning.md index f68948cf193..b57dd0df4b9 100644 --- a/docs/data/partitioning/hive_partitioning.md +++ b/docs/data/partitioning/hive_partitioning.md @@ -68,7 +68,7 @@ orders By default the system tries to infer if the provided files are in a hive partitioned hierarchy. And if so, the `hive_partitioning` flag is enabled automatically. The autodetection will look at the names of the folders and search for a 'key'='value' pattern. This behaviour can be overridden by setting the `hive_partitioning` flag manually. -#### Hive types +#### Hive Types `hive_types` is a way to specify the logical types of the hive partitions in a struct: diff --git a/docs/data/partitioning/partitioned_writes.md b/docs/data/partitioning/partitioned_writes.md index 5b648a5b848..c3222a55f73 100644 --- a/docs/data/partitioning/partitioned_writes.md +++ b/docs/data/partitioning/partitioned_writes.md @@ -41,7 +41,7 @@ The values of the partitions are automatically extracted from the data. Note tha By default the partitioned write will not allow overwriting existing directories. Use the `OVERWRITE_OR_IGNORE` option to allow overwriting an existing directory. -#### Filename pattern +#### Filename Pattern By default, files will be named `data_0.parquet` or `data_0.csv`. With the flag `FILENAME_PATTERN` a pattern with `{i}` or `{uuid}` can be defined to create specific filenames: * `{i}` will be replaced by an index diff --git a/docs/extensions/httpfs.md b/docs/extensions/httpfs.md index 72031e183eb..11a81879313 100644 --- a/docs/extensions/httpfs.md +++ b/docs/extensions/httpfs.md @@ -166,7 +166,7 @@ could for example result in: | 1 | examplevalue1 | s3://bucket/file1.parquet | 2 | examplevalue1 | s3://bucket/file2.parquet -#### Hive partitioning +#### Hive Partitioning DuckDB also offers support for the Hive partitioning scheme. In the Hive partitioning scheme, data is partitioned in separate files. The columns by which the data is partitioned, are not actually in the files, but are encoded in the file diff --git a/docs/extensions/overview.md b/docs/extensions/overview.md index 9e09785cc5b..9894556957a 100644 --- a/docs/extensions/overview.md +++ b/docs/extensions/overview.md @@ -4,7 +4,7 @@ title: Extensions --- DuckDB has a number of extensions available for use. Not all of them are included by default in every distribution, but DuckDB has a mechanism that allows for remote installation. -## Remote installation +## Remote Installation If a given extensions is not available with your distribution, you can do the following to make it available. @@ -15,19 +15,19 @@ LOAD 'fts'; If you are using the Python API client, you can install and load them with the `load_extension(name: str)` and `install_extension(name: str)` methods. -## Unsigned extensions +## Unsigned Extensions All verified extensions are signed, if you wish to load your own extensions or extensions from untrusted third-parties you'll need to enable the `allow_unsigned_extensions` flag. To load unsigned extensions using the CLI, you'll need to pass the `-unsigned` flag to it on startup. -## Listing extensions +## Listing Extensions You can check the list of core and installed extensions with the following query: ```sql SELECT * FROM duckdb_extensions(); ``` -## All available extensions +## All Available Extensions | Extension name | Description | Aliases | |---|-----|--| @@ -50,7 +50,7 @@ SELECT * FROM duckdb_extensions(); | tpch | Adds TPC-H data generation and query support | | | visualizer | | | -## Downloading extensions directly from S3 +## Downloading Extensions Directly from S3 Downloading an extension directly could be helpful when building a lambda or container that uses DuckDB. DuckDB extensions are stored in public S3 buckets, but the directory structure of those buckets is not searchable. @@ -79,7 +79,7 @@ The list of supported platforms may increase over time, but the current list of See above for a list of extension names and how to pull the latest list of extensions. -## Loading an extension from local storage +## Loading an Extension from Local Storage Extensions are stored in gzip format, so they must be unzipped prior to use. There are many methods to decompress gzip. Here is a Python example: @@ -100,11 +100,11 @@ INSTALL 'httpfs.duckdb_extension'; LOAD 'httpfs.duckdb_extension'; ``` -## Extension template +## Extension Template A template for creating extensions is available in the [`extension-template` repository](https://github.com/duckdb/extension-template/). Note that this project is work-in-progress. -## Pages in this Section +## Pages in This Section