Skip to content

Commit

Permalink
merging with main for linting fix
Browse files Browse the repository at this point in the history
  • Loading branch information
Alex-Monahan committed Aug 24, 2023
2 parents 510d1d6 + 3e11d47 commit 5f6f07e
Show file tree
Hide file tree
Showing 58 changed files with 215 additions and 173 deletions.
1 change: 1 addition & 0 deletions .github/linkchecker/linkchecker.conf
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ ignore=
https://nbviewer\.org/.*
https://open\.spotify\.com/.*
https://www\.dataengineeringpodcast\.com/.*
https://www.tpc.org/.*
4 changes: 1 addition & 3 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,7 @@ jobs:
- uses: articulate/actions-markdownlint@main
with:
config: .markdownlint.jsonc
files: 'docs/**/*.md'
# TODO:
# files: 'docs/**/*.md _posts/*.md dev/*.md'
files: 'docs/**/*.md _posts/*.md dev/*.md'

python:
runs-on: ubuntu-latest
Expand Down
2 changes: 2 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,13 @@ Some of this style guide is automated with GitHub Actions, but feel free to run
* Quoted blocks (lines starting with `>`) are rendered as [a "Note" box](https://duckdb.org/docs/archive/0.8.1/guides/python/filesystems).
* Always format SQL code, variable names, function names, etc. as code. For example, when talking about the `CREATE TABLE` statement, the keywords should be formatted as code.
* When presenting SQL statements, do not include the DuckDB prompt (`D `) in the documentation.
* SQL statements should end with a semicolon (`;`) to allow readers to quickly paste them into a SQL console.

### Headers

* The title of the page should be encoded in the front matter's `title` property. The body of the page should not start with a repetition of this title.
* In the body of the page, restrict the use of headers to the following levels: h2 (`##`), h3 (`###`), and h4 (`####`).
* Use headline capitalization as defined in the [Chicago Manual of Style](https://headlinecapitalization.com/).

### SQL style

Expand Down
2 changes: 1 addition & 1 deletion _config_exclude_archive.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# bundler exec jekyll serve --incremental --config _config.yml,_config_exclude_archive.yml
exclude: ['docs/archive']
exclude: ['docs/archive', 'vendor']
4 changes: 4 additions & 0 deletions _data/menu_docs_dev.json
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,10 @@
"page": "HTTPFS",
"url": "httpfs"
},
{
"page": "Iceberg",
"url": "iceberg"
},
{
"page": "JSON",
"url": "json"
Expand Down
8 changes: 4 additions & 4 deletions _posts/2021-08-27-external-sorting.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,12 +372,12 @@ We see similar trends at SF10 and SF100, but for SF100, at around 12 payload col
ClickHouse switches to an external sorting strategy, which is much slower than its in-memory strategy.
Therefore, adding a few payload columns results in a runtime that is orders of magnitude higher.
At 20 payload columns ClickHouse runs into the following error:
```
```text
DB::Exception: Memory limit (for query) exceeded: would use 11.18 GiB (attempt to allocate chunk of 4204712 bytes), maximum: 11.18 GiB: (while reading column cs_list_price): (while reading from part ./store/523/5230c288-7ed5-45fa-9230-c2887ed595fa/all_73_108_2/ from mark 4778 with max_rows_to_read = 8192): While executing MergeTreeThread.
```

HyPer also drops in performance before erroring out with the following message:
```
```text
ERROR: Cannot allocate 333982248 bytes of memory: The `global memory limit` limit of 12884901888 bytes was exceeded.
```
As far as we are aware, HyPer uses [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html), which creates a mapping between memory and a file.
Expand All @@ -392,7 +392,7 @@ Using swap usually slows down processing significantly, but the SSD is so fast t
While Pandas loads the data, swap size grows to an impressive \~40 GB: Both the file and the data frame are fully in memory/swap at the same time, rather than streamed into memory.
This goes down to \~20 GB of memory/swap when the file is done being read.
Pandas is able to get quite far into the experiment until it crashes with the following error:
```
```text
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
```

Expand Down Expand Up @@ -521,7 +521,7 @@ We have set the number of threads that DuckDB and ClickHouse use to 8 because we

Pandas performs comparatively worse than on the MacBook, because it has a single-threaded implementation, and this CPU has a lower single-thread performance.
Again, Pandas crashes with an error (this machine does not dynamically increase swap):
```
```text
numpy.core._exceptions.MemoryError: Unable to allocate 6.32 GiB for an array with shape (6, 141430723) and data type float64
```

Expand Down
4 changes: 2 additions & 2 deletions _posts/2021-10-29-duckdb-wasm.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,14 +245,14 @@ In 2018, the Spectre and Meltdown vulnerabilities sent crippling shockwaves thro

Without `SharedArrayBuffers`, WebAssembly modules can run in a dedicated web worker to unblock the main event loop but won't be able to spawn additional workers for parallel computations within the same instance. By default, we therefore cannot unleash the parallel query execution of DuckDB in the web. However, browser vendors have recently started to reenable `SharedArrayBuffers` for websites that are [cross-origin-isolated](https://web.dev/coop-coep/). A website is cross-origin-isolated if it ships the main document with the following HTTP headers:

```
```text
Cross-Origin-Embedded-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin
```

These headers will instruct browsers to A) isolate the top-level document from other top-level documents outside its own origin and B) prevent the document from making arbitrary cross-origin requests unless the requested resource explicitly opts in. Both restrictions have far reaching implications for a website since many third-party data sources won't yet provide the headers today and the top-level isolation currently hinders the communication with, for example, OAuth pop up's ([there are plans to lift that](https://github.com/whatwg/html/issues/6364)).

*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users. Share your thoughts with us [here]().*
*We therefore assume that DuckDB-Wasm will find the majority of users on non-isolated websites. We are, however, experimenting with dedicated bundles for isolated sites using the suffix `-next-coi`) and will closely monitor the future need of our users.*

## Web Shell

Expand Down
2 changes: 1 addition & 1 deletion _posts/2021-11-26-duck-enum.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ df_out <- dbReadTable(con, "characters")

To demonstrate the performance of DuckDB when running operations on categorical columns of Pandas DataFrames, we present a number of benchmarks. The source code for the benchmarks is available [here](https://raw.githubusercontent.com/duckdb/duckdb-web/main/_posts/benchmark_scripts/enum.py). In our benchmarks we always consume and produce Pandas DataFrames.

#### Dataset
### Dataset

Our dataset is composed of one dataframe with 4 columns and 10 million rows. The first two columns are named ```race``` and ```subrace``` representing races. They are both categorical, with the same categories but different values. The other two columns ```race_string``` and ```subrace_string``` are the string representations of ```race``` and ```subrace```.

Expand Down
2 changes: 1 addition & 1 deletion _posts/2021-12-03-duck-arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ nyc.filter("year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amou

In this section, we will look at some basic examples of the code needed to read and output Arrow tables in both Python and R.

#### Setup
### Setup

First we need to install DuckDB and Arrow. The installation process for both libraries in Python and R is shown below.
```bash
Expand Down
2 changes: 1 addition & 1 deletion _posts/2022-05-04-friendlier-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ In addition to what has already been implemented, several other improvements hav
- Clickhouse supports this with the [`COLUMNS` expression](https://clickhouse.com/docs/en/sql-reference/statements/select/#columns-expression)
- Incremental column aliases
- Refer to previously defined aliases in subsequent calculated columns rather than re-specifying the calculations
- Dot operators for JSON types
- Dot operators for JSON types
- The JSON extension is brand new ([see our documentation!](https://duckdb.org/docs/extensions/json)) and already implements friendly `->` and `->>` syntax

Thanks for checking out DuckDB! May the Force be with you...
2 changes: 1 addition & 1 deletion _posts/2022-09-30-postgres-scanner.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ CALL postgres_attach('dbname=myshinydb');
`postgres_attach` takes a single required string parameter, which is the [`libpq` connection string](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING). For example you can pass `'dbname=myshinydb'` to select a different database name. In the simplest case, the parameter is just `''`. There are three additional named parameters to the function:
* `source_schema` the name of a non-standard schema name in Postgres to get tables from. Default is `public`.
* `overwrite` whether we should overwrite existing views in the target schema, default is `false`.
* `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls.
* `filter_pushdown` whether filter predicates that DuckDB derives from the query should be forwarded to Postgres, defaults to `false`. See below for a discussion of what this parameter controls.

The tables in the database are registered as views in DuckDB, you can list them with
```SQL
Expand Down
14 changes: 8 additions & 6 deletions _posts/2022-11-14-announcing-duckdb-060.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ CREATE TABLE messages(u UNION(num INT, error VARCHAR));
INSERT INTO messages VALUES (42);
INSERT INTO messages VALUES ('oh my globs');
```
```
```text
SELECT * FROM messages;
┌─────────────┐
│ u │
Expand Down Expand Up @@ -142,7 +142,7 @@ CREATE TABLE obs(id INT, val1 INT, val2 INT);
INSERT INTO obs VALUES (1, 10, 100), (2, 20, NULL), (3, NULL, 300);
SELECT MIN(COLUMNS(*)), COUNT(*) from obs;
```
```
```text
┌─────────────┬───────────────┬───────────────┬──────────────┐
│ min(obs.id) │ min(obs.val1) │ min(obs.val2) │ count_star() │
├─────────────┼───────────────┼───────────────┼──────────────┤
Expand All @@ -155,7 +155,7 @@ The `COLUMNS` expression supports all star expressions, including [the `EXCLUDE`
```sql
SELECT COLUMNS('val[0-9]+') from obs;
```
```
```text
┌──────┬──────┐
│ val1 │ val2 │
├──────┼──────┤
Expand All @@ -170,7 +170,7 @@ SELECT COLUMNS('val[0-9]+') from obs;
```sql
SELECT [x + 1 for x in [1, 2, 3]] AS l;
```
```
```text
┌───────────┐
│ l │
├───────────┤
Expand Down Expand Up @@ -211,8 +211,10 @@ The DuckDB shell also offers several improvements over the SQLite shell, such as

The number of rows that are rendered can be changed by using the `.maxrows X` setting, and you can switch back to the old rendering using the `.mode box` command.

```
```sql
D SELECT * FROM '~/Data/nyctaxi/nyc-taxi/2014/04/data.parquet';
```
```text
┌───────────┬─────────────────────┬─────────────────────┬───┬────────────┬──────────────┬──────────────┐
│ vendor_id │ pickup_at │ dropoff_at │ … │ tip_amount │ tolls_amount │ total_amount │
│ varchar │ timestamp │ timestamp │ │ float │ float │ float │
Expand Down Expand Up @@ -265,7 +267,7 @@ SELECT student_id FROM 'data/ -> data/grades.csv

**Progress Bars**. DuckDB has [supported progress bars in queries for a while now](https://github.com/duckdb/duckdb/pull/1432), but they have always been opt-in. In this release we have [prettied up the progress bar](https://github.com/duckdb/duckdb/pull/5187) and enabled it by default in the shell. The progress bar will pop up when a query is run that takes more than 2 seconds, and display an estimated time-to-completion for the query.

```
```sql
D copy lineitem to 'lineitem-big.parquet';
32% ▕███████████████████▏ ▏
```
Expand Down
16 changes: 8 additions & 8 deletions _posts/2023-02-13-announcing-duckdb-070.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month));

This will cause the Parquet files to be written in the following directory structure:

```
```text
orders
├── year=2021
│ ├── month=1
Expand Down Expand Up @@ -145,7 +145,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
>>> lineitem = duckdb.sql('FROM lineitem.parquet')
>>> lineitem.limit(3).show()
```
```
```text
┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐
│ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │
│ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │
Expand All @@ -161,7 +161,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
>>> lineitem_filtered = duckdb.sql('FROM lineitem WHERE l_orderkey>5000')
>>> lineitem_filtered.limit(3).show()
```
```
```text
┌────────────┬───────────┬───────────┬───┬────────────────┬────────────┬──────────────────────┐
│ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │
│ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │
Expand All @@ -176,7 +176,7 @@ SELECT * FROM t1 POSITIONAL JOIN t2;
```py
>>> duckdb.sql('SELECT MIN(l_orderkey), MAX(l_orderkey) FROM lineitem_filtered').show()
```
```
```text
┌─────────────────┬─────────────────┐
│ min(l_orderkey) │ max(l_orderkey) │
│ int32 │ int32 │
Expand All @@ -193,7 +193,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk
>>> lineitem = duckdb.read_csv('lineitem.csv')
>>> lineitem.limit(3).show()
```
```
```text
┌────────────┬───────────┬───────────┬───┬───────────────────┬────────────┬──────────────────────┐
│ l_orderkey │ l_partkey │ l_suppkey │ … │ l_shipinstruct │ l_shipmode │ l_comment │
│ int32 │ int32 │ int32 │ │ varchar │ varchar │ varchar │
Expand All @@ -208,7 +208,7 @@ Note that everything is lazily evaluated. The Parquet file is not read from disk
```py
>>> duckdb.sql('select min(l_orderkey) from lineitem').show()
```
```
```text
┌─────────────────┐
│ min(l_orderkey) │
│ int32 │
Expand All @@ -225,7 +225,7 @@ import duckdb
duckdb.sql('select 42').pl()
```

```
```text
shape: (1, 1)
┌─────┐
│ 42 │
Expand All @@ -245,7 +245,7 @@ df = pl.DataFrame({'a': 42})
duckdb.sql('select * from df').pl()
```

```
```text
shape: (1, 1)
┌─────┐
│ a │
Expand Down
4 changes: 2 additions & 2 deletions _posts/2023-04-14-h2oai.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The time reported is the sum of the time it takes to run all 5 queries twice.

More information about the specific queries can be found below.

#### The Data and Queries
### The Data and Queries

The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation.

Expand Down Expand Up @@ -87,7 +87,7 @@ You can also look at the results [here](https://duckdblabs.github.io/db-benchmar

Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the [_utils/repro.sh](https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh) script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss.

# Maintenance plan
## Maintenance plan

DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided.

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-04-21-swift.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ One problem with our current `ExoplanetStore` type is that it doesn’t yet cont

There are hundreds of configuration options for this incredible resource, but today we want each exoplanet’s name and its discovery year packaged as a CSV. [Checking the docs](https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html) gives us the following endpoint:

```
```text
https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv
```

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-05-26-correlated-subqueries-in-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ WHERE distance=(
);
```

```
```text
┌───────────────────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
Expand Down
4 changes: 2 additions & 2 deletions dev/sqllogictest/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ A syntax highlighter exists for [Visual Studio Code](https://marketplace.visuals

A syntax highlighter is also available for [CLion](https://plugins.jetbrains.com/plugin/15295-sqltest). It can be installed directly on the IDE by searching SQLTest on the marketplace. A [github repository](https://github.com/pdet/SQLTest) is also available, with extensions and bug reports being welcome.

##### Temporary Files
#### Temporary Files

For some tests (e.g. CSV/Parquet file format tests) it is necessary to create temporary files. Any temporary files should be created in the temporary testing directory. This directory can be used by placing the string `__TEST_DIR__` in a query. This string will be replaced by the path of the temporary testing directory.

Expand All @@ -71,7 +71,7 @@ statement ok
COPY csv_data TO '__TEST_DIR__/output_file.csv.gz' (COMPRESSION GZIP);
```

##### Require & Extensions
#### Require & Extensions

To avoid bloating the core system, certain functionality of DuckDB is available only as an extension. Tests can be build for those extensions by adding a `require` field in the test. If the extension is not loaded, any statements that occurs after the require field will be skipped. Examples of this are `require parquet` or `require icu`.

Expand Down
12 changes: 6 additions & 6 deletions docs/api/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,9 +275,9 @@ D .schema
```

```sql
CREATE TABLE fliers(animal VARCHAR);;
CREATE TABLE swimmers(animal VARCHAR);;
CREATE TABLE walkers(animal VARCHAR);;
CREATE TABLE fliers(animal VARCHAR);
CREATE TABLE swimmers(animal VARCHAR);
CREATE TABLE walkers(animal VARCHAR);
```

## Opening Database Files
Expand Down Expand Up @@ -452,7 +452,7 @@ Note that the duck head is built with unicode characters and does not always wor
-- Duck head prompt
.prompt '⚫◗ '
-- Example SQL statement
SELECT 'Begin quacking!' as "Ready, Set, ..."
SELECT 'Begin quacking!' AS "Ready, Set, ...";
```

To invoke that file on initialization, use this command:
Expand All @@ -477,7 +477,7 @@ Use ".open FILENAME" to reopen on a persistent database.
⚫◗
```

## Non-interactive usage
## Non-interactive Usage

To read/process a file and exit immediately, pipe the file contents in to `duckdb`:

Expand Down Expand Up @@ -521,7 +521,7 @@ D LOAD 'fts';

<!-- SQL parameters do not appear to work -->

## Reading from stdin and writing to stdout
## Reading from stdin and Writing to stdout

When in a Unix environment, it can be useful to pipe data between multiple commands.
DuckDB is able to read data from stdin as well as write to stdout using the file location of stdin (`/dev/stdin`) and stdout (`/dev/stdout`) within SQL commands, as pipes act very similarly to file handles.
Expand Down
Loading

0 comments on commit 5f6f07e

Please sign in to comment.