Skip to content

Commit

Permalink
Merge pull request #3543 from szarnyasg/globbing
Browse files Browse the repository at this point in the history
Document globbing's peculiarities
  • Loading branch information
szarnyasg authored Sep 5, 2024
2 parents b2089f9 + 02491c2 commit 25ddb98
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 17 deletions.
2 changes: 1 addition & 1 deletion docs/extensions/httpfs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ redirect_from:
---

The `httpfs` extension is an autoloadable extension implementing a file system that allows reading remote/writing remote files.
For plain HTTP(S), only file reading is supported. For object storage using the S3 API, the `httpfs` extension supports reading/writing/globbing files.
For plain HTTP(S), only file reading is supported. For object storage using the S3 API, the `httpfs` extension supports reading/writing/[globbing]({% link docs/sql/functions/pattern_matching.md %}#globbing) files.

## Installation and Loading

Expand Down
6 changes: 3 additions & 3 deletions docs/extensions/httpfs/s3api.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: docu
title: S3 API Support
---

The `httpfs` extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers.
The `httpfs` extension supports reading/writing/[globbing](#globbing) files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers.

## Platforms

Expand Down Expand Up @@ -173,9 +173,9 @@ FROM read_parquet([
]);
```

### Glob
### Globbing

File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:
File [globbing]({% link docs/sql/functions/pattern_matching.md %}#globbing) is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:

```sql
SELECT *
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/performance/file_formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ SELECT Prompt FROM sniff_csv('part-0001.csv');
Prompt = FROM read_csv('file_path.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={'hello': 'BIGINT', 'world': 'VARCHAR'});
```

Then, you can adjust `read_csv` command, by e.g., applying filename expansion (globbing), and run with the rest of the options detected by the sniffer:
Then, you can adjust `read_csv` command, by e.g., applying [filename expansion (globbing)]({% link docs/sql/functions/pattern_matching.md %}#globbing), and run with the rest of the options detected by the sniffer:

```sql
FROM read_csv('part-*.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={'hello': 'BIGINT', 'world': 'VARCHAR'});
Expand Down
2 changes: 1 addition & 1 deletion docs/sql/dialect/friendly_sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ DuckDB offers several advanced SQL features as well syntactic sugar to make SQL
* [Auto-detecting the headers and schema of CSV files]({% link docs/data/csv/auto_detection.md %})
* Directly querying [CSV files]({% link docs/data/csv/overview.md %}) and [Parquet files]({% link docs/data/parquet/overview.md %})
* Loading from files using the syntax `FROM 'my.csv'`, `FROM 'my.csv.gz'`, `FROM 'my.parquet'`, etc.
* Filename expansion (globbing), e.g.: `FROM 'my-data/part-*.parquet'`
* [Filename expansion (globbing)]({% link docs/sql/functions/pattern_matching.md %}#globbing), e.g.: `FROM 'my-data/part-*.parquet'`

## Functions and Expressions

Expand Down
61 changes: 50 additions & 11 deletions docs/sql/functions/pattern_matching.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,14 @@ SELECT 'A%c' ILIKE 'a$%c' ESCAPE '$'; -- true

There are also alternative characters that can be used as keywords in place of `LIKE` expressions. These enhance PostgreSQL compatibility.

<div class="narrow_table"></div>
<div class="narrow_table monospace_table"></div>

| LIKE-style | PostgreSQL-style |
|:---|:---|
| `LIKE` | `~~` |
| `NOT LIKE` | `!~~` |
| `ILIKE` | `~~*` |
| `NOT ILIKE` | `!~~*` |
| LIKE | ~~ |
| NOT LIKE | !~~ |
| ILIKE | ~~* |
| NOT ILIKE | !~~* |

## `SIMILAR TO`

Expand All @@ -93,18 +93,25 @@ SELECT 'abc' NOT SIMILAR TO 'abc'; -- false

There are also alternative characters that can be used as keywords in place of `SIMILAR TO` expressions. These follow POSIX syntax.

<div class="narrow_table"></div>
<div class="narrow_table monospace_table"></div>

| `SIMILAR TO`-style | POSIX-style |
|:---|:---|
| `SIMILAR TO` | `~` |
| `NOT SIMILAR TO` | `!~` |
| SIMILAR TO | ~ |
| NOT SIMILAR TO | !~ |

## Globbing

## `GLOB`
DuckDB supports file name expansion, also known as globbing, for discovering files.
DuckDB's glob syntax uses the question mark (`?`) wildcard to match any single character and the asterisk (`*`) to match zero or more characters.
In addition, you can use the bracket syntax (`[...]`) to match any single character contained within the brackets, or within the character range specified by the brackets. An exclamation mark (`!`) may be used inside the first bracket to search for a character that is not contained within the brackets.
To learn more, visit the [“glob (programming)” Wikipedia page](https://en.wikipedia.org/wiki/Glob_(programming)).

### `GLOB`

<div id="rrdiagram3"></div>

The `GLOB` operator returns `true` or `false` if the string matches the `GLOB` pattern. The `GLOB` operator is most commonly used when searching for filenames that follow a specific pattern (for example a specific file extension). Use the question mark (`?`) wildcard to match any single character, and use the asterisk (`*`) to match zero or more characters. In addition, use bracket syntax (`[...]`) to match any single character contained within the brackets, or within the character range specified by the brackets. An exclamation mark (`!`) may be used inside the first bracket to search for a character that is not contained within the brackets. To learn more, visit the [Glob Programming Wikipedia page](https://en.wikipedia.org/wiki/Glob_(programming)).
The `GLOB` operator returns `true` or `false` if the string matches the `GLOB` pattern. The `GLOB` operator is most commonly used when searching for filenames that follow a specific pattern (for example a specific file extension).

Some examples:

Expand Down Expand Up @@ -154,7 +161,7 @@ Search the current directory for all files:
SELECT * FROM glob('*');
```

<div class="narrow_table"></div>
<div class="narrow_table monospace_table"></div>

| file |
|---------------|
Expand All @@ -166,6 +173,38 @@ SELECT * FROM glob('*');
| test2.parquet |
| todos.json |

### Globbing Semantics

DuckDB's globbing implementation follows the semantics of [Python's `glob`](https://docs.python.org/3/library/glob.html) and not the `glob` used in the shell.
A notable difference is the behavior of the `**/` construct: `**/⟨filename⟩` will not return a file with `⟨filename⟩` in top-level directory.
For example, with a `README.md` file present in the directory, the following query finds it:

```sql
SELECT * FROM glob('README.md');
```

<div class="narrow_table monospace_table"></div>

| file |
|-----------|
| README.md |

However, the following query returns an empty result:

```sql
SELECT * FROM glob('**/README.md');
```

Meanwhile, the globbing of Bash, Zsh, etc. finds the file using the same syntax:

```bash
ls **/README.md
```

```text
README.md
```

## Regular Expressions

DuckDB's regex support is documented on the [Regular Expressions page]({% link docs/sql/functions/regular_expressions.md %}).

0 comments on commit 25ddb98

Please sign in to comment.