diff --git a/docs/extensions/httpfs/overview.md b/docs/extensions/httpfs/overview.md index b4f8621f6d8..10323f75ab1 100644 --- a/docs/extensions/httpfs/overview.md +++ b/docs/extensions/httpfs/overview.md @@ -8,7 +8,7 @@ redirect_from: --- The `httpfs` extension is an autoloadable extension implementing a file system that allows reading remote/writing remote files. -For plain HTTP(S), only file reading is supported. For object storage using the S3 API, the `httpfs` extension supports reading/writing/globbing files. +For plain HTTP(S), only file reading is supported. For object storage using the S3 API, the `httpfs` extension supports reading/writing/[globbing]({% link docs/sql/functions/pattern_matching.md %}#globbing) files. ## Installation and Loading diff --git a/docs/extensions/httpfs/s3api.md b/docs/extensions/httpfs/s3api.md index e6c61c8f362..475888931f9 100644 --- a/docs/extensions/httpfs/s3api.md +++ b/docs/extensions/httpfs/s3api.md @@ -3,7 +3,7 @@ layout: docu title: S3 API Support --- -The `httpfs` extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers. +The `httpfs` extension supports reading/writing/[globbing](#globbing) files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers. ## Platforms @@ -173,9 +173,9 @@ FROM read_parquet([ ]); ``` -### Glob +### Globbing -File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example: +File [globbing]({% link docs/sql/functions/pattern_matching.md %}#globbing) is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example: ```sql SELECT * diff --git a/docs/guides/performance/file_formats.md b/docs/guides/performance/file_formats.md index d925b7a9526..e2242f1b2bc 100644 --- a/docs/guides/performance/file_formats.md +++ b/docs/guides/performance/file_formats.md @@ -98,7 +98,7 @@ SELECT Prompt FROM sniff_csv('part-0001.csv'); Prompt = FROM read_csv('file_path.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={'hello': 'BIGINT', 'world': 'VARCHAR'}); ``` -Then, you can adjust `read_csv` command, by e.g., applying filename expansion (globbing), and run with the rest of the options detected by the sniffer: +Then, you can adjust `read_csv` command, by e.g., applying [filename expansion (globbing)]({% link docs/sql/functions/pattern_matching.md %}#globbing), and run with the rest of the options detected by the sniffer: ```sql FROM read_csv('part-*.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\n', skip=0, header=true, columns={'hello': 'BIGINT', 'world': 'VARCHAR'}); diff --git a/docs/sql/dialect/friendly_sql.md b/docs/sql/dialect/friendly_sql.md index d32f8801046..e9619bf7dbf 100644 --- a/docs/sql/dialect/friendly_sql.md +++ b/docs/sql/dialect/friendly_sql.md @@ -58,7 +58,7 @@ DuckDB offers several advanced SQL features as well syntactic sugar to make SQL * [Auto-detecting the headers and schema of CSV files]({% link docs/data/csv/auto_detection.md %}) * Directly querying [CSV files]({% link docs/data/csv/overview.md %}) and [Parquet files]({% link docs/data/parquet/overview.md %}) * Loading from files using the syntax `FROM 'my.csv'`, `FROM 'my.csv.gz'`, `FROM 'my.parquet'`, etc. -* Filename expansion (globbing), e.g.: `FROM 'my-data/part-*.parquet'` +* [Filename expansion (globbing)]({% link docs/sql/functions/pattern_matching.md %}#globbing), e.g.: `FROM 'my-data/part-*.parquet'` ## Functions and Expressions diff --git a/docs/sql/functions/pattern_matching.md b/docs/sql/functions/pattern_matching.md index 8b4f5fb131e..0f1e2c5d6a1 100644 --- a/docs/sql/functions/pattern_matching.md +++ b/docs/sql/functions/pattern_matching.md @@ -64,14 +64,14 @@ SELECT 'A%c' ILIKE 'a$%c' ESCAPE '$'; -- true There are also alternative characters that can be used as keywords in place of `LIKE` expressions. These enhance PostgreSQL compatibility. -
+
| LIKE-style | PostgreSQL-style | |:---|:---| -| `LIKE` | `~~` | -| `NOT LIKE` | `!~~` | -| `ILIKE` | `~~*` | -| `NOT ILIKE` | `!~~*` | +| LIKE | ~~ | +| NOT LIKE | !~~ | +| ILIKE | ~~* | +| NOT ILIKE | !~~* | ## `SIMILAR TO` @@ -93,18 +93,25 @@ SELECT 'abc' NOT SIMILAR TO 'abc'; -- false There are also alternative characters that can be used as keywords in place of `SIMILAR TO` expressions. These follow POSIX syntax. -
+
| `SIMILAR TO`-style | POSIX-style | |:---|:---| -| `SIMILAR TO` | `~` | -| `NOT SIMILAR TO` | `!~` | +| SIMILAR TO | ~ | +| NOT SIMILAR TO | !~ | + +## Globbing -## `GLOB` +DuckDB supports file name expansion, also known as globbing, for discovering files. +DuckDB's glob syntax uses the question mark (`?`) wildcard to match any single character and the asterisk (`*`) to match zero or more characters. +In addition, you can use the bracket syntax (`[...]`) to match any single character contained within the brackets, or within the character range specified by the brackets. An exclamation mark (`!`) may be used inside the first bracket to search for a character that is not contained within the brackets. +To learn more, visit the [“glob (programming)” Wikipedia page](https://en.wikipedia.org/wiki/Glob_(programming)). + +### `GLOB`
-The `GLOB` operator returns `true` or `false` if the string matches the `GLOB` pattern. The `GLOB` operator is most commonly used when searching for filenames that follow a specific pattern (for example a specific file extension). Use the question mark (`?`) wildcard to match any single character, and use the asterisk (`*`) to match zero or more characters. In addition, use bracket syntax (`[...]`) to match any single character contained within the brackets, or within the character range specified by the brackets. An exclamation mark (`!`) may be used inside the first bracket to search for a character that is not contained within the brackets. To learn more, visit the [Glob Programming Wikipedia page](https://en.wikipedia.org/wiki/Glob_(programming)). +The `GLOB` operator returns `true` or `false` if the string matches the `GLOB` pattern. The `GLOB` operator is most commonly used when searching for filenames that follow a specific pattern (for example a specific file extension). Some examples: @@ -154,7 +161,7 @@ Search the current directory for all files: SELECT * FROM glob('*'); ``` -
+
| file | |---------------| @@ -166,6 +173,38 @@ SELECT * FROM glob('*'); | test2.parquet | | todos.json | +### Globbing Semantics + +DuckDB's globbing implementation follows the semantics of [Python's `glob`](https://docs.python.org/3/library/glob.html) and not the `glob` used in the shell. +A notable difference is the behavior of the `**/` construct: `**/⟨filename⟩` will not return a file with `⟨filename⟩` in top-level directory. +For example, with a `README.md` file present in the directory, the following query finds it: + +```sql +SELECT * FROM glob('README.md'); +``` + +
+ +| file | +|-----------| +| README.md | + +However, the following query returns an empty result: + +```sql +SELECT * FROM glob('**/README.md'); +``` + +Meanwhile, the globbing of Bash, Zsh, etc. finds the file using the same syntax: + +```bash +ls **/README.md +``` + +```text +README.md +``` + ## Regular Expressions DuckDB's regex support is documented on the [Regular Expressions page]({% link docs/sql/functions/regular_expressions.md %}).