Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Experimental support for filesystem interfaces via fsspec #315

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ repos:
)$

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.3
rev: v0.6.4
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
118 changes: 113 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,22 @@ Note: This tap currently does not support incremental state.

## Settings

| Setting | Required | Default | Description |
|:--------------------|:--------:|:-------:|:------------|
| files | False | None | An array of csv file stream settings. |
| csv_files_definition| False | None | A path to the JSON file holding an array of file settings. |
| add_metadata_columns| False | False | When True, add the metadata columns (`_sdc_source_file`, `_sdc_source_file_mtime`, `_sdc_source_lineno`) to output. |
| Setting | Required | Default | Description |
| :------------------- | :------- | :------ | :----------------------------------------------------------------------------------------------------------------- |
| files | False | None | An array of csv file stream settings |
| filesystem | False | local | The filesystem to use for reading files |
| csv_files_definition | False | None | A path to the JSON file holding an array of file settings |
| add_metadata_columns | False | 0 | When True, add the metadata columns (`_sdc_source_file`, `_sdc_source_file_mtime`, `_sdc_source_lineno`) to output |

A full list of supported settings and capabilities is available by running: `tap-csv --about`

The `filesystem` setting can be used to specify the filesystem to use for reading files. The following filesystems are supported:

- `local`, the default, for reading files from the local filesystem.
- [`ftp`](#ftp), for reading files from an FTP server.
- [`github`](#github), for reading files from a GitHub repository.
- [`dropbox`](#dropbox), for reading files from a Dropbox account.

The `config.json` contains an array called `files` that consists of dictionary objects detailing each destination table to be passed to Singer. Each of those entries contains:
* `entity`: The entity name to be passed to singer (i.e. the table)
* `path`: Local path to the file to be ingested. Note that this may be a directory, in which case all files in that directory and any of its subdirectories will be recursively processed
Expand Down Expand Up @@ -81,6 +89,106 @@ Optionally, the files definition can be provided by an external json file:
]
```

### Filesystem settings

#### FTP

| Setting | Required | Default | Description |
| :----------- | :------- | :------ | :---------------------- |
| ftp | False | None | FTP connection settings |
| ftp.host | True | None | FTP server host |
| ftp.port | False | 21 | FTP server port |
| ftp.username | False | None | FTP username |
| ftp.password | False | None | FTP password |
| ftp.encoding | False | utf-8 | FTP server encoding |

#### GitHub

| Setting | Required | Default | Description |
| :-------------- | :------- | :------ | :---------------------------------------------------------- |
| github | False | None | GitHub connection settings |
| github.org | True | None | GitHub organization or user where the repository is located |
| github.repo | True | None | GitHub repository |
| github.username | False | None | GitHub username |
| github.token | False | None | GitHub token |

<details><summary>Example configuration</summary>
<p>

```json
{
"add_metadata_columns": true,
"filesystem": "github",
"github": {
"org": "MeltanoLabs",
"repo": "tap-csv"
},
"files": [
{
"entity": "alphabet",
"path": "tap_csv/tests/data/alphabet.csv",
"keys": [
"col1"
]
}
]
}
```

</p>
</details>


#### Dropbox

| Setting | Required | Default | Description |
| :------------ | :------- | :------ | :-------------------------- |
| dropbox | False | None | Dropbox connection settings |
| dropbox.token | True | None | Dropbox token |

The token needs the `files.content.read` scope:

[![Dropbox scopes](img/dropbox_scopes.png)](https://www.dropbox.com/developers/apps)

<details><summary>Example configuration</summary>
<p>

```json
{
"add_metadata_columns": true,
"filesystem": "dropbox",
"dropbox": {
"token": "...."
},
"files": [
{
"entity": "alphabet",
"path": "/alphabet.csv",
"keys": [
"col1"
]
}
]
}
```

</p>
</details>

### Built-in Singer SDK settings

The following settings are supported by the Singer SDK and are automatically handled by the tap:

| Setting | Required | Default | Description |
| :------------------- | :------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| stream_maps | False | None | Config object for stream maps capability. For more information check out [Stream Maps](https://sdk.meltano.com/en/latest/stream_maps.html). |
| stream_map_config | False | None | User-defined config values to be used within map expressions. |
| faker_config | False | None | Config for the [`Faker`](https://faker.readthedocs.io/en/master/) instance variable `fake` used within map expressions. Only applicable if the plugin specifies `faker` as an addtional dependency (through the `singer-sdk` `faker` extra or directly). |
| faker_config.seed | False | None | Value to seed the Faker generator for deterministic output: https://faker.readthedocs.io/en/master/#seeding-the-generator |
| faker_config.locale | False | None | One or more LCID locale strings to produce localized output for: https://faker.readthedocs.io/en/master/#localization |
| flattening_enabled | False | None | 'True' to enable schema flattening and automatically expand nested properties. |
| flattening_max_depth | False | None | The max depth to flatten schemas. |

## Installation

```bash
Expand Down
Binary file added img/dropbox_scopes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions meltano.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,69 @@ plugins:
keys:
- col1
add_metadata_columns: false
settings_group_validation:
- [ftp.host]
- [github.org, github.repo]
settings:
- name: filesystem
kind: options
options:
- label: Local Filesystem
value: local
- label: FTP
value: ftp
- label: GitHub
value: github

# FTP settings
- name: ftp.host
label: FTP Host
description: Hostname of the FTP server
kind: string
- name: ftp.port
label: FTP Port
description: Port of the FTP server
kind: integer
- name: ftp.username
label: FTP Username
description: Username for the FTP server
kind: string
- name: ftp.password
label: FTP Password
description: Password for the FTP server
kind: password
sensitive: true
- name: ftp.encoding
label: FTP Encoding
description: Encoding for the FTP server
kind: string

# GitHub settings
- name: github.org
label: GitHub Organization
description: Organization name on GitHub
kind: string
- name: github.repo
label: GitHub Repository
description: Repository name on GitHub
kind: string
- name: github.username
label: GitHub Username
description: Username for GitHub
kind: string
- name: github.token
label: GitHub Token
description: Token for GitHub
kind: password
sensitive: true

# Dropbox settings
- name: dropbox.token
label: Dropbox Token
description: Token for Dropbox
kind: password
sensitive: true

- name: files
description: Array of objects containing keys - `entity`, `path`, `keys`, `encoding` (Optional), `delimiter` (Optional), `doublequote` (Optional), `escapechar` (Optional), `quotechar` (Optional), `skipinitialspace` (Optional), `strict` (Optional)
kind: array
Expand All @@ -30,6 +92,7 @@ plugins:
- name: add_metadata_columns
description: When True, add the metadata columns (`_sdc_source_file`, `_sdc_source_file_mtime`, `_sdc_source_lineno`) to output.
kind: boolean

loaders:
- name: target-jsonl
variant: andyh1203
Expand Down
Loading