Skip to content

Commit

Permalink
Merge pull request #195 from dlt-hub/setup-guide
Browse files Browse the repository at this point in the history
GitHub and Google Sheets setup guides
  • Loading branch information
rahuljo authored Mar 30, 2023
2 parents bddb41b + 1a34bd5 commit 6732c4f
Show file tree
Hide file tree
Showing 4 changed files with 300 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
175 changes: 175 additions & 0 deletions docs/website/docs/pipelines/github.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Github pipeline setup guide

This pipeline can be used to load data on issues or pull requests from any GitHub repository onto a [destination](https://dlthub.com/docs/destinations) of your choice.

This pipeline can access the GitHub API from two `dlt` sources:
1. `github_reactions` with the resource end-points `issues` and `pullRequests`.
2. `github_repo_events` with the resource end-point `repo_events`.

## Grab the API auth token
You can optionally add API access tokens to avoid making requests as an unauthorized user. Note: If you wish to load reaction data, then the access token is mandatory.

To get the API token, sign-in to your GitHub account and follow these steps:

1. Click on your profile picture on the top right corner.
2. Choose *Settings*.
3. Select *Developer settings* on the left panel.
4. Under *Personal access tokens*, click on *Generate a personal access token* (preferably under *Tokens(classic)*).
5. Grant at least the following scopes to the token by checking them.


| public_repo | Limits access to public repositories. |
| --- | --- |
| read:repo_hook | Grants read and ping access to hooks in public or private repositories. |
| read:org | Read-only access to organization membership, organization projects, and team membership. |
| read:user | Grants access to read a user's profile data. |
| read:project | Grants read only access to user and organization projects. |
| read:discussion | Allows read access for team discussions. |

7. Finally select *Generate token*.
8. Copy the token and save it somewhere. This will be added later in the `dlt` configuration.

You can learn more about GitHub authentication in the docs [here](https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28#basic-authentication) and API token scopes [here](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/scopes-for-oauth-apps).

## Initialize the pipeline

Initialize the pipeline with the following command:
```
dlt init github bigquery
```
Here, we chose BigQuery as the destination. To choose a different destination, replace `bigquery` with your choice of destination.

Running this command will create a directory with the following structure:
```bash
github_pipeline
β”œβ”€β”€ .dlt
β”‚ β”œβ”€β”€ .pipelines
β”‚ β”œβ”€β”€ config.toml
β”‚ └── secrets.toml
β”œβ”€β”€ github
β”‚ └── __pycache__
β”‚ └── __init__.py
β”‚ └── queries.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ github_pipeline.py
└── requirements.txt
```

## Add credentials

1. In the `.dlt` folder, you will find `secrets.toml`, which looks like this:

```bash
# Put your secret values and credentials here
# Note: Do not share this file and do not push it to GitHub!
# Github access token (must be classic for reactions source)
[sources.github]
access_token="GITHUB_API_TOKEN"
[destination.bigquery.credentials] # the credentials required will change based on the destination
project_id = "set me up" # GCP project ID
private_key = "set me up" # Unique private key (including `BEGINand END PRIVATE KEY`)
client_email = "set me up" # Service account email
location = "set me up" # Project location (e.g. β€œUS”)
```

2. Replace `"GITHUB_API_TOKEN"` with the API token you [copied above](#grab-the-api-auth-token) or leave it blank if not specified.
3. Follow the instructions in the [Destinations](https://dlthub.com/docs/destinations) document to add credentials for your chosen destination.

## Modify the script `github_pipeline.py`

To load the data from the desired repository onto the desired destination, a source method needs to be specified in `github_pipeline.py`. For this, you can either write your own method, or modify one of the three example templates:

|Function name| Description |
| --- | --- |
| load_duckdb_repo_reactions_issues_only | Loads data on user reactions to issues and user comments on issues from the duckdb repository onto the specified destination. To run this method, it is necessary to add the API access token in `.dlt/secrets.toml`. |
| load_airflow_events | Loads all the events associated with the airflow repository onto the specified destination. |
| load_dlthub_dlt_all_data | Loads data on user reactions to issues, user comments on issues, pull requests, and pull request comments from the `dlt` repository onto the specified destination. To run this method, it is necessary to add the API access token in `.dlt/secrets.toml`. |

Include the source method in the `__main__` block and comment out any other functions. For example, if the source method is `load_airflow_event`, the code block would look as follows:
```python
if __name__ == "__main__" :
# load_duckdb_repo_reactions_issues_only()
load_airflow_events()
# load_dlthub_dlt_all_data()

```

## Run the pipeline

1. Install the necessary dependencies by running the following command:
```
pip install -r requirements.txt
```
2. Now the pipeline can be run by using the command:
```
python3 github_pipeline.py
```
3. To make sure that everything is loaded as expected, use the command:
```
dlt pipeline github_pipeline show
```
## Customize source methods

You can customize the existing templates in `github_pipeline.py` to load from any repository of your choice.

### a. Load GitHub events from any repository

For this, you can modify the method `load_airflow_events`. By default, this method loads events from the Apache Airflow repository (https://github.com/apache/airflow). The owner name of this repository is `apache` and the repository name is `airflow`. To load events from a different repository, change the owner name and the repository name in the function to that of your chosen repository.

The general template for this method is as follows:
```python
def load_<repo_name>_events() -> None:
"""Loads <repo_name> events. Shows incremental loading. Forces anonymous access token"""
pipeline = dlt.pipeline("github_events", destination=<destination_name> dataset_name="<repo_name>_events")
data = github_repo_events(<owner_name>, <repo_name>, access_token="")
print(pipeline.run(data))
# does not load same events again
data = github_repo_events(<owner_name>, <repo_name>, access_token="")
print(pipeline.run(data))

```
1. By default, <repo_name> is `airflow` and <owner_name> is `apache`. <destination_name> is the destination specified when initiating the `dlt` project.
2. To load events from any other repository, change <repo_name> and <owner_name> in the method to that of the desired repository.
3. The argument `access_token`, if left blank, will make calls to the API anonymously. To add your API access token, include `[set access_token = dlt.secrets.value]`.
4. Lastly, include your source method in the `__main__` block:

```python
if __name__ == "__main__" :
load_<repo_name>_events()

```

### b. Load GitHub reactions from any repository

The template source method for loading user reactions and comments on issues is `load_duckdb_repo_reactions_issues_only`. By default, this method loads data from the duckdb repository(https://github.com/duckdb/duckdb). The owner name and the repository name for this repository is `duckdb`. To load data from a different respository, change the owner name and the repository name in the function to that of your chosen repository.

The general template for this method is as follows:

```python
def load_<owner_name>_<repo_name>_reactions() -> None:
"""Loads all issues, pull requests and comments for <repo_name> """
pipeline = dlt.pipeline("github_reactions", destination=<destination_name>, dataset_name="<repo_name>_reactions", full_refresh=True)
data = github_reactions(<owner_name>, <repo_name>)
print(pipeline.run(data))

```
1. By default, <repo_name> is `duckdb` and <owner_name> is `duckdb`. <destination_name> is the destination specified when initiating the `dlt` project.
2. To load events from any other repository, change <repo_name> and <owner_name> in the method to that of the desired repository.
3. Use arguments `items_per_page` and `max_items` in `github_reactions` to set limits on the number of items per page and the total number of items.

```python
data = github_reactions(<owner_name>, <repo_name>, items_per_page=100, max_items=300)
#Limits the items per page to 100 and the total number of items to 300
```
4. To limit the data to a single resource (example "issues"), use the method `with_resource`:

```python
data = github_reactions(<owner_name>, <repo_name>).with_resources("issues")
```
5. Lastly, include your source method in the `__main__` block:

```python
if __name__ == "__main__" :
load_<owner_name>_<repo_name>_reactions()

```
125 changes: 125 additions & 0 deletions docs/website/docs/pipelines/google_sheets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Google sheets pipeline setup guide

This pipeline can be used to load data from a [Google sheets](https://www.google.com/sheets/about/) workspace onto a [destination](https://dlthub.com/docs/destinations) of your choice.

1. `dlt` loads each sheet in the workspace as a separate table in the destination.
2. The tables in destination have same name as individual sheets.
3. For **merged** cells, `dlt` retains only that cell value which was taken during the merge(e.g., top-leftmost), and every other cell in the merge is given a null value.
4. [**Named Ranges**](https://support.google.com/docs/answer/63175?hl=en&co=GENIE.Platform%3DDesktop) are loaded as a separate column with an automatically generated header.

## Google Sheets API authentication

### Get API credentials:

Before creating the pipeline, we need to first get the necessary API credentials:

1. Sign in to [[console.cloud.google.com](http://console.cloud.google.com/)].
2. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#creating) if you don't already have one.
3. Enable Google Sheets API:
1. In the left panel under *APIs & Services*, choose *Enabled APIs & services*.
2. Click on *+ ENABLE APIS AND SERVICES* and find and select Google Sheets API.
3. Click on *ENABLE*.
4. Generate credentials:
1. In the left panel under *IAM & Admin*, select *Service Accounts*.
2. In the service account table click on the three dots under the column "Actions" for the service account that you wish to use.
3. Select *Manage Keys*.
4. Under *ADD KEY* choose *Create new key*, and for the key type JSON select *CREATE*.
5. This downloads a .json which contains the credentials that we will be using later.

### Share the Google Sheet with the API:

To allow the API to access the Google Sheet, open the sheet that you wish to use and do the following:

1. Select the share button on the top left corner.

![Share_Button](docs_images/Share_button.png)

2. In *Add people and groups*, add the *client_email* with at least viewer privileges. You will find this *client_email* in the JSON that you downloaded above.

![Add people](docs_images/Add_people.png)

3. Finally, click on *Copy link* and save the link. This will need to be added to the `dlt` script.

Here you will find a setup guide for the [Google sheets](https://www.google.com/sheets/about/) pipeline.

## Initialize the pipeline

We can now create the pipeline.

Initialize a `dlt` project with the following command:

```bash
dlt init google_sheets bigquery
```
Here, we chose BigQuery as the destination. To choose a different destination, replace `bigquery` with your choice of destination.

Running this command will create a directory with the following structure:
```shell
directory
β”œβ”€β”€ .dlt
β”‚ β”œβ”€β”€ .pipelines
β”‚ β”œβ”€β”€ config.toml
β”‚ └── secrets.toml
└── google_sheets
β”œβ”€β”€ helpers
β”‚ β”œβ”€β”€ __.init.py__
β”‚ β”œβ”€β”€ api_calls.py
β”‚ └── data_processing.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ google_sheets_pipelines.py
└── requirements.txt
```

## Add credentials

1. Open `.dlt/secrets.toml`
2. From the .json that you downloaded earlier, copy `project_id`, `private_key`, and `client_email` under `[sources.google_spreadsheet.credentials]`
```python
[sources.google_spreadsheet.credentials]
project_id = "set me up" # GCP Source project ID!
private_key = "set me up" # Unique private key !(Must be copied fully including BEGIN and END PRIVATE KEY)
client_email = "set me up" # Email for source service account
location = "set me up" #Project Location For ex. β€œUS”
```
3. Enter credentials for your chosen destination as per the [docs](https://dlthub.com/docs/destinations#google-bigquery)

## Add Spreadsheet ID and URL

1. The following two constants need to be added to `google_sheets_pipelines.py`

```python
# constants
SPREADSHEET_ID = "Set_me_up" # Spread Sheet ID (as per Step Two)
SPREADSHEET_URL = "Set_me_up" #URL from Google Sheets > Share > Copy Link
```

2. The spreadsheet URL can be found in Google Sheets > Share Button ( at top right) > Copy the link.
3. Assign the copied link to `SPREADSHEET_URL`.
4. The `SPREADSHEET_ID` can be found in `SPREADSHEET_URL`, for example. if the SPREADSHEET_URL is as below:

```python
https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing
```

Then `SPREADSHEET_ID` is between …spreadsheets/d/**SPREADSHEET_ID**/edit?usp=sharing i.e. as below

```python
**1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4**
```

5. After filling in the variables as above, save the file. ( Ctrl + S or Cmd + S ).

## Run the pipeline
1. Install the requirements by using the following command

```python
pip install -r requirements.txt
```

2. Run the pipeline by using the following command

```python
python3 google_sheets_pipelines.py
```

3. Use `dlt pipeline google_sheets_pipeline show` to make sure that everything loaded as expected.

0 comments on commit 6732c4f

Please sign in to comment.