Skip to content

Commit e46b56f

Browse files
committed
add Lab 10
1 parent f849a22 commit e46b56f

File tree

6 files changed

+34
-19
lines changed

6 files changed

+34
-19
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,8 @@ You can find the rubric under the [Assignment](https://courseworks2.columbia.edu
101101
| 7 | 3/6 | [Databases](lectures/lecture_07.md) | [Readings](readings/week_07.md) | [Databases](labs/lab_07.md) | [Lab 6](labs/lab_06.md) |
102102
| 8 | 3/13 | [Guest speaker; data warehousing](lectures/lecture_08.md) | [Project Part 4](docs/project.md#part-4) | [Data loading](labs/lab_08.md) | [Lab 7](labs/lab_07.md) |
103103
| 9 | 3/20 | none ([Spring Recess][recess]) | none | none ([Spring Recess][recess]) | none |
104-
| 10 | 3/27 | [Data engineering (pipelines, ETL)](lectures/lecture_10.md) | [Project Part 5](docs/project.md#part-5) | TBD | [Lab 8](labs/lab_08.md) |
105-
| 11 | 4/3 | Infrastructure / cloud computing | [Readings](readings/week_11.md) | TBD | TBD |
104+
| 10 | 3/27 | [Data engineering (pipelines, ETL)](lectures/lecture_10.md) | [Project Part 5](docs/project.md#part-5) | [Data loading, continued](labs/lab_10.md) | [Lab 8](labs/lab_08.md) |
105+
| 11 | 4/3 | Infrastructure / cloud computing | [Readings](readings/week_11.md) | TBD | [Lab 10](labs/lab_10.md) |
106106
| 12 | 4/10 | Big data; algorithms | [Readings](readings/week_12.md) | TBD | TBD |
107107
| 13 | 4/17 | Privacy | [Readings](readings/week_13.md) | TBD | TBD |
108108
| 14 | 4/24 | buffer | TBD | TBD | TBD |

docs/project.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ At this point, your project should be looking more like one of the [examples](#i
192192

193193
### Steps
194194

195-
Do the following for your regularly-updated data source. Repeating the middle steps for the additional dataset(s) is optional.
195+
Do the following for your regularly-updated data source. Only do one for now — we'll do the rest in [Lab 10](../labs/lab_10.md).
196196

197197
1. Install [pandas-gbq](https://pandas-gbq.readthedocs.io/).
198198
1. Load data.

labs/lab_06.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,4 +52,4 @@ This is one of those times where you'll follow instructions without necessarily
5252

5353
### Submit
5454

55-
[Submit the links to the pull request(s) via CourseWorks.](https://courseworks2.columbia.edu/courses/210480/assignments)
55+
[Submit links to the pull request(s) via CourseWorks.](https://courseworks2.columbia.edu/courses/210480/assignments)

labs/lab_08.md

+15-15
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The trick is avoiding duplicates. Your script might then need to say something l
2828
## Lab work
2929

3030
- You'll write methods to load continuously updated data into a database.
31-
- You'll set up scripts to perform each of the [methods of data loading](#data-loading) into DuckDB.
31+
- You'll set up scripts to perform each of the [methods of data loading](#data-loading) into DuckDB.
3232
- You'll [pair](../docs/pairing.md) in your Lab group.
3333
- Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
3434

@@ -40,13 +40,13 @@ The trick is avoiding duplicates. Your script might then need to say something l
4040
- We have monthly observations (rows) and monthly vintages (columns)
4141

4242
| DATE | PCPI04M1 | PCPI04M2 | PCPI04M3 |
43-
|---------|---------:|---------:|---------:|
44-
| 2003:09 | 185.0 | 185.1 | 185.1 |
45-
| 2003:10 | 185.0 | 184.9 | 184.9 |
46-
| 2003:11 | 184.6 | 184.6 | 184.6 |
47-
| 2003:12 | 185.0 | 184.9 | 184.9 |
48-
| 2004:01 | #N/A | 185.8 | 185.8 |
49-
| 2004:02 | #N/A | #N/A | 186.3 |
43+
| ------- | -------: | -------: | -------: |
44+
| 2003:09 | 185.0 | 185.1 | 185.1 |
45+
| 2003:10 | 185.0 | 184.9 | 184.9 |
46+
| 2003:11 | 184.6 | 184.6 | 184.6 |
47+
| 2003:12 | 185.0 | 184.9 | 184.9 |
48+
| 2004:01 | #N/A | 185.8 | 185.8 |
49+
| 2004:02 | #N/A | #N/A | 186.3 |
5050

5151
- A revision of past data is released in February of each year.
5252
- A revision released in year `t` can update the values in years `t-5` to `t-1`.
@@ -58,9 +58,9 @@ The trick is avoiding duplicates. Your script might then need to say something l
5858
Suppose your organization wants to maintain a database of CPI data
5959

6060
- Write a `get_latest_data` function that accepts a `pull_date` and returns the latest data available up to that date
61-
- For example, if the `pull_date` is 2004-01-15, the function should return the data from vintage `PCPI04M1`
61+
- For example, if the `pull_date` is 2004-01-15, the function should return the data from vintage `PCPI04M1`
6262
- Write code that pulls the latest data at a given `pull_date` and loads it into a DuckDB database
63-
- You will implement each of the methods `append`, `trunc`, and `incremental`
63+
- You will implement each of the methods `append`, `trunc`, and `incremental`
6464
- Loop over a range of `pull_dates` to simulate running the scripts on a daily basis
6565
- Compare the performance of each method (consistency and speed)
6666

@@ -79,12 +79,12 @@ Suppose your organization wants to maintain a database of CPI data
7979
- `_append`
8080
- `_trunc`
8181
- `_inc`
82-
- Your code should accept a `pull_date` parameter and load the data up to that date
83-
- The script should be able to run multiple times without duplicating data
84-
- For incremental: a Python script may be easier than a SQL one
82+
- Your code should accept a `pull_date` parameter and load the data up to that date
83+
- The script should be able to run multiple times without duplicating data
84+
- For incremental: a Python script may be easier than a SQL one
8585
4. On a notebook: simulate your organization running the scripts on a daily basis.
8686
- Start from empty tables
8787
- Loop over a range of `pull_dates` (e.g. 2000-01-01 to 2025-02-28) to simulate running the scripts on a daily basis.
8888
- If the loop takes way too long, use a shorter range
89-
- Compare the performance of each method (data consistency and speed)
90-
5. [Submit the links to the pull request(s) via CourseWorks.](https://courseworks2.columbia.edu/courses/210480/assignments)
89+
- Compare the performance of each method (data consistency and speed)
90+
5. [Submit links to the pull request(s) via CourseWorks.](https://courseworks2.columbia.edu/courses/210480/assignments)

labs/lab_10.md

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Lab 10
2+
3+
**Objective:** Think through data loading for different data sources
4+
5+
---
6+
7+
Work in your [Project team](../docs/project_teams.csv) to load your other datasets to BigQuery. For each data source:
8+
9+
1. What type of [data loading](lab_08.md#data-loading) will you use? Why? Explain as Markdown in your repository.
10+
1. Create a data loading script, repeating those steps from [Part 5](../docs/project.md#part-5).
11+
12+
---
13+
14+
[Submit links to the pull request(s) via CourseWorks.](https://courseworks2.columbia.edu/courses/210480/assignments)

readings/week_11.md

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
- [Append Load vs Incremental Load vs Truncate and Load](https://medium.com/@santosh_beora/the-3-most-commonly-used-etl-processes-explained-through-everyday-analogies-a7aa9f7a3754)
66
- Pipeline assignment
7+
- Airflow
78
- [Cracking the Cloud_Open](https://www.redhat.com/en/command-line-heroes/season-1/crack-the-cloud-open)
89
- [Overview of Cloud Computing](https://dc.arcabc.ca/islandora/object/dc%3A54375?solr_nav%5Bid%5D=c0f46853d72e7e533f04&solr_nav%5Bpage%5D=0&solr_nav%5Boffset%5D=0), Chapters 1-2
910

0 commit comments

Comments
 (0)