Lab 8

Goal: Understand different methods of loading data

Data loading

Let's say you were given access to a random table that uses one of the three data loading methods above. How would you tell which it was?

The trick is avoiding duplicates. Your script might then need to say something like:

You'll write methods to load continuously updated data into a database.
- You'll set up scripts to perform each of the methods of data loading into DuckDB.
You'll pair in your Lab group.
Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.

We will be using Consumer Price Index data from the Philadelphia Federal Reserve.
We have monthly observations (rows) and monthly vintages (columns)

DATE	PCPI04M1	PCPI04M2	PCPI04M3
2003:09	185.0	185.1	185.1
2003:10	185.0	184.9	184.9
2003:11	184.6	184.6	184.6
2003:12	185.0	184.9	184.9
2004:01	#N/A	185.8	185.8
2004:02	#N/A	#N/A	186.3

Suppose your organization wants to maintain a database of CPI data

Write a get_latest_data function that accepts a pull_date and returns the latest data available up to that date
- For example, if the pull_date is 2004-01-15, the function should return the data from vintage PCPI04M1
Write code that pulls the latest data at a given pull_date and loads it into a DuckDB database
- You will implement each of the methods append, trunc, and incremental
Loop over a range of pull_dates to simulate running the scripts on a daily basis
Compare the performance of each method (consistency and speed)

Write out the usage and manual testing instructions as Markdown.
- We're doing this as documentation-driven development.
- What should the user expect to see in the table after running each script?
Write the get_latest_data function.
- This function should return only two columns: e.g. dates and cpi
- All other code should interact with the source data only through this function
Work through each method of data loading.
- Include the type in the scripts and table names to keep them separate — something like:
  - _append
  - _trunc
  - _inc
- Your code should accept a pull_date parameter and load the data up to that date
- The script should be able to run multiple times without duplicating data
- For incremental: a Python script may be easier than a SQL one
On a notebook: simulate your organization running the scripts on a daily basis.
- Start from empty tables
- Loop over a range of pull_dates (e.g. 2000-01-01 to 2025-02-28) to simulate running the scripts on a daily basis.
- If the loop takes way too long, use a shorter range
- Compare the performance of each method (data consistency and speed)
Submit links to the pull request(s) via CourseWorks.

This is a simpler version of the previous task. You can submit work fullfilling either version of the task.

Write out the usage and manual testing instructions as Markdown.
- We're doing this as documentation-driven development.
- What should the user expect to see in the table after running each script?
Use the data in the file PCPI24M1.csv to initialize your database. This file contains inflation information as available in January 2024.
- Create a persistent database containing three tables: cpi_append, cpi_trunc and cpi_inc.
- For now these three tables look the same.
Load additional inflation data contained in the file PCPI25M2.csv into your database. This file contains inflation data as available in February 2025. It contains additional observations and historical revisions with respect to the previous file.
- Use each method of data loading to update the tables in your database.
- Now your tables might look different.
Discuss how the three methods differ.
Submit links to the pull request(s) via CourseWorks.