Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PHMSA distribution data into PUDL #3770

Open
11 of 23 tasks
Tracked by #2848
e-belfer opened this issue Aug 7, 2024 · 2 comments
Open
11 of 23 tasks
Tracked by #2848

Integrate PHMSA distribution data into PUDL #3770

e-belfer opened this issue Aug 7, 2024 · 2 comments
Assignees
Labels
community new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration

Comments

@e-belfer
Copy link
Member

e-belfer commented Aug 7, 2024

PHMSA distribution data (1990-present)

PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.

Prep

Before getting into the table transforms, let's update the data itself.

Tasks

Table design

  • All form parts below are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables. The original files (CSV and Excel) are not split by form part.
  • Each report corresponds to 1 state and one commodity group. Further parts of the table have dozens of columns that we should tidy into one categorical column, creating different PKs within the parts of the form. We should try to structure our tables to match these PKs as much as possible.

core_phmsagas__yearly_distribution_operators

Part A: Operator Information

  • Essentially just the operator ID, name, address, and info on the state and commodity pertinent to the report.

Part D-I

  • Each field occurs once per report, these should probably be kept in one table with Part A.
  • Part D, Excavation Damage: 6 fields of summary stats
  • Part E, EFV and Service Valve Data: 4 fields of summary stats
  • Part F: Total number of leaks on federal lands scheduled for repair or repaired.
  • Part G: %age of unaccounted gas.
  • Part H: Additional Info: a Notes field, discusses corrections, changes in calculations, and other ambiguities.
  • Part I: Preparer's info: Email, initial or supplemental report, contact info for preparer.

Operator Table Tasks

core_phmsagas__yearly_distribution_main_and_services

Part B: System Description

  • Summary stats on miles and services by material.
  • Miles of main by material and size of pipe.
  • Miles of services by material and size of pipe.
  • Miles of main and services by decade of installation. (This has a different categorical and might want to be its own table)

Each piece of this has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.

Tasks

core_phmsa__yearly_distribution_leaks_and_repairs

Part C: Leaks and Repairs
Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"

Tasks

General Known PHMSA Cleaning Steps

  • Convert 2-digit report_year into 4-digit years (pre 2000).
  • Standardize report_state to use either shorthand or full state name
  • Ideally, adapt existing wide_to_tidy infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).
  • Standardize the multiple treatments of time noted in Clean up and standardize column names #3277 (filing date, data date, revision date)
  • Handle different aggregations of reporting over time for each form section (e.g., 1 form per state, one form per system)
  • Where granularity increases over time (e.g. onshore becomes onshore types A, B, C), aggregate these increasingly disaggregated columns back to have comparable totals over time.
  • Deal with extremely varying telephone formats
  • Standardize use of office vs. HQ addresses over time, and do general address cleaning

Out of scope

@e-belfer e-belfer changed the title Distribution data Integrate PHMSA distribution data into PUDL Aug 7, 2024
@e-belfer e-belfer added community new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration labels Aug 7, 2024
@e-belfer e-belfer changed the title Integrate PHMSA distribution data into PUDL Integrate PHMSA operators data into PUDL Aug 7, 2024
@e-belfer e-belfer changed the title Integrate PHMSA operators data into PUDL Integrate PHMSA distribution data into PUDL Aug 7, 2024
@seeess1
Copy link
Contributor

seeess1 commented Sep 15, 2024

@e-belfer can you assign me to this ticket?

@seeess1
Copy link
Contributor

seeess1 commented Sep 21, 2024

Working on this branch in my forked repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Status: In progress
Development

No branches or pull requests

2 participants