You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.
Prep
Before getting into the table transforms, let's update the data itself.
The content you are editing has changed. Please copy your edits and refresh the page.
All form parts below are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables. The original files (CSV and Excel) are not split by form part.
Each report corresponds to 1 state and one commodity group. Further parts of the table have dozens of columns that we should tidy into one categorical column, creating different PKs within the parts of the form. We should try to structure our tables to match these PKs as much as possible.
core_phmsagas__yearly_distribution_operators
Part A: Operator Information
Essentially just the operator ID, name, address, and info on the state and commodity pertinent to the report.
Part D-I
Each field occurs once per report, these should probably be kept in one table with Part A.
Part D, Excavation Damage: 6 fields of summary stats
Part E, EFV and Service Valve Data: 4 fields of summary stats
Part F: Total number of leaks on federal lands scheduled for repair or repaired.
Part G: %age of unaccounted gas.
Part H: Additional Info: a Notes field, discusses corrections, changes in calculations, and other ambiguities.
Part I: Preparer's info: Email, initial or supplemental report, contact info for preparer.
The content you are editing has changed. Please copy your edits and refresh the page.
Part C: Leaks and Repairs
Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"
The content you are editing has changed. Please copy your edits and refresh the page.
Convert 2-digit report_year into 4-digit years (pre 2000).
Standardize report_state to use either shorthand or full state name
Ideally, adapt existing wide_to_tidy infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).
Handle different aggregations of reporting over time for each form section (e.g., 1 form per state, one form per system)
Where granularity increases over time (e.g. onshore becomes onshore types A, B, C), aggregate these increasingly disaggregated columns back to have comparable totals over time.
Deal with extremely varying telephone formats
Standardize use of office vs. HQ addresses over time, and do general address cleaning
PHMSA distribution data (1990-present)
PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.
Prep
Before getting into the table transforms, let's update the data itself.
Tasks
Table design
core_phmsagas__yearly_distribution_operators
Part A: Operator Information
Part D-I
Operator Table Tasks
core_phmsagas__yearly_distribution_main_and_services
Part B: System Description
Each piece of this has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.
Tasks
core_phmsa__yearly_distribution_leaks_and_repairs
Part C: Leaks and Repairs
Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"
Tasks
General Known PHMSA Cleaning Steps
report_year
into 4-digit years (pre 2000).report_state
to use either shorthand or full state namewide_to_tidy
infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).Out of scope
The text was updated successfully, but these errors were encountered: