Record linkage for SEC to EIA #120

katie-lamb · 2024-11-27T20:20:11Z

Overview

Closes #119 .

This moves SEC to EIA record linkage into the production pipeline and creates SEC table output assets.

Files to look at:

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py
- Create EIA table inputs for record linkage
- Define EIA table as Dagster input
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py
- Create SEC table inputs for record linkage
- Define SEC table as Dagster input
src/mozilla_sec_eia/library/record_linkage_utils.py
- Defines some record linkage cleaning utility functions, maybe this shouldn't live here?
notebooks/18-kl-splink-sec-eia.ipynb
- Where the splink model and visualizations live, creates the asset core_sec_10k__filers

What did you change in this PR?

I created an "SEC record linkage input table" which is a temporally flattened table of the basic 10k filer information. It has a row for each SEC filings company and in the end will have an ID to link to EIA (post record linkage).
After record linkage (which happens in the splink notebook), I add in the Ex. 21 subsidiaries to the basic 10k table, and do a process to match subsidiaries to the filer, if that subsidiary files a 10k.
I added an EIA utility company table asset, in src/mozilla_sec_eia/models/sec_eia_record_linkage/create_eia_input.py
I added an SEC to EIA record linkage module and notebook

Testing

How did you make sure this worked? How can a reviewer verify this?

I'm not sure what the error with the CI means right now.

I materialized all assets and then ran through splink.

To-do list

Tasks

Give feedback

Add missing year-quarters of Ex. 21 extracted data to GCS
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Make a unit test for match_ex21_subsidiaries_to_filer_company
Try materializing assets on all partitions
Dagsterize EIA record linkage input module
Add MLFlow experiment tracking throughout record linkage process
Options

katie-lamb · 2024-12-03T19:29:57Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+@multi_asset(
+    ins={
+        "ex21_df": AssetIn("ex21_company_ownership_info"),
+    },
+    outs={
+        "transformed_ex21_subsidiary_table": AssetOut(
+            io_manager_key="pandas_parquet_io_manager",
+        )
+    },
+    partitions_def=year_quarter_partitions,
+)


not sure if this should be a multi_asset

katie-lamb · 2024-12-03T19:30:29Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+def sec_rl_input_table(
+    basic_10k_df: pd.DataFrame,
+    clean_ex21_df: pd.DataFrame,
+    sec10k_filing_metadata: pd.DataFrame,


I think I used this sec10k_filing_metadata parameter/asset right?

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py

…-fix Tox Fix

katie-lamb · 2024-12-17T06:04:02Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+    basic_10k_df = flatten_companies_across_time(
+        df=basic_10k_df, key_cols=["company_name", "street_address"]
+    )


Instead of flattening across time to have one record per CIK, I opted to flatten so that there is one record per unique company name and street address pair. This is because it's helpful to have all address changes in the record linkage process, in case the company is still reporting an old address to EIA. Later, in the output table that includes a utility_id_eia connection column, I think we could flatten across time, so there's one record per sec_company_id, but maybe it's not a big deal to have that key be unique and we should opt to include all address and company name changes in that final table.

katie-lamb · 2024-12-17T06:07:13Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+    # exclude Ex. 21 subs and just match to filers
+    # once the match has been conducted, add back in the Ex. 21 subs
+    out_df = basic_10k_df.fillna(np.nan).reset_index(names="record_id")
+    # TODO: Here we conduct the match to EIA and add on a column with utility_id_eia


This is where I think it makes sense to call the splink model and run record linkage between the SEC 10K filers and the EIA utilities. Basically I think the transformed basic 10k table should be an input, and the output of the splink model is that basic 10K table with utility_id_eia as a column added on. In my head, I'm calling this matched asset core_sec_10k__filers, but maybe the whole way I'm thinking about this structure is wrong.

katie-lamb · 2024-12-17T06:08:28Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+@asset(
+    ins={
+        "sec10k_filers_matched_df": AssetIn("core_sec_10k__filers"),
+        "clean_ex21_df": AssetIn("transformed_ex21_subsidiary_table"),
+    },
+)
+def out_sec_10k__parents_and_subsidiaries(


Now take that matched basic 10k table and merge Ex. 21 subsidiaries on, so we know which of the subsidiaries file a 10k themselves, and can get parent company information and ownership percentage from that. Additionally, take the subsidiaries that don't file a 10k and concatenate them onto the dataframe, forming the asset out_sec_10k__parents_and_subsidiaries

katie-lamb · 2024-12-18T22:44:29Z

Seems like the Seg Fault is still causing the CI to fail in tests/integration/models/sec10k/extract_test.py::test_ex21_validation. When I run this test locally I get zsh: trace trap which seems to be the same thing as a seg fault / generally dereferencing a null pointer. Hmmmm... Not sure what's happening

katie-lamb · 2024-12-18T23:35:06Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+    # the last step is to take the EIA utilities that haven't been matched
+    # to a filer company, and merge them by company name onto the Ex. 21 subs
+    unmatched_eia_df = clean_eia_df[
+        ~clean_eia_df["utility_id_eia"].isin(
+            sec_10k_filers_matched_df.utility_id_eia.unique()
+        )
+    ].drop_duplicates(subset="company_name")
+    ex21_non_filing_subs_df = ex21_non_filing_subs_df.merge(
+        unmatched_eia_df[["utility_id_eia", "company_name"]],
+        how="left",
+        on="company_name",
+    ).drop_duplicates(subset="sec_company_id")


This is a slightly weird last step, but basically we want to take all the EIA utilities that didn't get matched to SEC filers during the record linkage process, and see if we can just match them to a Ex. 21 subsidiary based on an exact match on company name. This is obviously imperfect because you can have companies that share the same name but aren't actually matches. However, we have basically no other useful shared columns between Ex. 21 subs and EIA utilities, and I believe that there are a lot of EIA utilities that are reported as Ex. 21 subs, so I think the benefits of adding in a bunch of true positive matches outweighs the cons of also adding in a bunch of false positives. I think a user is likely to want to see a possible connection to EIA and then assess for themselves whether that is a valid connection vs not having the connection at all.

We should probably think of a way to communicate uncertainty to downstream users through documentation or something.

One idea i had was to include the match_probability column that splink provides in the output SEC table. The threshold for matching is .95, so all probabilities would be between .95-1 which is maybe not that helpful from a "human intuition perspective". I think for now, the best thing to do is just add table level documentation that conveys that the utility_id_eia column is modeled and the whole extraction of data is modeled.

katie-lamb · 2024-12-18T23:35:36Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+    logger.info(
+        f"Ex. 21 subsidiary names matched to an EIA utility name: {len(ex21_non_filing_subs_df["utility_id_eia"].unique())}"
+    )


we should probably have more logging throughout these transformations and record them with mlflow

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py

katie-lamb · 2024-12-19T19:25:41Z

@zschira I think we don't have ReviewNB for this repo? But I think the markdown cells in the splink notebook show what I'm thinking in terms of Dagsterizing this. I'm reading in inputs as parquets and then make a note about where the final output table is (it's just the input SEC table with utility_id_eia merged on. I have a little manually created validation set (here is the spreadsheet version, tab is SEC to EIA Validation Set) that I use to get some basic stats about the match, but it's imperfect. Maybe this should go into package data?

zschira

Overall looks good. I haven't dug too much into the details and focused on high level structure given time constraints.

One thing I think we probably need to do is define schemas for any core/out tables and do some basic validation of them. I think if we added pandera models like in models/sec10k/entities.py that would be a fairly easy way to handle this.

One other thing that comes to mind is that we might want to fully move all our assets in this repo to using PUDL naming conventions, including for raw extracted data and intermediate steps. That can probably be handled in a follow up PR though.

zschira · 2024-12-20T20:15:38Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+def _remove_weird_sec_cols(sec_df) -> pd.DataFrame:
+    weird_cols = ["]fiscal_year_end", "]irs_number", "]state_of_incorporation"]


It probably wouldn't be too hard to add this to the 10k extraction, but given time constraints, it might be better to just open an issue for future work.

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

zschira · 2024-12-20T20:33:06Z

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py

+    # the last step is to take the EIA utilities that haven't been matched
+    # to a filer company, and merge them by company name onto the Ex. 21 subs
+    unmatched_eia_df = clean_eia_df[
+        ~clean_eia_df["utility_id_eia"].isin(
+            sec_10k_filers_matched_df.utility_id_eia.unique()
+        )
+    ].drop_duplicates(subset="company_name")
+    ex21_non_filing_subs_df = ex21_non_filing_subs_df.merge(
+        unmatched_eia_df[["utility_id_eia", "company_name"]],
+        how="left",
+        on="company_name",
+    ).drop_duplicates(subset="sec_company_id")


We should probably think of a way to communicate uncertainty to downstream users through documentation or something.

zschira added 30 commits August 13, 2024 15:02

Initial dagster integration

5767035

Update validate integration test to dagster infra

9d9fbfd

Merge branch 'main' into dagster_integration

3da9659

Generalize mltools

ee77e7a

Reorg repo to move towards generalized modelling repo

53d3354

Change library module structure

014bcb1

Create turn experiment_tracking into sub-package

5404148

Remove unused function

886614f

Gracefully handle mlflow run on failure

dec80b8

Fix variable name

e725f3d

Change experiment tracker resource names

df44ed5

Add mlflow artifact io-manager

93da052

Simplify pudl_models decorator

07713e9

Split extraction logging into two funcs

5d89ec6

Add mlflow metrics io-manager

c57818a

Change pudl_model to pudl_pipeline

625783b

Add validation pipeline

4f50a7b

Streamline construction of dagster jobs for running/testing pudl models

f6ab22c

Remove old comment

f20fb7d

Add ex21 to dagster jobs

92e2e00

Prep for multiple code locations

520e6d1

Add top-level worksapce file

e99ee1a

Restructure docs

559c0e6

Add train model job

93d02f3

Log mlflow artifacts as parquet until csv is fixed

5190bf9

Fix ex21 extraction

ca9599e

Add development section to docs

7e7a503

Fix integration tests

61f48c3

Don't run ruff on notebooks

0fd8ffc

xfail ex21 integration test

97d5587

katie-lamb commented Dec 3, 2024

View reviewed changes

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py Show resolved Hide resolved

katie-lamb requested a review from zschira December 3, 2024 21:21

katie-lamb marked this pull request as ready for review December 3, 2024 21:21

katie-lamb and others added 8 commits December 3, 2024 15:02

notebook has cells for SEC and EIA hook up

50e7b7e

Merge pull request #123 from catalyst-cooperative/splink-skeleton-tox…

224721b

…-fix Tox Fix

Fix dagster setup for record linkage inputs

7dc78e1

fix util functions

daa8f0a

Handle missing partitions in extracted data

dbefe34

Fix basic_10k partitions

97f5d68

debug materialization of rl input assets

b26f1f8

clean up notebook to work with dagster assets

acaf3d1

katie-lamb commented Dec 17, 2024

View reviewed changes

clean up new structure of sec assets

f4cceb7

katie-lamb added 3 commits December 18, 2024 15:28

add in final match between ex 21 subs and eia utilities

fa9e52e

remove sec output table module

599ae87

add drop duplicates on sec company id

c340718

katie-lamb commented Dec 18, 2024

View reviewed changes

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py Show resolved Hide resolved

katie-lamb commented Dec 19, 2024

View reviewed changes

src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py Outdated Show resolved Hide resolved

katie-lamb added 3 commits December 19, 2024 11:17

clean up notbook

24de7d6

add markdown cell note

26b1a72

make asset not multi asset

70427a0

zschira requested changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record linkage for SEC to EIA #120

Record linkage for SEC to EIA #120

katie-lamb commented Nov 27, 2024 •

edited

Loading

Tasks

katie-lamb Dec 3, 2024

katie-lamb Dec 3, 2024

katie-lamb Dec 17, 2024 •

edited

Loading

katie-lamb Dec 17, 2024

katie-lamb Dec 17, 2024

katie-lamb commented Dec 18, 2024 •

edited

Loading

katie-lamb Dec 18, 2024

zschira Dec 20, 2024

katie-lamb Dec 21, 2024

katie-lamb Dec 18, 2024

katie-lamb commented Dec 19, 2024

zschira left a comment

zschira Dec 20, 2024

zschira Dec 20, 2024

		def _remove_weird_sec_cols(sec_df) -> pd.DataFrame:
		weird_cols = ["]fiscal_year_end", "]irs_number", "]state_of_incorporation"]

Record linkage for SEC to EIA #120

Are you sure you want to change the base?

Record linkage for SEC to EIA #120

Conversation

katie-lamb commented Nov 27, 2024 • edited Loading

Overview

Testing

To-do list

Tasks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Dec 19, 2024

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Nov 27, 2024 •

edited

Loading

katie-lamb Dec 17, 2024 •

edited

Loading

katie-lamb commented Dec 18, 2024 •

edited

Loading