-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record linkage for SEC to EIA #120
base: main
Are you sure you want to change the base?
Conversation
@multi_asset( | ||
ins={ | ||
"ex21_df": AssetIn("ex21_company_ownership_info"), | ||
}, | ||
outs={ | ||
"transformed_ex21_subsidiary_table": AssetOut( | ||
io_manager_key="pandas_parquet_io_manager", | ||
) | ||
}, | ||
partitions_def=year_quarter_partitions, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this should be a multi_asset
def sec_rl_input_table( | ||
basic_10k_df: pd.DataFrame, | ||
clean_ex21_df: pd.DataFrame, | ||
sec10k_filing_metadata: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I used this sec10k_filing_metadata
parameter/asset right?
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py
Show resolved
Hide resolved
basic_10k_df = flatten_companies_across_time( | ||
df=basic_10k_df, key_cols=["company_name", "street_address"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of flattening across time to have one record per CIK, I opted to flatten so that there is one record per unique company name and street address pair. This is because it's helpful to have all address changes in the record linkage process, in case the company is still reporting an old address to EIA. Later, in the output table that includes a utility_id_eia
connection column, I think we could flatten across time, so there's one record per sec_company_id
, but maybe it's not a big deal to have that key be unique and we should opt to include all address and company name changes in that final table.
# exclude Ex. 21 subs and just match to filers | ||
# once the match has been conducted, add back in the Ex. 21 subs | ||
out_df = basic_10k_df.fillna(np.nan).reset_index(names="record_id") | ||
# TODO: Here we conduct the match to EIA and add on a column with utility_id_eia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I think it makes sense to call the splink model and run record linkage between the SEC 10K filers and the EIA utilities. Basically I think the transformed basic 10k table should be an input, and the output of the splink model is that basic 10K table with utility_id_eia
as a column added on. In my head, I'm calling this matched asset core_sec_10k__filers
, but maybe the whole way I'm thinking about this structure is wrong.
@asset( | ||
ins={ | ||
"sec10k_filers_matched_df": AssetIn("core_sec_10k__filers"), | ||
"clean_ex21_df": AssetIn("transformed_ex21_subsidiary_table"), | ||
}, | ||
) | ||
def out_sec_10k__parents_and_subsidiaries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now take that matched basic 10k table and merge Ex. 21 subsidiaries on, so we know which of the subsidiaries file a 10k themselves, and can get parent company information and ownership percentage from that. Additionally, take the subsidiaries that don't file a 10k and concatenate them onto the dataframe, forming the asset out_sec_10k__parents_and_subsidiaries
Seems like the Seg Fault is still causing the CI to fail in |
# the last step is to take the EIA utilities that haven't been matched | ||
# to a filer company, and merge them by company name onto the Ex. 21 subs | ||
unmatched_eia_df = clean_eia_df[ | ||
~clean_eia_df["utility_id_eia"].isin( | ||
sec_10k_filers_matched_df.utility_id_eia.unique() | ||
) | ||
].drop_duplicates(subset="company_name") | ||
ex21_non_filing_subs_df = ex21_non_filing_subs_df.merge( | ||
unmatched_eia_df[["utility_id_eia", "company_name"]], | ||
how="left", | ||
on="company_name", | ||
).drop_duplicates(subset="sec_company_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slightly weird last step, but basically we want to take all the EIA utilities that didn't get matched to SEC filers during the record linkage process, and see if we can just match them to a Ex. 21 subsidiary based on an exact match on company name. This is obviously imperfect because you can have companies that share the same name but aren't actually matches. However, we have basically no other useful shared columns between Ex. 21 subs and EIA utilities, and I believe that there are a lot of EIA utilities that are reported as Ex. 21 subs, so I think the benefits of adding in a bunch of true positive matches outweighs the cons of also adding in a bunch of false positives. I think a user is likely to want to see a possible connection to EIA and then assess for themselves whether that is a valid connection vs not having the connection at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably think of a way to communicate uncertainty to downstream users through documentation or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea i had was to include the match_probability
column that splink
provides in the output SEC table. The threshold for matching is .95, so all probabilities would be between .95-1 which is maybe not that helpful from a "human intuition perspective". I think for now, the best thing to do is just add table level documentation that conveys that the utility_id_eia
column is modeled and the whole extraction of data is modeled.
logger.info( | ||
f"Ex. 21 subsidiary names matched to an EIA utility name: {len(ex21_non_filing_subs_df["utility_id_eia"].unique())}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably have more logging throughout these transformations and record them with mlflow
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py
Show resolved
Hide resolved
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py
Outdated
Show resolved
Hide resolved
@zschira I think we don't have ReviewNB for this repo? But I think the markdown cells in the splink notebook show what I'm thinking in terms of Dagsterizing this. I'm reading in inputs as parquets and then make a note about where the final output table is (it's just the input SEC table with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I haven't dug too much into the details and focused on high level structure given time constraints.
One thing I think we probably need to do is define schemas for any core
/out
tables and do some basic validation of them. I think if we added pandera
models like in models/sec10k/entities.py
that would be a fairly easy way to handle this.
One other thing that comes to mind is that we might want to fully move all our assets in this repo to using PUDL naming conventions, including for raw extracted data and intermediate steps. That can probably be handled in a follow up PR though.
def _remove_weird_sec_cols(sec_df) -> pd.DataFrame: | ||
weird_cols = ["]fiscal_year_end", "]irs_number", "]state_of_incorporation"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably wouldn't be too hard to add this to the 10k extraction, but given time constraints, it might be better to just open an issue for future work.
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py
Show resolved
Hide resolved
# the last step is to take the EIA utilities that haven't been matched | ||
# to a filer company, and merge them by company name onto the Ex. 21 subs | ||
unmatched_eia_df = clean_eia_df[ | ||
~clean_eia_df["utility_id_eia"].isin( | ||
sec_10k_filers_matched_df.utility_id_eia.unique() | ||
) | ||
].drop_duplicates(subset="company_name") | ||
ex21_non_filing_subs_df = ex21_non_filing_subs_df.merge( | ||
unmatched_eia_df[["utility_id_eia", "company_name"]], | ||
how="left", | ||
on="company_name", | ||
).drop_duplicates(subset="sec_company_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably think of a way to communicate uncertainty to downstream users through documentation or something.
Overview
Closes #119 .
This moves SEC to EIA record linkage into the production pipeline and creates SEC table output assets.
Files to look at:
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_eia_input.py
src/mozilla_sec_eia/models/sec_eia_record_linkage/transform_sec_input.py
src/mozilla_sec_eia/library/record_linkage_utils.py
notebooks/18-kl-splink-sec-eia.ipynb
core_sec_10k__filers
What did you change in this PR?
src/mozilla_sec_eia/models/sec_eia_record_linkage/create_eia_input.py
Testing
How did you make sure this worked? How can a reviewer verify this?
I'm not sure what the error with the CI means right now.
I materialized all assets and then ran through splink.
To-do list
Tasks