Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record linkage for SEC to EIA #120

Open
wants to merge 171 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 168 commits
Commits
Show all changes
171 commits
Select commit Hold shift + click to select a range
5767035
Initial dagster integration
zschira Aug 13, 2024
9d9fbfd
Update validate integration test to dagster infra
zschira Aug 14, 2024
3da9659
Merge branch 'main' into dagster_integration
zschira Aug 20, 2024
ee77e7a
Generalize mltools
zschira Aug 27, 2024
53d3354
Reorg repo to move towards generalized modelling repo
zschira Aug 28, 2024
014bcb1
Change library module structure
zschira Aug 28, 2024
5404148
Create turn experiment_tracking into sub-package
zschira Aug 28, 2024
886614f
Remove unused function
zschira Aug 28, 2024
dec80b8
Gracefully handle mlflow run on failure
zschira Aug 28, 2024
e725f3d
Fix variable name
zschira Aug 28, 2024
df44ed5
Change experiment tracker resource names
zschira Aug 28, 2024
93da052
Add mlflow artifact io-manager
zschira Aug 28, 2024
07713e9
Simplify pudl_models decorator
zschira Aug 29, 2024
5d89ec6
Split extraction logging into two funcs
zschira Aug 29, 2024
c57818a
Add mlflow metrics io-manager
zschira Aug 29, 2024
625783b
Change pudl_model to pudl_pipeline
zschira Aug 29, 2024
4f50a7b
Add validation pipeline
zschira Aug 30, 2024
f6ab22c
Streamline construction of dagster jobs for running/testing pudl models
zschira Sep 2, 2024
f20fb7d
Remove old comment
zschira Sep 2, 2024
92e2e00
Add ex21 to dagster jobs
zschira Sep 3, 2024
520e6d1
Prep for multiple code locations
zschira Sep 3, 2024
e99ee1a
Add top-level worksapce file
zschira Sep 3, 2024
559c0e6
Restructure docs
zschira Sep 3, 2024
93d02f3
Add train model job
zschira Sep 3, 2024
5190bf9
Log mlflow artifacts as parquet until csv is fixed
zschira Sep 3, 2024
ca9599e
Fix ex21 extraction
zschira Sep 4, 2024
7e7a503
Add development section to docs
zschira Sep 4, 2024
61f48c3
Fix integration tests
zschira Sep 4, 2024
0fd8ffc
Don't run ruff on notebooks
zschira Sep 4, 2024
97d5587
xfail ex21 integration test
zschira Sep 4, 2024
ace268b
Add parquet upath io-manager
zschira Sep 5, 2024
fb1feeb
Remove nb-output clear
zschira Sep 5, 2024
294ec72
Test docker deployment
zschira Sep 5, 2024
4de51b3
Chunk ex 21 extraction
zschira Sep 5, 2024
214e28f
Fix asign copy
zschira Sep 6, 2024
c5736e0
Add job for testing ex21 resource usage
zschira Sep 6, 2024
4a81e88
Merge branch 'test_parquet_logging' into dagster_integration
zschira Sep 6, 2024
ec39633
Remove test docker files
zschira Sep 6, 2024
101ccf1
Remove complex asset factory
zschira Sep 6, 2024
7e0c5a5
Parallelize ex21 extraction
zschira Sep 6, 2024
080d790
Don't chunk in inference module
zschira Sep 6, 2024
44dfc52
Handle failures in converting to pdf
zschira Sep 6, 2024
6e24157
Delete cached pdfs early
zschira Sep 6, 2024
cd06d07
Add metadata to chunk_filings
zschira Sep 9, 2024
e3e8c45
Catch oom errors while extracting ex21
zschira Sep 9, 2024
350defb
Fix ex21 gcs io-manager
zschira Sep 9, 2024
3c80b72
Fix partitions for basic 10k extraction.
zschira Sep 9, 2024
31971b7
Cache layoutlm locally
zschira Sep 9, 2024
634a050
Fix caching model
zschira Sep 9, 2024
69ee4c0
Remove bad call
zschira Sep 9, 2024
63d6600
Test own_per conversion
zschira Sep 10, 2024
c8490d4
Add pandera types for output tables
zschira Sep 10, 2024
fa4f57d
Add missing entities module
zschira Sep 10, 2024
35e917d
Don't cache model, load with io manager
zschira Sep 10, 2024
a7b1c7f
Remove float conversion
zschira Sep 10, 2024
f019117
Add hypothesis to deps
zschira Sep 10, 2024
d7d13d8
Make own_per str
zschira Sep 10, 2024
70f5293
Remove astype
zschira Sep 10, 2024
e406092
Validate ex21 return types
zschira Sep 10, 2024
f3835d9
Clean model download temp dir
zschira Sep 11, 2024
3c995cd
Fix model return type
zschira Sep 11, 2024
ef55e4b
Catch errors in creating ex 21 dataset
zschira Sep 11, 2024
b37450a
Fix column name
zschira Sep 11, 2024
06b18ed
Try to catch empty pdf errors
zschira Sep 12, 2024
abfc006
Print traceback in caught exception
zschira Sep 12, 2024
ff92a55
Fix empty pdf check
zschira Sep 12, 2024
8aa8c95
Actually fix empty pdf check?
zschira Sep 12, 2024
43600bc
Use UPath in GCSArchive
zschira Sep 18, 2024
05ad82c
Make _configure_mlflow a standalone function
zschira Sep 18, 2024
fddc3b2
Merge branch 'main' into error_handling_improvements
zschira Sep 18, 2024
99fc7ed
Try to skip notebooks in ruff check
zschira Sep 18, 2024
b135500
Pull integration test fixes from main
zschira Sep 19, 2024
6e868f2
Fix typos in README.rst
zschira Sep 19, 2024
df4fd09
Cache downloaded layoutlm in dagster home
zschira Sep 19, 2024
74d237d
Merge branch 'error_handling_improvements' of github.com:catalyst-coo…
zschira Sep 19, 2024
3642765
Fix broken test
zschira Sep 19, 2024
830bd74
fix rename filings
katie-lamb Sep 20, 2024
2cd1fe6
fix paths to cache training data
katie-lamb Sep 20, 2024
64dc8c5
update root dir path
katie-lamb Sep 20, 2024
226d91c
Fix UPath initialization
zschira Sep 20, 2024
3c17d33
Fix path in test
zschira Sep 20, 2024
df69f42
Create huggingface dataset outside model execution
zschira Sep 20, 2024
2d3345c
small fixes to path handling
katie-lamb Sep 20, 2024
46e7b40
Merge branch 'error_handling_improvements' into second-pass-ex21-impr…
katie-lamb Sep 20, 2024
6f9d34a
Minor fixes
zschira Sep 23, 2024
07d500a
Start migrating model training to notebook
zschira Sep 23, 2024
81813a7
Create dataset as dataframe for logging
zschira Sep 24, 2024
5174ed7
Modify dataset return type
zschira Sep 24, 2024
7a572c0
Fix dataset types for model signature
zschira Sep 24, 2024
5728026
Migrate ex 21 model training to a notebook
zschira Sep 25, 2024
5fbbfff
Merge initial notebook migration (broken)
zschira Oct 3, 2024
37edd50
Split dataset loading into separate assets
zschira Oct 4, 2024
d6889e3
Minor notebook fixes
zschira Oct 4, 2024
d5e013a
Fix import in notebook
zschira Oct 4, 2024
f9810db
add device to pipeline
zschira Oct 4, 2024
2760881
Fix signature inference
zschira Oct 4, 2024
1dcacfa
Fix notebook dagster config
zschira Oct 4, 2024
39bb45b
Fix config param name
zschira Oct 4, 2024
cb83862
Partition training data
zschira Oct 5, 2024
c71593c
Add partitions to notebook asset
zschira Oct 5, 2024
4efa515
Update ex21 labels
zschira Oct 6, 2024
581b2e3
Use run name for specifying training runs
zschira Oct 6, 2024
c67a1be
Rework how notebook is configured
zschira Oct 6, 2024
b8a5b24
Finetune configuration
zschira Oct 6, 2024
45d5cf8
separate inference dataset creation from model prediction
zschira Oct 7, 2024
3e15b1f
Remove deprecated inference module
zschira Oct 7, 2024
60a1260
Add notebook for training ex21 classifier
zschira Oct 8, 2024
4105110
Pull in model updates
zschira Oct 8, 2024
4d29037
Update classifier model
zschira Oct 8, 2024
85c44ff
Fix set on copy pandas issue
zschira Oct 9, 2024
52e3580
Fix model uri's
zschira Oct 9, 2024
b709053
Fix indices in extraction model
zschira Oct 9, 2024
b8dad3c
Fix typo
zschira Oct 9, 2024
e6b29ff
Add asset factory for loading models
zschira Oct 10, 2024
3d11777
Catch layout classification NaN exception
zschira Oct 10, 2024
df5fe0d
Use GCS pickle io-manager
zschira Oct 10, 2024
d6c41a2
Switch gcs pickle io manager to upath based
zschira Oct 11, 2024
ddd2263
Remove duplicate logger
zschira Oct 11, 2024
93bffcb
Fix config warnings
zschira Oct 11, 2024
d717caa
Test pin sphinx
zschira Oct 11, 2024
09cd189
add splink and model to environment
katie-lamb Oct 13, 2024
15be127
Catch errors while normalizing bounding boxes
zschira Oct 14, 2024
4117d0a
Fix call to pandera example
zschira Oct 14, 2024
8c8dd60
Fix handle failures in converting to pdf
zschira Oct 14, 2024
ff821b5
Actually fix handle failures in converting to pdf
zschira Oct 14, 2024
a8eb359
Add model documentation to sec10k readme
zschira Oct 16, 2024
dc160ac
Fix ex 21 validation integration test
zschira Oct 16, 2024
10b24a9
Improve classifier error handling
zschira Oct 16, 2024
ad54979
Fully broaden classifier errors
zschira Oct 16, 2024
672e123
add more docs on running the notebooks
zschira Oct 16, 2024
b95b2fb
add splink notebooks and preprocessing functions
katie-lamb Oct 21, 2024
2dbdcaa
clean up feature creation in paragraph classifier
katie-lamb Oct 22, 2024
cda3225
fix feature creation function
katie-lamb Oct 22, 2024
509b7a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
8855e5e
small fixes to read in comments in tracking dataframe
katie-lamb Oct 22, 2024
f806ad8
Merge branch 'prep_paragraph_classifier' of https://github.com/cataly…
katie-lamb Oct 22, 2024
590ba60
updates to model pipeline
katie-lamb Oct 23, 2024
3db47d4
take out logging messages
katie-lamb Oct 23, 2024
858ecff
Merge branch 'prep_paragraph_classifier' into splink-skeleton
katie-lamb Oct 23, 2024
61c8abf
make pudl editable
katie-lamb Oct 23, 2024
e5148d8
add in record linkage modules
katie-lamb Nov 27, 2024
88f17f2
fix errors with asset creation
katie-lamb Nov 29, 2024
c9b62ba
clean up sec output table creation
katie-lamb Nov 30, 2024
01b2d23
splink notebook change
katie-lamb Nov 30, 2024
5d9419b
Merge branch 'main' into splink-skeleton
katie-lamb Nov 30, 2024
30d22c9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 30, 2024
3fca3c9
fix pre commit
katie-lamb Nov 30, 2024
a50dd9d
Merge branch 'splink-skeleton' of https://github.com/catalyst-coopera…
katie-lamb Nov 30, 2024
d53ab25
update python dependency in test environment
katie-lamb Dec 1, 2024
2eb1555
update github tox env
katie-lamb Dec 1, 2024
44eb70b
restructure input table assets
katie-lamb Dec 2, 2024
7563df8
include pseudo code of SEC output table module
katie-lamb Dec 2, 2024
3c88ff2
Try using conda env to run tox
jdangerx Dec 3, 2024
fb3d772
Add PUDL dependency and restrict to Py3.12
jdangerx Dec 3, 2024
390770f
Guess you don't need to specify tox env with setup-micromamba
jdangerx Dec 3, 2024
8e8beee
Install GDAL version via conda since we rely on PUDL now
jdangerx Dec 3, 2024
50e7b7e
notebook has cells for SEC and EIA hook up
katie-lamb Dec 3, 2024
224721b
Merge pull request #123 from catalyst-cooperative/splink-skeleton-tox…
jdangerx Dec 4, 2024
7dc78e1
Fix dagster setup for record linkage inputs
zschira Dec 4, 2024
daa8f0a
fix util functions
katie-lamb Dec 9, 2024
dbefe34
Handle missing partitions in extracted data
zschira Dec 10, 2024
97f5d68
Fix basic_10k partitions
zschira Dec 11, 2024
b26f1f8
debug materialization of rl input assets
katie-lamb Dec 11, 2024
acaf3d1
clean up notebook to work with dagster assets
katie-lamb Dec 16, 2024
f4cceb7
clean up new structure of sec assets
katie-lamb Dec 17, 2024
fa9e52e
add in final match between ex 21 subs and eia utilities
katie-lamb Dec 18, 2024
599ae87
remove sec output table module
katie-lamb Dec 18, 2024
c340718
add drop duplicates on sec company id
katie-lamb Dec 18, 2024
24de7d6
clean up notbook
katie-lamb Dec 19, 2024
26b1a72
add markdown cell note
katie-lamb Dec 19, 2024
70427a0
make asset not multi asset
katie-lamb Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tox-pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
id-token: write
strategy:
matrix:
python-version: ["3.10", "3.11"]
python-version: ["3.12"]
fail-fast: false
defaults:
run:
Expand Down
1 change: 0 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ repos:
rev: 24.10.0
hooks:
- id: black
language_version: python3.11

- repo: https://github.com/pre-commit/mirrors-prettier
rev: v4.0.0-alpha.8
Expand Down
8 changes: 6 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:
dependencies:
# Packages required for setting up the environment
- pip>=21,<24
- python>=3.10,<3.12
- python>=3.10,<=3.12
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
- setuptools>=66,<69

# Packages specified in setup.py that need or benefit from binary conda packages
Expand All @@ -19,11 +19,15 @@ dependencies:

# Jupyter packages:
- jupyterlab>=3.2,<4
- nbconvert>=6,<7 # Used to clear notebook outputs in pre-commit hooks
- nbconvert>=7 # Used to clear notebook outputs in pre-commit hooks

# These are not normal Python packages available on PyPI
- nodejs # Useful for Jupyter and prettier pre-commit hook

- dask>=2024
- gdal

# Use pip to install the package defined by this repo for development:
- pip:
# - git+https://github.com/catalyst-cooperative/pudl.git@main
- --editable ./[dev,docs,tests,types]
Loading
Loading