[pdp] Extend modeling functionality #41

bdewilde · 2024-12-08T21:47:52Z

changes

adds functions (and unit tests) for common modeling-adjacent tasks: splitting datasets into exclusive subsets, and computing sample weights to deal with class imbalance

context

I've had to do this stuff "custom" for more than one school, but it's the same logic, so should be in the shared lib code.

Follows PR #40

questions

Are there any other tasks like this to include?

subtly useful -- it's good to have this process well-done and unit-tested

vishpillai123

Sample weight function looks to be an extension of sklearn.utils.class_weight.compute_sample_weight, so looks great to me.

For the compute_data_splits() function, I was curious why using a random generator to create the dataset splits was necessary, instead of just using sklearn.model_selection.train_test_split? I have seen train_test_split commonly utilized across academia/industry, so just curious of your choice of implementing the rng library here. Not really requesting changes and I'm definitely open to your thoughts here, just thought to ask. :)

bdewilde · 2024-12-16T22:20:25Z

As the name suggests, train_test_split() only splits the dataset into 2 components -- "train" and "test" -- while we want 3 -- the additional "validate" split. It's possible to apply train_test_split() twice, but that's slower and clunky. Using the underlying numpy functionality directly is faster and accomplishes the same thing. :)

vishpillai123 · 2024-12-16T22:30:19Z

As the name suggests, train_test_split() only splits the dataset into 2 components -- "train" and "test" -- while we want 3 -- the additional "validate" split. It's possible to apply train_test_split() twice, but that's slower and clunky. Using the underlying numpy functionality directly is faster and accomplishes the same thing. :)

Ah, I thought that you could create a validation split and I could have sworn I've used it before, but I'm probably mixing up something else I've used. Ok, makes sense!

bdewilde added 3 commits December 8, 2024 16:28

Add func to split labeled datasets

b1346b0

subtly useful -- it's good to have this process well-done and unit-tested

Add func to compute sample weights

63f6438

tests: Bump rtol for random split proportions

992401b

bdewilde marked this pull request as ready for review December 8, 2024 21:59

bdewilde requested review from vishpillai123 and kaylawilding December 8, 2024 22:00

Base automatically changed from upgrade-package-deps to develop December 16, 2024 15:42

vishpillai123 requested changes Dec 16, 2024

View reviewed changes

vishpillai123 approved these changes Dec 16, 2024

View reviewed changes

bdewilde merged commit 8970ef8 into develop Dec 17, 2024
5 checks passed

bdewilde deleted the pdp-extend-modeling-functionality branch December 17, 2024 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pdp] Extend modeling functionality #41

[pdp] Extend modeling functionality #41

bdewilde commented Dec 8, 2024 •

edited

Loading

vishpillai123 left a comment •

edited

Loading

bdewilde commented Dec 16, 2024

vishpillai123 commented Dec 16, 2024

[pdp] Extend modeling functionality #41

[pdp] Extend modeling functionality #41

Conversation

bdewilde commented Dec 8, 2024 • edited Loading

changes

context

questions

vishpillai123 left a comment • edited Loading

Choose a reason for hiding this comment

bdewilde commented Dec 16, 2024

vishpillai123 commented Dec 16, 2024

bdewilde commented Dec 8, 2024 •

edited

Loading

vishpillai123 left a comment •

edited

Loading