Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pdp] Extend modeling functionality #41

Merged
merged 3 commits into from
Dec 17, 2024

Conversation

bdewilde
Copy link
Member

@bdewilde bdewilde commented Dec 8, 2024

changes

adds functions (and unit tests) for common modeling-adjacent tasks: splitting datasets into exclusive subsets, and computing sample weights to deal with class imbalance

context

I've had to do this stuff "custom" for more than one school, but it's the same logic, so should be in the shared lib code.

Follows PR #40

questions

Are there any other tasks like this to include?

@bdewilde bdewilde marked this pull request as ready for review December 8, 2024 21:59
Base automatically changed from upgrade-package-deps to develop December 16, 2024 15:42
Copy link

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample weight function looks to be an extension of sklearn.utils.class_weight.compute_sample_weight, so looks great to me.

For the compute_data_splits() function, I was curious why using a random generator to create the dataset splits was necessary, instead of just using sklearn.model_selection.train_test_split? I have seen train_test_split commonly utilized across academia/industry, so just curious of your choice of implementing the rng library here. Not really requesting changes and I'm definitely open to your thoughts here, just thought to ask. :)

@bdewilde
Copy link
Member Author

As the name suggests, train_test_split() only splits the dataset into 2 components -- "train" and "test" -- while we want 3 -- the additional "validate" split. It's possible to apply train_test_split() twice, but that's slower and clunky. Using the underlying numpy functionality directly is faster and accomplishes the same thing. :)

@vishpillai123
Copy link

As the name suggests, train_test_split() only splits the dataset into 2 components -- "train" and "test" -- while we want 3 -- the additional "validate" split. It's possible to apply train_test_split() twice, but that's slower and clunky. Using the underlying numpy functionality directly is faster and accomplishes the same thing. :)

Ah, I thought that you could create a validation split and I could have sworn I've used it before, but I'm probably mixing up something else I've used. Ok, makes sense!

@bdewilde bdewilde merged commit 8970ef8 into develop Dec 17, 2024
5 checks passed
@bdewilde bdewilde deleted the pdp-extend-modeling-functionality branch December 17, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants