-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation update #26
base: master
Are you sure you want to change the base?
Conversation
…ng newlines which simulated soft-wrap poorly)
…ng newlines which simulated soft-wrap poorly)
…xes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count)
…currently only one: SimpleImputation)
…e is currently only one: StandardScaling)
…iKit-Learn's root ModelManager
…he extended discussion of how this implementation works around SciKit-Learn's triple-variant approach!
…ple usage. Woohoo
…r use in the broader context of the framework.
@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self: | |||
|
|||
@registered_data_hook("sample_drop_null") | |||
class SampleNullityDrop(NullityDrop): | |||
""" | |||
Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of | |||
null values. For example, with a threshold of 0.5, any samples which are missing more than half of their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null values
does it mean NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be np.NaN
, pd.NA
, pd.NaT
, null
, or None
by default (as its just using Pandas null detection under the hood). Pandas documentation refers to this as NA though, so maybe we should 'null' to 'NA' to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null values. For example, with a threshold of 0.5, any samples which are missing more than half of their | |
NA values. For example, with a threshold of 0.5, any samples which are missing more than half of their |
@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self: | |||
|
|||
@registered_data_hook("sample_drop_null") | |||
class SampleNullityDrop(NullityDrop): | |||
""" | |||
Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just if I understand it correctly: SampleNullityDrop
drops rows (samples), while FeatureNullityDrop
drops columns (features), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct; I plan on adding a small documentation clarifying what "feature" and "sample" mean in the context of this library.
Given this is confusing here though, I'm going to extend the docstrings of data hooks which directly refer to features/samples with their definition to avoid this.
{ | ||
"type": "imputation_simple", | ||
"features": ["color", "species"], | ||
"strategy": "most_common" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be most_frequent
? (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; dunno why my brain shut down here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"strategy": "most_common" | |
"strategy": "most_frequent" |
@@ -15,27 +15,22 @@ it can be easily extended to allow for the analysis of any tabular dataset. | |||
* `conda activate modular_optuna_ml` | |||
* `mamba activate modular_optuna_ml` | |||
4. Done! | |||
5. This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see `testing` for an example). | |||
|
|||
NOTE: This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see the `testing` directory for some examples). | |||
|
|||
## Running the Program | |||
|
|||
Four files are needed to run an analysis | |||
|
|||
* A tabular dataset, containing the metrics you want to run the analysis on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A tabular dataset, containing the metrics you want to run the analysis on | |
* A tabular dataset (`.tsv` file), containing the metrics you want to run the analysis on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsv
is an informal format; while its supposed to always be .tsv
extension, lots of tools just re-use the .csv
designation (even though it is still tab-delineated). Going to clarify your suggestion slightly to account for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* A tabular dataset, containing the metrics you want to run the analysis on | |
* A tabular dataset (usually a `.tsv` file), containing the metrics you want to run the analysis on |
determine the hyperparameters to use. | ||
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the | ||
place of a constant; an example of this can be seen in the `penalty` parameter for the | ||
* If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding some explanatory figure (e.g., from your slides)? As you might remember, it took me a while to understand the concepts of replicate,
trial,
and split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this was next on the docket. Just looking into how to set up Sphinx w/ AutoDoc (so we're not locked to GitHub's wiki should they decide to become tosspots in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for improving the documentation, @SomeoneInParticular! I left a few minor comments and suggestions.
Addendum suggested by valosekj Co-authored-by: Jan Valosek <[email protected]>
Added additional (common) parameter, as suggested by Jan Co-authored-by: Jan Valosek <[email protected]>
Swapped to GitHub Block formatting, which is a lot better at drawing the eye of the user to this note. Thanks Jan for pointing this out! Co-authored-by: Jan Valosek <[email protected]>
No idea why GitHub had a merge conflict during import; it was solved by just using the import from master |
Modified suggestion by Jan
…tuall say what it is
No description provided.