Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update #26

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from
Draft

Documentation update #26

wants to merge 31 commits into from

Conversation

valosekj
Copy link
Member

No description provided.

kalum.ost added 17 commits January 27, 2025 14:53
…ng newlines which simulated soft-wrap poorly)
…ng newlines which simulated soft-wrap poorly)
…xes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count)
…he extended discussion of how this implementation works around SciKit-Learn's triple-variant approach!
…r use in the broader context of the framework.
data/hooks/encoding.py Outdated Show resolved Hide resolved
data/hooks/encoding.py Outdated Show resolved Hide resolved
@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self:

@registered_data_hook("sample_drop_null")
class SampleNullityDrop(NullityDrop):
"""
Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of
null values. For example, with a threshold of 0.5, any samples which are missing more than half of their
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null values

does it mean NaN?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be np.NaN, pd.NA, pd.NaT, null, or None by default (as its just using Pandas null detection under the hood). Pandas documentation refers to this as NA though, so maybe we should 'null' to 'NA' to match?

Copy link
Collaborator

@SomeoneInParticular SomeoneInParticular Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
null values. For example, with a threshold of 0.5, any samples which are missing more than half of their
NA values. For example, with a threshold of 0.5, any samples which are missing more than half of their

@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self:

@registered_data_hook("sample_drop_null")
class SampleNullityDrop(NullityDrop):
"""
Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just if I understand it correctly: SampleNullityDrop drops rows (samples), while FeatureNullityDrop drops columns (features), right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; I plan on adding a small documentation clarifying what "feature" and "sample" mean in the context of this library.

Given this is confusing here though, I'm going to extend the docstrings of data hooks which directly refer to features/samples with their definition to avoid this.

{
"type": "imputation_simple",
"features": ["color", "species"],
"strategy": "most_common"
Copy link
Member Author

@valosekj valosekj Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; dunno why my brain shut down here.

Copy link
Collaborator

@SomeoneInParticular SomeoneInParticular Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"strategy": "most_common"
"strategy": "most_frequent"

@@ -15,27 +15,22 @@ it can be easily extended to allow for the analysis of any tabular dataset.
* `conda activate modular_optuna_ml`
* `mamba activate modular_optuna_ml`
4. Done!
5. This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see `testing` for an example).

NOTE: This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see the `testing` directory for some examples).

## Running the Program

Four files are needed to run an analysis

* A tabular dataset, containing the metrics you want to run the analysis on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A tabular dataset, containing the metrics you want to run the analysis on
* A tabular dataset (`.tsv` file), containing the metrics you want to run the analysis on

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsv is an informal format; while its supposed to always be .tsv extension, lots of tools just re-use the .csv designation (even though it is still tab-delineated). Going to clarify your suggestion slightly to account for this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A tabular dataset, containing the metrics you want to run the analysis on
* A tabular dataset (usually a `.tsv` file), containing the metrics you want to run the analysis on

README.md Outdated Show resolved Hide resolved
determine the hyperparameters to use.
* Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the
place of a constant; an example of this can be seen in the `penalty` parameter for the
* If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding some explanatory figure (e.g., from your slides)? As you might remember, it took me a while to understand the concepts of replicate, trial, and split.

Copy link
Collaborator

@SomeoneInParticular SomeoneInParticular Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was next on the docket. Just looking into how to set up Sphinx w/ AutoDoc (so we're not locked to GitHub's wiki should they decide to become tosspots in the future).

Copy link
Member Author

@valosekj valosekj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for improving the documentation, @SomeoneInParticular! I left a few minor comments and suggestions.

@valosekj valosekj added the documentation Improvements or additions to documentation label Jan 31, 2025
SomeoneInParticular and others added 2 commits January 31, 2025 12:12
Addendum suggested by valosekj

Co-authored-by: Jan Valosek <[email protected]>
Added additional (common) parameter, as suggested by Jan

Co-authored-by: Jan Valosek <[email protected]>
SomeoneInParticular and others added 2 commits January 31, 2025 12:31
Swapped to GitHub Block formatting, which is a lot better at drawing the eye of the user to this note. Thanks Jan for pointing this out!

Co-authored-by: Jan Valosek <[email protected]>
@SomeoneInParticular
Copy link
Collaborator

No idea why GitHub had a merge conflict during import; it was solved by just using the import from master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants