Documentation update #26

valosekj · 2025-01-31T15:52:16Z

No description provided.

…ng newlines which simulated soft-wrap poorly)

…xes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count)

…c-strings

…currently only one: SimpleImputation)

…e is currently only one: StandardScaling)

…iKit-Learn's root ModelManager

…g example usage)

…he extended discussion of how this implementation works around SciKit-Learn's triple-variant approach!

…els. Oops

…ple usage. Woohoo

…r use in the broader context of the framework.

data/hooks/encoding.py

valosekj · 2025-01-31T16:06:18Z

data/hooks/feature_selection.py

@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self:

 @registered_data_hook("sample_drop_null")
 class SampleNullityDrop(NullityDrop):
+    """
+    Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of
+        null values. For example, with a threshold of 0.5, any samples which are missing more than half of their


null values

does it mean NaN?

Can be np.NaN, pd.NA, pd.NaT, null, or None by default (as its just using Pandas null detection under the hood). Pandas documentation refers to this as NA though, so maybe we should 'null' to 'NA' to match?

Suggested change

null values. For example, with a threshold of 0.5, any samples which are missing more than half of their

NA values. For example, with a threshold of 0.5, any samples which are missing more than half of their

valosekj · 2025-01-31T16:07:51Z

data/hooks/feature_selection.py

@@ -72,17 +96,39 @@ def from_config(cls, config: dict, logger: Logger = Logger.root) -> Self:

 @registered_data_hook("sample_drop_null")
 class SampleNullityDrop(NullityDrop):
+    """
+    Data hook which will automatically remove samples in the dataset which contain more than some threshold amount of


Just if I understand it correctly: SampleNullityDrop drops rows (samples), while FeatureNullityDrop drops columns (features), right?

Correct; I plan on adding a small documentation clarifying what "feature" and "sample" mean in the context of this library.

Given this is confusing here though, I'm going to extend the docstrings of data hooks which directly refer to features/samples with their definition to avoid this.

valosekj · 2025-01-31T16:11:06Z

data/hooks/imputation.py

+    {
+      "type": "imputation_simple",
+      "features": ["color", "species"],
+      "strategy": "most_common"


shouldn't it be most_frequent? (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

Good catch; dunno why my brain shut down here.

Suggested change

"strategy": "most_common"

"strategy": "most_frequent"

valosekj · 2025-01-31T16:18:32Z

README.md

@@ -15,27 +15,22 @@ it can be easily extended to allow for the analysis of any tabular dataset.
   * `conda activate modular_optuna_ml`
   * `mamba activate modular_optuna_ml`
 4. Done!
-5. This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see `testing` for an example).
+
+NOTE: This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see the `testing` directory for some examples).

 ## Running the Program

 Four files are needed to run an analysis

 * A tabular dataset, containing the metrics you want to run the analysis on


Suggested change

* A tabular dataset, containing the metrics you want to run the analysis on

* A tabular dataset (`.tsv` file), containing the metrics you want to run the analysis on

tsv is an informal format; while its supposed to always be .tsv extension, lots of tools just re-use the .csv designation (even though it is still tab-delineated). Going to clarify your suggestion slightly to account for this

Suggested change

* A tabular dataset, containing the metrics you want to run the analysis on

* A tabular dataset (usually a `.tsv` file), containing the metrics you want to run the analysis on

README.md

valosekj · 2025-01-31T16:25:09Z

README.md

-   determine the hyperparameters to use.
-   * Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the 
-   place of a constant; an example of this can be seen in the `penalty` parameter for the 
+   * If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below)


What do you think about adding some explanatory figure (e.g., from your slides)? As you might remember, it took me a while to understand the concepts of replicate, trial, and split.

Yup, this was next on the docket. Just looking into how to set up Sphinx w/ AutoDoc (so we're not locked to GitHub's wiki should they decide to become tosspots in the future).

valosekj

Thanks a lot for improving the documentation, @SomeoneInParticular! I left a few minor comments and suggestions.

Addendum suggested by valosekj Co-authored-by: Jan Valosek <[email protected]>

Added additional (common) parameter, as suggested by Jan Co-authored-by: Jan Valosek <[email protected]>

Swapped to GitHub Block formatting, which is a lot better at drawing the eye of the user to this note. Thanks Jan for pointing this out! Co-authored-by: Jan Valosek <[email protected]>

SomeoneInParticular · 2025-01-31T17:41:18Z

No idea why GitHub had a merge conflict during import; it was solved by just using the import from master

Modified suggestion by Jan

…fication

…tuall say what it is

kalum.ost added 17 commits January 27, 2025 14:53

Cleaned up README.md to be a bit more clean and clear (namely, removi…

fadd879

…ng newlines which simulated soft-wrap poorly)

Cleaned up README.md to be a bit more clean and clear (namely, removi…

081f799

…ng newlines which simulated soft-wrap poorly)

Merge branch 'master' into kjo/documentation

a8d2a3f

Added docstring for 'registered_data_hook'

e3ffbac

Updated docstrings for the ABCs for data hooks

b81bc6f

Updated the docstrings of data-encoding hooks

f2f1d59

Updated DocStrings for the 'feature_selection.py' data hooks. Also fi…

9e094d4

…xes a slight error in the SampleNullityDrop hook, where it was evaluating checking the threshold against sample count (rather than feature count)

[Minor] Corrected indentation of use-case to match the rest of the do…

07504f4

…c-strings

Added the docstring for the Imputation data hooks (of which there is …

d4d00e5

…currently only one: SimpleImputation)

Added the docstring for the Standardization data hooks (of which ther…

fa1c984

…e is currently only one: StandardScaling)

Added missing docstring for the "evaluate_param" function, used by Sc…

9a7ef4f

…iKit-Learn's root ModelManager

Updated docstrings for the SciKit-Learn Ensemble models (mostly addin…

0ae009d

…g example usage)

Updated docstring for the Linear models provided by this tool. Note t…

4a496c9

…he extended discussion of how this implementation works around SciKit-Learn's triple-variant approach!

Fixed incorrect indentation in the use-case examples for Ensemble mod…

8406ba2

…els. Oops

Extended the docstring of KNeighborsClassifierManager to include exam…

c58a613

…ple usage. Woohoo

Added example usage to the SVC docstring.

cd459ab

Extended the docstring of the tuning utility classes, to clarify thei…

c926333

…r use in the broader context of the framework.