From 9020f9ee70938ee5177f3797b3efc6334e14fd3e Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Mon, 13 Jan 2025 15:39:22 -0800
Subject: [PATCH 1/8] updated usage.md with more indepth documentation

---
 docs/pages/usage.md | 377 ++++++++++++++++++++++++++++++--------------
 1 file changed, 256 insertions(+), 121 deletions(-)
diff --git a/docs/pages/usage.md b/docs/pages/usage.md
index 6465d4e0..e9e4c255 100644
--- a/docs/pages/usage.md
+++ b/docs/pages/usage.md
@@ -12,127 +12,262 @@ CoderData is a comprehensive package designed for handling cancer benchmark data
 It offers functionalities to download datasets, load them into Python environments, and reformat them according to user needs.
 
 ## Installation
-To install, confirm that you have python avilable and then run the following command in your terminal:
-
-```bash
-pip install coderdata
-```
-
-## Downloading Data
-The `download` function in CoderData facilitates the downloading of datasets from Figshare. Users can specify a dataset prefix to filter the required files.
-
-### Command Line Usage
-To download data via the command line, execute the following command:
-<div class="code-box">
-    <p>coderdata download --prefix [PREFIX]</p>
-</div>
-Replace [PREFIX] with the desired dataset prefix (e.g., 'hcmi', 'beataml'). Omit the prefix argument to download all available datasets.
-
-### Python Usage
-In Python, the download process is handled through the `download_data_by_prefix` function from the downloader module.
-<div class="code-box">
-    <p>import coderdata as cd</p>
-    <p><span class="code-comment"># Download a specific dataset</span></p>
-    <p>cd.download_data_by_prefix('beataml')</p>
-    <p><span class="code-comment"># Download all datasets</span></p>
-    <p>cd.download_data_by_prefix()</p>
-</div>
-
-## Loading Data
-The `DatasetLoader` class in CoderData is designed for loading datasets into Python.  
-It automatically initializes attributes for each dataset type like transcriptomics, proteomics, and mutations.
-<div class="code-box">
-    <p>import coderdata as cd</p>
-    <p><span class="code-comment"># Initialize the DatasetLoader for a specific dataset type</span></p>
-    <p>broad_sanger = cd.DatasetLoader('broad_sanger')</p>
-    <p><span class="code-comment"># Access pandas formatted preview of the samples data</span></p>
-    <p>broad_sanger.samples</p>
-    <p><span class="code-comment"># Access pandas formatted preview of each data type</span></p>
-    <p>broad_sanger.transcriptomics</p>
-    <p>broad_sanger.proteomics</p>
-    <p>broad_sanger.pertubations</p>
-    <p>broad_sanger.mutations</p>
-    <p>broad_sanger.copy_number</p>
-    <p>broad_sanger.drugs</p>
-    <p>broad_sanger.experiments</p>
-    <p>broad_sanger.genes</p>
-</div>
-
-## Joining Datasets
-The `join_datasets` function in CoderData is designed for joining and loading datasets in Python with the most flexibility possible.
-It is capable of joining initialized, previously joined, or non-initialized datasets. This means you may modify a dataset before joining it with another.
-<div class="code-box">
-    <p>import coderdata as cd</p>
-    <p><span class="code-comment"># Initialize the DatasetLoader for a specific dataset type</span></p>
-    <p>hcmi = cd.DatasetLoader('hcmi')</p>
-    <p><span class="code-comment"># Access a datatype of the loaded dataset</span></p>
-    <p>beataml = cd.DatasetLoader('beataml')</p>
-    <p><span class="code-comment"># Join two previously initialized datasets</span></p>
-    <p>joined_dataset1 = cd.join_datasets(beataml, hcmi)</p>
-    <p><span class="code-comment"># Join a previously joined dataset with a non-initialized dataset</span></p>
-    <p><span class="code-comment"># Quotes around a dataset name will load from local files using the DatasetLoader function.</span></p>
-    <p>joined_dataset2 = cd.join_datasets(joined_dataset1, "broad_sanger")</p>
-    <p><span class="code-comment"># Join multiple datasets using every method available</span></p>
-    <p>joined_dataset3 = cd.join_datasets("broad_sanger", beataml)</p>
-    <p>joined_dataset4 = cd.join_datasets(joined_dataset3, "cptac", hcmi)</p>
-</div>
-
-## Reformatting Datasets
-You can reformat datasets into long or wide formats using the `reformat_dataset` method. By default, data is in the long format.  
-Reformatting from long to wide retains three data types, entrez_id and improve_sample_id, value of interest (such as transcriptomics).  
-Datasets cannot be joined while there is a datatype in the wide format.
-<div class="code-box">
-    <p>import coderdata as cd</p>
-    <p><span class="code-comment"># Reformat a specific dataset</span></p>
-    <p>hcmi.reformat_dataset('transcriptomics', 'wide') </p>
-    <p><span class="code-comment"># Reformat all datasets</span></p>
-    <p>hcmi.reformat_dataset('wide')</p>
-    <p><span class="code-comment"># Reformat all datatypes back to 'long' datasets</span></p>
-    <p>hcmi.reformat_dataset('long') </p>
-</div>
-
-## Reloading Datasets
-The `reload_datasets` method is useful for reloading specific datasets or all datasets from local storage, especially if the data files have been updated or altered.
-<div class="code-box">
-    <p>import coderdata as cd</p>
-    <p><span class="code-comment"># Reload a specific dataset</span></p>
-    <p>hcmi.reload_datasets('transcriptomics')</p>
-    <p><span class="code-comment"># Reload all datasets</span></p>
-    <p>hcmi.reload_datasets()</p>
-</div>
-
-## Info Function 
-The `info` method tells you which datatypes are available, their long/wide format, and which datasets they came from.
-<div class="code-box">
-    <span class="code-comment"># Get information about the joined datasets</span><br>
-    joined_dataset4.info()<br>
-    <span class="code-comment"># The output is as follows - </span><br>
-    <span class="code-comment">
-    This is a joined dataset comprising of:<br>
-    - beataml: Beat acute myeloid leukemia (BeatAML) data was collected though GitHub and Synapse.<br>
-    - hcmi: Human Cancer Models Initiative (HCMI) data was collected though the National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal.<br>
-    - broad_sanger: The cell line datasets were collected from numerous resources such as the LINCS project, broad_sanger, and the Sanger Institute.<br>
-    - cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI).<br>
-
-    Available Datatypes and Their Formats<br>
-    - copy_number: long format<br>
-    - mutations: long format<br>
-    - proteomics: long format<br>
-    - samples: long format<br>
-    - transcriptomics: long format<br>
-    - drugs: long format<br>
-    - experiments: long format<br>
-
-    Datatype Origins:<br>
-    - proteomics: Data from beataml, broad_sanger, cptac<br>
-    - transcriptomics: Data from beataml, broad_sanger, hcmi, cptac<br>
-    - copy_number: Data from broad_sanger, hcmi, cptac<br>
-    - mutations: Data from beataml, broad_sanger, hcmi, cptac<br>
-    - samples: Data from beataml, broad_sanger, hcmi, cptac<br>
-    - drugs: Data from beataml, broad_sanger<br>
-    - experiments: Data from beataml, broad_sanger<br>
-    </span>
-</div>
+`coderdata` requires `python>=3.9` to be installed. The installed version can be checked via
+```shell
+$ python --version
+Python 3.13.1
+```
+If a Python version older that 3.9 is installed please referr to the instruction at [python.org](https://www.python.org/about/gettingstarted/#installing) on how to install / update Python.
+
+The preferred way to install `coderdata` is via `pip`. Executing the command below will install the most recent published version of `coderdata` including all required dependencies.
+```shell
+$ pip install coderdata
+```
+
+To check if the package has been sucessfully installed open an interactive python termial and import the package. See an example of what to expect below.
+```python
+>>> import coderdata as cd
+>>> cd.__version__
+'0.1.40'
+```
+
+## Usage
+The primary way to interact with coderdata is through the `coderdata` API. Additionally a command line interface with limited functionality (primarily to download data) is also available.
+
+### CLI
+Invoking `coderdata` from the command line will by default print a help / usage message and exit (see below):
+```sh
+$ coderdata
+usage: coderdata [-h] [-l | -v] {download} ...
+
+options:
+  -h, --help     show this help message and exit
+  -l, --list     prints list of available datasets and exits program.
+  -v, --version  prints the versions of the coderdata API and dataset and exits the program
+
+commands:
+  {download}
+    download     subroutine to download datasets. See "coderdata download -h" for more options.
+```
+
+The primary use case of the CLI is to retrieve dataset from the repository. This can be done by invoking the `download` routine of `coderdata`. Without defining a specific dataset the whole repository will be downloaded:
+```sh
+$ coderdata download
+Downloaded 'https://ndownloader.figshare.com/files/48032953' to '/tmp/beataml_drugs.tsv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48032962' to '/tmp/mpnst_drugs.tsv.gz'
+...
+```
+
+Downloading a specific dataset can be achieve by passing the `-n/--name` argument to the `download` routine:
+```sh
+$ coderdata download --name beataml
+Downloaded 'https://ndownloader.figshare.com/files/48032953' to 'beataml_drugs.tsv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48032959' to 'beataml_samples.csv'
+...
+```
+
+A full list of available arguments of the `download` function including a short explanation can retrieved via the command shown below:
+```sh
+$ coderdata download -h
+usage: coderdata download [-h] [-n DATASET_NAME] [-p LOCAL_PATH] [-o]
+
+options:
+  -h, --help            show this help message and exit
+  -n, --name DATASET_NAME
+                        name of the dataset to download (e.g., "beataml"). Alternatively, "all" will download the full repository of coderdata datasets. See "coderdata --list" for a
+                        complete list of available datasets. Defaults to "all"
+  -p, --local_path LOCAL_PATH
+                        defines the folder the datasets should be stored in. Defaults to the current working directory if omitted.
+  -o, --overwrite       allow dataset files to be overwritten if they already exist.
+```
+
+Additionally to the `download` functionality, the CLI currently supports displaying basic information such as the version numbers of the package and the dataset (see example call below)
+```sh
+$ coderdata --version
+package version: 0.1.40
+dataset version: 0.1.4
+```
+as well as listing the dataset that are available for download (example output below)
+```sh
+$ coderdata --list
+
+Available datasets
+------------------
+
+beataml: Beat acute myeloid leukemia (BeatAML) focuses on acute myeloid leukemia tumor data. Data includes drug response, proteomics, and transcriptomics datasets.
+cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI) focused on improving our understanding of cancer biology through the integration of transcriptomic, proteomic, and genomic data.
+hcmi: Human Cancer Models Initiative (HCMI) encompasses numerous cancer types and includes cell line, organoid, and tumor data. Data includes the transcriptomics, somatic mutation, and copy number datasets.
+mpnst: Malignant Peripheral Nerve Sheath Tumor is a rare, agressive sarcoma that affects peripheral nerves throughout the body.
+
+------------------
+
+To download individual datasets run "coderdata download -name DATASET_NAME" where "DATASET_NAME" is for example "beataml".
+```
+
+### API
+
+#### Downloading data
+Using the `coderdata` API, the download process is handled through the `download` function in the downloader module.
+```python
+>>> import coderdata as cd
+>>> cd.download(name='beataml')
+Downloaded 'https://ndownloader.figshare.com/files/48032953' to 'beataml_drugs.tsv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48032959' to 'beataml_samples.csv'
+Downloaded 'https://ndownloader.figshare.com/files/48032965' to 'beataml_mutations.csv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48032968' to 'beataml_proteomics.csv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48032974' to 'beataml_experiments.tsv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48033052' to 'beataml_transcriptomics.csv.gz'
+Downloaded 'https://ndownloader.figshare.com/files/48033058' to 'beataml_drug_descriptors.tsv.gz'
+```
+As with the CLI download functionality, the local path where to store the downloaded files, as well as a flag the defines whether existing files should be overwritten can be defined in the `download()` function. For example the function call below will download all 'BeatAML' related datasets to the local path `/tmp/coderdata/` and will overwrite files if they already exist.
+```python
+>>> cd.download(name='beataml', local_path='/tmp/coderdata/', exist_ok=True)
+```
+Note that if `exist_ok==False` (the default if omitted) and a downloaded file already exists a warning will be given and the file won't be stored. Finally, if all datasets should be downloaded the `name` argument can manually set to `name='all'` or omitted all together as the `name` defaults to `'all'`.
+
+#### The `Dataset` object
+
+The `Dataset` object is the central data structure in CoderData. It automatically initializes attributes for each dataset type like tumor samples, drug response data, as well as associated omics data like proteomics. Each datatype in a `Dataset` is internally stored in a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/frame.html).
+
+##### Loading data into a `Dataset` object
+The code snippet will load the [previously downloaded](#downloading-data) 'BeatAML' dataset into a `Dataset` object called `beataml`.
+
+```python
+>>> beataml = cd.load(name='beataml', local_path='/tmp/coderdata')
+Importing raw data ...
+Importing 'transcriptomics' from /tmp/coderdata/beataml_transcriptomics.csv.gz ... DONE
+Importing 'drugs' from /tmp/coderdata/beataml_drugs.tsv.gz ... DONE
+Importing 'proteomics' from /tmp/coderdata/beataml_proteomics.csv.gz ... DONE
+Importing 'drug_descriptors' from /tmp/coderdata/beataml_drug_descriptors.tsv.gz ... DONE
+Importing 'mutations' from /tmp/coderdata/beataml_mutations.csv.gz ... DONE
+Importing 'samples' from /tmp/coderdata/beataml_samples.csv ... DONE
+Importing 'experiments' from /tmp/coderdata/beataml_experiments.tsv.gz ... DONE
+Importing raw data ... DONE
+```
+
+Additionally, the `load()` function also allows for loading data from a previously pickled `Dataset` object (see [Saving manipulated `Dataset` objects](#saving-manipulated-dataset-objects)).
+
+##### Displaying the datatypes in a `Dataset` object
+
+The data types associated with a dataset can be displayed via the `Dataset.types()` function. The function will return a simple list of available datatypes.
+```python
+>>> beataml.types()
+['transcriptomics', 'proteomics', 'mutations', 'samples', 'drugs', 'experiments']
+```
+Individual datatypes can be adressed and manipulated by subscripting the dataset. For example extracting the underlying `pandas.DataFrame` that contains drug response values for 'BeatAML' can be done via the command below:
+```python
+>>> beataml.experiments
+         source  improve_sample_id improve_drug_id    study  time time_unit dose_response_metric  dose_response_value
+0       synapse               3907       SMI_11123  BeatAML    72       hrs              fit_auc               0.0564
+1       synapse               3907       SMI_11211  BeatAML    72       hrs              fit_auc               0.9621
+2       synapse               3907       SMI_12192  BeatAML    72       hrs              fit_auc               0.1691
+3       synapse               3907       SMI_12254  BeatAML    72       hrs              fit_auc               0.4245
+4       synapse               3907       SMI_12469  BeatAML    72       hrs              fit_auc               0.7397
+...         ...                ...             ...      ...   ...       ...                  ...                  ...
+233775  synapse               3626        SMI_7110  BeatAML    72       hrs                  dss               0.0000
+233776  synapse               3626        SMI_7590  BeatAML    72       hrs                  dss               0.0000
+233777  synapse               3626        SMI_8159  BeatAML    72       hrs                  dss               0.1946
+233778  synapse               3626        SMI_8724  BeatAML    72       hrs                  dss               0.0000
+233779  synapse               3626         SMI_987  BeatAML    72       hrs                  dss               0.7165
+
+[233780 rows x 8 columns]
+```
+
+##### Reformatting and exporting datatypes
+
+Internally all data is stored in long format. If different formats are needed for further analysis or as input for the training of machine learning models, the `Dataset.format(data_type, **kwargs)` function is able to return individual data types in altered formats.
+
+For example the drug response data can be reformatted into wide format via the following command:
+```python
+>>> beataml.format(data_type='experiments', shape='wide', metrics=['fit_auc', 'dss'])
+        source  improve_sample_id improve_drug_id    study  time time_unit     dss  fit_auc
+0      synapse               3190       SMI_11123  BeatAML    72       hrs  0.4244   0.5447
+1      synapse               3190       SMI_12192  BeatAML    72       hrs  0.2782   0.4848
+2      synapse               3190       SMI_12254  BeatAML    72       hrs  0.0000   0.5872
+3      synapse               3190       SMI_12469  BeatAML    72       hrs  0.2973   0.4435
+4      synapse               3190       SMI_12953  BeatAML    72       hrs  0.0000   0.5566
+...        ...                ...             ...      ...   ...       ...     ...      ...
+23373  synapse               3916        SMI_7590  BeatAML    72       hrs  0.4537   0.5689
+23374  synapse               3916        SMI_8063  BeatAML    72       hrs  0.0000   0.5640
+23375  synapse               3916        SMI_8159  BeatAML    72       hrs  0.0000   0.5340
+23376  synapse               3916        SMI_8724  BeatAML    72       hrs  0.7033   0.7172
+23377  synapse               3916         SMI_987  BeatAML    72       hrs  0.0000   0.4842
+```
+Note that the `Dataset.format(data_type, **kwargs)` function behaves slightly different for different `data_type` values. For example for `data_type='experiments'` accepted keyword arguments are `shape` & `metrics`. `shape` defines which format the resulting `pandas.DataFrame` should be in (e.g. `long`, `wide` or `matrix`). `metrics` defines the drug response metrics that should be filtered for.
+
+A full list of parameters for the individual data types can be found below:
+- `Dataset.format(data_type='transcriptomics')` returns a `matrix` like `pandas.DataFrame` where each cell contains the measured transcriptomics value for a gene (row - `entrez_id`) in a specific cancer sample (column - `improve_sample_id`).
+- `Dataset.format(data_type='mutations', mutation_type=...)` will return a binary `matrix` like `pandas.DataFrame` with rows representing genes and columns representing samples. `mutation_type` can be any of the recoreded mutation types available (e.g. `'Frame_Shift_Del'`,`'Frame_Shift_Ins'`,`'Missense_Muation'` or `'Start_Codon_SNP'` among others). Cells contain the value of `1` if a mutation in given gene/sample falls into the category defined by `mutation_type`.
+- `Dataset.format(data_type='copy_number', copy_call=False)` returns a `matrix` like `pandas.DataFrame` where cells report the `mean` copy number value for each combination of gene (row - `entrez_id`) and cancer sample (column - `improve_sample_id`). If `copy_call=True` cells report the discretized measurement ('deep del', 'het loss', 'diploid', 'gain', 'amp') of copy number provided by the schema.
+- `Dataset.format('data_type=proteomics')` returns a `matrix` like `pandas.DataFrame` where each cell contains the measured proteomics value for a gene (row - `entrez_id`) in a specific cancer sample (column - `improve_sample_id`).
+- `Dataset.format(data_type='experiments', shape=..., metrics=...)`, returns a formatted `pandas.DataFrame` according to defined `shape` (`shape` can be of values `'long'`, `'wide'` and `'matrix'`). `metrics` further defines which drug response metrics the resulting output `DataFrame` should be filtederd for. Examples are `'fit_auc'`, `'fit_ec50` or `'dss'`. If `shape=wide`, a list can be passed to `metric` containing more than one value.
+- `Dataset.format(data_type='drug_descriptor', shape=..., drug_descriptor_type=...)` returns a `pandas.DataFrame` formatted either in `long` or `wide` (depending on the `shape` argument). `drug_descriptor_type` can be defined as a list of desired `structural_descriptors` in conjunction with `shape=wide`, to limit the resulting `DataFrame` to only list the desired `structual_descriptors` as columns.
+- `Dataset.format(data_type='drugs')` is equal to `Dataset.drugs`. It returns the underlying `pandas.DataFrame` containing the drug information.
+- `Dataset.format(data_type='genes')` is equal to `Dataset.genes`. It returns the underlying `pandas.DataFrame` containing the gene information.
+- `Dataset.format(data_type='samples')` is equal to `Dataset.samples`. It returns the underlying `pandas.DataFrame` containing the cancer sample data information.
+
+##### Creating training / testing and validation splits with `coderdata`
+
+Using the `Dataset.train_test_validate()` function the dataset can be split into trining, testing and validation sets. The function will return a `Split` object (a python `@dataclass`) that contains three `Dataset` objects that can be adressed and retrieved by subscripting with eiter `Split.train`, `Split.test` or `Split.validate`. 
+
+```python
+>>> split = beataml.train_test_validate()
+>>> split.train.experiments.shape
+(187020, 8)
+>>> split.test.experiments.shape
+(23380, 8)
+>>> split.validate.experiments.shape
+(23380, 8)
+```
+
+By default the returned splits will be `mixed-set` (drugs and cancer samples can appear in all three folds), with a ratio of 8:1:1, no stratification and no set random state (seed). This behaviour can be changed by passing `split_type`, `ratio`, `stratified_by` and `random_state` to the function. 
+
+`split_type` can be either `'mixed-set'`, `'drug-blind'` or `'drug-blind'`:
+- `mixed-set`: Splits randomly independent of drug / cancer association of the samples. Individual drugs or cancer types can appear in all three splits
+- `drug-blind`: Splits according to drug association. Any sample associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never in test or validate.
+- `cancer-blind`: Splits according to cancer association. Equivalent to drug-blind, except cancer types will be unique to splits.
+
+`ratio` can be used to adjust the split ratios using a 3 item tuple containing integers. For example `ratio=(5:3:2)` would result in a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
+
+`random_state` defines a seed values for the random number generator. Defining a `random_state` will guarantee reproducability as two runs with the same `random_state` will result in the same splits.
+
+`stratify_by` Defines if the training, testing and validation sets should be stratified. Stratification tries to maintain a similar distribution of feature classes across different splits. For example assuming a drug respones value threshold that defines positive and negative classes (e.g. reduced vs. no change in cancer cell viability) the splitting algorithm could attempt to assign the same amount of positive class instances as negative class instances to each split. Stratification is performed by `drug_response_value`. Any value other than `None` indicates stratification and defines which `drug_response_value` should be used as basis for the stratification. `None` indicates that no stratfication should be performed. Which type of stratification should be performe can further be customized with keyword arguments (`thresh`, `num_classes`, `quantiles`).
+
+An example call to create a 70/20/10 drug-blind split that is stratified by `fit_auc` could look like this:
+```python
+>>> split = beataml.train_test_validate(
+...     split_type='drug-blind',
+...     ratio=[7,2,1],
+...     random_state=42,
+...     stratify_by='fit_auc',
+...     thresh=0.8
+...     )
+>>> split.train.experiments.shape
+(154840, 8)
+>>> split.test.experiments.shape
+(65750, 8)
+>>> split.validate.experiments.shape
+(13190, 8)
+```
+
+##### Saving manipulated `Dataset` objects (e.g. saving splits)
+In order to save a `Dataset` for later use, the `Dataset.save()` function can be used.
+
+```python
+>>> split.train.save(path='/tmp/coderdata/beataml_train.pickle')
+>>> split.test.save(path='/tmp/coderdata/beataml_test.pickle')
+>>> split.validate.save(path='/tmp/coderdata/beataml_validate.pickle')
+```
+
+This function can be used to either save the individual splits (as demonstrated above), or raw `Dataset` that was the basis for the splits for example if any modifications of the dataset were performed.
+
+To reload the splits (or the full dataset) the `coderdata.load()` function (see also [Loading data into a `Dataset` object](#loading-data-into-a-dataset-object)) can be used. To load a pickled `Dataset`, the argument `from_pickle=True` must be passed to the function:
+
+```python
+>>> beataml_train = cd.load('beataml_train', local_path='/tmp/coderdata/', from_pickle=True)
+Importing pickled data ... DONE
+```
+Note that only individual splits (e.g. only train) can be saved and loaded and not the full `Split` object.
 
 ## Conclusion
 CoderData provides a robust and flexible way to work with cancer benchmark data.   

From 9a4c24b4ecc78d3fa8417c1b7b68d4943392d59f Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Tue, 14 Jan 2025 09:49:26 -0800
Subject: [PATCH 2/8] updates to index.md

---
 docs/index.md | 60 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 740e9563..e5d218a2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -7,42 +7,62 @@ title: CoderData
 
 <!-- # Cancer Omics and Drug Experiment Response Data (`coderdata`) Python Package -->
 
-### Introduction
+## Introduction
 CoderData is a cancer benchmark data package developed in Python and R. 
 There are two aspects of this package, the backend build section and the user facing python package.
 The build section is a github workflow that generates four cancer datasets in a format that is easy for users and algorithms to ingest. 
 The python package allows users to easily download the data, load it into python and reformat it as desired.
 
-### Installation and Usage
-##### Bash / Command Line
+## Installation and Usage
+### Installation
 
-To install coderdata, simply run the following command in your terminal:
+Assuming `python>=3.9` is installed on the system, simply run the following command in the terminal to install the most recent release of the coderdata API:
 
 ```bash
-pip install coderdata
+$ pip install coderdata
 ```
 
-##### Bash / Command line
-To download datasets, simply run the following command in your terminal. Remove the prefix argument if you'd like to install all datasets.
+### Bash / Command line
+A full list of available datasets can be retrieved via:
+```sh
+$ coderdata --list
+```
+
+To download datasets, simply run the following command in your terminal substituting `<DATASET>` with the desired dataset (e.g. `beataml`). To download all datasets use `--name all`.
 
 ```bash
-coderdata download --prefix hcmi
+$ coderdata download --name <DATASET>
 ```
 
-##### Python
-To download, load, and call datasets in python, simply run the following commands. 
-
-<div class="code-box">
-    <p>import coderdata as cd </p>
-    <p>cd.download_data_by_prefix('hcmi')</p>
-    <p>hcmi_data = cd.DatasetLoader('hcmi')</p>
-    <p>hcmi_data.transcriptomics</p>
-</div>
+### Python
+
+To download, load, and call datasets in python, simply run the following commands.
+
+```python
+>>> import coderdata as cd
+>>> cd.download(name='beataml')
+>>> beataml = cd.load('beataml')
+>>> beataml.experiments
+         source  improve_sample_id improve_drug_id    study  time time_unit dose_response_metric  dose_response_value
+0       synapse               3907       SMI_11123  BeatAML    72       hrs              fit_auc               0.0564
+1       synapse               3907       SMI_11211  BeatAML    72       hrs              fit_auc               0.9621
+2       synapse               3907       SMI_12192  BeatAML    72       hrs              fit_auc               0.1691
+3       synapse               3907       SMI_12254  BeatAML    72       hrs              fit_auc               0.4245
+4       synapse               3907       SMI_12469  BeatAML    72       hrs              fit_auc               0.7397
+...         ...                ...             ...      ...   ...       ...                  ...                  ...
+233775  synapse               3626        SMI_7110  BeatAML    72       hrs                  dss               0.0000
+233776  synapse               3626        SMI_7590  BeatAML    72       hrs                  dss               0.0000
+233777  synapse               3626        SMI_8159  BeatAML    72       hrs                  dss               0.1946
+233778  synapse               3626        SMI_8724  BeatAML    72       hrs                  dss               0.0000
+233779  synapse               3626         SMI_987  BeatAML    72       hrs                  dss               0.7165
+
+[233780 rows x 8 columns]
+```
 
-View our [Usage](pages/usage.md) page for full instructions.
+For more indepth instructions view our [Usage](pages/usage.md) page.
 
 
-### Datasets
+## Datasets
 
 <table>
   <thead>
@@ -171,7 +191,7 @@ View our [Usage](pages/usage.md) page for full instructions.
 </div> -->
 
 
-### Data Overview
+## Data Overview
 
 <div class="flex-container"> 
     <div class="flex-item">

From 5ad64217c89081080898b20033f48f357305e7f6 Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Thu, 23 Jan 2025 11:05:26 -0800
Subject: [PATCH 3/8] test use of html tags for code blocks

---
 docs/pages/usage.md | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/docs/pages/usage.md b/docs/pages/usage.md
index e9e4c255..f1773720 100644
--- a/docs/pages/usage.md
+++ b/docs/pages/usage.md
@@ -13,9 +13,10 @@ It offers functionalities to download datasets, load them into Python environmen
 
 ## Installation
 `coderdata` requires `python>=3.9` to be installed. The installed version can be checked via
-```shell
-$ python --version
-Python 3.13.1
+<div class="code-block">
+  <p>$ python --version</p>
+  <p>Python 3.13.1</p>
+</div>
 ```
 If a Python version older that 3.9 is installed please referr to the instruction at [python.org](https://www.python.org/about/gettingstarted/#installing) on how to install / update Python.
 
@@ -36,19 +37,20 @@ The primary way to interact with coderdata is through the `coderdata` API. Addit
 
 ### CLI
 Invoking `coderdata` from the command line will by default print a help / usage message and exit (see below):
-```sh
-$ coderdata
-usage: coderdata [-h] [-l | -v] {download} ...
+<div class="code-block">
+<p>$ coderdata</p>
+<p>usage: coderdata [-h] [-l | -v] {download} ...</p>
+<p></p>
+<p>options:</p>
+<p>  -h, --help     show this help message and exit</p>
+<p>  -l, --list     prints list of available datasets and exits program.</p>
+<p>  -v, --version  prints the versions of the coderdata API and dataset and exits the program</p>
+<p></p>
+<p>commands:</p>
+<p>  {download}</p>
+<p>    download     subroutine to download datasets. See "coderdata download -h" for more options.</p>
+</div>
 
-options:
-  -h, --help     show this help message and exit
-  -l, --list     prints list of available datasets and exits program.
-  -v, --version  prints the versions of the coderdata API and dataset and exits the program
-
-commands:
-  {download}
-    download     subroutine to download datasets. See "coderdata download -h" for more options.
-```
 
 The primary use case of the CLI is to retrieve dataset from the repository. This can be done by invoking the `download` routine of `coderdata`. Without defining a specific dataset the whole repository will be downloaded:
 ```sh

From 128fbdd55c9c976c885f283bc159b1b4ce1124e9 Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Thu, 23 Jan 2025 11:10:52 -0800
Subject: [PATCH 4/8] added test

---
 docs/pages/usage.md | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/docs/pages/usage.md b/docs/pages/usage.md
index f1773720..d91f05b5 100644
--- a/docs/pages/usage.md
+++ b/docs/pages/usage.md
@@ -17,20 +17,20 @@ It offers functionalities to download datasets, load them into Python environmen
   <p>$ python --version</p>
   <p>Python 3.13.1</p>
 </div>
-```
+
 If a Python version older that 3.9 is installed please referr to the instruction at [python.org](https://www.python.org/about/gettingstarted/#installing) on how to install / update Python.
 
 The preferred way to install `coderdata` is via `pip`. Executing the command below will install the most recent published version of `coderdata` including all required dependencies.
-```shell
-$ pip install coderdata
-```
+<div class="code-block">
+  <p>$ pip install coderdata</p>
+</div>
 
 To check if the package has been sucessfully installed open an interactive python termial and import the package. See an example of what to expect below.
-```python
->>> import coderdata as cd
->>> cd.__version__
-'0.1.40'
-```
+<div class="code-block">
+  <p>\>\>\> import coderdata as cd</p>
+  <p>\>\>\> cd.__version__</p>
+  <p>'0.1.40'</p>
+</div>
 
 ## Usage
 The primary way to interact with coderdata is through the `coderdata` API. Additionally a command line interface with limited functionality (primarily to download data) is also available.
@@ -38,17 +38,17 @@ The primary way to interact with coderdata is through the `coderdata` API. Addit
 ### CLI
 Invoking `coderdata` from the command line will by default print a help / usage message and exit (see below):
 <div class="code-block">
-<p>$ coderdata</p>
-<p>usage: coderdata [-h] [-l | -v] {download} ...</p>
-<p></p>
-<p>options:</p>
-<p>  -h, --help     show this help message and exit</p>
-<p>  -l, --list     prints list of available datasets and exits program.</p>
-<p>  -v, --version  prints the versions of the coderdata API and dataset and exits the program</p>
-<p></p>
-<p>commands:</p>
-<p>  {download}</p>
-<p>    download     subroutine to download datasets. See "coderdata download -h" for more options.</p>
+  <p>$ coderdata</p>
+  <p>usage: coderdata [-h] [-l | -v] {download} ...</p>
+  <p></p>
+  <p>options:</p>
+  <p>  -h, --help     show this help message and exit</p>
+  <p>  -l, --list     prints list of available datasets and exits program.</p>
+  <p>  -v, --version  prints the versions of the coderdata API and dataset and exits the program</p>
+  <p></p>
+  <p>commands:</p>
+  <p>  {download}</p>
+  <p>    download     subroutine to download datasets. See "coderdata download -h" for more options.</p>
 </div>
 
 

From faafd4b411130bb64e5e206e2108d181ed4cf4bd Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Thu, 23 Jan 2025 11:40:31 -0800
Subject: [PATCH 5/8] revert changes

---
 docs/pages/usage.md | 52 ++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/docs/pages/usage.md b/docs/pages/usage.md
index d91f05b5..0af09601 100644
--- a/docs/pages/usage.md
+++ b/docs/pages/usage.md
@@ -13,44 +13,42 @@ It offers functionalities to download datasets, load them into Python environmen
 
 ## Installation
 `coderdata` requires `python>=3.9` to be installed. The installed version can be checked via
-<div class="code-block">
-  <p>$ python --version</p>
-  <p>Python 3.13.1</p>
-</div>
-
+```shell
+$ python --version
+Python 3.13.1
+```
 If a Python version older that 3.9 is installed please referr to the instruction at [python.org](https://www.python.org/about/gettingstarted/#installing) on how to install / update Python.
 
 The preferred way to install `coderdata` is via `pip`. Executing the command below will install the most recent published version of `coderdata` including all required dependencies.
-<div class="code-block">
-  <p>$ pip install coderdata</p>
-</div>
+```shell
+$ pip install coderdata
+```
 
 To check if the package has been sucessfully installed open an interactive python termial and import the package. See an example of what to expect below.
-<div class="code-block">
-  <p>\>\>\> import coderdata as cd</p>
-  <p>\>\>\> cd.__version__</p>
-  <p>'0.1.40'</p>
-</div>
+```python
+>>> import coderdata as cd
+>>> cd.__version__
+'0.1.40'
+```
 
 ## Usage
 The primary way to interact with coderdata is through the `coderdata` API. Additionally a command line interface with limited functionality (primarily to download data) is also available.
 
 ### CLI
 Invoking `coderdata` from the command line will by default print a help / usage message and exit (see below):
-<div class="code-block">
-  <p>$ coderdata</p>
-  <p>usage: coderdata [-h] [-l | -v] {download} ...</p>
-  <p></p>
-  <p>options:</p>
-  <p>  -h, --help     show this help message and exit</p>
-  <p>  -l, --list     prints list of available datasets and exits program.</p>
-  <p>  -v, --version  prints the versions of the coderdata API and dataset and exits the program</p>
-  <p></p>
-  <p>commands:</p>
-  <p>  {download}</p>
-  <p>    download     subroutine to download datasets. See "coderdata download -h" for more options.</p>
-</div>
+```sh
+$ coderdata
+usage: coderdata [-h] [-l | -v] {download} ...
 
+options:
+  -h, --help     show this help message and exit
+  -l, --list     prints list of available datasets and exits program.
+  -v, --version  prints the versions of the coderdata API and dataset and exits the program
+
+commands:
+  {download}
+    download     subroutine to download datasets. See "coderdata download -h" for more options.
+```
 
 The primary use case of the CLI is to retrieve dataset from the repository. This can be done by invoking the `download` routine of `coderdata`. Without defining a specific dataset the whole repository will be downloaded:
 ```sh
@@ -273,4 +271,4 @@ Note that only individual splits (e.g. only train) can be saved and loaded and n
 
 ## Conclusion
 CoderData provides a robust and flexible way to work with cancer benchmark data.   
-By using these functionalities, researchers and data scientists can easily manipulate and analyze complex datasets in their Python environments
+By using these functionalities, researchers and data scientists can easily manipulate and analyze complex datasets in their Python environments
\ No newline at end of file

From 69a1e90777ec012b40bc6bfe11615d0a9f30621a Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Thu, 23 Jan 2025 14:19:10 -0800
Subject: [PATCH 6/8] updated css

---
 docs/assets/css/style.css | 75 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/docs/assets/css/style.css b/docs/assets/css/style.css
index a70656df..5ca50529 100644
--- a/docs/assets/css/style.css
+++ b/docs/assets/css/style.css
@@ -147,7 +147,8 @@ code {
     
 }
 
-/* pre > code.language-bash {
+
+pre > code.language-bash {
     background-color: white; 
     width: 60%;
     border: 1px solid #ccc; 
@@ -162,7 +163,7 @@ pre > code.language-python {
     display: block; 
     overflow-x: auto; 
     border: 1px solid #ccc;
-} */
+}
 /* 
 .hamburger {
     display: none;
@@ -230,7 +231,7 @@ pre > code.language-python {
  */
 
 
- .code-box {
+.code-box {
     background-color: white;
     border: 1px solid #ccc;
     font-family: monospace; /* Gives the text a code-like appearance */
@@ -247,7 +248,7 @@ pre > code.language-python {
 
 .flex-container {
     display: flex;
-    justify-content: space-around; /* Or use 'center' if you prefer */
+    justify-content: space-around; /* Or use 'center' if you ffer */
     align-items: center;
     flex-wrap: wrap; /* Allows items to wrap onto the next line on smaller screens */
     width: 100%;
@@ -476,4 +477,68 @@ th {
         font-size: 16px; 
     }
 
-}
\ No newline at end of file
+}
+
+/* Github stlye syntax highlighting*/
+
+.highlight .hll { background-color: #ffffcc }
+.highlight .c { color: #999988; font-style: italic } /* Comment */
+.highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */
+.highlight .k { color: #000000; font-weight: bold } /* Keyword */
+.highlight .o { color: #000000; font-weight: bold } /* Operator */
+.highlight .cm { color: #999988; font-style: italic } /* Comment.Multiline */
+.highlight .cp { color: #999999; font-weight: bold; font-style: italic } /* Comment.Preproc */
+.highlight .c1 { color: #999988; font-style: italic } /* Comment.Single */
+.highlight .cs { color: #999999; font-weight: bold; font-style: italic } /* Comment.Special */
+.highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */
+.highlight .ge { color: #000000; font-style: italic } /* Generic.Emph */
+.highlight .gr { color: #aa0000 } /* Generic.Error */
+.highlight .gh { color: #999999 } /* Generic.Heading */
+.highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */
+.highlight .go { color: #888888 } /* Generic.Output */
+.highlight .gp { color: #555555 } /* Generic.Prompt */
+.highlight .gs { font-weight: bold } /* Generic.Strong */
+.highlight .gu { color: #aaaaaa } /* Generic.Subheading */
+.highlight .gt { color: #aa0000 } /* Generic.Traceback */
+.highlight .kc { color: #000000; font-weight: bold } /* Keyword.Constant */
+.highlight .kd { color: #000000; font-weight: bold } /* Keyword.Declaration */
+.highlight .kn { color: #000000; font-weight: bold } /* Keyword.Namespace */
+.highlight .kp { color: #000000; font-weight: bold } /* Keyword.Pseudo */
+.highlight .kr { color: #000000; font-weight: bold } /* Keyword.Reserved */
+.highlight .kt { color: #445588; font-weight: bold } /* Keyword.Type */
+.highlight .m { color: #009999 } /* Literal.Number */
+.highlight .s { color: #d01040 } /* Literal.String */
+.highlight .na { color: #008080 } /* Name.Attribute */
+.highlight .nb { color: #0086B3 } /* Name.Builtin */
+.highlight .nc { color: #445588; font-weight: bold } /* Name.Class */
+.highlight .no { color: #008080 } /* Name.Constant */
+.highlight .nd { color: #3c5d5d; font-weight: bold } /* Name.Decorator */
+.highlight .ni { color: #800080 } /* Name.Entity */
+.highlight .ne { color: #990000; font-weight: bold } /* Name.Exception */
+.highlight .nf { color: #990000; font-weight: bold } /* Name.Function */
+.highlight .nl { color: #990000; font-weight: bold } /* Name.Label */
+.highlight .nn { color: #555555 } /* Name.Namespace */
+.highlight .nt { color: #000080 } /* Name.Tag */
+.highlight .nv { color: #008080 } /* Name.Variable */
+.highlight .ow { color: #000000; font-weight: bold } /* Operator.Word */
+.highlight .w { color: #bbbbbb } /* Text.Whitespace */
+.highlight .mf { color: #009999 } /* Literal.Number.Float */
+.highlight .mh { color: #009999 } /* Literal.Number.Hex */
+.highlight .mi { color: #009999 } /* Literal.Number.Integer */
+.highlight .mo { color: #009999 } /* Literal.Number.Oct */
+.highlight .sb { color: #d01040 } /* Literal.String.Backtick */
+.highlight .sc { color: #d01040 } /* Literal.String.Char */
+.highlight .sd { color: #d01040 } /* Literal.String.Doc */
+.highlight .s2 { color: #d01040 } /* Literal.String.Double */
+.highlight .se { color: #d01040 } /* Literal.String.Escape */
+.highlight .sh { color: #d01040 } /* Literal.String.Heredoc */
+.highlight .si { color: #d01040 } /* Literal.String.Interpol */
+.highlight .sx { color: #d01040 } /* Literal.String.Other */
+.highlight .sr { color: #009926 } /* Literal.String.Regex */
+.highlight .s1 { color: #d01040 } /* Literal.String.Single */
+.highlight .ss { color: #990073 } /* Literal.String.Symbol */
+.highlight .bp { color: #999999 } /* Name.Builtin.Pseudo */
+.highlight .vc { color: #008080 } /* Name.Variable.Class */
+.highlight .vg { color: #008080 } /* Name.Variable.Global */
+.highlight .vi { color: #008080 } /* Name.Variable.Instance */
+.highlight .il { color: #009999 } /* Literal.Number.Integer.Long */
\ No newline at end of file

From 7a7452e9ca987ae69502f6b7215c39f791648dc1 Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Sun, 26 Jan 2025 14:32:34 -0800
Subject: [PATCH 7/8] css update to distinguish inline code and code blocks

---
 docs/assets/css/style.css | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/docs/assets/css/style.css b/docs/assets/css/style.css
index 5ca50529..3f2e7140 100644
--- a/docs/assets/css/style.css
+++ b/docs/assets/css/style.css
@@ -139,7 +139,15 @@ body {
     box-shadow: 0px 2px 4px rgba(0, 0, 0, 0.05);
 }
 
-code {
+/* inline code */
+code.highlighter-rouge{
+    background: white;
+    border: 1px solid #ccc;
+
+}
+
+/* code block */
+div.highlighter-rouge {
     background-color: white; 
     width: 60%;
     border: 1px solid #ccc; 
@@ -147,7 +155,7 @@ code {
     
 }
 
-
+/*
 pre > code.language-bash {
     background-color: white; 
     width: 60%;
@@ -164,6 +172,7 @@ pre > code.language-python {
     overflow-x: auto; 
     border: 1px solid #ccc;
 }
+*/
 /* 
 .hamburger {
     display: none;

From 2cf7cad97c2a9d6df3b9cc7feff629aa51dc30f7 Mon Sep 17 00:00:00 2001
From: Yannick Mahlich <yannick.mahlich@pnnl.gov>
Date: Mon, 27 Jan 2025 09:27:13 -0800
Subject: [PATCH 8/8] fix padding of inline code and x-overflow scrolling
 behaviour of codeblock

---
 docs/assets/css/style.css | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/assets/css/style.css b/docs/assets/css/style.css
index 3f2e7140..c36d90f5 100644
--- a/docs/assets/css/style.css
+++ b/docs/assets/css/style.css
@@ -143,6 +143,7 @@ body {
 code.highlighter-rouge{
     background: white;
     border: 1px solid #ccc;
+    padding: 2.5px;
 
 }
 
@@ -151,7 +152,8 @@ div.highlighter-rouge {
     background-color: white; 
     width: 60%;
     border: 1px solid #ccc; 
-    padding: 2.5px; 
+    padding: 2.5px;
+    overflow-x: auto; 
     
 }