Updated torch docs

ai2es · Aug 16, 2024 · bc6752b · bc6752b
1 parent 10cf824
commit bc6752b
Show file tree

Hide file tree

Showing 2 changed files with 153 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -49,19 +49,16 @@ conda activate guess
 
 ## Using miles-guess
 
-The package contains three scripts for training three regression models, and one for training categorical models. 
-The regression examples are trained on our surface layer ("SL") dataset for predicting latent heat and other quantities, 
-and the categorical example is trained on a precipitation dataset ("p-type").
-
 The law of total variance for each model prediction target may be computed as
 
 $$LoTV = E[\sigma^2] + Var[\mu]$$ 
 
-which is the sum of aleatoric and epistemic contributions, respectively. 
+which is the sum of aleatoric and epistemic contributions, respectively. The MILES-GUESS package contains options for using either Keras or PyTorch for computing quantites according to the LoTV as well as utilizing Dempster-Shafer theory uncertainty in the classifier case. 
 
-For detailed information about computing the LoTV using training with Keras, refer to [the Keras training details README](mlguess/keras/keras.md).
+For detailed information about training with Keras, refer to [the Keras training details README](docs/source/keras.md). There three scripts for training three regression models, and one for training categorical models. The regression examples are trained on our surface layer ("SL") dataset for predicting latent heat and other quantities, 
+and the categorical example is trained on a precipitation dataset ("p-type").
 
-For pyTorch, please visit the [the pyTorch training details README](mlguess/torch/torch.md).
+For pyTorch, please visit the [the pyTorch training details README](docs/source/torch.md). There is one training script that works for both evidential and standard classification tasks. Torch examples use the same datasets as the Keras models. The torch code can be used with DDP and FSDP.
 
 <!--
 ### 1a. Train/evaluate a deterministic multi-layer perceptrion (MLP) on the SL dataset:

diff --git a/docs/source/torch.md b/docs/source/torch.md
@@ -1,5 +1,151 @@
-Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches.
+Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches. Email [email protected] for questions/concerns/fixes/etc
 
-Overall, there is (1) one script to train regression models and (2) one to train categorical models. Let us review the configuration file first, then we will train models. 
+## Regression usage
 
-(1) Currently, for regression problems only the Amini-evidential MLP and a standard multi-task MLP (e.g. one that does not predict uncertaintes). Support for the Gaussian model will be added eventually. To train a regression MLP
+There are two provided scripts which are mostly similar, one for training and one for loading a trained model and predicting. 
+
+Run the training script with: `python applications/train_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]`
+
+Arguments:
+- `-c, --config`: Path to the YAML configuration file (required)
+- `-l`: Submit workers to PBS (optional, default: 0)
+- `-m, --mode`: Set the training mode to 'none', 'ddp', or 'fsdp' (optional)
+
+Example: 
+`python trainer.py -c config.yml -m ddp -l 1`
+
+The YAML configuration file should contain settings for the model, training, data, and pbs or slurm settings. For distributed training, set the `mode` in the config file or use the `-m` argument to specify 'ddp' or 'fsdp', and use the `-l` flag to submit jobs to PBS or manually set up your distributed environment.
+
+For more detailed information about configuration options and advanced usage, please refer to the code documentation and comments within the script.
+
+Once a model is trained, then run
+
+`python applications/predict_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]`
+
+which will load the trained model from disk and predict on the training splits and save them along with some computed metrics to disk. The predicted quanties include the task(s) predictions along with aleatoric and epistemic quantities.
+
+## Classifier usage
+
+For the classifier models, training and evaluating an evidential model on a dataset is performed in the same script, with options for distributed training using either DDP or FSDP.
+
+Run the combined script with: `python applications/train_classifier_torch.py -c <path-to-config-file> [-l] [-m <mode>]`
+
+Example: 
+`python applications/train_classifier_torch.py -c config.yml -m ddp -l 1`
+
+As noted this script will doubly train a model and then predict on the supplied training splits. The predicted quanties include the task(s) predictions along with the Dempster-Shafer uncertainty, and aleatoric and epistemic quantities for a $K$-class problem. Please see the full documentation for more. 
+
+## Model and training configuration yaml
+
+The most important fields for training evidential models with options are the trainer and model fields in the example config file. These fields apply and work with both classifier and regression models. 
+```yaml
+trainer:
+    mode: fsdp # none, ddp, fsdp
+    training_metric: "valid_loss"
+    train_batch_size: *batch_size
+    valid_batch_size: *batch_size
+    batches_per_epoch: 500 # Set to 0 to use len(dataloader)
+    valid_batches_per_epoch: 0
+    learning_rate: 0.0015285262808755972
+    weight_decay: 0.0009378550509012784
+    start_epoch: 0
+    epochs: 100
+    amp: False
+    grad_accum_every: 1
+    grad_max_norm: 1.0
+    thread_workers: 4
+    valid_thread_workers: 4
+    stopping_patience: 5
+    load_weights: False
+    load_optimizer: False
+    use_scheduler: True
+    # scheduler: {'scheduler_type': 'cosine-annealing', first_cycle_steps: 500, cycle_mult: 6.0, max_lr: 5.0e-04, min_lr: 5.0e-07, warmup_steps: 499, gamma: 0.7}
+    scheduler: {scheduler_type: plateau, mode: min, factor: 0.1, patience: 2, cooldown: 2, min_lr: 1.0e-07, verbose: true, threshold: 1.0e-04}
+    # scheduler: {'scheduler_type': 'lambda'}
+
+model:
+  input_size: *input_cols  # Reference to data:input_cols
+  output_size: *output_cols  # Reference to data:output_cols
+  layer_size: [1057, 1057, 1057, 1057, 1057]  # Example block sizes
+  dr: [0.263, 0.263, 0.263, 0.263, 0.263]  # Dropout rates
+  batch_norm: False  # Whether to use batch normalization
+  lng: True  # Use the evidential layer (True) or not (False)
+```
+
+## Trainer Configuration
+
+### General Settings
+
+* **Mode:** Specifies the distributed training strategy.
+    * Options: `none`, `ddp`, `fsdp`.
+* **Training Metric:** Metric used for monitoring during training.
+    * Example: `"valid_loss"`.
+* **Batch Sizes:**
+    * `train_batch_size`: Batch size used for training.
+    * `valid_batch_size`: Batch size used for validation.
+* **Epoch Configuration:**
+    * `batches_per_epoch`: Number of batches per epoch during training. Set to `0` to use the entire dataloader length.
+    * `valid_batches_per_epoch`: Number of batches per epoch during validation.
+    * `start_epoch`: The epoch from which training starts.
+    * `epochs`: Total number of epochs for training.
+* **Learning Parameters:**
+    * `learning_rate`: Initial learning rate.
+    * `weight_decay`: Weight decay for regularization.
+    * `amp`: Use Automatic Mixed Precision (AMP) if set to `True`.
+    * `grad_accum_every`: Gradient accumulation steps.
+    * `grad_max_norm`: Maximum norm for gradient clipping.
+* **Multi-threading**:
+    * `thread_workers`: Number of worker threads for training.
+    * `valid_thread_workers`: Number of worker threads for validation.
+* **Early Stopping**:
+    * `stopping_patience`: Number of epochs with no improvement after which training will stop.
+* **Checkpointing**:
+    * `load_weights`: Load weights from a pre-trained model if `True`.
+    * `load_optimizer`: Load optimizer state from a checkpoint if `True`.
+* **Learning Rate Scheduler**:
+    * `use_scheduler`: Apply learning rate scheduling if `True`.
+    * `scheduler`: Dictionary containing scheduler configuration.
+
+```yaml
+# Example: Cosine Annealing Scheduler
+scheduler:
+  scheduler_type: cosine-annealing
+  first_cycle_steps: 500
+  cycle_mult: 6.0
+  max_lr: 5.0e-04
+  min_lr: 5.0e-07
+  warmup_steps: 499
+  gamma: 0.7
+
+# Example: Plateau Scheduler
+scheduler:
+  scheduler_type: plateau
+  mode: min
+  factor: 0.1
+  patience: 2
+  cooldown: 2
+  min_lr: 1.0e-07
+  verbose: true
+  threshold: 1.0e-04
+```
+
+## Model Configuration
+
+### Input/Output Sizes
+
+* `input_size`: Number of input features (referenced from data).
+* `output_size`: Number of output features (referenced from data).
+
+### Architecture
+
+* `layer_size`: List defining the number of neurons in each layer.
+* `dr`: List defining the dropout rates for each layer.
+* `batch_norm`: Enable/Disable batch normalization. Set to `True` for enabling.
+
+### Evidential Layer
+
+* `lng`: Use evidential layer if `True`. Useful for uncertainty quantification.
+
+### Classifier Models
+
+* `output_activation`: Set to `softmax` for standard classification. If not set, the model will use evidential classification.