Skip to content

Commit

Permalink
Updated torch docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jsschreck committed Aug 16, 2024
1 parent 10cf824 commit bc6752b
Show file tree
Hide file tree
Showing 2 changed files with 153 additions and 10 deletions.
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,16 @@ conda activate guess

## Using miles-guess

The package contains three scripts for training three regression models, and one for training categorical models.
The regression examples are trained on our surface layer ("SL") dataset for predicting latent heat and other quantities,
and the categorical example is trained on a precipitation dataset ("p-type").

The law of total variance for each model prediction target may be computed as

$$LoTV = E[\sigma^2] + Var[\mu]$$

which is the sum of aleatoric and epistemic contributions, respectively.
which is the sum of aleatoric and epistemic contributions, respectively. The MILES-GUESS package contains options for using either Keras or PyTorch for computing quantites according to the LoTV as well as utilizing Dempster-Shafer theory uncertainty in the classifier case.

For detailed information about computing the LoTV using training with Keras, refer to [the Keras training details README](mlguess/keras/keras.md).
For detailed information about training with Keras, refer to [the Keras training details README](docs/source/keras.md). There three scripts for training three regression models, and one for training categorical models. The regression examples are trained on our surface layer ("SL") dataset for predicting latent heat and other quantities,
and the categorical example is trained on a precipitation dataset ("p-type").

For pyTorch, please visit the [the pyTorch training details README](mlguess/torch/torch.md).
For pyTorch, please visit the [the pyTorch training details README](docs/source/torch.md). There is one training script that works for both evidential and standard classification tasks. Torch examples use the same datasets as the Keras models. The torch code can be used with DDP and FSDP.

<!--
### 1a. Train/evaluate a deterministic multi-layer perceptrion (MLP) on the SL dataset:
Expand Down
152 changes: 149 additions & 3 deletions docs/source/torch.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,151 @@
Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches.
Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches. Email [email protected] for questions/concerns/fixes/etc

Overall, there is (1) one script to train regression models and (2) one to train categorical models. Let us review the configuration file first, then we will train models.
## Regression usage

(1) Currently, for regression problems only the Amini-evidential MLP and a standard multi-task MLP (e.g. one that does not predict uncertaintes). Support for the Gaussian model will be added eventually. To train a regression MLP
There are two provided scripts which are mostly similar, one for training and one for loading a trained model and predicting.

Run the training script with: `python applications/train_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]`

Arguments:
- `-c, --config`: Path to the YAML configuration file (required)
- `-l`: Submit workers to PBS (optional, default: 0)
- `-m, --mode`: Set the training mode to 'none', 'ddp', or 'fsdp' (optional)

Example:
`python trainer.py -c config.yml -m ddp -l 1`

The YAML configuration file should contain settings for the model, training, data, and pbs or slurm settings. For distributed training, set the `mode` in the config file or use the `-m` argument to specify 'ddp' or 'fsdp', and use the `-l` flag to submit jobs to PBS or manually set up your distributed environment.

For more detailed information about configuration options and advanced usage, please refer to the code documentation and comments within the script.

Once a model is trained, then run

`python applications/predict_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]`

which will load the trained model from disk and predict on the training splits and save them along with some computed metrics to disk. The predicted quanties include the task(s) predictions along with aleatoric and epistemic quantities.

## Classifier usage

For the classifier models, training and evaluating an evidential model on a dataset is performed in the same script, with options for distributed training using either DDP or FSDP.

Run the combined script with: `python applications/train_classifier_torch.py -c <path-to-config-file> [-l] [-m <mode>]`

Example:
`python applications/train_classifier_torch.py -c config.yml -m ddp -l 1`

As noted this script will doubly train a model and then predict on the supplied training splits. The predicted quanties include the task(s) predictions along with the Dempster-Shafer uncertainty, and aleatoric and epistemic quantities for a $K$-class problem. Please see the full documentation for more.

## Model and training configuration yaml

The most important fields for training evidential models with options are the trainer and model fields in the example config file. These fields apply and work with both classifier and regression models.
```yaml
trainer:
mode: fsdp # none, ddp, fsdp
training_metric: "valid_loss"
train_batch_size: *batch_size
valid_batch_size: *batch_size
batches_per_epoch: 500 # Set to 0 to use len(dataloader)
valid_batches_per_epoch: 0
learning_rate: 0.0015285262808755972
weight_decay: 0.0009378550509012784
start_epoch: 0
epochs: 100
amp: False
grad_accum_every: 1
grad_max_norm: 1.0
thread_workers: 4
valid_thread_workers: 4
stopping_patience: 5
load_weights: False
load_optimizer: False
use_scheduler: True
# scheduler: {'scheduler_type': 'cosine-annealing', first_cycle_steps: 500, cycle_mult: 6.0, max_lr: 5.0e-04, min_lr: 5.0e-07, warmup_steps: 499, gamma: 0.7}
scheduler: {scheduler_type: plateau, mode: min, factor: 0.1, patience: 2, cooldown: 2, min_lr: 1.0e-07, verbose: true, threshold: 1.0e-04}
# scheduler: {'scheduler_type': 'lambda'}

model:
input_size: *input_cols # Reference to data:input_cols
output_size: *output_cols # Reference to data:output_cols
layer_size: [1057, 1057, 1057, 1057, 1057] # Example block sizes
dr: [0.263, 0.263, 0.263, 0.263, 0.263] # Dropout rates
batch_norm: False # Whether to use batch normalization
lng: True # Use the evidential layer (True) or not (False)
```
## Trainer Configuration
### General Settings
* **Mode:** Specifies the distributed training strategy.
* Options: `none`, `ddp`, `fsdp`.
* **Training Metric:** Metric used for monitoring during training.
* Example: `"valid_loss"`.
* **Batch Sizes:**
* `train_batch_size`: Batch size used for training.
* `valid_batch_size`: Batch size used for validation.
* **Epoch Configuration:**
* `batches_per_epoch`: Number of batches per epoch during training. Set to `0` to use the entire dataloader length.
* `valid_batches_per_epoch`: Number of batches per epoch during validation.
* `start_epoch`: The epoch from which training starts.
* `epochs`: Total number of epochs for training.
* **Learning Parameters:**
* `learning_rate`: Initial learning rate.
* `weight_decay`: Weight decay for regularization.
* `amp`: Use Automatic Mixed Precision (AMP) if set to `True`.
* `grad_accum_every`: Gradient accumulation steps.
* `grad_max_norm`: Maximum norm for gradient clipping.
* **Multi-threading**:
* `thread_workers`: Number of worker threads for training.
* `valid_thread_workers`: Number of worker threads for validation.
* **Early Stopping**:
* `stopping_patience`: Number of epochs with no improvement after which training will stop.
* **Checkpointing**:
* `load_weights`: Load weights from a pre-trained model if `True`.
* `load_optimizer`: Load optimizer state from a checkpoint if `True`.
* **Learning Rate Scheduler**:
* `use_scheduler`: Apply learning rate scheduling if `True`.
* `scheduler`: Dictionary containing scheduler configuration.

```yaml
# Example: Cosine Annealing Scheduler
scheduler:
scheduler_type: cosine-annealing
first_cycle_steps: 500
cycle_mult: 6.0
max_lr: 5.0e-04
min_lr: 5.0e-07
warmup_steps: 499
gamma: 0.7
# Example: Plateau Scheduler
scheduler:
scheduler_type: plateau
mode: min
factor: 0.1
patience: 2
cooldown: 2
min_lr: 1.0e-07
verbose: true
threshold: 1.0e-04
```

## Model Configuration

### Input/Output Sizes

* `input_size`: Number of input features (referenced from data).
* `output_size`: Number of output features (referenced from data).

### Architecture

* `layer_size`: List defining the number of neurons in each layer.
* `dr`: List defining the dropout rates for each layer.
* `batch_norm`: Enable/Disable batch normalization. Set to `True` for enabling.

### Evidential Layer

* `lng`: Use evidential layer if `True`. Useful for uncertainty quantification.

### Classifier Models

* `output_activation`: Set to `softmax` for standard classification. If not set, the model will use evidential classification.

0 comments on commit bc6752b

Please sign in to comment.