-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
153 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,151 @@ | ||
Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches. | ||
Welcome to the pyTorch users page. The instructions below outline how to compute various UQ quantities like aleatoric and epistemic using different modeling approaches. Email [email protected] for questions/concerns/fixes/etc | ||
|
||
Overall, there is (1) one script to train regression models and (2) one to train categorical models. Let us review the configuration file first, then we will train models. | ||
## Regression usage | ||
|
||
(1) Currently, for regression problems only the Amini-evidential MLP and a standard multi-task MLP (e.g. one that does not predict uncertaintes). Support for the Gaussian model will be added eventually. To train a regression MLP | ||
There are two provided scripts which are mostly similar, one for training and one for loading a trained model and predicting. | ||
|
||
Run the training script with: `python applications/train_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]` | ||
|
||
Arguments: | ||
- `-c, --config`: Path to the YAML configuration file (required) | ||
- `-l`: Submit workers to PBS (optional, default: 0) | ||
- `-m, --mode`: Set the training mode to 'none', 'ddp', or 'fsdp' (optional) | ||
|
||
Example: | ||
`python trainer.py -c config.yml -m ddp -l 1` | ||
|
||
The YAML configuration file should contain settings for the model, training, data, and pbs or slurm settings. For distributed training, set the `mode` in the config file or use the `-m` argument to specify 'ddp' or 'fsdp', and use the `-l` flag to submit jobs to PBS or manually set up your distributed environment. | ||
|
||
For more detailed information about configuration options and advanced usage, please refer to the code documentation and comments within the script. | ||
|
||
Once a model is trained, then run | ||
|
||
`python applications/predict_regressor_torch.py -c <path-to-config-file> [-l] [-m <mode>]` | ||
|
||
which will load the trained model from disk and predict on the training splits and save them along with some computed metrics to disk. The predicted quanties include the task(s) predictions along with aleatoric and epistemic quantities. | ||
|
||
## Classifier usage | ||
|
||
For the classifier models, training and evaluating an evidential model on a dataset is performed in the same script, with options for distributed training using either DDP or FSDP. | ||
|
||
Run the combined script with: `python applications/train_classifier_torch.py -c <path-to-config-file> [-l] [-m <mode>]` | ||
|
||
Example: | ||
`python applications/train_classifier_torch.py -c config.yml -m ddp -l 1` | ||
|
||
As noted this script will doubly train a model and then predict on the supplied training splits. The predicted quanties include the task(s) predictions along with the Dempster-Shafer uncertainty, and aleatoric and epistemic quantities for a $K$-class problem. Please see the full documentation for more. | ||
|
||
## Model and training configuration yaml | ||
|
||
The most important fields for training evidential models with options are the trainer and model fields in the example config file. These fields apply and work with both classifier and regression models. | ||
```yaml | ||
trainer: | ||
mode: fsdp # none, ddp, fsdp | ||
training_metric: "valid_loss" | ||
train_batch_size: *batch_size | ||
valid_batch_size: *batch_size | ||
batches_per_epoch: 500 # Set to 0 to use len(dataloader) | ||
valid_batches_per_epoch: 0 | ||
learning_rate: 0.0015285262808755972 | ||
weight_decay: 0.0009378550509012784 | ||
start_epoch: 0 | ||
epochs: 100 | ||
amp: False | ||
grad_accum_every: 1 | ||
grad_max_norm: 1.0 | ||
thread_workers: 4 | ||
valid_thread_workers: 4 | ||
stopping_patience: 5 | ||
load_weights: False | ||
load_optimizer: False | ||
use_scheduler: True | ||
# scheduler: {'scheduler_type': 'cosine-annealing', first_cycle_steps: 500, cycle_mult: 6.0, max_lr: 5.0e-04, min_lr: 5.0e-07, warmup_steps: 499, gamma: 0.7} | ||
scheduler: {scheduler_type: plateau, mode: min, factor: 0.1, patience: 2, cooldown: 2, min_lr: 1.0e-07, verbose: true, threshold: 1.0e-04} | ||
# scheduler: {'scheduler_type': 'lambda'} | ||
|
||
model: | ||
input_size: *input_cols # Reference to data:input_cols | ||
output_size: *output_cols # Reference to data:output_cols | ||
layer_size: [1057, 1057, 1057, 1057, 1057] # Example block sizes | ||
dr: [0.263, 0.263, 0.263, 0.263, 0.263] # Dropout rates | ||
batch_norm: False # Whether to use batch normalization | ||
lng: True # Use the evidential layer (True) or not (False) | ||
``` | ||
## Trainer Configuration | ||
### General Settings | ||
* **Mode:** Specifies the distributed training strategy. | ||
* Options: `none`, `ddp`, `fsdp`. | ||
* **Training Metric:** Metric used for monitoring during training. | ||
* Example: `"valid_loss"`. | ||
* **Batch Sizes:** | ||
* `train_batch_size`: Batch size used for training. | ||
* `valid_batch_size`: Batch size used for validation. | ||
* **Epoch Configuration:** | ||
* `batches_per_epoch`: Number of batches per epoch during training. Set to `0` to use the entire dataloader length. | ||
* `valid_batches_per_epoch`: Number of batches per epoch during validation. | ||
* `start_epoch`: The epoch from which training starts. | ||
* `epochs`: Total number of epochs for training. | ||
* **Learning Parameters:** | ||
* `learning_rate`: Initial learning rate. | ||
* `weight_decay`: Weight decay for regularization. | ||
* `amp`: Use Automatic Mixed Precision (AMP) if set to `True`. | ||
* `grad_accum_every`: Gradient accumulation steps. | ||
* `grad_max_norm`: Maximum norm for gradient clipping. | ||
* **Multi-threading**: | ||
* `thread_workers`: Number of worker threads for training. | ||
* `valid_thread_workers`: Number of worker threads for validation. | ||
* **Early Stopping**: | ||
* `stopping_patience`: Number of epochs with no improvement after which training will stop. | ||
* **Checkpointing**: | ||
* `load_weights`: Load weights from a pre-trained model if `True`. | ||
* `load_optimizer`: Load optimizer state from a checkpoint if `True`. | ||
* **Learning Rate Scheduler**: | ||
* `use_scheduler`: Apply learning rate scheduling if `True`. | ||
* `scheduler`: Dictionary containing scheduler configuration. | ||
|
||
```yaml | ||
# Example: Cosine Annealing Scheduler | ||
scheduler: | ||
scheduler_type: cosine-annealing | ||
first_cycle_steps: 500 | ||
cycle_mult: 6.0 | ||
max_lr: 5.0e-04 | ||
min_lr: 5.0e-07 | ||
warmup_steps: 499 | ||
gamma: 0.7 | ||
# Example: Plateau Scheduler | ||
scheduler: | ||
scheduler_type: plateau | ||
mode: min | ||
factor: 0.1 | ||
patience: 2 | ||
cooldown: 2 | ||
min_lr: 1.0e-07 | ||
verbose: true | ||
threshold: 1.0e-04 | ||
``` | ||
|
||
## Model Configuration | ||
|
||
### Input/Output Sizes | ||
|
||
* `input_size`: Number of input features (referenced from data). | ||
* `output_size`: Number of output features (referenced from data). | ||
|
||
### Architecture | ||
|
||
* `layer_size`: List defining the number of neurons in each layer. | ||
* `dr`: List defining the dropout rates for each layer. | ||
* `batch_norm`: Enable/Disable batch normalization. Set to `True` for enabling. | ||
|
||
### Evidential Layer | ||
|
||
* `lng`: Use evidential layer if `True`. Useful for uncertainty quantification. | ||
|
||
### Classifier Models | ||
|
||
* `output_activation`: Set to `softmax` for standard classification. If not set, the model will use evidential classification. |