Simplify save_dir and some directory -> dir renames (#151)

* wip renames * renames in docs * readme * data dir renamme in docs * rename in code from data_directory to data_dir * maintaining update * fix capitalization * further updates * tweak * do not overwrite * add overwrite save dir * add overwrite save dir to config * update configs with all info * use full train configuration * only upload if does not exist * tests for save * overwrite param * better set up and test for overwrite * docs * update docs with overwrite * from overwrite_save_dir to overwrite * missed rename * remove machine specific from vlc * unindent so test actually runs * check for local and cached checkpoints * should be and * write out predict config before preds start like we do for train config * update all configs and use only first 10 digits of hash * dry run check after save is configured; more robust test * reorder * show save directory * copy edits * update template * fix test * lower case for consistency * fix test
drivendataorg · Oct 25, 2021 · e1c1f03 · e1c1f03
1 parent b4d0a3d
commit e1c1f03
Show file tree

Hide file tree

Showing 35 changed files with 478 additions and 336 deletions.
diff --git a/.github/MAINTAINING.md b/.github/MAINTAINING.md
@@ -113,7 +113,7 @@ make publish_models
 
 This will generate a public file name for each model based on the config hash and upload the model weights to the three DrivenData public s3 buckets. This will generate a folder in `zamba/models/official_models/{your_name_name}` that contains the official config as well as reference yaml and json files. You should PR everything in this folder.
 
-Lastly, you need to update the template in `templates`. The template should contain all the same info as the model's `config.yaml`, plus placeholders for `data_directory` and `labels` in `train_config`, and `data_directory`, `filepaths`, and `checkpoint` in `predict_config`.
+Lastly, you need to update the template in `templates`. The template should contain all the same info as the model's `config.yaml`, plus placeholders for `data_dir` and `labels` in `train_config`, and `data_dir`, `filepaths`, and `checkpoint` in `predict_config`.
 
 ### New model checklist
 

diff --git a/README.md b/README.md
@@ -82,7 +82,7 @@ See the [Quickstart](https://zamba.drivendata.org/docs/quickstart/) page or the
 ### Training a model
 
 ```console
-$ zamba train --data-dir path/to/videos --labels path_to_labels.csv --save-path my_trained_model
+$ zamba train --data-dir path/to/videos --labels path_to_labels.csv --save_dir my_trained_model
 ```
 
 The newly trained model will be saved to the specified save directory. The folder will contain a model checkpoint as well as training configuration, model hyperparameters, and validation and test metrics. Run `zamba train --help` to list all possible options to pass to `train`.

diff --git a/docs/docs/configurations.md b/docs/docs/configurations.md
@@ -32,7 +32,7 @@ All video loading arguments can be specified either in a [YAML file](yaml-config
     from zamba.models.config import PredictConfig
     from zamba.models.model_manager import predict_model
 
-    predict_config = PredictConfig(data_directory="example_vids/")
+    predict_config = PredictConfig(data_dir="example_vids/")
     video_loader_config = VideoLoaderConfig(
         model_input_height=240,
         model_input_width=426,
@@ -146,14 +146,16 @@ All possible model inference parameters are defined by the [`PredictConfig` clas
 
 class PredictConfig(ZambaBaseModel)
  |  PredictConfig(*,
- data_directory: DirectoryPath = Path.cwd()
+ data_dir: DirectoryPath = Path.cwd(),
  filepaths: FilePath = None,
  checkpoint: FilePath = None,
  model_name: zamba.models.config.ModelEnum = <ModelEnum.time_distributed: 'time_distributed'>,
  gpus: int = 0,
  num_workers: int = 3,
  batch_size: int = 2,
- save: Union[bool, pathlib.Path] = True,
+ save: bool = True,
+ save_dir: Optional[Path] = None,
+ overwrite: bool = False,
  dry_run: bool = False,
  proba_threshold: float = None,
  output_class_names: bool = False,
@@ -164,9 +166,9 @@ class PredictConfig(ZambaBaseModel)
  ...
 ```
 
-**Either `data_directory` or `filepaths` must be specified to instantiate `PredictConfig`.** If neither is specified, the current working directory will be used as the default `data_directory`.
+**Either `data_dir` or `filepaths` must be specified to instantiate `PredictConfig`.** If neither is specified, the current working directory will be used as the default `data_dir`.
 
-#### `data_directory (DirectoryPath, optional)`
+#### `data_dir (DirectoryPath, optional)`
 
 Path to the directory containing videos for inference. Defaults to the current working directory.
 
@@ -194,9 +196,18 @@ The number of CPUs to use during training. The maximum value for `num_workers` i
 
 The batch size to use for inference. Defaults to `2`
 
-#### `save (bool, optional)`
+#### `save (bool)`
 
-Whether to save out the predictions to a CSV file. By default, predictions will be saved at `zamba_predictions.csv`. Defaults to `True`
+Whether to save out predictions. If `False`, predictions are not saved. Defaults to `True`.
+
+#### `save_dir (Path, optional)`
+
+An optional directory in which to save the model predictions and configuration yaml.  If
+no `save_dir` is specified and `save` is True, outputs will be written to the current working directory. Defaults to `None`
+
+#### `overwrite (bool)`
+
+If True, will overwrite `zamba_predictions.csv` and `predict_configuration.yaml` in `save_dir` if they exist. Defaults to False.
 
 #### `dry_run (bool, optional)`
 
@@ -237,7 +248,7 @@ All possible model training parameters are defined by the [`TrainConfig` class](
 class TrainConfig(ZambaBaseModel)
  |  TrainConfig(*,
  labels: Union[FilePath, pandas.DataFrame],
- data_directory: DirectoryPath = # your current working directory ,
+ data_dir: DirectoryPath = # your current working directory ,
  checkpoint: FilePath = None,
  scheduler_config: Union[str, zamba.models.config.SchedulerConfig, NoneType] = 'default',
  model_name: zamba.models.config.ModelEnum = <ModelEnum.time_distributed: 'time_distributed'>,
@@ -256,8 +267,8 @@ class TrainConfig(ZambaBaseModel)
             verbose=True, mode='max'),
  weight_download_region: zamba.models.utils.RegionEnum = 'us',
  split_proportions: Dict[str, int] = {'train': 3, 'val': 1, 'holdout': 1},
- save_directory: pathlib.Path = # your current working directory ,
- overwrite_save_directory: bool = False,
+ save_dir: pathlib.Path = # your current working directory ,
+ overwrite: bool = False,
  skip_load_validation: bool = False,
  from_scratch: bool = False,
  predict_all_zamba_species: bool = True,
@@ -270,7 +281,7 @@ class TrainConfig(ZambaBaseModel)
 
 Either the path to a CSV file with labels for training, or a dataframe of the training labels. There must be columns for `filename` and `label`. **`labels` must be specified to instantiate `TrainConfig`.**
 
-#### `data_directory (DirectoryPath, optional)`
+#### `data_dir (DirectoryPath, optional)`
 
 Path to the directory containing training videos. Defaults to the current working directory.
 
@@ -326,13 +337,13 @@ Because `zamba` needs to download pretrained weights for the neural network arch
 
 The proportion of data to use during training, validation, and as a holdout set. Defaults to `{"train": 3, "val": 1, "holdout": 1}`
 
-#### `save_directory (Path, optional)`
+#### `save_dir (Path, optional)`
 
-Directory in which to save model checkpoint and configuration file. If not specified, will save to a `version_*` folder in your working directory.
+Directory in which to save model checkpoint and configuration file. If not specified, will save to a `version_n` folder in your current working directory.
 
-#### `overwrite_save_directory (bool, optional)`
+#### `overwrite (bool, optional)`
 
- If `True`, will save outputs in `save_directory` and overwrite the directory if it exists. If False, will create an auto-incremented `version_n` folder within `save_directory` with model outputs. Defaults to `False`.
+ If `True`, will save outputs in `save_dir` and overwrite the directory if it exists. If False, will create an auto-incremented `version_n` folder within `save_dir` with model outputs. Defaults to `False`.
 
 #### `skip_load_validation (bool, optional)`
 

diff --git a/docs/docs/debugging.md b/docs/docs/debugging.md
@@ -12,7 +12,7 @@ Before kicking off a full run of inference or model training, we recommend testi
     In Python, add `dry_run=True` to [`PredictConfig`](configurations.md#prediction-arguments) or [`TrainConfig`](configurations.md#training-arguments):
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/", dry_run=True
+        data_dir="example_vids/", dry_run=True
     )
     ```
 
@@ -30,7 +30,7 @@ The dry run will also catch any GPU memory errors. If you hit a GPU memory error
     In Python, add `batch_size` to [`PredictConfig`](configurations.md#prediction-arguments) or [`TrainConfig`](configurations.md#training-arguments):
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/", batch_size=1
+        data_dir="example_vids/", batch_size=1
     )
     ```
 
@@ -66,7 +66,7 @@ Reduce the number of workers (subprocesses) used for data loading. By default `n
     In Python, add `num_workers` to [`PredictConfig`](configurations.md#prediction-arguments) or [`TrainConfig`](configurations.md#training-arguments):
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/", num_workers=1
+        data_dir="example_vids/", num_workers=1
     )
     ```
 

diff --git a/docs/docs/extra-options.md b/docs/docs/extra-options.md
@@ -22,7 +22,7 @@ For using a YAML file with the Python package and other details, see the [YAML C
     In Python this can be specified in [`PredictConfig`](configurations.md#prediction-arguments) or [`TrainConfig`](configurations.md#training-arguments):
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/",
+        data_dir="example_vids/",
         weight_download_region='asia',
     )
     ```
@@ -50,7 +50,7 @@ Say that you have a large number of videos, and you are more concerned with dete
     from zamba.models.config import PredictConfig
     from zamba.models.model_manager import predict_model
 
-    predict_config = PredictConfig(data_directory="example_vids/")
+    predict_config = PredictConfig(data_dir="example_vids/")
 
     video_loader_config = VideoLoaderConfig(
         model_input_height=50, model_input_width=50, total_frames=16
@@ -139,7 +139,7 @@ For example, to take the 16 frames with the highest probability of detection:
         total_frames=16,
     )
 
-    train_config = TrainConfig(data_directory="example_vids/", labels="example_labels.csv",)
+    train_config = TrainConfig(data_dir="example_vids/", labels="example_labels.csv",)
 
     train_model(video_loader_config=video_loader_config, train_config=train_config)
     ```
@@ -162,15 +162,15 @@ Both can be specified in either [`predict_config`](configurations.md#prediction-
 === "YAML file"
     ```yaml
     predict_config:
-        data_directory: example_vids/
+        data_dir: example_vids/
         num_workers: 5
         batch_size: 4
         # ... other parameters
     ```
 === "Python"
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/",
+        data_dir="example_vids/",
         num_workers=5,
         batch_size=4,
         # ... other parameters

diff --git a/docs/docs/models/denspose.md b/docs/docs/models/denspose.md
@@ -41,7 +41,7 @@ Once that is done, here's how to run the DensePose model:
 === "Python"
     ```python
     from zamba.models.densepose import DensePoseConfig
-    densepose_conf = DensePoseConfig(data_directory="PATH_TO_VIDEOS", render_output=True)
+    densepose_conf = DensePoseConfig(data_dir="PATH_TO_VIDEOS", render_output=True)
     densepose_conf.run_model()
     ```
 
@@ -68,7 +68,7 @@ Options:
                                   containing images/videos.
   --filepaths PATH                Path to csv containing `filepath` column
                                   with videos.
-  --save-path PATH                An optional directory for saving the output.
+  --save-dir PATH                 An optional directory for saving the output.
                                   Defaults to the current working directory.
   --config PATH                   Specify options using yaml configuration
                                   file instead of through command line

diff --git a/docs/docs/predict-tutorial.md b/docs/docs/predict-tutorial.md
@@ -37,17 +37,17 @@ Minimum example for prediction using the Python package:
 from zamba.models.model_manager import predict_model
 from zamba.models.config import PredictConfig
 
-predict_config = PredictConfig(data_directory="example_vids/")
+predict_config = PredictConfig(data_dir="example_vids/")
 predict_model(predict_config=predict_config)
 ```
 
 The only two arguments that can be passed to `predict_model` are `predict_config` and (optionally) `video_loader_config`. The first step is to instantiate [`PredictConfig`](configurations.md#prediction-arguments). Optionally, you can also specify video loading arguments by instantiating and passing in [`VideoLoaderConfig`](configurations.md#video-loading-arguments).
 
 ### Required arguments
 
-To run `predict_model` in Python, you must specify either `data_directory` or `filepaths` when `PredictConfig` is instantiated.
+To run `predict_model` in Python, you must specify either `data_dir` or `filepaths` when `PredictConfig` is instantiated.
 
-* **`data_directory (DirectoryPath)`:** Path to the folder containing your videos.
+* **`data_dir (DirectoryPath)`:** Path to the folder containing your videos.
 
 * **`filepaths (FilePath)`:** Path to a CSV file with a column for the filepath to each video you want to classify. The CSV must have a column for `filepath`. Filepaths can be absolute or relative to the data directory.
 
@@ -57,7 +57,7 @@ For detailed explanations of all possible configuration arguments, see [All Opti
 
 By default, the [`time_distributed`](models/index.md#time-distributed) model will be used. `zamba` will output a `.csv` file with rows labeled by each video filename and columns for each class (ie. species). The default prediction will store all class probabilities, so that cell (i,j) can be interpreted as *the probability that animal j is present in video i.*
 
-By default, predictions will be saved to `zamba_predictions.csv`. You can save predictions to a custom directory using the `--save-path` argument.
+By default, predictions will be saved to `zamba_predictions.csv` in your working directory. You can save predictions to a custom directory using the `--save-dir` argument.
 
 ```console
 $ cat zamba_predictions.csv
@@ -90,7 +90,7 @@ Add the path to your video folder. For example, if your videos are in a folder c
     ```
 === "Python"
     ```python
-    predict_config = PredictConfig(data_directory='example_vids/')
+    predict_config = PredictConfig(data_dir='example_vids/')
     predict_model(predict_config=predict_config)
     ```
 
@@ -109,7 +109,7 @@ Add the model name to your command. The `time_distributed` model will be used if
 === "Python"
     ```python
     predict_config = PredictConfig(
-        data_directory='example_vids/', model_name='slowfast'
+        data_dir='example_vids/', model_name='slowfast'
     )
     predict_model(predict_config=predict_config)
     ```
@@ -137,7 +137,7 @@ Say we want to generate predictions for the videos in `example_vids` indicating
 === "Python"
     ```python
     predict_config = PredictConfig(
-        data_directory="example_vids/", proba_threshold=0.5
+        data_dir="example_vids/", proba_threshold=0.5
     )
     predict_model(predict_config=predict_config)
     predictions = pd.read_csv("zamba_predictions.csv")
@@ -153,7 +153,7 @@ Say we want to generate predictions for the videos in `example_vids` indicating
 
 ### 4. Specify any additional parameters
 
-And there's so much more! You can also do things like specify your region for faster model download (`--weight-download-region`), use a saved model checkpoint (`--checkpoint`), or specify a different path where your predictions should be saved (`--save`). To read about a few common considerations, see the [Guide to Common Optional Parameters](extra-options.md) page.
+And there's so much more! You can also do things like specify your region for faster model download (`--weight-download-region`), use a saved model checkpoint (`--checkpoint`), or specify a different folder where your predictions should be saved (`--save-dir`). To read about a few common considerations, see the [Guide to Common Optional Parameters](extra-options.md) page.
 
 ### 5. Test your configuration with a dry run
 

diff --git a/docs/docs/quickstart.md b/docs/docs/quickstart.md
@@ -65,7 +65,8 @@ $ zamba predict --data-dir example_vids/
 ```
 
 `zamba` will output a `.csv` file with rows labeled by each video filename and columns for each class (ie. species). The default prediction will store all class probabilities, so that cell `(i,j)` is *the probability that animal `j` is present in video `i`.*  Comprehensive predictions are helpful when a single video contains multiple species.
-Predictions will be saved to `zamba_predictions.csv` in the current working directory by default. You can save out predictions to a different folder using the `--save-path` argument.
+
+Predictions will be saved to `zamba_predictions.csv` in the current working directory by default. You can save out predictions to a different folder using the `--save-dir` argument.
 
 Adding the argument `--output-class-names` will simplify the predictions to return only the *most likely* animal in each video:
 
@@ -108,7 +109,7 @@ eleph.MP4,elephant
 leopard.MP4,leopard
 ```
 
-By default, the trained model and additional training output will be saved to a `version_*` folder in the current working directory. For example,
+By default, the trained model and additional training output will be saved to a `version_n` folder in the current working directory. For example,
 
 ```console
 $ zamba train --data-dir example_vids/ --labels example_labels.csv
@@ -134,8 +135,6 @@ Once zamba is installed, you can see more details of each function with `--help`
 To get help with `zamba predict`:
 
 ```console
-$ zamba predict --help
-
 Usage: zamba predict [OPTIONS]
 
   Identify species in a video.
@@ -162,11 +161,13 @@ Options:
                                   specifiied, will use all GPUs found on
                                   machine.
   --batch-size INTEGER            Batch size to use for training.
-  --save / --no-save              Whether to save out predictions to a csv
-                                  file. If you want to specify the location of
-                                  the csv, use save_path instead.
-  --save-path PATH                Full path for prediction CSV file. Any
-                                  needed parent directories will be created.
+  --save / --no-save              Whether to save out predictions. If you want
+                                  to specify the output directory, use
+                                  save_dir instead.
+  --save-dir PATH                 An optional directory in which to save the
+                                  model predictions and configuration yaml.
+                                  Defaults to the current working directory if
+                                  save is True.
   --dry-run / --no-dry-run        Runs one batch of inference to check for
                                   bugs.
   --config PATH                   Specify options using yaml configuration
@@ -193,6 +194,8 @@ Options:
                                   loaded prior to inference. Only use if
                                   you're very confident all your videos can be
                                   loaded.
+  -o, --overwrite                 Overwrite outputs in the save directory if
+                                  they exist.
   -y, --yes                       Skip confirmation of configuration and
                                   proceed right to prediction.
   --help                          Show this message and exit.
@@ -228,11 +231,10 @@ Options:
                                   machine.
   --dry-run / --no-dry-run        Runs one batch of train and validation to
                                   check for bugs.
-  --save-dir PATH                 Directory in which to save model checkpoint
-                                  and configuration file. If not specified,
-                                  will save to a folder called
-                                  'zamba_{model_name}' in your working
-                                  directory.
+  --save-dir PATH                 An optional directory in which to save the
+                                  model checkpoint and configuration file. If
+                                  not specified, will save to a `version_n`
+                                  folder in your working directory.
   --num-workers INTEGER           Number of subprocesses to use for data
                                   loading.
   --weight-download-region [us|eu|asia]