Add example HPC directories, config, scripts and documentation (#578)

* Add example HPC directories, config, scripts and documentation * Update training scripts and parser * Minor install/usage docs updates * Add GCN cronjob dirs/scripts/bugfix * Remove .DS_Store files * Add docs on gcn cronjob * Add empty directories with README files, update docs * Add some more description of file origins
ZwickyTransientFacility · Apr 10, 2024 · 2be406a · 2be406a
1 parent e96dfe0
commit 2be406a
Show file tree

Hide file tree

Showing 60 changed files with 590,932 additions and 15 deletions.
diff --git a/doc/developer.md b/doc/developer.md
@@ -14,6 +14,9 @@
   ```shell script
   scope-initialize
   ```
+
+- If using GPU-accelerated period-finding algorithms, install [periodfind](https://github.com/ZwickyTransientFacility/periodfind) from the source.
+
 - Change directories to `scope` and modify `config.yaml` to finish the initialization process. This config file is used by default when running all scripts. You can also specify another config file using the `--config-path` argument.
 
 
@@ -94,16 +97,18 @@ cp config.defaults.yaml config.yaml
 
 Edit config.yaml to include Kowalski instance and Fritz tokens in the associated empty `token:` fields.
 
+#### (Optional) Install `periodfind`
+If using GPU-accelerated period-finding algorithms, install [periodfind](https://github.com/ZwickyTransientFacility/periodfind) from the source.
+
 #### Testing
 Run `scope-test` to test your installation. Note that for the test to pass, you will need access to the Kowalski database. If you do not have Kowalski access, you can run `scope-test-limited` to run a more limited (but still useful) set of tests.
 
 ### Troubleshooting
 Upon encountering installation/testing errors, manually install the package in question using  `conda install xxx` , and remove it from `.requirements/dev.txt`. After that, re-run `pip install -r requirements.txt` to continue.
 
 #### Known issues
-- If using GPU-accelerated period-finding algorithms for feature generation, you will need to install [periodfind](https://github.com/ZwickyTransientFacility/periodfind) separately from the source.
 - Across all platforms, we are currently aware of `scope` dependency issues with Python 3.12.
-- Anaconda continues to cause problems with environment setup.
+- Anaconda may cause problems with environment setup.
 - Using `pip` to install `healpy` on an arm64 Mac can raise an error upon import. We recommend including `h5py` as a requirement during the creation of your `conda` environment.
 - On Windows machines, `healpy` and `cesium` raise errors upon installation.
    - For `healpy`, see [this](https://healpy.readthedocs.io/en/latest/install.html#installation-on-windows-through-the-windows-subsystem-for-linux) guide for a potential workaround.

diff --git a/doc/index.rst b/doc/index.rst
@@ -7,6 +7,7 @@ ZTF Variable Source Classification Project
    developer
    quickstart
    usage
+   scripts
    scanner
    field_guide
    allocation

diff --git a/doc/scripts.md b/doc/scripts.md
@@ -0,0 +1,121 @@
+# SCoPe script guide
+
+The `hpc_files` directory in the `scope-ml` repository contains scripts, files and directory structures that can be used to quick-start running SCoPe on HPC resources (like SDSC Expanse or NCSA Delta). This page documents the constituents of this directory and provides a high-level overview of what the scripts do and how to use them.  After installing SCoPe, all the contents of `hpc_files` can be placed within the `scope` directory generated by the `scope-initialize` command.
+
+Note that data files are not included in the `hpc_files` directory. The main files necessary to run the scripts detailed below are listed here and available on [Zenodo](https://zenodo.org/doi/10.5281/zenodo.8410825):
+- `trained_models_dnn` and `trained_models_xgb`: download on Zenodo, unzip, and place directories into `models_dnn` and `models_xgb` directories, respectively
+- `training_set.parquet`: download on Zenodo and place into the directory called `fritzDownload`
+
+Note also that most included scripts and directories can also be generated from scratch using the following SCoPe scripts: `train-algorithm-slurm`, `generate-features-slurm`, `run-inference-slurm`, and `combine-preds-slurm`. The directories generated by these scripts generally are populated with two subdirectories: `logs` to contain slurm logs, and `slurm` to contain slurm scripts.
+
+## Configuration
+
+The `hpc_config.yaml` file contains the settings that have been used for SCoPe through April 2024. This file can be renamed to `config.yaml`, overwriting the standard file generated by `scope-initialize` and fast-tracking HPC runs. Tokens for for **kowalski**, **wandb** and **fritz** should be obtained and added to this file to enable SCoPe code to run.
+
+It is generally advisable to run SCoPe scripts from the main `scope` directory that contains your config file. You can also provide the `--config-path` argument to any script. Keep in mind that the code will default to checking your current directory for a file called `config.yaml` if you have not specified this argument.
+
+## Training
+
+### Training scripts: `train_dnn_DR16.sh` and `train_xgb_DR16.sh`
+
+Each of these scripts can be generated with `create-training-script`. They contain several calls to `scope-train` and initially served as the primary way to sequentially train each model. When `train-algorithm-job-submission` is run to train all classifiers in parallel, these scripts are parsed to identify the tags, group name and algorithm to pass to the training code.
+
+### Directories: `dnn_training` and `xgb_training`
+These two directories are generated when running `train-algorithm-slurm`. The `slurm` subdirectories within each one are populated with three example scripts:
+
+- `slurm_sing.sub`: trains a single classifier (specified with the `--tag` argument) using `scope-train`
+- `slurm.sub`: uses a wildcard to serve as a training script for any `--tag`
+- `slurm_submission.sub`: runs the `train-algorithm-job-submission` python code to submit training jobs for all classifiers, referencing the training scripts mentioned above
+
+### Output: trained models in `models_dnn` and `models_xgb`
+Trained models are saved in these two directories. The `--group` name passed to the training code will determine the subdirectory where the models are saved. Within this, each classifier gets its own subdirectory that includes the model files, diagnostic plots, and feature importance data (XGB only).
+
+**To run inference with the latest trained models, download `trained_dnn_models.zip` and `trained_xgb_models.zip` from Zenodo and unzip them within the corresponding `models_dnn` or `models_xgb` directory.**
+
+## Generating Features
+
+### Field-by-field feature generation
+
+The primary way to generate feature with SCoPe is by specifying a specific ZTF field to run. The following directories contain example slurm scripts to perform this process.
+
+#### Directories: `generated_features_new`, `generated_features_delta`
+Each of these directories can be generated with `generate-features-slurm`. `generated_features_new` has been used extensively for SDSC Expanse jobs, while `generated_features_delta` contains experimental slurm scripts for the NCSA Delta resource. The `slurm` subdirectories within each one are populated with a data file and three example scripts:
+
+- `slurm.dat`: "quadrant file", generated using `check-quads-for-sources`, mapping each field/ccd/quadrant combination to an integer job number. Files names for DR16, DR19 and DR20 are also included. The generic `slurm.dat` file is identical to the DR20 file.
+- `slurm_sing.sub`: generates features for a single field, CCD, and quad (specified with `--field`, `--ccd`, and `--quad` arguments) using `generate-features`
+- `slurm.sub`: uses a wildcard to serve as a feature generation script for any `--quadrant-index` in `slurm.dat`
+- `slurm_submission.sub`: runs the `generate-features-job-submission` python code to submit feature generation jobs for all config-specified fields (`feature_generation: fields_to_run:`) while excluding fields listed in `fields_to_exclude:`
+
+### Lightcurve-by-lightcurve feature generation
+
+Another way to run SCoPe feature generation is to provide individual ZTF lightcurve IDs instead of fields. This requires some data wrangling to put the source list in the appropriate format for SCoPe to recognize.
+
+#### Notebook: `underMS_data_wrangling_notebook.ipynb`
+
+This notebook contains an example on wrangling a list of designations, right ascensions and declinations into a SCoPe-friendly format. This notebook demonstrates running a cone search for all ZTF lightcurves within a specified radius and then formatting column names as SCoPe requires. The notebook then saves the resulting lightcurve list in batches so the feature generation process does not time out when running on SDSC Expanse.
+
+#### Directory: `generated_features_underMS`
+
+Once the lightcurve lists are generated and saved (in this example to the `underMS_ids_DR20` subdirectory), the following slurm script can be repeatedly queues to run feature generation:
+
+- `dr20_slurm.sub`: uses a wildcard to serve as a feature generation script for any index (`$IDX`) in the batched filenames. For example, run `sbatch --export=IDX=0 dr20_slurm.sub` to run feature generation on `sources_ids_2arcsec_renamed_0.parquet`
+
+
+### General feature generation advice/troubleshooting
+The following advice and troubleshooting list is based on running ~70 fields' worth of feature generation on [SDSC Expanse](https://www.sdsc.edu/support/user_guides/expanse.html) resources. It may need to be adjusted when running on other resources.
+
+#### Ensuring all quads run successfully
+- When a field/ccd/quad job is queued to run, an empty file with a `.running` extension will be saved. The code uses this file to keep track of which fields/ccds/quads have been queued for feature generation. Note that the existence of this file does not mean that feature generation necessarily completed; in some cases, the job may fail. It is important to verify that feature generation actually succeeded for all quads in a field. One may do this either by manually counting the files and comparing with expectations, or by re-running feature generation job submission for the same fields while setting the `--reset-running` flag (assuming all jobs have concluded). This will re-submit any jobs that did not produce the requisite `.parquet` file or conclude immediately if all jobs are complete.
+
+#### Fields with > 10,000,000 lightcurves
+- Some fields have a particularly large number of lightcurves, especially those near the Galactic Plane. On the Expanse `gpu-shared` partition, there have been out-of-memory issues when using the standard `91G` of memory for fields with more than around 10,000,000 lightcurves. To avoid lost GPU time, identify these fields ahead of time using the included `DR19_field_counts.json` file. (This file was generated by running `scope.utils.get_field_count` on the `DR19_catalog_completeness.json` file, which itself was obtained using `tools.generate_features_slurm.check_quads_for_sources`.) Next, scale up the requested memory in `slurm.sub` proportional to the number of lightcurves in the field divided by 10,000,000. Scale down the `--max-instances` argument in `slurm_submission.dat` by the same fraction to avoid running into cluster limitations on memory requested per user. Note that as a result of this scaling, "large" fields will take more real time to run than they would if the maximum number of instances could be simultaneously used.
+
+#### Kowalski query limitations
+- Note that while some compute resources may offer many cores that could parallelize and speed up Kowalski queries, once this number exceeds ~200 simultaneous queries (e.g. the 20 jobs each with 9 cores each that we currently run in parallel), there will begin to be failed queries and wasted compute time.
+
+#### Path to scope code/inputs/outputs should be the same
+- While the config file supports specifying a `path_to_features` and `path_to_preds` that are unique from the code installation location, it is easiest to install `scope-ml` in the same directory where the inputs will be stored and outputs will be written. On a cluster, make sure this is not the home or scratch directory, but instead the project storage location.
+
+#### Lightcurve-by-lightcurve memory requirements
+- For lightcurve-by-lightcurve feature generation, try to limit the number of lightcurves in a batch to 100,000 and increase the memory to `182G`. The current code requires the user to manually run `sbatch` for each batch file, modifying the `--export=IDX=N` argument for each `N` in the batched filenames.
+
+## Running Inference
+
+### Inference scripts: `get_all_preds_dnn_DR16.sh` and `get_all_preds_xgb_DR16.sh`
+
+Each of these scripts can be generated with `create-inference-script`. They contain a call to `run-inference` can be run on their own to perform inference (one field at a time). They can also be used by running `run-inference-job-submission` to perform inference for all fields in parallel.
+
+### Directories: `dnn_inference` and `xgb_inference`
+
+These two directories are generated when running `run-inference-slurm`. The `slurm` subdirectories within each one are populated with three example scripts:
+
+- `slurm_sing.sub`: runs inference on a single field specified in file
+- `slurm.sub`: uses a wildcard to serve as an inference script for any field
+- `slurm_submission.sub`: runs the `run-inference-job-submission` python code to submit inference jobs for all config-specified fields (`inference: fields_to_run:`) while excluding fields listed in `fields_to_exclude:`
+
+## Combining Predictions
+
+### Directory: `combine_preds`
+
+The `slurm` subdirectory here contains a script to combine the predictions for the DNN and XGB algorithms:
+
+- `slurm.sub`: run `combine-preds` for all config-specified fields (`inference: fields_to_run:`) while excluding fields listed in `fields_to_exclude:`, writing both parquet and CSV files
+
+## Classifying variables near GCN transient candidates
+
+One special application of SCoPe is to classify variable sources that are near (in angular separation) to GCN transient candidates listed on fritz. In this workflow, small-scale feature generation is run on SDSC Expanse before running inference locally and uploading any high-confidence classifications to fritz (see **Guide for Fritz Scanners** for more details).
+
+### GCN inference scripts: `get_all_preds_dnn_GCN.sh`, `get_all_preds_xgb_GCN.sh`
+
+These scripts are nearly identical to the inference scripts referenced above, but inference results are saved to different directories.
+
+### Directory: `generated_features_GCN_sources`
+
+The `slurm` subdirectory within contains two example scripts:
+
+- `gpu-debug_slurm.sub`: uses wildcards to run small-scale feature generation for a list of sources from a given GCN `dateobs`. This is the script that is run by default, since the `gpu-debug` partition on Expanse offers enough resources with shorter wait times than `gpu-shared`.
+- `gpu-shared_slurm.sub`: the same as `gpu-debug_slurm.sub`, but running on the `gpu-shared` partition of Expanse.
+
+### Script: `gcn_cronjob.py`
+
+See more details about how this script can be run automatically in the **Usage/Running automated analyses** section of the documentation.
diff --git a/doc/usage.md b/doc/usage.md
@@ -188,7 +188,7 @@ inputs:
 8. --period-batch-size : maximum number of sources to simultaneously perform period finding (int)
 9.  --doCPU : flag to run config-specified CPU period algorithms (bool)
 10. --doGPU : flag to run config-specified GPU period algorithms (bool)
-11. --samples_per_peak : number of samples per periodogram peak (int)
+11. --samples-per-peak : number of samples per periodogram peak (int)
 12. --doScaleMinPeriod : for period finding, scale min period based on min-cadence-minutes (bool). Otherwise, set --max-freq to desired value
 13. --doRemoveTerrestrial : remove terrestrial frequencies from period-finding analysis (bool)
 14. --Ncore : number of CPU cores to parallelize queries (int)

diff --git a/gcn_cronjob.py b/gcn_cronjob.py
@@ -202,8 +202,8 @@ def query_gcn_events(
                 os.system(
                     f'ssh -tt {username}@login.expanse.sdsc.edu \
                     "source .bash_profile && \
-                    cd /expanse/lustre/projects/umn131/{username}/{generated_features_dirname}/slurm && \
-                    sbatch --wait --export=DOBS={save_dateobs},DS={filepath.name} {partition}_slurm.sub"'
+                    cd /expanse/lustre/projects/umn131/{username} && \
+                    sbatch --wait --export=DOBS={save_dateobs},DS={filepath.name} {generated_features_dirname}/slurm/{partition}_slurm.sub"'
                 )
                 print("Finished generating features on Expanse.")
 
@@ -251,7 +251,7 @@ def query_gcn_events(
                         try:
                             generator = scope.select_fritz_sample(
                                 fields=[f"{save_dateobs}_specific_ids"],
-                                group="DR16_importance",
+                                group="trained_xgb_models",
                                 algorithm="xgb",
                                 probability_threshold=0.0,
                                 consol_filename=f"inference_results_{save_dateobs}",
@@ -266,7 +266,7 @@ def query_gcn_events(
 
                             generator = scope.select_fritz_sample(
                                 fields=[f"{save_dateobs}_specific_ids"],
-                                group="nobalance_DR16_DNN",
+                                group="trained_dnn_models",
                                 algorithm="dnn",
                                 probability_threshold=0.0,
                                 consol_filename=f"inference_results_{save_dateobs}",
@@ -306,7 +306,7 @@ def query_gcn_events(
                         try:
                             upload_classification(
                                 file=f"{BASE_DIR}/{combined_preds_dirname}/{save_dateobs}/merged_GCN_sources_{save_dateobs}.parquet",
-                                classification="read",
+                                classification=["read"],
                                 taxonomy_map=f"{BASE_DIR}/{taxonomy_map}",
                                 skip_phot=True,
                                 use_existing_obj_id=True,

diff --git a/hpc_files/DR19_catalog_completeness.json b/hpc_files/DR19_catalog_completeness.json
diff --git a/hpc_files/DR19_field_counts.json b/hpc_files/DR19_field_counts.json
diff --git a/hpc_files/combine_preds/logs/README b/hpc_files/combine_preds/logs/README
@@ -0,0 +1 @@
+Slurm logs go here
diff --git a/hpc_files/combine_preds/slurm/slurm.sub b/hpc_files/combine_preds/slurm/slurm.sub
@@ -0,0 +1,15 @@
+#!/bin/bash
+#SBATCH --job-name=combine_preds.job
+#SBATCH --output=combine_preds/logs/combine_preds_%A_%a.out
+#SBATCH --error=combine_preds/logs/combine_preds_%A_%a.err
+#SBATCH -p shared
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node 1
+#SBATCH --mem 128G
+#SBATCH -A umn131
+#SBATCH --time=48:00:00
+#SBATCH --mail-type=ALL
+#SBATCH [email protected]
+module purge
+source activate scope-env
+combine-preds --use-config-fields --write-csv
diff --git a/hpc_files/dnn_inference/logs/README b/hpc_files/dnn_inference/logs/README
@@ -0,0 +1 @@
+Slurm logs go here
diff --git a/hpc_files/dnn_inference/slurm/slurm.sub b/hpc_files/dnn_inference/slurm/slurm.sub
@@ -0,0 +1,16 @@
+#!/bin/bash
+#SBATCH --job-name=run_inference_dnn.job
+#SBATCH --output=dnn_inference/logs/run_inference_dnn_%A_%a.out
+#SBATCH --error=dnn_inference/logs/run_inference_dnn_%A_%a.err
+#SBATCH -p shared
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node 10
+#SBATCH --gpus 0
+#SBATCH --mem 64G
+#SBATCH --time=48:00:00
+#SBATCH --mail-type=ALL
+#SBATCH [email protected]
+#SBATCH -A umn131
+module purge
+source activate scope-env
+./get_all_preds_dnn_DR16.sh $FID
diff --git a/hpc_files/dnn_inference/slurm/slurm_sing.sub b/hpc_files/dnn_inference/slurm/slurm_sing.sub
@@ -0,0 +1,16 @@
+#!/bin/bash
+#SBATCH --job-name=run_inference_dnn.job
+#SBATCH --output=dnn_inference/logs/run_inference_dnn_%A_%a.out
+#SBATCH --error=dnn_inference/logs/run_inference_dnn_%A_%a.err
+#SBATCH -p shared
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node 16
+#SBATCH --gpus 0
+#SBATCH --mem 64G
+#SBATCH --time=48:00:00
+#SBATCH --mail-type=ALL
+#SBATCH [email protected]
+#SBATCH -A umn131
+module purge
+source activate scope-env
+./get_all_preds_dnn_DR16.sh 881
diff --git a/hpc_files/dnn_inference/slurm/slurm_submission.sub b/hpc_files/dnn_inference/slurm/slurm_submission.sub
@@ -0,0 +1,15 @@
+#!/bin/bash
+#SBATCH --job-name=run_inference_dnn_submit.job
+#SBATCH --output=dnn_inference/logs/run_inference_dnn_submit_%A_%a.out
+#SBATCH --error=dnn_inference/logs/run_inference_dnn_submit_%A_%a.err
+#SBATCH -p shared
+#SBATCH --nodes 1
+#SBATCH --ntasks-per-node 1
+#SBATCH -A umn131
+#SBATCH --time=24:00:00
+#SBATCH --mail-type=ALL
+#SBATCH [email protected]
+module purge
+module add slurm
+source activate scope-env
+run-inference-job-submission --dirname dnn_inference --user dwarshofsky --algorithm dnn
diff --git a/hpc_files/dnn_training/logs/README b/hpc_files/dnn_training/logs/README
@@ -0,0 +1 @@
+Slurm logs go here