This file contains instructions for Parameter Estimation when applying the HMMScan method to a use case dataset. The instructions can be used either to replicate the use case in the paper results or to run HMMScan on a new sequence.
sequence_name
: string, name of lot sequence (e.g.,dfa_by_date_ex_iqr_outliers
ordfa_by_date
).ae_type
: string, name of AE type (e.g.,serious_std
). This is eitherserious_std
,exp_no_admin_std
,serious_std_iqr_outlier_ceiling
, orexp_no_admin_std_iqr_outlier_ceiling
for paper result replication.s_max
: integer, maximum number of candidate HMM states. This is 4 for paper result replication.c_max
: integer, maximum number of candidate HMM mixture components. This is 9 for paper result replication.output_subdir
: string, name of subdirectory ofae-project/results/use_case/random_initializations
to store the HMM fitting results.
Ensure that the file ae-project/data/use_case/[sequence_name].csv
has been created. If providing your own sequence, not replicating the paper results, then see the User-Provided Input Data for details.
Ensure that the ae-project
repository has been downloaded locally (see the repo readme file here for instructions).
If this repository is downloaded, then the necessary input data files will already be available.
From the top level of this directory on Engaging, run the following command for each combination of sequence_name
and ae_type
to fit 1 state models:
sbatch --array=0-49 --time=0-00:30:00 hmmscan/cluster/scan_use_case_parallel.sh seed_starter [sequence_name] [ae_type] 1 [output_subdir]
Then, run the following command for each combination of sequence_name
and ae_type
for each state from 2 to s_max
.
sbatch --array=0-[c_max * 50 - 1] --time=0-01:00:00 hmmscan/cluster/scan_use_case_parallel.sh grid_component_seed_starter [sequence_name] [ae_type] [n_states] [output_subdir]
The commands above must be run for each of the following sequence_name
, ae_type
combinations. Items 7-12 create results for Section S5 in Online Resource 1:
dfa_by_date_ex_iqr_outliers
,serious_std
dfb_by_date_ex_iqr_outliers
,serious_std
dfc_by_date_ex_iqr_outliers
,serious_std
dfa_by_date_ex_iqr_expedited
,exp_no_admin_std
dfb_by_date_ex_iqr_expedited
,exp_no_admin_std
dfc_by_date_ex_iqr_expedited
,exp_no_admin_std
dfa_by_date
,serious_std_iqr_outlier_ceiling
dfb_by_date
,serious_std_iqr_outlier_ceiling
dfc_by_date
,serious_std_iqr_outlier_ceiling
dfa_by_date
,exp_no_admin_std_iqr_outlier_ceiling
dfb_by_date
,exp_no_admin_std_iqr_outlier_ceiling
dfc_by_date
,exp_no_admin_std_iqr_outlier_ceiling
Here is the command for the single state model fitting for items 1-6. Use by_date
for the final argument for items 7-12.
sbatch --array=0-49 --time=0-00:30:00 hmmscan/cluster/scan_use_case_parallel.sh seed_starter [sequence_name] [ae_type] 1 by_date_ex_iqr
Here is the command for the multiple state model fitting. This is run for n_states
equal to 2, 3, then 4.
sbatch --array=0-449 --time=0-01:00:00 hmmscan/cluster/scan_use_case_parallel.sh grid_component_seed_starter [sequence_name] [ae_type] [n_states] by_date_ex_iqr
Step 2 generates a file for each combination of sequence_name
, ae_type
, number of hidden states, number of mixture components, and random initialization in ae-project/results/use_case/random_initializations/[output_subdir]
.
On Engaging, run the following commands in an interactive session:
- Load
R 4.1
:module load R/4.1.0
. - Run
aggregate_scan_results.R
:Rscript hmmscan/scripts/scans/aggregate_scan_results.R scans/use_case/random_initializations/[output_subdir]
This script will generate a CSV file called ae-project/results/scans/use_case/random_initializations/[output_subdir].csv
.
Use these commands:
Rscript hmmscan/scripts/scans/aggregate_scan_results.R scans/use_case/random_initializations/by_date_ex_iqr
Rscript hmmscan/scripts/scans/aggregate_scan_results.R scans/use_case/random_initializations/by_date
On Engaging, run the following commands in an interactive session:
- Load
R 4.1
:module load R/4.1.0
. - Run
get_best_initializations.R
:Rscript hmmscan/scripts/scans/get_best_initializations.R [output_subdir].csv
This script will generate a CSV file called ae-project/results/scans/use_case/best_initializations/[output_subdir].csv
.
Use these commands:
Rscript hmmscan/scripts/scans/get_best_initializations.R by_date_ex_iqr.csv
Rscript hmmscan/scripts/scans/get_best_initializations.R by_date.csv
For this section, you will need to look at the CSV file generated in step 4 and find the best number of states and mixture components for each sequence_name
and ae_type
combination.
The best structure is referred to below as best_n_states
and best_n_mix_comps
.
On Engaging, run the following commands in an interactive session:
- Load
python 3.9
:module load python/3.9.4
. - Run
scripts/state_prediction.py
:python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/[output_subdir].csv use_case [sequence_name] [ae_type] [best_n_states] [best_n_mix_comps]
.
This script will generate a file in ae-project/results/state_prediction/use_case
for each sequence_name
, ae_type
, best_n_states
, and best_n_mix_comps
.
Run these commands, where items 7-12 create results for Section S5 in Online Resource 1:
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfa_by_date_ex_iqr_expedited exp_no_admin_std 3 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfb_by_date_ex_iqr_expedited exp_no_admin_std 2 2
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfc_by_date_ex_iqr_expedited exp_no_admin_std 1 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfa_by_date_ex_iqr_outliers serious_std 3 2
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfb_by_date_ex_iqr_outliers serious_std 3 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfc_by_date_ex_iqr_outliers serious_std 2 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfa_by_date serious_std_iqr_outlier_ceiling 3 2
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfb_by_date serious_std_iqr_outlier_ceiling 3 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfc_by_date serious_std_iqr_outlier_ceiling 2 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfa_by_date exp_no_admin_std_iqr_outlier_ceiling 3 2
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfb_by_date exp_no_admin_std_iqr_outlier_ceiling 2 3
python -m hmmscan.scripts.state_prediction.state-prediction scans/use_case/best_initializations/by_date_ex_iqr.csv use_case dfc_by_date exp_no_admin_std_iqr_outlier_ceiling 1 3
It is probably easiest to generate the necessary plots locally off Engaging. To do so, copy ae-project/results/scans/use_case/best_initializations/[output_subdir].csv
and the contents of ae-project/results/state_prediction/use_case
into the same relative file locations in your local version of ae-project
.
Then, you can run hmmscan/scripts/viz/bic.R
to view the BICs of the HMM model candidates, and hmmscan/scripts/viz/best_model_dists_and_predictions.R
to view the characteristics of the models with the best BICs.
If you are using your own lot sequence and not replicating the paper results, then you will need to adjust these visualization scripts.