raphael-group · tyamaguchi-ucla · May 2, 2024 · May 2, 2024 · May 2, 2024 · May 2, 2024
diff --git a/docs/source/README.md b/docs/source/README.md
@@ -139,7 +139,7 @@ you can use any Pyomo-supported solver by setting the environment variable `HATC
 `glpk`, or any other Pyomo-supported solver. For example, `export HATCHET_COMPUTE_CN_SOLVER=cbc`.
 
 Alternatively, you can set the key `solver` key in the `compute_cn`
-section of your `hatchet.ini` (if using the [hatchet run](doc_runhatchet.html) command) to a specific Pyomo-supported
+section of your `hatchet.ini` (if using the [hatchet run](doc_runhatchet.md) command) to a specific Pyomo-supported
 solver. Make sure the relevant solver binaries are in your `$PATH`, otherwise Pyomo will not be able to find them
 correctly.
 
@@ -173,14 +173,14 @@ HATCHet requires 3 input data files:
 The repository includes all the components that are required to cover every step of the entire HATCHet's pipeline, starting from the processing of raw data reported in a BAM file through the analysis of the final results.
 We provide:
 
-- <a name="fullpipelineandtutorial"></a> A script representing the [full pipeline](doc_fullpipeline.html#fullpipelineandtutorial) of HATCHet, and we describe in details the whole script through a tutorial with instructions for usage.
-- <a name="demos"></a> [Demos](doc_fullpipeline.html#demos) that correspond to guided executions of HATCHet on some examples, and explain in detail the usage of HATCHet when considering standard datasets, real datasets with high noise, and different kind of data.
-- <a name="custompipelines"></a> [Custom pipelines](doc_fullpipeline.html#custompipelines) which adapt the full HATCHet's pipeline to special conditions or integrates pre-processed data belonging to different pipelines.
+- <a name="fullpipelineandtutorial"></a> A script representing the [full pipeline](doc_fullpipeline.md#fullpipelineandtutorial) of HATCHet, and we describe in details the whole script through a tutorial with instructions for usage.
+- <a name="demos"></a> [Demos](doc_fullpipeline.md#demos) that correspond to guided executions of HATCHet on some examples, and explain in detail the usage of HATCHet when considering standard datasets, real datasets with high noise, and different kind of data.
+- <a name="custompipelines"></a> [Custom pipelines](doc_fullpipeline.md#custompipelines) which adapt the full HATCHet's pipeline to special conditions or integrates pre-processed data belonging to different pipelines.
 
   The implementation of HATCHet is highly modular and one can replace any HATCHet's module with any other method to obtain the required results (especially for the pre-processing modules).
-As such, we also provide here an overview of the entire pipeline and we describe the <a name="detailedsteps"></a> [details of each step](doc_fullpipeline.html#detailedsteps) in a dedicated section of the manual.
+As such, we also provide here an overview of the entire pipeline and we describe the <a name="detailedsteps"></a> [details of each step](doc_fullpipeline.md#detailedsteps) in a dedicated section of the manual.
 
-- <a name="recommendations"></a> [Recommendations](doc_fullpipeline.html#recommendations), especially for noisy datasets or with different features, to guide the user in the interpretation of HATCHet's inference. We explain how to perform quality control to guarantee the best-quality results, and describe how the user can control and tune some of the parameters to obtain the best-fitting results.
+- <a name="recommendations"></a> [Recommendations](doc_fullpipeline.md#recommendations), especially for noisy datasets or with different features, to guide the user in the interpretation of HATCHet's inference. We explain how to perform quality control to guarantee the best-quality results, and describe how the user can control and tune some of the parameters to obtain the best-fitting results.
 
 ## Current issues
 <a name="currentissues"></a>

diff --git a/docs/source/doc_check.md b/docs/source/doc_check.md
@@ -6,7 +6,7 @@ All checks can be run simultaneously via `hatchet check`, or an individual comma
 
 The check for `compute-cn` runs the step on a set of small data files (.bbc/.seg) pre-packaged with HATCHet, and is a quick way to verify if your solver is working correctly.
 If you are unable to run this command, it likely indicates a licensing issue with default (Gurobi) solver. To use alternative solvers, see the
-[Using a different Pyomo-supported solver](README.html#usingasolver_other) section of the README for more details.
+[Using a different Pyomo-supported solver](README.md#usingasolver_other) section of the README for more details.
 
 ## Input
 

diff --git a/docs/source/doc_cluster_bins_gmm.md b/docs/source/doc_cluster_bins_gmm.md
@@ -61,13 +61,13 @@ If your clusters do not appear to be cohesive, try lowering the maximum number o
 | `-K`, `--initclusters` | Maximum number of clusters | The parameter specifies the maximum number of clusters to infer, i.e., the maximum number of GMM components | 50 |
 | `-c`, `--concentration` | Concentration parameter for clustering | This parameter determines how much confidence the GMM has in different types of clusterings. Higher values (e.g., 10 or 100)  favor fewer clusters, and smaller values (e.g., 0.01 or 0.001) favor more clusters. For experts, this is the alpha parameter for the Dirichlet process prior. | 1/K |
 
-3. cluster-bins-gmm offers a bootstraping approach that allows a succesfull clustering even when there is a limited number genomic bins that are considred. The bootstraping approach generates sinthetic (i.e. used only for clustering) bins based on the data of the given bins. The bootstraping is controlled by the following parameters.
+3. cluster-bins-gmm offers a bootstraping approach that allows a successful clustering even when there is a limited number genomic bins that are considered. The bootstraping approach generates synthetic (i.e. used only for clustering) bins based on the data of the given bins. The bootstraping is controlled by the following parameters.
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-u`, `--bootclustering` | Number of sinthetic bins to generate | Sinthetic bins can be generated based on the RDR and BAF of given bins and are added only to the clustering to improve it when the total number of bins is low (e.g. when considering data from WES) | 0, not used |
-| `-dR`,`--ratiodeviation` | Standard deviation for generate RDR of sinthetic bins | The parameter affects the variance of the generated data, this value can be estimated from given bins and plot-bins generates informative plots to do this | 0.02 |
-| `-dB`,`--bafdeviation` | Standard deviation for generate BAF of sinthetic bins | The parameter affects the variance of the generated data, this value can be estimated from given bins and plot-bins generates informative plots to do this | 0.02 |
+| `-u`, `--bootclustering` | Number of synthetic bins to generate | Synthetic bins can be generated based on the RDR and BAF of given bins and are added only to the clustering to improve it when the total number of bins is low (e.g. when considering data from WES) | 0, not used |
+| `-dR`,`--ratiodeviation` | Standard deviation for generate RDR of synthetic bins | The parameter affects the variance of the generated data, this value can be estimated from given bins and plot-bins generates informative plots to do this | 0.02 |
+| `-dB`,`--bafdeviation` | Standard deviation for generate BAF of synthetic bins | The parameter affects the variance of the generated data, this value can be estimated from given bins and plot-bins generates informative plots to do this | 0.02 |
 | `-s`, `--seed` | Random seed | The value is used to seed the random generation of RDR and BAF of synthetic bins | 0 |
 
 4. cluster-bins-gmm offers a basic iterative process to merge clusters according to given tolerances. This feature can be used to refine the results of the GMM clustering and merge distinct clusters that are not sufficiently distinguished. This process can be controlled by the following parameters.
@@ -81,5 +81,5 @@ If your clusters do not appear to be cohesive, try lowering the maximum number o
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-v`, `--verbose`  | Verbose logging flag | When enabled, combine-counts outputs a verbose log of the executiong | Not used |
+| `-v`, `--verbose`  | Verbose logging flag | When enabled, combine-counts outputs a verbose log of the executing | Not used |
 | `-r`, `--disablebar` | Disabling progress-bar flag | When enabled, the output progress bar is disabled | Not used |
diff --git a/docs/source/doc_combine_counts.md b/docs/source/doc_combine_counts.md
@@ -6,7 +6,7 @@ This step constructs variable-length bins that ensure that each bin has at least
 
 `combine-counts` takes in input the output from `count-reads` (i.e., two gzipped files `ch.total.gz` and `ch.thresholds.gz` for each chromosome `ch`, ). Use the `-A, --array` argument to specify a directory containing these input files.
 
-It also requires (specified by the flag `-b`, `--baffile`) a tab-separated file specifying the allele counts for heterzygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the `count-alleles` command and has the following fields:
+It also requires (specified by the flag `-b`, `--baffile`) a tab-separated file specifying the allele counts for heterozygous germline SNPs from all tumor samples. The tab separated file would typically be produced by the `count-alleles` command and has the following fields:
 
 | Field | Description |
 |-------|-------------|

diff --git a/docs/source/doc_combine_counts_fw.md b/docs/source/doc_combine_counts_fw.md
@@ -67,7 +67,7 @@ combine-counts has some main parameters; the main values of these parameters all
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-v`, `--verbose`  | Verbose logging flag | When enabled, combine-counts outputs a verbose log of the executiong | Not used |
+| `-v`, `--verbose`  | Verbose logging flag | When enabled, combine-counts outputs a verbose log of the executing | Not used |
 | `-r`, `--disablebar` | Disabling progress-bar flag | When enabled, the output progress bar is disabled | Not used |
 | `-b`, `--normalbafs` | File of allele counts for SNPs in matched-normal sample | When provided, combine-counts attempts to correct the estimated BAF using the variance in matched-normal sample. | Not used (deprecated) |
 | `-d`, `--diploidbaf` | Maximum expected shift from 0.5 for BAF of diploid or tetraploid clusters | The maximum shift is used to identify potential potential bins with base states (1, 1) or (2, 2) whose BAF needs to be corrected. The value depends on the variance in the data (related to noise and coverage); generally, higher variance requires a higher shift. Information provided by plot-bins can help to decide this value in special datasets. | 0.08 (other typically suggested values are 0.1-0.11 for higher variance and 0.06 for low variance) |

diff --git a/docs/source/doc_compute_cn.md b/docs/source/doc_compute_cn.md
@@ -86,15 +86,15 @@ This heuristic can be controlled by the following parameters:
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-c`, `--clonal` | The required clusters and corresponding copy number states | User can directly specifies the required clusters and the corresponding copy number states to compute the allele-specific fractional copy number states. These must be speficied in the format `IDX-1:A-1:B-1, ..., IDX-M:A-M:B-M` where `IDX-S` is the name of cluster `S` and `(A-S, B-S)` is the corresponding copy-number state. Moreover, user can specify addittional clusters and copy numbers beyond the required ones. The copy-numbers for these clusters will be fixed during the computation. This can be an usefule feature for especially noisy datasets. | None |
+| `-c`, `--clonal` | The required clusters and corresponding copy number states | User can directly specifies the required clusters and the corresponding copy number states to compute the allele-specific fractional copy number states. These must be specified in the format `IDX-1:A-1:B-1, ..., IDX-M:A-M:B-M` where `IDX-S` is the name of cluster `S` and `(A-S, B-S)` is the corresponding copy-number state. Moreover, user can specify additional clusters and copy numbers beyond the required ones. The copy-numbers for these clusters will be fixed during the computation. This can be an useful feature for especially noisy datasets. | None |
 | `-ts`, `--minsize` | Threshold for size of clusters | The minimum size of the clusters to consider for the heuristic that identifies clonal clusters. The non-selected clusters will not be considered as potential tumor-clonal clusters. The threshold must be expressed as a fraction of the entire genome. | 0.02, e.g. `2%` of genome |
 | `-tc`, `--minchrs` | Threshold for number of chromosomes | The minimum number of chromosomes covered by the clusters to consider for the heuristic that identifies clonal clusters. The non-selected clusters will not be considered as potential tumor-clonal clusters. | 1 |
 | `-td`, `--maxneutralshift` | Maximum BAF shift allowed for diploid cluster | The maximum expected shift from 0.5 for BAF for a diploid or tetraploid cluster (i.e. with copy-number states (1, 1) or (2, 2)). This threshold is used for two goals to identify the diploid or tetraploid cluster. | 0.1 |
 | `-tR`, `--toleranceRDR` | Maximum RDR tolerance | The maximum RDR tolerance used by the heuristic when estimating the position of all clonal clusters | 0.04 |
 | `-tB`, `--toleranceBAF` | Maximum BAF tolerance | The maximum BAF tolerance used by the heuristic when estimating the position of all clonal clusters | 0.03 |
 | `--merge` | Activate merging of clusters | When activated, the heuristic will merge together clusters that appear to have the same values of RDR and BAF, according to the values below. This procedure can help the heuristic by refining the clustering and merging clusters that are likely to have the same copy-number states and unlikely to be clonal. | False, not used |
-| `-mR`, `--mergeRDR` | RDR merging threhsold | The maximum difference in RDR considered by the merging procedure | 0 |
-| `-mB`, `--mergeBAF` | BAF merging threhsold | The maximum difference in BAF considered by the merging procedure | 0 |
+| `-mR`, `--mergeRDR` | RDR merging threshold | The maximum difference in RDR considered by the merging procedure | 0 |
+| `-mB`, `--mergeBAF` | BAF merging threshold | The maximum difference in BAF considered by the merging procedure | 0 |
 
 ## Simultaneous factorization
 
@@ -107,7 +107,7 @@ hatchet solves a constrained and distance-based variant of the factorization tha
 | `-eD`, `--diploidcmax` | Maximum copy number with no WGD | The value of maximum copy number that is considered when assuming no WGD. When `0`-value is specified the maximum copy number is directly inferred from the data by rounding the maximum fractional copy number | 8 |
 | `-eT`, `--tetraploidcmax` | Maximum copy number with a WGD | The value of maximum copy number that is considered when assuming there is a WGD. When `0`-value is specified the maximum copy number is directly inferred from the data by rounding the maximum fractional copy number | 8 |
 | `-u`, `--minprop` | Minimum clone proportion | In every sample, each clone either is non present (clone proportion equal to 0.0) or has a clone proportion higher than this threshold | 0.03 |
-| `-f`, `--noampdel` | Activate clone evolutionary contraints | User can decide whether to enable or not constrained about the evolution of tumor clones. These constrained force each allele to be either amplified or deleted across all tumor clones | Activated |
+| `-f`, `--noampdel` | Activate clone evolutionary constraints | User can decide whether to enable or not constrained about the evolution of tumor clones. These constrained force each allele to be either amplified or deleted across all tumor clones | Activated |
 | `-d`, `--cnstates` | Maximum number of distinct copy-number states per cluster | When enabled, the maximum number of distinct copy-number states per cluster is fixed. This option is deprecated | Not used |
 
 HATCHet implements two methods to solve the constrained and distance-based simultaneous factorization: (1) a integer-linear programming (ILP) and (2) a coordinate-descent method (CD).

diff --git a/docs/source/doc_count_alleles.md b/docs/source/doc_count_alleles.md
@@ -1,6 +1,6 @@
 # count-alleles
 
-Given one or more BAM files and lists of heterozygous SNP positions, this step of HATCHet counts the number of reads covering both the alleles of each identified heterozgyous SNP in every tumor sample.
+Given one or more BAM files and lists of heterozygous SNP positions, this step of HATCHet counts the number of reads covering both the alleles of each identified heterozygous SNP in every tumor sample.
 
 ## Input
 
@@ -39,12 +39,12 @@ count-alleles has some main parameters; the main values of these parameters allo
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-S`, `--samples` | White-space separater list of a names | The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files | File names are used |
-| `-st`, `--samtools` | Path to `bin` directory of SAMtools | The path to this direcoty needs to be specified when it is not included in `$PATH` | Path is expected in the enviroment variable `$PATH` |
-| `-bt`, `--bcftools` | Path to `bin` directory of BCFtools | The path to this direcoty needs to be specified when it is not included in `$PATH` | Path is expected in the enviroment variable `$PATH` |
+| `-S`, `--samples` | White-space separator list of a names | The first name is used for the matched-normal sample, while the others are for the tumor samples and they match the same order of the corresponding BAM files | File names are used |
+| `-st`, `--samtools` | Path to `bin` directory of SAMtools | The path to this directory needs to be specified when it is not included in `$PATH` | Path is expected in the environment variable `$PATH` |
+| `-bt`, `--bcftools` | Path to `bin` directory of BCFtools | The path to this directory needs to be specified when it is not included in `$PATH` | Path is expected in the environment variable `$PATH` |
 | `-c`, `--mincov` | Minimum coverage | Minimum number of reads that have to cover a variant to be called, the value can be increased when considering a dataset with high depth (>60x) | 8 |
 | `-C`, `--maxcov` | Maximum coverage | Maximum number of reads that have to cover a variant to be called, the typically suggested value should be twice higher than expected coverage to avoid sequencing and mapping artifacts | 300 |
-| `-j`, `--processes` | Number of parallele jobs | Parallel jobs are used to consider the chromosomes in different samples on parallel. The higher the number the better the running time | 22 |
+| `-j`, `--processes` | Number of parallel jobs | Parallel jobs are used to consider the chromosomes in different samples on parallel. The higher the number the better the running time | 22 |
 
 
 ## Optional parameters
@@ -53,7 +53,7 @@ count-alleles has some optional parameters; changes in the default values of the
 
 | Name | Description | Usage | Default |
 |------|-------------|-------|---------|
-| `-v`, `--verbose`  | Verbose logging flag | When enabled, count-alleles outputs a verbose log of the executiong | Not used |
+| `-v`, `--verbose`  | Verbose logging flag | When enabled, count-alleles outputs a verbose log of the executing | Not used |
 | `-g`, `--gamma` | Level of confidence for selecting germline heterozygous SNPs | This value is the level of confidence used for the binomial model used to assess whether a called SNPs is in fact germline heterozygous | 0.05 |
 | `-q`, `--readquality` | Threshold for phred-score quality of sequencing reads | The value can be either decreased (e.g. 10) or increased (e.g. 30) to adjust the filtering of sequencing reads | 20 |
 | `-Q`, `--basequality` | Threshold for phred-score quality of sequenced nucleotide bases | The value can be either decreased (e.g. 10) or increased (e.g. 30) to adjust the filtering of sequenced nucleotide bases | 20 |