diff --git a/.github/.gitignore b/.github/.gitignore index 2d19fc76..5c86aa40 100644 --- a/.github/.gitignore +++ b/.github/.gitignore @@ -1 +1,3 @@ *.html + +/.quarto/ diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 180ecf6c..f9f8de97 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -5,42 +5,87 @@ For a detailed discussion on contributing to this and other tidyverse packages, ## Fixing typos -You can fix typos, spelling mistakes, or grammatical errors in the documentation directly using the GitHub web interface, as long as the changes are made in the _source_ file. -This generally means you'll need to edit [roxygen2 comments](https://roxygen2.r-lib.org/articles/roxygen2.html) in an `.R`, not a `.Rd` file. +You can fix typos, spelling mistakes, or grammatical errors in the documentation directly using the GitHub web interface, as long as the changes are made in the _source_ file. +This generally means you'll need to edit [roxygen2 comments](https://roxygen2.r-lib.org/articles/roxygen2.html) in an `.R`, not a `.Rd` file. You can find the `.R` file that generates the `.Rd` by reading the comment in the first line. ## Bigger changes -If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed. -If you’ve found a bug, please file an issue that illustrates the bug with a minimal +If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed. +If you’ve found a bug, please file an issue that illustrates the bug with a minimal [reprex](https://www.tidyverse.org/help/#reprex) (this will also help you write a unit test, if needed). See our guide on [how to create a great issue](https://code-review.tidyverse.org/issues/) for more advice. ### Pull request process -* Fork the package and clone onto your computer. If you haven't done this before, we recommend using `usethis::create_from_github("JRaviLab/MolEvolvR", fork = TRUE)`. +- Fork the package and clone onto your computer. If you haven't done this before, we recommend using `usethis`. -* Install all development dependencies with `devtools::install_dev_deps()`, and then make sure the package passes R CMD check by running `devtools::check()`. - If R CMD check doesn't pass cleanly, it's a good idea to ask for help before continuing. -* Create a Git branch for your pull request (PR). We recommend using `usethis::pr_init("brief-description-of-change")`. +- Install and load the `usethis` package with: -* Make your changes, commit to git, and then create a PR by running `usethis::pr_push()`, and following the prompts in your browser. - The title of your PR should briefly describe the change. - The body of your PR should contain `Fixes #issue-number`. + ``` + install.packages("usethis") -* For user-facing changes, add a bullet to the top of `NEWS.md` (i.e. just below the first header). Follow the style described in . + library("usethis") + ``` + +- Clone and fork the MolEvolvR package using: + ``` + usethis::create_from_github("JRaviLab/MolEvolvR", fork = TRUE) + ``` +- Install BiocManager from Bioconductor: + + ``` + if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") + BiocManager::install(version = "3.19") + ``` + +- Install other development dependencies and then ensure that the package passes R CMD check using `devtools`: + + ``` + install.packages("devtools") + + library("devtools") + + devtools::install_dev_deps() + + devtools::check() + ``` + + _If R CMD check doesn't pass cleanly, it's a good idea to ask for help before continuing._ + +- Create a Git branch for your pull request (PR). We recommend using: + + ``` + usethis::pr_init("brief-description-of-change") + ``` + +- Make your changes, commit to git, and then create a PR by running `usethis::pr_push()`, and following the prompts in your browser. + The title of your PR should briefly describe the change. + The body of your PR should contain `Fixes #issue-number`. + + + + ### Code style -* New code should follow the tidyverse [style guide](https://style.tidyverse.org). - You can use the [styler](https://CRAN.R-project.org/package=styler) package to apply these styles, but please don't restyle code that has nothing to do with your PR. - -* Lint Your Code: Ensure your code adheres to our style guidelines by using [lintr](https://lintr.r-lib.org/): `lintr::lint("path/to/your/file.R")` +- New code should follow the tidyverse [style guide](https://style.tidyverse.org). + You can use the [styler](https://CRAN.R-project.org/package=styler) package to apply these styles, but please don't restyle code that has nothing to do with your PR. +- Lint Your Code: Ensure your code adheres to our style guidelines by using [lintr](https://lintr.r-lib.org/): + + ``` + install.packages("lintr") + + library("lintr") + + lintr::lint("path/to/your/file.R") + ``` -* We use [roxygen2](https://cran.r-project.org/package=roxygen2), with [Markdown syntax](https://cran.r-project.org/web/packages/roxygen2/vignettes/rd-formatting.html), for documentation. +- We use [roxygen2](https://cran.r-project.org/package=roxygen2), with [Markdown syntax](https://cran.r-project.org/web/packages/roxygen2/vignettes/rd-formatting.html), for documentation. -* We use [testthat](https://cran.r-project.org/package=testthat) for unit tests. - Contributions with test cases included are easier to accept. +- We use [testthat](https://cran.r-project.org/package=testthat) for unit tests. + Contributions with test cases included are easier to accept. ## Code of Conduct diff --git a/NAMESPACE b/NAMESPACE index dbab97b3..74c4614a 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -4,7 +4,6 @@ export(BinaryDomainNetwork) export(GCA2Lineage) export(GenContextNetwork) export(IPG2Lineage) -export(RepresentativeAccNums) export(acc2FA) export(acc2Lineage) export(acc2fa) @@ -12,33 +11,32 @@ export(addLeaves2Alignment) export(addLineage) export(addName) export(addTaxID) -export(add_leaves) -export(add_name) -export(advanced_opts2est_walltime) export(alignFasta) -export(assign_job_queue) +export(assignJobQueue) +export(calculateEstimatedWallTimeFromOpts) +export(calculateProcessRuntime) export(cleanClusters) export(cleanDomainArchitecture) export(cleanGeneDescription) export(cleanGenomicContext) export(cleanLineage) export(cleanSpecies) -export(combine_files) -export(combine_full) -export(combine_ipr) +export(combineFiles) +export(combineFullAnalysis) +export(combineIPR) export(condenseRepeatedDomains) export(convert2TitleCase) export(convertAlignment2FA) export(convertAlignment2Trees) export(convertFA2Tree) -export(convert_aln2fa) export(countByColumn) export(createFA2Tree) export(createJobResultsURL) export(createJobStatusEmailMessage) +export(createLineageLookup) +export(createRepresentativeAccNum) export(createWordCloud2Element) export(createWordCloudElement) -export(create_lineage_lookup) export(domain_network) export(downloadAssemblySummary) export(efetchIPG) @@ -46,22 +44,20 @@ export(extractAccNum) export(filterByDomains) export(filterByFrequency) export(findParalogs) -export(find_top_acc) export(formatJobArgumentsHTML) export(gc_undirected_network) export(generateAllAlignments2FA) -export(generate_all_aln2fa) export(generate_msa) -export(get_accnums_from_fasta_file) -export(get_proc_medians) -export(get_proc_weights) -export(ipr2viz) -export(ipr2viz_web) -export(make_opts2procs) +export(getAccNumFromFA) +export(getProcessRuntimeWeights) +export(getTopAccByLinDomArch) export(mapAcc2Name) -export(map_acc2name) -export(map_advanced_opts2procs) +export(mapAdvOption2Process) +export(mapOption2Process) export(msa_pdf) +export(plotEstimatedWallTimes) +export(plotIPR2Viz) +export(plotIPR2VizWeb) export(plotLineageDA) export(plotLineageDomainRepeats) export(plotLineageHeatmap) @@ -72,7 +68,6 @@ export(plotStackedLineage) export(plotSunburst) export(plotTreemap) export(plotUpSet) -export(plot_estimated_walltimes) export(prepareColumnParams) export(prepareSingleColumnParams) export(proteinAcc2TaxID) @@ -83,28 +78,27 @@ export(removeTails) export(renameFA) export(rename_fasta) export(replaceQuestionMarks) -export(reveql) -export(reverse_operon) +export(reverseOperonSeq) export(runDeltaBlast) export(runRPSBlast) export(selectLongestDuplicate) export(sendJobStatusEmail) export(shortenLineage) export(sinkReset) +export(straightenOperonSeq) export(summarizeByLineage) export(summarizeDomArch) export(summarizeDomArch_ByLineage) export(summarizeGenContext) export(summarizeGenContext_ByDomArchLineage) export(summarizeGenContext_ByLineage) -export(theme_genes2) -export(to_titlecase) +export(themeGenes2) export(totalGenContextOrDomArchCounts) export(validateCountDF) export(wordcloud3) -export(write.MsaAAMultipleAlignment) -export(write_proc_medians_table) -export(write_proc_medians_yml) +export(writeMSA_AA2FA) +export(writeProcessRuntime2TSV) +export(writeProcessRuntime2YML) importFrom(Biostrings,AAStringSet) importFrom(Biostrings,readAAStringSet) importFrom(Biostrings,toString) @@ -117,6 +111,7 @@ importFrom(assertthat,assert_that) importFrom(assertthat,has_name) importFrom(base64enc,base64encode) importFrom(biomartr,getKingdomAssemblySummary) +importFrom(d3r,d3_nest) importFrom(data.table,as.data.table) importFrom(data.table,fread) importFrom(data.table,fwrite) @@ -181,6 +176,7 @@ importFrom(ggplot2,theme) importFrom(ggplot2,theme_classic) importFrom(ggplot2,theme_grey) importFrom(ggplot2,theme_minimal) +importFrom(ggplot2,unit) importFrom(ggplot2,xlab) importFrom(ggplot2,ylab) importFrom(grDevices,adjustcolor) @@ -237,6 +233,7 @@ importFrom(readr,write_file) importFrom(readr,write_lines) importFrom(readr,write_tsv) importFrom(rentrez,entrez_fetch) +importFrom(rlang,.data) importFrom(rlang,as_string) importFrom(rlang,sym) importFrom(sendmailR,mime_part) @@ -244,6 +241,7 @@ importFrom(sendmailR,sendmail) importFrom(seqinr,dist.alignment) importFrom(seqinr,read.alignment) importFrom(shiny,showNotification) +importFrom(stats,as.formula) importFrom(stats,complete.cases) importFrom(stats,logLik) importFrom(stats,na.omit) @@ -264,6 +262,7 @@ importFrom(stringr,str_sub) importFrom(stringr,str_trim) importFrom(stringr,word) importFrom(sunburstR,sunburst) +importFrom(sunburstR,sund2b) importFrom(tibble,as_tibble) importFrom(tibble,tibble) importFrom(tidyr,drop_na) diff --git a/R/CHANGED-pre-msa-tree.R b/R/CHANGED-pre-msa-tree.R index c4a97589..2f6c8a62 100644 --- a/R/CHANGED-pre-msa-tree.R +++ b/R/CHANGED-pre-msa-tree.R @@ -54,7 +54,7 @@ convert2TitleCase <- function(x, y = " ") { ################################ ## Function to add leaves to an alignment file ## !! Add DA to leaves? -#' Adding Leaves to an alignment file w/ accessions +#' addLeaves2Alignment #' #' @author Janani Ravi #' @keywords alignment, accnum, leaves, lineage, species @@ -178,7 +178,7 @@ addLeaves2Alignment <- function(aln_file = "", } -#' Add Name +#' addName #' #' @author Samuel Chen, Janani Ravi #' @description This function adds a new 'Name' column that is comprised of components from @@ -252,7 +252,7 @@ addName <- function(data, ################################ ## Function to convert alignment 'aln' to fasta format for MSA + Tree -#' Adding Leaves to an alignment file w/ accessions +#' convertAlignment2FA #' #' @author Janani Ravi #' @keywords alignment, accnum, leaves, lineage, species @@ -320,6 +320,9 @@ convertAlignment2FA <- function(aln_file = "", return(fasta) } +#' mapAcc2Name +#' +#' @description #' Default renameFA() replacement function. Maps an accession number to its name #' #' @param line The line of a fasta file starting with '>' @@ -382,6 +385,9 @@ renameFA <- function(fa_path, outpath, ################################ ## generateAllAlignments2FA +#' generateAllAlignments2FA +#' +#' @description #' Adding Leaves to an alignment file w/ accessions #' #' @keywords alignment, accnum, leaves, lineage, species @@ -441,10 +447,11 @@ generateAllAlignments2FA <- function(aln_path = here("data/rawdata_aln/"), # accessions <- c("P12345","Q9UHC1","O15530","Q14624","P0DTD1") # accessions <- rep("ANY95992.1", 201) -#' acc2FA converts protein accession numbers to a fasta format. +#' acc2FA #' #' @description -#' Resulting fasta file is written to the outpath. +#' converts protein accession numbers to a fasta format. Resulting +#' fasta file is written to the outpath. #' #' @author Samuel Chen, Janani Ravi #' @keywords accnum, fasta @@ -539,6 +546,9 @@ acc2FA <- function(accessions, outpath, plan = "sequential") { return(result) } +#' createRepresentativeAccNum +#' +#' @description #' Function to generate a vector of one Accession number per distinct observation from 'reduced' column #' #' @author Samuel Chen, Janani Ravi @@ -556,7 +566,7 @@ acc2FA <- function(accessions, outpath, plan = "sequential") { #' @export #' #' @examples -RepresentativeAccNums <- function(prot_data, +createRepresentativeAccNum <- function(prot_data, reduced = "Lineage", accnum_col = "AccNum") { # Get Unique reduced column and then bind the AccNums back to get one AccNum per reduced column @@ -585,6 +595,9 @@ RepresentativeAccNums <- function(prot_data, return(accessions) } +#' alignFasta +#' +#' @description #' Perform a Multiple Sequence Alignment on a FASTA file. #' #' @author Samuel Chen, Janani Ravi @@ -610,12 +623,12 @@ alignFasta <- function(fasta_file, tool = "Muscle", outpath = NULL) { ) if (typeof(outpath) == "character") { - write.MsaAAMultipleAlignment(aligned, outpath) + writeMSA_AA2FA(aligned, outpath) } return(aligned) } -#' Write MsaAAMultpleAlignment Objects as algined fasta sequence +#' writeMSA_AA2FA #' #' @description #' MsaAAMultipleAlignment Objects are generated from calls to msaClustalOmega @@ -632,7 +645,7 @@ alignFasta <- function(fasta_file, tool = "Muscle", outpath = NULL) { #' @export #' #' @examples -write.MsaAAMultipleAlignment <- function(alignment, outpath) { +writeMSA_AA2FA <- function(alignment, outpath) { l <- length(rownames(alignment)) fasta <- "" for (i in 1:l) @@ -645,7 +658,7 @@ write.MsaAAMultipleAlignment <- function(alignment, outpath) { return(fasta) } -#' Get accnums from fasta file +#' getAccNumFromFA #' #' @param fasta_file #' @@ -655,7 +668,7 @@ write.MsaAAMultipleAlignment <- function(alignment, outpath) { #' @export #' #' @examples -get_accnums_from_fasta_file <- function(fasta_file) { +getAccNumFromFA <- function(fasta_file) { txt <- read_file(fasta_file) accnums <- stringi::stri_extract_all_regex(fasta_file, "(?<=>)[\\w,.]+")[[1]] return(accnums) diff --git a/R/acc2lin.R b/R/acc2lin.R index 1984ec3c..5f25afe2 100644 --- a/R/acc2lin.R +++ b/R/acc2lin.R @@ -10,6 +10,7 @@ #' Sink Reset #' #' @return No return, but run to close all outstanding `sink()`s +#' #' @export #' #' @examples @@ -18,25 +19,32 @@ #' } sinkReset <- function() { for (i in seq_len(sink.number())) { - sink(NULL) + sink(NULL) } } #' addLineage #' -#' @param df -#' @param acc_col -#' @param assembly_path -#' @param lineagelookup_path -#' @param ipgout_path -#' @param plan +#' @param df A `data.frame` containing the input data. One column must contain +#' the accession numbers. +#' @param acc_col A string specifying the column name in `df` that holds the +#' accession numbers. Defaults to `"AccNum"`. +#' @param assembly_path A string specifying the path to the `assembly_summary.txt` +#' file. This file contains metadata about assemblies. +#' @param lineagelookup_path A string specifying the path to the lineage lookup +#' file, which contains a mapping from tax IDs to their corresponding lineages. +#' @param ipgout_path (Optional) A string specifying the path where IPG database +#' fetch results will be saved. If `NULL`, the results are not written to a file. +#' @param plan A string specifying the parallelization strategy for the future +#' package, such as `"sequential"` or `"multisession"`. #' #' @importFrom dplyr pull #' @importFrom magrittr %>% #' @importFrom rlang sym #' -#' @return Describe return, in detail +#' @return A `data.frame` that combines the original `df` with the lineage +#' information. #' @export #' #' @examples @@ -49,18 +57,20 @@ addLineage <- function(df, acc_col = "AccNum", assembly_path, accessions <- df %>% pull(acc_col) lins <- acc2Lineage(accessions, assembly_path, lineagelookup_path, ipgout_path, plan) - # Drop a lot of the unimportant columns for now? will make merging much easier - lins <- lins[, c( + # Drop a lot of the unimportant columns for now? + # will make merging much easier + lins <- lins[, c( "Strand", "Start", "Stop", "Nucleotide Accession", "Source", "Id", "Strain" - ) := NULL] - lins <- unique(lins) + ) := NULL] + lins <- unique(lins) - # dup <- lins %>% group_by(Protein) %>% summarize(count = n()) %>% filter(count > 1) %>% - # pull(Protein) + # dup <- lins %>% group_by(Protein) %>% + # summarize(count = n()) %>% filter(count > 1) %>% + # pull(Protein) - merged <- merge(df, lins, by.x = acc_col, by.y = "Protein", all.x = TRUE) - return(merged) + merged <- merge(df, lins, by.x = acc_col, by.y = "Protein", all.x = TRUE) + return(merged) } @@ -78,9 +88,12 @@ addLineage <- function(df, acc_col = "AccNum", assembly_path, #' (taxid to lineage mapping). This file can be generated using the #' @param ipgout_path Path to write the results of the efetch run of the accessions #' on the ipg database. If NULL, the file will not be written. Defaults to NULL -#' @param plan +#' @param plan A string specifying the parallelization strategy for the future +#' package, such as `"sequential"` or `"multisession"`. #' -#' @return Describe return, in detail +#' @return A `data.table` that contains the lineage information, mapping protein +#' accessions to their tax IDs and lineages. +#' @export #' @export #' #' @examples @@ -97,10 +110,10 @@ acc2Lineage <- function(accessions, assembly_path, lineagelookup_path, ipgout_pa lins <- IPG2Lineage(accessions, ipgout_path, assembly_path, lineagelookup_path) - if (tmp_ipg) { - unlink(tempdir(), recursive = T) - } - return(lins) + if (tmp_ipg) { + unlink(tempdir(), recursive = T) + } + return(lins) } #' efetchIPG @@ -108,17 +121,17 @@ acc2Lineage <- function(accessions, assembly_path, lineagelookup_path, ipgout_pa #' @author Samuel Chen, Janani Ravi #' #' @description Perform efetch on the ipg database and write the results to out_path -#' #' @param accnums Character vector containing the accession numbers to query on #' the ipg database #' @param out_path Path to write the efetch results to -#' @param plan +#' @param plan A string specifying the parallelization strategy for the future +#' package, such as `"sequential"` or `"multisession"`. #' #' @importFrom furrr future_map #' @importFrom future plan #' @importFrom rentrez entrez_fetch #' -#' @return Describe return, in detail +#' @return No return value. The function writes the fetched results to `out_path`. #' @export #' #' @examples @@ -133,46 +146,52 @@ efetchIPG <- function(accnums, out_path, plan = "sequential", ...) { # limit of 10/second w/ key l <- length(in_data) - partitioned <- list() - for (i in 1:groups) - { - partitioned[[i]] <- in_data[seq.int(i, l, groups)] - } - - return(partitioned) - } - - plan(strategy = plan, .skip = T) - - - min_groups <- length(accnums) / 200 - groups <- min(max(min_groups, 15), length(accnums)) - partitioned_acc <- partition(accnums, groups) - sink(out_path) - - a <- future_map(1:length(partitioned_acc), function(x) { - # Avoid hitting the rate API limit - if (x %% 9 == 0) { - Sys.sleep(1) - } - cat( - entrez_fetch( - id = partitioned_acc[[x]], - db = "ipg", - rettype = "xml", - api_key = "YOUR_KEY_HERE" ## Can this be included in public package? - ) - ) - }) - sink(NULL) + partitioned <- list() + for (i in 1:groups){ + partitioned[[i]] <- in_data[seq.int(i, l, groups)] + } + + return(partitioned) } + + # Set the future plan strategy + plan(strategy = plan, .skip = T) + + + min_groups <- length(accnums) / 200 + groups <- min(max(min_groups, 15), length(accnums)) + partitioned_acc <- partition(accnums, groups) + + # Open the sink to the output path + sink(out_path) + + a <- future_map(1:length(partitioned_acc), function(x) { + # Avoid hitting the rate API limit + if (x %% 9 == 0) { + Sys.sleep(1) + } + cat( + entrez_fetch( + id = partitioned_acc[[x]], + db = "ipg", + rettype = "xml", + api_key = "YOUR_KEY_HERE" ## Can this be included in public package? + ) + ) + }) + sink(NULL) + + } } + + #' IPG2Lineage #' #' @author Samuel Chen, Janani Ravi #' -#' @description Takes the resulting file of an efetch run on the ipg database and +#' @description Takes the resulting file +#' of an efetch run on the ipg database and #' #' @param accessions Character vector of protein accessions #' @param ipg_file Filepath to the file containing results of an efetch run on the @@ -182,11 +201,12 @@ efetchIPG <- function(accnums, out_path, plan = "sequential", ...) { #' This file can be generated using the \link[MolEvolvR]{downloadAssemblySummary} function #' @param lineagelookup_path String of the path to the lineage lookup file #' (taxid to lineage mapping). This file can be generated using the -#' "create_lineage_lookup()" function +#' "createLineageLookup()" function #' #' @importFrom data.table fread #' -#' @return Describe return, in detail +#' @return A `data.table` with the lineage information for the provided protein +#' accessions. #' @export #' #' @examples @@ -197,8 +217,10 @@ efetchIPG <- function(accnums, out_path, plan = "sequential", ...) { IPG2Lineage <- function(accessions, ipg_file, assembly_path, lineagelookup_path, ...) { ipg_dt <- fread(ipg_file, sep = "\t", fill = T) + # Filter the IPG data table to only include the accessions ipg_dt <- ipg_dt[Protein %in% accessions] + # Rename the 'Assembly' column to 'GCA_ID' ipg_dt <- setnames(ipg_dt, "Assembly", "GCA_ID") lins <- GCA2Lineage(prot_data = ipg_dt, assembly_path, lineagelookup_path) diff --git a/R/assign_job_queue.R b/R/assign_job_queue.R index bc5253d4..69609417 100644 --- a/R/assign_job_queue.R +++ b/R/assign_job_queue.R @@ -3,24 +3,31 @@ # pipeline. # to use this, construct paths like so: file.path(common_root, "path", "to", "file.R") # for example, the reference for this file would be: -# file.path(common_root, "molevol_scripts", "R", "assign_job_queue.R") +# file.path(common_root, "molevol_scripts", "R", "assignJobQueue.R") common_root <- Sys.getenv("COMMON_SRC_ROOT") +#' mapOption2Process +#' +#' @description #' Construct list where names (MolEvolvR advanced options) point to processes #' #' @return list where names (MolEvolvR advanced options) point to processes #' -#' example: list_opts2procs <- make_opts2procs +#' example: list_opts2procs <- mapOption2Process #' @export -make_opts2procs <- function() { - opts2processes <- list( - "homology_search" = c("dblast", "dblast_cleanup"), - "domain_architecture" = c("iprscan", "ipr2lineage", "ipr2da"), - "always" = c("blast_clust", "clust2table") # processes always present agnostic of advanced options - ) - return(opts2processes) +mapOption2Process <- function() { + opts2processes <- list( + "homology_search" = c("dblast", "dblast_cleanup"), + "domain_architecture" = c("iprscan", "ipr2lineage", "ipr2da"), + # processes always present agnostic of advanced options + "always" = c("blast_clust", "clust2table") + ) + return(opts2processes) } +#' mapAdvOption2Process +#' +#' @description #' Use MolEvolvR advanced options to get associated processes #' #' @param advanced_opts character vector of MolEvolvR advanced options @@ -30,19 +37,22 @@ make_opts2procs <- function() { #' #' example: #' advanced_opts <- c("homology_search", "domain_architecture") -#' procs <- map_advanced_opts2procs(advanced_opts) +#' procs <- mapAdvOption2Process(advanced_opts) #' @export -map_advanced_opts2procs <- function(advanced_opts) { - # append 'always' to add procs that always run - advanced_opts <- c(advanced_opts, "always") - opts2proc <- make_opts2procs() - # setup index for opts2proc based on advanced options - idx <- which(names(opts2proc) %in% advanced_opts) - # extract processes that will run - procs <- opts2proc[idx] |> unlist() - return(procs) +mapAdvOption2Process <- function(advanced_opts) { + # append 'always' to add procs that always run + advanced_opts <- c(advanced_opts, "always") + opts2proc <- mapOption2Process() + # setup index for opts2proc based on advanced options + idx <- which(names(opts2proc) %in% advanced_opts) + # extract processes that will run + procs <- opts2proc[idx] |> unlist() + return(procs) } +#' calculateProcessRuntime +#' +#' @description #' Scrape MolEvolvR logs and calculate median processes #' #' @param dir_job_results [chr] path to MolEvolvR job_results @@ -58,49 +68,54 @@ map_advanced_opts2procs <- function(advanced_opts) { #' #' 1) #' dir_job_results <- "/data/scratch/janani/molevolvr_out" -#' list_proc_medians <- get_proc_medians(dir_job_results) +#' list_proc_medians <- calculateProcessRuntime(dir_job_results) #' #' 2) from outside container environment #' common_root <- "/data/molevolvr_transfer/molevolvr_dev" #' dir_job_results <- "/data/molevolvr_transfer/molevolvr_dev/job_results" -#' list_proc_medians <- get_proc_medians(dir_job_results) +#' list_proc_medians <- calculateProcessRuntime(dir_job_results) #' @export -get_proc_medians <- function(dir_job_results) { - source(file.path(common_root, "molevol_scripts", "R", "metrics.R")) +calculateProcessRuntime <- function(dir_job_results) { + source(file.path(common_root, "molevol_scripts", "R", "metrics.R")) - # aggregate logs from - path_log_data <- file.path(common_root, "molevol_scripts", "log_data", "prod_logs.rda") + # aggregate logs from + path_log_data <- file.path(common_root, + "molevol_scripts", "log_data", "prod_logs.rda") - # ensure the folder exists to the location - if (!dir.exists(path_log_data)) { - dir.create(dirname(path_log_data), recursive = TRUE, showWarnings = FALSE) - } + # ensure the folder exists to the location + if (!dir.exists(path_log_data)) { + dir.create(dirname(path_log_data), + recursive = TRUE, showWarnings = FALSE) + } - # attempt to load pre-generated logdata - if (!file.exists(path_log_data)) { - logs <- aggregate_logs(dir_job_results, latest_date = Sys.Date() - 60) - save(logs, file = path_log_data) - } else { - load(path_log_data) # loads the logs object - } - df_log <- logs$df_log - procs <- c( - "dblast", "dblast_cleanup", "iprscan", - "ipr2lineage", "ipr2da", "blast_clust", - "clust2table" - ) - list_proc_medians <- df_log |> - dplyr::select(dplyr::all_of(procs)) |> - dplyr::summarise( - dplyr::across( - dplyr::everything(), - \(x) median(x, na.rm = TRUE) - ) - ) |> - as.list() - return(list_proc_medians) + # attempt to load pre-generated logdata + if (!file.exists(path_log_data)) { + logs <- aggregate_logs(dir_job_results, latest_date = Sys.Date() - 60) + save(logs, file = path_log_data) + } else { + load(path_log_data) # loads the logs object + } + df_log <- logs$df_log + procs <- c( + "dblast", "dblast_cleanup", "iprscan", + "ipr2lineage", "ipr2da", "blast_clust", + "clust2table" + ) + list_proc_medians <- df_log |> + dplyr::select(dplyr::all_of(procs)) |> + dplyr::summarise( + dplyr::across( + dplyr::everything(), + \(x) median(x, na.rm = TRUE) + ) + ) |> + as.list() + return(list_proc_medians) } +#' writeProcessRuntime2TSV +#' +#' @description #' Write a table of 2 columns: 1) process and 2) median seconds #' #' @param dir_job_results [chr] path to MolEvolvR job_results @@ -113,53 +128,61 @@ get_proc_medians <- function(dir_job_results) { #' #' @return [tbl_df] 2 columns: 1) process and 2) median seconds #' -#' example: write_proc_medians_table( +#' example: writeProcessRuntime2TSV( #' "/data/scratch/janani/molevolvr_out/", #' "/data/scratch/janani/molevolvr_out/log_tbl.tsv" #' ) #' @export -write_proc_medians_table <- function(dir_job_results, filepath) { - df_proc_medians <- get_proc_medians(dir_job_results) |> - tibble::as_tibble() |> - tidyr::pivot_longer( - dplyr::everything(), - names_to = "process", - values_to = "median_seconds" - ) |> - dplyr::arrange(dplyr::desc(median_seconds)) - readr::write_tsv(df_proc_medians, file = filepath) - return(df_proc_medians) +writeProcessRuntime2TSV <- function(dir_job_results, filepath) { + df_proc_medians <- calculateProcessRuntime(dir_job_results) |> + tibble::as_tibble() |> + tidyr::pivot_longer( + dplyr::everything(), + names_to = "process", + values_to = "median_seconds" + ) |> + dplyr::arrange(dplyr::desc(median_seconds)) + + # Write the resulting tibble to a TSV file + readr::write_tsv(df_proc_medians, file = filepath) + return(df_proc_medians) } +#' writeProcessRuntime2YML +#' +#' @description #' Compute median process runtimes, then write a YAML list of the processes and #' their median runtimes in seconds to the path specified by 'filepath'. #' #' The default value of filepath is the value of the env var -#' MOLEVOLVR_PROC_WEIGHTS, which get_proc_weights() also uses as its default +#' MOLEVOLVR_PROC_WEIGHTS, which getProcessRuntimeWeights() also uses as its default #' read location. #' #' @param dir_job_results [chr] path to MolEvolvR job_results directory -#' @param filepath [chr] path to save YAML file; if NULL, uses ./molevol_scripts/log_data/job_proc_weights.yml +#' @param filepath [chr] path to save YAML file; if NULL, +#' uses ./molevol_scripts/log_data/job_proc_weights.yml #' #' @importFrom yaml write_yaml #' #' @examples #' \dontrun{ -#' write_proc_medians_yml( +#' writeProcessRuntime2YML( #' "/data/scratch/janani/molevolvr_out/", #' "/data/scratch/janani/molevolvr_out/log_tbl.yml" #' ) #' } #' @export -write_proc_medians_yml <- function(dir_job_results, filepath = NULL) { - if (is.null(filepath)) { - filepath <- file.path(common_root, "molevol_scripts", "log_data", "job_proc_weights.yml") - } - - medians <- get_proc_medians(dir_job_results) - yaml::write_yaml(medians, filepath) +writeProcessRuntime2YML <- function(dir_job_results, filepath = NULL) { + if (is.null(filepath)) { + filepath <- file.path(common_root, "molevol_scripts", "log_data", "job_proc_weights.yml") + } + medians <- calculateProcessRuntime(dir_job_results) + yaml::write_yaml(medians, filepath) } +#' getProcessRuntimeWeights +#' +#' @description #' Quickly get the runtime weights for MolEvolvR backend processes #' #' @param dir_job_results [chr] path to MolEvolvR job_results @@ -170,50 +193,55 @@ write_proc_medians_yml <- function(dir_job_results, filepath = NULL) { #' #' @return [list] names: processes; values: median runtime (seconds) #' -#' example: get_proc_weights() +#' example: writeProcessRuntime2YML() #' @export -get_proc_weights <- function(medians_yml_path = NULL) { - if (is.null(medians_yml_path)) { - medians_yml_path <- file.path(common_root, "molevol_scripts", "log_data", "job_proc_weights.yml") - } +getProcessRuntimeWeights <- function(medians_yml_path = NULL) { + if (is.null(medians_yml_path)) { + medians_yml_path <- file.path(common_root, + "molevol_scripts", + "log_data", + "job_proc_weights.yml") + } - proc_weights <- tryCatch( - { - # attempt to read the weights from the YAML file produced by - # write_proc_medians_yml() - if (stringr::str_trim(medians_yml_path) == "") { - stop( - stringr::str_glue("medians_yml_path is empty ({medians_yml_path}), returning default weights") - ) - } + proc_weights <- tryCatch({ + # attempt to read the weights from the YAML file produced by + # writeProcessRuntime2YML() + if (stringr::str_trim(medians_yml_path) == "") { + stop( + stringr::str_glue("medians_yml_path is empty + ({medians_yml_path}), returning default weights") + ) + } - proc_weights <- yaml::read_yaml(medians_yml_path) - }, - # to avoid fatal errors in reading the proc weights yaml, - # some median process runtimes have been hardcoded based on - # the result of get_proc_medians() from Jan 2024 - error = function(cond) { - proc_weights <- list( - "dblast" = 2810, - "iprscan" = 1016, - "dblast_cleanup" = 79, - "ipr2lineage" = 18, - "ipr2da" = 12, - "blast_clust" = 2, - "clust2table" = 2 - ) - proc_weights - } + proc_weights <- yaml::read_yaml(medians_yml_path) + }, + # to avoid fatal errors in reading the proc weights yaml, + # some median process runtimes have been hardcoded based on + # the result of calculateProcessRuntime() from Jan 2024 + error = function(cond) { + proc_weights <- list( + "dblast" = 2810, + "iprscan" = 1016, + "dblast_cleanup" = 79, + "ipr2lineage" = 18, + "ipr2da" = 12, + "blast_clust" = 2, + "clust2table" = 2 ) + proc_weights + }) - return(proc_weights) + return(proc_weights) } +#' calculateEstimatedWallTimeFromOpts +#' +#' @description #' Given MolEvolvR advanced options and number of inputs, #' calculate the total estimated walltime for the job #' #' @param advanced_opts character vector of MolEvolvR advanced options -#' (see make_opts2procs for the options) +#' (see mapOption2Process for the options) #' @param n_inputs total number of input proteins #' #' @importFrom dplyr if_else @@ -221,70 +249,88 @@ get_proc_weights <- function(medians_yml_path = NULL) { #' #' @return total estimated number of seconds a job will process (walltime) #' -#' example: advanced_opts2est_walltime(c("homology_search", "domain_architecture"), n_inputs = 3, n_hits = 50L) +#' example: calculateEstimatedWallTimeFromOpts (c("homology_search", +#' "domain_architecture"), +#' n_inputs = 3, n_hits = 50L) #' @export -advanced_opts2est_walltime <- function(advanced_opts, n_inputs = 1L, n_hits = NULL, verbose = FALSE) { - # to calculate est walltime for a homology search job, the number of hits - # must be provided - validation_fail <- is.null(n_hits) && "homology_search" %in% advanced_opts - stopifnot(!validation_fail) +calculateEstimatedWallTimeFromOpts <- function(advanced_opts, + n_inputs = 1L, + n_hits = NULL, + verbose = FALSE) { + # to calculate est walltime for a homology search job, the number of hits + # must be provided + validation_fail <- is.null(n_hits) && "homology_search" %in% advanced_opts + stopifnot(!validation_fail) - proc_weights <- get_proc_weights() - # sort process weights by names and convert to vec - proc_weights <- proc_weights[order(names(proc_weights))] |> unlist() - all_procs <- names(proc_weights) |> sort() - # get processes from advanced options and sort by names - procs_from_opts <- map_advanced_opts2procs(advanced_opts) - procs_from_opts <- sort(procs_from_opts) - # binary encode: yes proc will run (1); else 0 - binary_proc_vec <- dplyr::if_else(all_procs %in% procs_from_opts, 1L, 0L) - # dot product of weights and procs to run; scaled by the number of inputs - est_walltime <- (n_inputs * (binary_proc_vec %*% proc_weights)) |> - as.numeric() - # calculate the additional processes to run for the homologous hits - if ("homology_search" %in% advanced_opts) { - opts2procs <- make_opts2procs() - # exclude the homology search processes for the homologous hits - procs2exclude_for_homologs <- opts2procs[["homology_search"]] - procs_homologs <- procs_from_opts[!(procs_from_opts %in% procs2exclude_for_homologs)] - binary_proc_vec_homolog <- dplyr::if_else(all_procs %in% procs_homologs, 1L, 0L) - # add the estimated walltime for processes run on the homologous hits - est_walltime <- est_walltime + - (n_hits * (binary_proc_vec_homolog %*% proc_weights) |> as.numeric()) - } - if (verbose) { - msg <- stringr::str_glue( - "warnings from advanced_opts2est_walltime():\n", - "\tn_inputs={n_inputs}\n", - "\tn_hits={ifelse(is.null(n_hits), 'null', n_hits)}\n", - "\test_walltime={est_walltime}\n\n" - ) - cat(file = stderr(), msg) - } - return(est_walltime) + # Get process weights + proc_weights <- writeProcessRuntime2YML() + + # sort process weights by names and convert to vec + proc_weights <- proc_weights[order(names(proc_weights))] |> unlist() + all_procs <- names(proc_weights) |> sort() + # get processes from advanced options and sort by names + procs_from_opts <- mapAdvOption2Process(advanced_opts) + procs_from_opts <- sort(procs_from_opts) + # binary encode: yes proc will run (1); else 0 + binary_proc_vec <- dplyr::if_else(all_procs %in% procs_from_opts, 1L, 0L) + # dot product of weights and procs to run; scaled by the number of inputs + est_walltime <- (n_inputs * (binary_proc_vec %*% proc_weights)) |> + as.numeric() + # calculate the additional processes to run for the homologous hits + if ("homology_search" %in% advanced_opts) { + opts2procs <- mapOption2Process() + # exclude the homology search processes for the homologous hits + procs2exclude_for_homologs <- opts2procs[["homology_search"]] + procs_homologs <- procs_from_opts[!(procs_from_opts + %in% procs2exclude_for_homologs)] + binary_proc_vec_homolog <- dplyr::if_else(all_procs + %in% procs_homologs, 1L, 0L) + # add the estimated walltime for processes run on the homologous hits + est_walltime <- est_walltime + + (n_hits * (binary_proc_vec_homolog + %*% proc_weights) |> as.numeric()) + } + if (verbose) { + msg <- stringr::str_glue( + "warnings from calculateEstimatedWallTimeFromOpts ():\n", + "\tn_inputs={n_inputs}\n", + "\tn_hits={ifelse(is.null(n_hits), 'null', n_hits)}\n", + "\test_walltime={est_walltime}\n\n" + ) + cat(file = stderr(), msg) + } + return(est_walltime) } + +#' assignJobQueue +#' +#' @description #' Decision function to assign job queue #' #' @param t_sec_estimate estimated number of seconds a job will process -#' (from advanced_opts2est_walltime()) +#' (from calculateEstimatedWallTimeFromOpts ()) #' @param t_long threshold value that defines the lower bound for assigning a #' job to the "long queue" #' #' @return a string of "short" or "long" #' #' example: -#' advanced_opts2est_walltime(c("homology_search", "domain_architecture"), 3) |> -#' assign_job_queue() +#' calculateEstimatedWallTimeFromOpts (c("homology_search", +#' "domain_architecture"), 3) |> +#' assignJobQueue() #' @export -assign_job_queue <- function( - t_sec_estimate, - t_cutoff = 21600 # 6 hours - ) { - queue <- ifelse(t_sec_estimate > t_cutoff, "long", "short") - return(queue) +assignJobQueue <- function( + t_sec_estimate, + t_cutoff = 21600 # 6 hours +) { + queue <- ifelse(t_sec_estimate > t_cutoff, "long", "short") + return(queue) } +#' plotEstimatedWallTimes +#' +#' @description #' Plot the estimated runtimes for different advanced options and number #' of inputs #' @@ -297,81 +343,88 @@ assign_job_queue <- function( #' @return line plot object #' #' example: -#' p <- plot_estimated_walltimes() -#' ggplot2::ggsave(filename = "/data/molevolvr_transfer/molevolvr_dev/molevol_scripts/docs/estimate_walltimes.png", plot = p) +#' p <- plotEstimatedWallTimes() +#' ggplot2::ggsave(filename = "/data/molevolvr_transfer/molevolvr_ +#' dev/molevol_scripts/docs/estimate_walltimes.png", plot = p) #' @export -plot_estimated_walltimes <- function() { - opts <- make_opts2procs() |> names() +plotEstimatedWallTimes <- function() { + opts <- mapOption2Process() |> names() # get all possible submission permutations (powerset) get_powerset <- function(vec) { - # generate powerset (do not include empty set) - n <- length(vec) - indices <- 1:n - powerset <- lapply(1:n, function(x) combn(indices, x, simplify = FALSE)) - powerset <- unlist(powerset, recursive = FALSE) - powerset <- lapply(powerset, function(index) vec[index]) - powerset + # generate powerset (do not include empty set) + n <- length(vec) + indices <- 1:n + powerset <- lapply(1:n, function(x) combn(indices, x, simplify = FALSE)) + powerset <- unlist(powerset, recursive = FALSE) + powerset <- lapply(powerset, function(index) vec[index]) + powerset } opts_power_set <- get_powerset(opts) est_walltimes <- list() for (i in 1:20) { - est_walltimes <- append( - x = est_walltimes, - values = sapply( - opts_power_set, - FUN = function(advanced_opts) { - # for simplicity, assume the default number of homologus hits (100) - n_hits <- if ("homology_search" %in% advanced_opts) { - 100 - } else { - NULL - } - est_walltime <- advanced_opts2est_walltime( - advanced_opts, - n_inputs = i, - n_hits = n_hits, - verbose = TRUE - ) - names(est_walltime) <- paste0(advanced_opts, collapse = "_") - est_walltime - } + est_walltimes <- append( + x = est_walltimes, + values = sapply( + opts_power_set, + FUN = function(advanced_opts) { + # for simplicity, assume the default number of homologus hits (100) + n_hits <- if ("homology_search" %in% advanced_opts) { + 100 + } else { + NULL + } + est_walltime <- calculateEstimatedWallTimeFromOpts ( + advanced_opts, + n_inputs = i, + n_hits = n_hits, + verbose = TRUE ) + names(est_walltime) <- paste0(advanced_opts, collapse = "_") + est_walltime + } ) + ) } # concat all results to their unique names est_walltimes <- tapply( - unlist( - est_walltimes, - use.names = FALSE - ), - rep( - names(est_walltimes), - lengths(est_walltimes) - ), - FUN = c + unlist( + est_walltimes, + use.names = FALSE + ), + rep( + names(est_walltimes), + lengths(est_walltimes) + ), + FUN = c ) df_walltimes <- est_walltimes |> - unlist() |> - matrix(nrow = length(est_walltimes[[1]]), ncol = length(names(est_walltimes))) + unlist() |> + matrix(nrow = length(est_walltimes[[1]]), + ncol = length(names(est_walltimes))) colnames(df_walltimes) <- names(est_walltimes) df_walltimes <- df_walltimes |> tibble::as_tibble() # rm always col or powerset outcome without the "always" processes col_idx_keep <- grep(pattern = "always$", x = names(df_walltimes)) df_walltimes <- df_walltimes |> - dplyr::select(col_idx_keep) + dplyr::select(col_idx_keep) # bind n_inputs df_walltimes <- df_walltimes |> - dplyr::mutate(n_inputs = 1:20) - df_walltimes <- tidyr::gather(df_walltimes, key = "advanced_opts", value = "est_walltime", -n_inputs) + dplyr::mutate(n_inputs = 1:20) + df_walltimes <- tidyr::gather(df_walltimes, + key = "advanced_opts", + value = "est_walltime", + n_inputs) # sec to hrs df_walltimes <- df_walltimes |> - dplyr::mutate(est_walltime = est_walltime / 3600) - p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = n_inputs, y = est_walltime, color = advanced_opts)) + - ggplot2::geom_line() + - ggplot2::labs( - title = "MolEvolvR estimated runtimes", - x = "Number of inputs", - y = "Estimated walltime (hours)" - ) + dplyr::mutate(est_walltime = est_walltime / 3600) + p <- ggplot2::ggplot(df_walltimes, ggplot2::aes(x = n_inputs, + y = est_walltime, + color = advanced_opts)) + + ggplot2::geom_line() + + ggplot2::labs( + title = "MolEvolvR estimated runtimes", + x = "Number of inputs", + y = "Estimated walltime (hours)" + ) return(p) } diff --git a/R/blastWrappers.R b/R/blastWrappers.R index dc11f589..48753afa 100755 --- a/R/blastWrappers.R +++ b/R/blastWrappers.R @@ -22,21 +22,22 @@ runDeltaBlast <- function(deltablast_path, db_search_path, out, num_alignments, num_threads = 1) { start <- Sys.time() - system(paste0("export BLASTDB=/", db_search_path)) + system(paste0("export BLASTDB=/", db_search_path)) - system2( - command = deltablast_path, - args = c( - "-db", db, - "-query", query, - "-evalue", evalue, - "-out", out, - "-num_threads", num_threads, - "-num_alignments", num_alignments - # ,"-outfmt", outfmt - ) + system2( + command = deltablast_path, + args = c( + "-db", db, + "-query", query, + "-evalue", evalue, + "-out", out, + "-num_threads", num_threads, + "-num_alignments", num_alignments + # ,"-outfmt", outfmt ) - print(Sys.time() - start) + ) + print(Sys.time() - start) + } diff --git a/R/clean_clust_file.R b/R/clean_clust_file.R index d3f813e5..87dcde70 100755 --- a/R/clean_clust_file.R +++ b/R/clean_clust_file.R @@ -55,9 +55,9 @@ #' #' @examples #' \dontrun{ -#' clean_clust_file("data/pspa.op_ins_cls", writepath = NULL, query = "pspa") +#' cleanClusterFile("data/pspa.op_ins_cls", writepath = NULL, query = "pspa") #' } -clean_clust_file <- function(path, writepath = NULL, query) { +cleanClusterFile <- function(path, writepath = NULL, query) { # ?? does the following line need to be changed to read_lines()? prot <- read_tsv(path, col_names = F) diff --git a/R/cleanup.R b/R/cleanup.R index 39b4b8d2..4fe074ee 100755 --- a/R/cleanup.R +++ b/R/cleanup.R @@ -88,12 +88,12 @@ ensureUniqAccNum <- function(accnums) { # for the index of occurence for each accession number df_accnums <- tibble::tibble("accnum" = accnums) df_accnums <- df_accnums |> - dplyr::group_by(accnum) |> + dplyr::group_by(.data$accnum) |> dplyr::mutate(suffix = dplyr::row_number()) |> dplyr::ungroup() |> - dplyr::mutate(accnum_adjusted = paste0(accnum, "_", suffix)) |> - dplyr::arrange(accnum_adjusted) - accnums_adjusted <- df_accnums |> dplyr::pull(accnum_adjusted) + dplyr::mutate(accnum_adjusted = paste0(.data$accnum, "_", .data$suffix)) |> + dplyr::arrange(.data$accnum_adjusted) + accnums_adjusted <- df_accnums |> dplyr::pull(.data$accnum_adjusted) return(accnums_adjusted) } diff --git a/R/combine_analysis.R b/R/combine_analysis.R index bb3b3ce2..55e36925 100755 --- a/R/combine_analysis.R +++ b/R/combine_analysis.R @@ -17,9 +17,9 @@ #' @export #' #' @examples -combine_full <- function(inpath, ret = FALSE) { +combineFullAnalysis <- function(inpath, ret = FALSE) { ## Combining full_analysis files - full_combnd <- combine_files(inpath, + full_combnd <- combineFiles(inpath, pattern = "*.full_analysis.tsv", skip = 0, col_names = T ) @@ -44,9 +44,9 @@ combine_full <- function(inpath, ret = FALSE) { #' @export #' #' @examples -combine_ipr <- function(inpath, ret = FALSE) { +combineIPR <- function(inpath, ret = FALSE) { ## Combining clean ipr files - ipr_combnd <- combine_files(inpath, + ipr_combnd <- combineFiles(inpath, pattern = "*.iprscan_cln.tsv", skip = 0, col_names = T ) diff --git a/R/combine_files.R b/R/combine_files.R index 76c5fa09..455ddd53 100755 --- a/R/combine_files.R +++ b/R/combine_files.R @@ -38,7 +38,7 @@ #' @export #' #' @examples -combine_files <- function(inpath = c("../molevol_data/project_data/phage_defense/"), +combineFiles <- function(inpath = c("../molevol_data/project_data/phage_defense/"), pattern = "*full_analysis.tsv", delim = "\t", skip = 0, col_names = T) { @@ -67,7 +67,7 @@ combine_files <- function(inpath = c("../molevol_data/project_data/phage_defense ## Sample Runs ## ################# # ## Combining full_analysis files -# full_combnd <- combine_files(inpath, +# full_combnd <- combineFiles(inpath, # pattern="*full_analysis.txt", skip=0, # col_names=T) # @@ -75,7 +75,7 @@ combine_files <- function(inpath = c("../molevol_data/project_data/phage_defense # path="../molevol_data/project_data/slps/full_combined.tsv") # # ## Combining clean files -# cln_combnd <- combine_files(inpath, +# cln_combnd <- combineFiles(inpath, # pattern="^.*cln.txt", skip=0, # col_names=T) # @@ -86,14 +86,14 @@ combine_files <- function(inpath = c("../molevol_data/project_data/phage_defense # ## Less helpful examples! # ## Combining BLAST files # ## Likely makes no sense since clustering is done per query -# cl_blast_combnd <- combine_files(inpath, +# cl_blast_combnd <- combineFiles(inpath, # pattern="^.*refseq.1e-5.txt", skip=0, # col_names=cl_blast_colnames) %>% # select(-PcPositive, -ClusterID) # # ## Combining IPR files # ## Likely makes no sense since there may be repeated AccNum from indiv. files! -# ipr_combnd <- combine_files(inpath, +# ipr_combnd <- combineFiles(inpath, # pattern="*iprscan.lins*", skip=0, # col_names=ipr_colnames) # diff --git a/R/create_lineage_lookup.R b/R/create_lineage_lookup.R index e7374df3..2408c5e6 100644 --- a/R/create_lineage_lookup.R +++ b/R/create_lineage_lookup.R @@ -3,6 +3,9 @@ # library(biomartr) +#' createLineageLookup +#' +#' @description #' Create a look up table that goes from TaxID, to Lineage #' #' @author Samuel Chen @@ -26,9 +29,9 @@ #' @export #' #' @examples -create_lineage_lookup <- function(lineage_file = here("data/rankedlineage.dmp"), +createLineageLookup <- function(lineage_file = here("data/rankedlineage.dmp"), outfile, taxonomic_rank = "phylum") { - shorten_NA <- function(Lineage) { + .shortenNA <- function(Lineage) { first_NA <- str_locate(Lineage, "NA")[1] if (is.na(first_NA)) { # No NAs @@ -92,7 +95,7 @@ create_lineage_lookup <- function(lineage_file = here("data/rankedlineage.dmp"), # Takes a while (2million rows after all) rankedLinsCombined <- rankedLins %>% unite(col = "Lineage", all_of(combined_taxonomy), sep = ">") %>% - mutate(Lineage = unlist(map(Lineage, shorten_NA))) + mutate(Lineage = unlist(map(Lineage, .shortenNA))) @@ -101,7 +104,7 @@ create_lineage_lookup <- function(lineage_file = here("data/rankedlineage.dmp"), -#' CreateLineageLookup <- function(assembly_path, updateAssembly = FALSE, file_type = "tsv") +#' createLineageLookup <- function(assembly_path, updateAssembly = FALSE, file_type = "tsv") #' { #' #' Create a look up table that goes from GCA_ID, to TaxID, to Lineage #' #' @author Samuel Chen diff --git a/R/fa2domain.R b/R/fa2domain.R index 01a56918..6dc6f622 100644 --- a/R/fa2domain.R +++ b/R/fa2domain.R @@ -138,10 +138,10 @@ createIPRScanDomainTable <- function( # filter for the accnum of interest (note: it's possible the accession # number is not in the table [i.e., it had no domains]) df_iprscan_accnum <- df_iprscan |> - dplyr::filter(Analysis %in% analysis) |> - dplyr::filter(AccNum == accnum) |> + dplyr::filter(.data$Analysis %in% analysis) |> + dplyr::filter(.data$AccNum == accnum) |> dplyr::select(dplyr::all_of(c("AccNum", "DB.ID", "StartLoc", "StopLoc"))) |> - dplyr::arrange(StartLoc) + dplyr::arrange(.data$StartLoc) # handle the case of no records after filtering by "Analysis"; return the tibble # with 0 rows quickly if (nrow(df_iprscan_accnum) < 1) { @@ -153,9 +153,9 @@ createIPRScanDomainTable <- function( dplyr::rowwise() |> dplyr::mutate( seq_domain = XVector::subseq( - fasta[[grep(pattern = AccNum, x = names(fasta), fixed = TRUE)]], - start = StartLoc, - end = StopLoc + fasta[[grep(pattern = .data$AccNum, x = names(fasta), fixed = TRUE)]], + start = .data$StartLoc, + end = .data$StopLoc ) |> as.character() ) @@ -166,7 +166,7 @@ createIPRScanDomainTable <- function( id_domain = stringr::str_glue("{AccNum}-{DB.ID}-{StartLoc}_{StopLoc}") ) |> dplyr::ungroup() |> - dplyr::relocate(id_domain, .before = 1) + dplyr::relocate(.data$id_domain, .before = 1) return(df_iprscan_domains) } diff --git a/R/ipr2viz.R b/R/ipr2viz.R index bf3650f7..9b625d4e 100644 --- a/R/ipr2viz.R +++ b/R/ipr2viz.R @@ -13,9 +13,9 @@ ################################# ## Modified gggenes::theme_genes ################################# -## theme_genes2 adapted from theme_genes (w/o strip.text()) +## themeGenes2 adapted from theme_genes (w/o strip.text()) ## https://github.com/wilkox/gggenes/blob/master/R/theme_genes.R -#' Theme Genes2 +#' themeGenes2 #' #' @importFrom ggplot2 element_blank element_line theme theme_grey #' @@ -23,7 +23,7 @@ #' @export #' #' @examples -theme_genes2 <- function() { +themeGenes2 <- function() { ggplot2::theme_grey() + ggplot2::theme( panel.background = ggplot2::element_blank(), panel.grid.major.y = ggplot2::element_line(colour = "grey80", size = 0.2), @@ -41,7 +41,8 @@ theme_genes2 <- function() { ################################## ## Get Top N AccNum by Lin+DomArch ################################## -#' Group by lineage + DA then take top 20 +#' getTopAccByLinDomArch +#' @description Group by lineage + DA then take top 20 #' #' @param infile_full #' @param DA_col @@ -53,12 +54,13 @@ theme_genes2 <- function() { #' @importFrom shiny showNotification #' @importFrom stats na.omit #' @importFrom rlang sym +#' @importFrom rlang .data #' #' @return #' @export #' #' @examples -find_top_acc <- function(infile_full, +getTopAccByLinDomArch <- function(infile_full, DA_col = "DomArch.Pfam", lin_col = "Lineage_short", n = 20, @@ -91,7 +93,7 @@ find_top_acc <- function(infile_full, ############################################# ## IPR + FULL files --> DomArch Visualization ############################################# -#' IPR2Viz +#' plotIPR2Viz #' #' @param infile_ipr #' @param infile_full @@ -105,15 +107,16 @@ find_top_acc <- function(infile_full, #' #' @importFrom dplyr distinct filter select #' @importFrom gggenes geom_gene_arrow geom_subgene_arrow -#' @importFrom ggplot2 aes aes_string as_labeller element_text facet_wrap ggplot guides margin scale_fill_manual theme theme_minimal ylab +#' @importFrom ggplot2 aes aes_string as_labeller element_text facet_wrap ggplot guides margin scale_fill_manual theme theme_minimal unit ylab #' @importFrom readr read_tsv #' @importFrom tidyr pivot_wider +#' @importFrom stats as.formula #' #' @return #' @export #' #' @examples -ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), +plotIPR2Viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), analysis = c("Pfam", "Phobius", "TMHMM", "Gene3D"), group_by = "Analysis", # "Analysis" topn = 20, name = "Name", text_size = 15, query = "All") { @@ -134,15 +137,15 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), ADDITIONAL_COLORS <- sample(CPCOLS, 1000, replace = TRUE) CPCOLS <- append(x = CPCOLS, values = ADDITIONAL_COLORS) ## Read IPR file - ipr_out <- read_tsv(infile_ipr, col_names = T, col_types = iprscan_cols) - ipr_out <- ipr_out %>% filter(Name %in% accessions) + ipr_out <- read_tsv(infile_ipr, col_names = T, col_types = MolEvolvR::iprscan_cols) + ipr_out <- ipr_out %>% filter(.data$Name %in% accessions) analysis_cols <- paste0("DomArch.", analysis) - infile_full <- infile_full %>% select(analysis_cols, Lineage_short, QueryName, PcPositive, AccNum) + infile_full <- infile_full %>% select(.data$analysis_cols, .data$Lineage_short, .data$QueryName, .data$PcPositive, .data$AccNum) ## To filter by Analysis analysis <- paste(analysis, collapse = "|") ## @SAM: This can't be set in stone since the analysis may change! - ## Getting top n accession numbers using find_top_acc() - top_acc <- find_top_acc( + ## Getting top n accession numbers using getTopAccByLinDomArch() + top_acc <- getTopAccByLinDomArch( infile_full = infile_full, DA_col = "DomArch.Pfam", ## @SAM, you could pick by the Analysis w/ max rows! @@ -157,22 +160,22 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), ## Need to fix this eventually based on the 'real' gene orientation! :) ipr_out$Strand <- rep("forward", nrow(ipr_out)) - ipr_out <- ipr_out %>% arrange(AccNum, StartLoc, StopLoc) + ipr_out <- ipr_out %>% arrange(.data$AccNum, .data$StartLoc, .data$StopLoc) ipr_out_sub <- filter( ipr_out, - grepl(pattern = analysis, x = Analysis) + grepl(pattern = analysis, x = .data$Analysis) ) # dynamic analysis labeller analyses <- ipr_out_sub %>% - select(Analysis) %>% + select(.data$Analysis) %>% distinct() analysis_labeler <- analyses %>% - pivot_wider(names_from = Analysis, values_from = Analysis) + pivot_wider(names_from = .data$Analysis, values_from = .data$Analysis) lookup_tbl_path <- "/data/research/jravilab/common_data/cln_lookup_tbl.tsv" - lookup_tbl <- read_tsv(lookup_tbl_path, col_names = T, col_types = lookup_table_cols) + lookup_tbl <- read_tsv(lookup_tbl_path, col_names = T, col_types = MolEvolvR::lookup_table_cols) - lookup_tbl <- lookup_tbl %>% select(-ShortName) # Already has ShortName -- Just needs SignDesc + lookup_tbl <- lookup_tbl %>% select(-.data$ShortName) # Already has ShortName -- Just needs SignDesc # ipr_out_sub = ipr_out_sub %>% select(-ShortName) # TODO: Fix lookup table and uncomment below # ipr_out_sub <- merge(ipr_out_sub, lookup_tbl, by.x = "DB.ID", by.y = "DB.ID") @@ -195,14 +198,14 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), ), color = "white") + geom_gene_arrow(fill = NA, color = "grey") + # geom_blank(data = dummies) + - facet_wrap(~Analysis, + facet_wrap(~.data$Analysis, strip.position = "top", ncol = 5, labeller = as_labeller(analysis_labeler) ) + # , ncol = 1 + #scales = "free", scale_fill_manual(values = CPCOLS, na.value = "#A9A9A9") + theme_minimal() + - theme_genes2() + + themeGenes2() + theme( legend.position = "bottom", legend.box = "horizontal", @@ -216,9 +219,9 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), plot <- ggplot( ipr_out_sub, aes( - xmin = 1, xmax = SLength, - y = Analysis, # y = AccNum - label = ShortName + xmin = 1, xmax = .data$SLength, + y = .data$Analysis, # y = AccNum + label = .data$ShortName ) ) + geom_subgene_arrow(data = ipr_out_sub, aes_string( @@ -232,7 +235,7 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), ) + scale_fill_manual(values = CPCOLS, na.value = "#A9A9A9") + theme_minimal() + - theme_genes2() + + themeGenes2() + theme( legend.position = "bottom", legend.box = "horizontal", @@ -246,7 +249,7 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), return(plot) } -#' IPR2Viz Web +#' plotIPR2VizWeb #' #' @param infile_ipr #' @param accessions @@ -268,7 +271,7 @@ ipr2viz <- function(infile_ipr = NULL, infile_full = NULL, accessions = c(), #' @export #' #' @examples -ipr2viz_web <- function(infile_ipr, +plotIPR2VizWeb <- function(infile_ipr, accessions, analysis = c("Pfam", "Phobius", "TMHMM", "Gene3D"), group_by = "Analysis", name = "Name", @@ -295,7 +298,7 @@ ipr2viz_web <- function(infile_ipr, ## @SAM, colnames, merges, everything neeeds to be done now based on the ## combined lookup table from "common_data" lookup_tbl_path <- "/data/research/jravilab/common_data/cln_lookup_tbl.tsv" - lookup_tbl <- read_tsv(lookup_tbl_path, col_names = T, col_types = lookup_table_cols) + lookup_tbl <- read_tsv(lookup_tbl_path, col_names = T, col_types = MolEvolvR::lookup_table_cols) ## Read IPR file and subset by Accessions ipr_out <- read_tsv(infile_ipr, col_names = T) @@ -303,7 +306,7 @@ ipr2viz_web <- function(infile_ipr, ## Need to fix eventually based on 'real' gene orientation! ipr_out$Strand <- rep("forward", nrow(ipr_out)) - ipr_out <- ipr_out %>% arrange(AccNum, StartLoc, StopLoc) + ipr_out <- ipr_out %>% arrange(.data$AccNum, .data$StartLoc, .data$StopLoc) ipr_out_sub <- filter( ipr_out, grepl(pattern = analysis, x = Analysis) @@ -344,7 +347,7 @@ ipr2viz_web <- function(infile_ipr, # , ncol = 1 + #scales = "free", scale_fill_manual(values = CPCOLS, na.value = "#A9A9A9") + theme_minimal() + - theme_genes2() + + themeGenes2() + theme( legend.position = "bottom", legend.box = "horizontal", @@ -374,7 +377,7 @@ ipr2viz_web <- function(infile_ipr, ) + scale_fill_manual(values = CPCOLS, na.value = "#A9A9A9") + theme_minimal() + - theme_genes2() + + themeGenes2() + theme( legend.position = "bottom", legend.box = "horizontal", diff --git a/R/lineage.R b/R/lineage.R index d14246d7..ef4fe586 100644 --- a/R/lineage.R +++ b/R/lineage.R @@ -77,7 +77,7 @@ downloadAssemblySummary <- function(outpath, #' This file can be generated using the "downloadAssemblySummary()" function #' @param lineagelookup_path String of the path to the lineage lookup file #' (taxid to lineage mapping). This file can be generated using the -#' "create_lineage_lookup()" function +#' "createLineageLookup()" function #' @param acc_col #' #' @importFrom dplyr pull @@ -309,7 +309,7 @@ efetchIPG <- function(accessions, out_path, plan = "multicore") { #' @param genbank_assembly_path #' @param lineagelookup_path String of the path to the lineage lookup file #' (taxid to lineage mapping). This file can be generated using the -#' "create_lineage_lookup()" function +#' "createLineageLookup()" function #' #' @importFrom data.table fread setnames #' diff --git a/R/msa.R b/R/msa.R index e56cc32c..0b1b6e34 100644 --- a/R/msa.R +++ b/R/msa.R @@ -197,21 +197,21 @@ msa_pdf <- function(fasta_path, out_path = NULL, #' #' @examples generate_msa <- function(fa_file = "", outfile = "") { - prot_aa <- readAAStringSet( - path = fa_file, - format = "fasta" - ) - prot_aa + prot_aa <- readAAStringSet( + fa_file, + format = "fasta" + ) + prot_aa - ## Install kalign ?rMSA_INSTALL - ## Messed up! Reimplement from kalign.R - ## https://github.com/mhahsler/rMSA/blob/master/R/kalign.R + ## Install kalign ?rMSA_INSTALL + ## Messed up! Reimplement from kalign.R + ## https://github.com/mhahsler/rMSA/blob/master/R/kalign.R - # source("scripts/c2r.R") + # source("scripts/c2r.R") - ## align the sequences - al <- kalign(prot_aa) # !! won't work! - al + ## align the sequences + al <- kalign(prot_aa) # !! won't work! + al } ############################ diff --git a/R/plotting.R b/R/plotting.R index 5c8de823..7191eace 100644 --- a/R/plotting.R +++ b/R/plotting.R @@ -521,8 +521,8 @@ plotLineageNeighbors <- function(query_data = "prot", query = "pspa", gather(key = TopNeighbors.DA, value = count, 19:ncol(query_data)) %>% select("Lineage", "TopNeighbors.DA", "count") %>% # "DomArch.norep","GenContext.norep", group_by(TopNeighbors.DA, Lineage) %>% - summarise(lincount = sum(count), bin = as.numeric(as.logical(lincount))) %>% - arrange(desc(lincount)) %>% + summarise(lincount =sum(count), bin = as.numeric(as.logical(.data$lincount))) %>% + arrange(desc(.data$lincount)) %>% within(TopNeighbors.DA <- factor(TopNeighbors.DA, levels = rev(names(sort(table(TopNeighbors.DA), decreasing = TRUE @@ -538,9 +538,9 @@ plotLineageNeighbors <- function(query_data = "prot", query = "pspa", geom_tile( data = subset( query.ggplot, - !is.na(lincount) + !is.na(.data$lincount) ), # bin - aes(fill = lincount), # bin + aes(fill = .data$lincount), # bin colour = "coral3", size = 0.3 ) + # , width=0.7, height=0.7), scale_fill_gradient(low = "white", high = "darkred") + @@ -1183,10 +1183,11 @@ createWordCloud2Element <- function(query_data = "prot", #' then the legend will be in the descending order of the top level hierarchy. #' will be rendered. If the type is sund2b, a sund2b plot will be rendered. #' +#' @importFrom d3r d3_nest #' @importFrom dplyr arrange desc group_by_at select summarise #' @importFrom htmlwidgets onRender #' @importFrom rlang sym -#' @importFrom sunburstR sunburst +#' @importFrom sunburstR sunburst sund2b #' @importFrom tidyr drop_na separate #' #' @return @@ -1227,9 +1228,9 @@ plotLineageSunburst <- function(prot, lineage_column = "Lineage", # Plot sunburst if (type == "sunburst") { - result <- sunburst(tree, legend = list(w = 225, h = 15, r = 5, s = 5), colors = cpcols, legendOrder = legendOrder, width = "100%", height = "100%") + result <- sunburst(tree, legend = list(w = 225, h = 15, r = 5, s = 5), colors = .data$cpcols, legendOrder = legendOrder, width = "100%", height = "100%") } else if (type == "sund2b") { - result <- sund2b(tree) + result <- .data$sund2b(tree) } if (showLegend) { diff --git a/R/pre-msa-tree.R b/R/pre-msa-tree.R index 44979c3c..290a1644 100644 --- a/R/pre-msa-tree.R +++ b/R/pre-msa-tree.R @@ -49,7 +49,7 @@ api_key <- Sys.getenv("ENTREZ_API_KEY", unset = "YOUR_KEY_HERE") #' @export #' #' @examples -to_titlecase <- function(x, y = " ") { +convert2TitleCase <- function(x, y = " ") { s <- strsplit(x, y)[[1]] paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse = y @@ -59,7 +59,7 @@ to_titlecase <- function(x, y = " ") { ################################ ## Function to add leaves to an alignment file ## !! Add DA to leaves? -#' Adding Leaves to an alignment file w/ accessions +#' addLeaves2Alignment #' #' @author Janani Ravi #' @@ -95,9 +95,9 @@ to_titlecase <- function(x, y = " ") { #' #' @examples #' \dontrun{ -#' add_leaves("pspa_snf7.aln", "pspa.txt") +#' addLeaves2Alignment("pspa_snf7.aln", "pspa.txt") #' } -add_leaves <- function(aln_file = "", +addLeaves2Alignment <- function(aln_file = "", lin_file = "data/rawdata_tsv/all_semiclean.txt", # !! finally change to all_clean.txt!! # lin_file="data/rawdata_tsv/PspA.txt", reduced = FALSE) { @@ -184,7 +184,7 @@ add_leaves <- function(aln_file = "", } -#' Title +#' addName #' #' @author Samuel Chen, Janani Ravi #' @@ -209,7 +209,7 @@ add_leaves <- function(aln_file = "", #' @export #' #' @examples -add_name <- function(data, +addName <- function(data, accnum_col = "AccNum", spec_col = "Species", lin_col = "Lineage", lin_sep = ">", out_col = "Name") { cols <- c(accnum_col, "Kingdom", "Phylum", "Genus", "Spp") @@ -258,7 +258,7 @@ add_name <- function(data, ################################ ## Function to convert alignment 'aln' to fasta format for MSA + Tree -#' Adding Leaves to an alignment file w/ accessions +#' convertAlignment2FA #' #' @author Janani Ravi #' @@ -288,9 +288,9 @@ add_name <- function(data, #' #' @examples #' \dontrun{ -#' add_leaves("pspa_snf7.aln", "pspa.txt") +#' convertAlignment2FA("pspa_snf7.aln", "pspa.txt") #' } -convert_aln2fa <- function(aln_file = "", +convertAlignment2FA <- function(aln_file = "", lin_file = "data/rawdata_tsv/all_semiclean.txt", # !! finally change to all_clean.txt!! fa_outpath = "", reduced = FALSE) { @@ -324,6 +324,9 @@ convert_aln2fa <- function(aln_file = "", return(fasta) } +#' mapAcc2Name +#' +#' @description #' Default rename_fasta() replacement function. Maps an accession number to its name #' #' @param line he line of a fasta file starting with '>' @@ -340,7 +343,7 @@ convert_aln2fa <- function(aln_file = "", #' @export #' #' @examples -map_acc2name <- function(line, acc2name, acc_col = "AccNum", name_col = "Name") { +mapAcc2Name <- function(line, acc2name, acc_col = "AccNum", name_col = "Name") { # change to be the name equivalent to an add_names column # Find the first ' ' end_acc <- str_locate(line, " ")[[1]] @@ -386,7 +389,10 @@ rename_fasta <- function(fa_path, outpath, } ################################ -## generate_all_aln2fa +## generateAllAlignments2FA +#' generateAllAlignments2FA +#' +#' @description #' Adding Leaves to an alignment file w/ accessions #' #' @author Janani Ravi @@ -413,9 +419,9 @@ rename_fasta <- function(fa_path, outpath, #' #' @examples #' \dontrun{ -#' generate_all_aln2fa() +#' generateAllAlignments2FA() #' } -generate_all_aln2fa <- function(aln_path = here("data/rawdata_aln/"), +generateAllAlignments2FA <- function(aln_path = here("data/rawdata_aln/"), fa_outpath = here("data/alns/"), lin_file = here("data/rawdata_tsv/all_semiclean.txt"), reduced = F) { @@ -448,6 +454,10 @@ generate_all_aln2fa <- function(aln_path = here("data/rawdata_aln/"), # accessions <- rep("ANY95992.1", 201) #' acc2fa #' +#' @description +#' converts protein accession numbers to a fasta format. Resulting +#' fasta file is written to the outpath. +#' #' @author Samuel Chen, Janani Ravi #' @keywords accnum, fasta #' @@ -546,7 +556,7 @@ acc2fa <- function(accessions, outpath, plan = "sequential") { return(result) } -#' RepresentativeAccNums +#' createRepresentativeAccNum #' #' @description #' Function to generate a vector of one Accession number per distinct observation from 'reduced' column @@ -566,7 +576,7 @@ acc2fa <- function(accessions, outpath, plan = "sequential") { #' @export #' #' @examples -RepresentativeAccNums <- function(prot_data, +createRepresentativeAccNum <- function(prot_data, reduced = "Lineage", accnum_col = "AccNum") { # Get Unique reduced column and then bind the AccNums back to get one AccNum per reduced column @@ -623,15 +633,15 @@ alignFasta <- function(fasta_file, tool = "Muscle", outpath = NULL) { ) if (typeof(outpath) == "character") { - write.MsaAAMultipleAlignment(aligned, outpath) + writeMSA_AA2FA(aligned, outpath) } return(aligned) } -#' write.MsaAAMultipleAlignment +#' writeMSA_AA2FA #' #' @description -#' Write MsaAAMultpleAlignment Objects as algined fasta sequence +#' Write MsaAAMultpleAlignment Objects as aligned fasta sequence #' MsaAAMultipleAlignment Objects are generated from calls to msaClustalOmega #' and msaMuscle from the 'msa' package #' @@ -647,7 +657,7 @@ alignFasta <- function(fasta_file, tool = "Muscle", outpath = NULL) { #' @export #' #' @examples -write.MsaAAMultipleAlignment <- function(alignment, outpath) { +writeMSA_AA2FA <- function(alignment, outpath) { l <- length(rownames(alignment)) fasta <- "" for (i in 1:l) @@ -660,7 +670,7 @@ write.MsaAAMultipleAlignment <- function(alignment, outpath) { return(fasta) } -#' get_accnums_from_fasta_file +#' getAccNumFromFA #' #' @param fasta_file #' @@ -671,7 +681,7 @@ write.MsaAAMultipleAlignment <- function(alignment, outpath) { #' @export #' #' @examples -get_accnums_from_fasta_file <- function(fasta_file) { +getAccNumFromFA <- function(fasta_file) { txt <- read_file(fasta_file) accnums <- stringi::stri_extract_all_regex(fasta_file, "(?<=>)[\\w,.]+")[[1]] return(accnums) diff --git a/R/reverse_operons.R b/R/reverse_operons.R index e4bbd50e..a2570e8d 100755 --- a/R/reverse_operons.R +++ b/R/reverse_operons.R @@ -3,7 +3,7 @@ # Modified by Janani Ravi and Samuel Chen -#' reveql +#' straightenOperonSeq #' #' @param prot #' @@ -11,7 +11,7 @@ #' @export #' #' @examples -reveql <- function(prot) { +straightenOperonSeq <- function(prot) { w <- prot # $GenContext.orig # was 'x' y <- rep(NA, length(w)) @@ -57,7 +57,7 @@ reveql <- function(prot) { ## The function to reverse operons -#' reverse_operon +#' reverseOperonSeq #' #' @param prot #' @@ -65,7 +65,7 @@ reveql <- function(prot) { #' @export #' #' @examples -reverse_operon <- function(prot) { +reverseOperonSeq <- function(prot) { gencontext <- prot$GenContext gencontext <- gsub(pattern = ">", replacement = ">|", x = gencontext) @@ -108,7 +108,7 @@ reverse_operon <- function(prot) { - ge <- lapply(1:length(ge), function(x) reveql(ge[[x]])) + ge <- lapply(1:length(ge), function(x) straightenOperonSeq(ge[[x]])) ye <- te[withouteq] @@ -141,4 +141,4 @@ reverse_operon <- function(prot) { # colnames(prot) <- c("AccNum","GenContext.orig","len", "GeneName","TaxID","Species") ## ??? straighten operons -# prot$GenContext.orig <- reverse_operon(prot) +# prot$GenContext.orig <- reverseOperonSeq(prot) diff --git a/R/summarize.R b/R/summarize.R index e0dae1c4..2816f174 100644 --- a/R/summarize.R +++ b/R/summarize.R @@ -10,7 +10,7 @@ # suppressPackageStartupMessages(library(rlang)) # conflicted::conflict_prefer("filter", "dplyr") -#' Filter by Domains +#' filterByDomains #' #' @author Samuel Chen, Janani Ravi #' @description filterByDomains filters a data frame by identifying exact domain matches @@ -88,21 +88,35 @@ filterByDomains <- function(prot, column = "DomArch", doms_keep = c(), doms_remo ## COUNTS of DAs and GCs ## ## Before/after break up ## ########################### -## Function to obtain element counts (DA, GC) -#' Count By Column -#' -#' @param prot -#' @param column -#' @param min.freq + +#' countByColumn +#' @description +#' Function to obtain element counts (DA, GC) +#' +#' @param prot A data frame containing the dataset to analyze, typically with +#' multiple columns including the one specified by the `column` parameter. +#' @param column A character string specifying the name of the column to analyze. +#' The default is "DomArch". +#' @param min.freq An integer specifying the minimum frequency an element must +#' have to be included in the output. Default is 1. #' #' @importFrom dplyr arrange as_tibble filter select #' -#' @return Describe return, in detail +#' @return A tibble with two columns: +#' \describe{ +#' \item{`column`}{The unique elements from the specified column +#' (e.g., "DomArch").} +#' \item{`freq`}{The frequency of each element, i.e., the number of times +#' each element appears in the specified column.} +#' } +#' The tibble is filtered to only include elements that have a frequency +#' greater than or equal to `min.freq` and does not include elements with `NA` +#' values or those starting with a hyphen ("-"). #' @export #' #' @examples #' \dontrun{ -#' countByColumn() +#' countByColumn(prot = my_data, column = "DomArch", min.freq = 10) #' } countByColumn <- function(prot = prot, column = "DomArch", min.freq = 1) { counts <- prot %>% @@ -117,25 +131,36 @@ countByColumn <- function(prot = prot, column = "DomArch", min.freq = 1) { return(counts) } -#' Elements 2 Words +#' elements2Words #' #' @description #' Break string ELEMENTS into WORDS for domain architecture (DA) and genomic #' context (GC) #' -#' @param prot [dataframe] -#' @param column [string] column name -#' @param conversion_type [string] type of conversion: 'da2doms': domain architectures to -#' domains. 'gc2da' genomic context to domain architectures +#' @param prot A dataframe containing the dataset to analyze. The specified +#' `column` contains the string elements to be processed. +#' @param column A character string specifying the name of the column to analyze. +#' Default is "DomArch". +#' @param conversion_type A character string specifying the type of conversion. +#' Two options are available: +#' \describe{ +#' \item{`da2doms`}{Convert domain architectures into individual domains by +#' replacing `+` symbols with spaces.} +#' \item{`gc2da`}{Convert genomic context into domain architectures by +#' replacing directional symbols (`<-`, `->`, and `|`) with spaces.} +#' } #' #' @importFrom dplyr pull #' @importFrom stringr str_replace_all #' -#' @return [string] with words delimited by spaces +#' @return A single string where elements are delimited by spaces. The function +#' performs necessary substitutions based on the `conversion_type` and cleans up +#' extraneous characters like newlines, tabs, and multiple spaces. #' #' @examples #' \dontrun{ -#' tibble::tibble(DomArch = c("aaa+bbb", "a+b", "b+c", "b-c")) |> elements2Words() +#' tibble::tibble(DomArch = c("aaa+bbb", +#' "a+b", "b+c", "b-c")) |> elements2Words() #' } #' elements2Words <- function(prot, column = "DomArch", conversion_type = "da2doms") { @@ -170,16 +195,25 @@ elements2Words <- function(prot, column = "DomArch", conversion_type = "da2doms" return(z3) } -#' Words 2 Word Counts +#' words2WordCounts #' #' @description #' Get word counts (wc) [DOMAINS (DA) or DOMAIN ARCHITECTURES (GC)] #' -#' @param string +#' @param string A character string containing the elements (words) to count. +#' This would typically be a space-delimited string representing domain +#' architectures or genomic contexts. #' -#' @importFrom dplyr as_tibble filter +#' @importFrom dplyr as_tibble filter arrange +#' @importFrom stringr str_replace_all #' -#' @return [tbl_df] table with 2 columns: 1) words & 2) counts/frequency +#' @return A tibble (tbl_df) with two columns: +#' \describe{ +#' \item{`words`}{A column containing the individual words +#' (domains or domain architectures).} +#' \item{`freq`}{A column containing the frequency counts for each word.} +#' } +#' #' #' @examples #' \dontrun{ @@ -216,13 +250,20 @@ words2WordCounts <- function(string) { arrange(-freq) return(df_word_count) } -## Function to filter based on frequencies -#' Filter Frequency + +#' filterByFrequency +#' @description +#' Function to filter based on frequencies +#' +#' @param x A tibble (tbl_df) containing at least two columns: one for +#' elements (e.g., `words`) and one for their frequency (e.g., `freq`). +#' @param min.freq A numeric value specifying the minimum frequency threshold. +#' Only elements with frequencies greater than or equal to this value will be +#' retained. #' -#' @param x -#' @param min.freq +#' @return A tibble with the same structure as `x`, but filtered to include +#' only rows where the frequency is greater than or equal to `min.freq`. #' -#' @return Describe return, in detail #' @export #' #' @examples @@ -237,17 +278,30 @@ filterByFrequency <- function(x, min.freq) { ######################### ## SUMMARY FUNCTIONS #### ######################### -#' Summarize by Lineage +#' MolEvolvR Summary +#' @name MolEvolvR_summary +#' @description +#' A collection of summary functions for the MolEvolvR package. +#' +NULL + +#' summarizeByLineage #' -#' @param prot -#' @param column -#' @param by -#' @param query +#' @param prot A dataframe or tibble containing the data. +#' @param column A string representing the column to be summarized +#' (e.g., `DomArch`). Default is "DomArch". +#' @param by A string representing the grouping column (e.g., `Lineage`). +#' Default is "Lineage". +#' @param query A string specifying the query pattern for filtering the target +#' column. Use "all" to skip filtering and include all rows. #' #' @importFrom dplyr arrange filter group_by summarise #' @importFrom rlang sym #' -#' @return Describe return, in detail +#' @return A tibble summarizing the counts of occurrences of elements in +#' the `column`, grouped by the `by` column. The result includes the number +#' of occurrences (`count`) and is arranged in descending order of count. +#' @rdname MolEvolvR_summary #' @export #' #' @examples @@ -283,11 +337,18 @@ summarizeByLineage <- function(prot = "prot", column = "DomArch", by = "Lineage" #' Function to summarize and retrieve counts by Domains & Domains+Lineage #' #' -#' @param x +#' @param x A dataframe or tibble containing the data. It must have columns +#' named `DomArch` and `Lineage`. #' #' @importFrom dplyr arrange count desc filter group_by summarise #' -#' @return Describe return, in detail +#' @return A tibble summarizing the counts of unique domain architectures +#' (`DomArch`) per lineage (`Lineage`). The resulting table contains three +#' columns: `DomArch`, `Lineage`, and `count`, which indicates the frequency +#' of each domain architecture for each lineage. The results are arranged in +#' descending order of `count`. +#' @rdname MolEvolvR_summary +#' #' @export #' #' @examples @@ -302,17 +363,25 @@ summarizeDomArch_ByLineage <- function(x) { arrange(desc(count)) } -## Function to retrieve counts of how many lineages a DomArch appears in + #' summarizeDomArch #' #' @description #' Function to retrieve counts of how many lineages a DomArch appears in #' -#' @param x +#' @param x A dataframe or tibble containing the data. It must have a column +#' named `DomArch` and a count column, such as `count`, which represents the +#' occurrences of each architecture in various lineages. #' #' @importFrom dplyr arrange group_by filter summarise #' -#' @return Describe return, in detail +#' @return A tibble summarizing each unique `DomArch`, along with the following +#' columns: +#' - `totalcount`: The total occurrences of each `DomArch` across all lineages. +#' - `totallin`: The total number of unique lineages in which each `DomArch` +#' appears. +#' The results are arranged in descending order of `totallin` and `totalcount`. +#' @rdname MolEvolvR_summary #' @export #' #' @examples @@ -330,11 +399,21 @@ summarizeDomArch <- function(x) { #' summarizeGenContext_ByDomArchLineage #' -#' @param x +#' @param x A dataframe or tibble containing the data. It must have columns +#' named `GenContext`, `DomArch`, and `Lineage`. #' #' @importFrom dplyr arrange desc filter group_by n summarise #' -#' @return Define return, in detail +#' @return A tibble summarizing each unique combination of `GenContext`, +#' `DomArch`, and `Lineage`, along with the following columns: +#' - `GenContext`: The genomic context for each entry. +#' - `DomArch`: The domain architecture for each entry. +#' - `Lineage`: The lineage associated with each entry. +#' - `count`: The total number of occurrences for each combination of +#' `GenContext`, `DomArch`, and `Lineage`. +#' +#' The results are arranged in descending order of `count`. +#' @rdname MolEvolvR_summary #' @export #' #' @examples @@ -354,11 +433,12 @@ summarizeGenContext_ByDomArchLineage <- function(x) { #' summarizeGenContext_ByLineage #' -#' @param x +#' @param x A dataframe or tibble containing the data. #' #' @importFrom dplyr arrange desc filter group_by n summarise #' #' @return Describe return, in detail +#' @rdname MolEvolvR_summary #' @export #' #' @examples @@ -378,11 +458,20 @@ summarizeGenContext_ByLineage <- function(x) { #' summarizeGenContext #' -#' @param x +#' @param x A dataframe or tibble containing the data. It must have columns +#' named `GenContext`, `DomArch`, and `Lineage`. #' -#' @importFrom dplyr arrange desc filter group_by n_distinct summarise +#' @importFrom dplyr arrange desc filter group_by n n_distinct summarise #' -#' @return Describe return, in detail +#' @return A tibble summarizing each unique combination of `GenContext` and +#' `Lineage`, along with the following columns: +#' - `GenContext`: The genomic context for each entry. +#' - `Lineage`: The lineage associated with each entry. +#' - `count`: The total number of occurrences for each combination of +#' `GenContext` and `Lineage`. +#' +#' The results are arranged in descending order of `count`. +#' @rdname MolEvolvR_summary #' @export #' #' @examples @@ -404,7 +493,7 @@ summarizeGenContext <- function(x) { ################## -#' Total Counts +#' totalGenContextOrDomArchCounts #' #' @description #' Creates a data frame with a totalcount column @@ -414,16 +503,28 @@ summarizeGenContext <- function(x) { #' #' @param prot A data frame that must contain columns: #' \itemize{\item Either 'GenContext' or 'DomArch.norep' \item count} -#' @param column Character. The column to summarize -#' @param lineage_col -#' @param cutoff Numeric. Cutoff for total count. Counts below cutoff value will not be shown. Default is 0. -#' @param RowsCutoff -#' @param digits +#' @param column Character. The column to summarize, default is "DomArch". +#' @param lineage_col Character. The name of the lineage column, default is +#' "Lineage". +#' @param cutoff Numeric. Cutoff for total count. Counts below this cutoff value +#' will not be shown. Default is 0. +#' @param RowsCutoff Logical. If TRUE, filters based on cumulative percentage +#' cutoff. Default is FALSE. +#' @param digits Numeric. Number of decimal places for percentage columns. +#' Default is 2. +#' #' #' @importFrom dplyr arrange distinct filter group_by left_join mutate select summarise ungroup #' @importFrom rlang as_string sym #' -#' @return Define return, in detail +#' @return A data frame with the following columns: +#' - `{{ column }}`: Unique values from the specified column. +#' - `totalcount`: The total count of occurrences for each unique value in +#' the specified column. +#' - `IndividualCountPercent`: The percentage of each `totalcount` relative to +#' the overall count. +#' - `CumulativePercent`: The cumulative percentage of total counts. +#' @rdname MolEvolvR_summary #' @export #' #' @note Please refer to the source code if you have alternate file formats and/or @@ -575,7 +676,7 @@ totalGenContextOrDomArchCounts <- function(prot, column = "DomArch", lineage_col -#' Find Paralogs +#' findParalogs #' #' @description #' Creates a data frame of paralogs. diff --git a/R/tree.R b/R/tree.R index 8eb641d9..5cdc20d1 100755 --- a/R/tree.R +++ b/R/tree.R @@ -37,14 +37,23 @@ ## !! FastTree will only work if there are unique sequence names!! #' convertFA2Tree #' -#' @param fa_path -#' @param tre_path -#' @param fasttree_path +#' @param fa_path Path to the input FASTA alignment file (.fa). Default is the +#' path to "data/alns/pspa_snf7.fa". +#' @param tre_path Path to the output file where the generated tree (.tre) will +#' be saved. Default is the path to "data/alns/pspa_snf7.tre". +#' @param fasttree_path Path to the FastTree executable, which is used to +#' generate the phylogenetic tree. Default is "src/FastTree". #' -#' @return +#' @return No return value. The function generates a tree file (.tre) from the +#' input FASTA file. #' @export #' #' @examples +#' \dontrun{ +#' convert_fa2tre(here("data/alns/pspa_snf7.fa"), +#' here("data/alns/pspa_snf7.tre"), +#' here("src/FastTree") +#' } convertFA2Tree <- function(fa_path = here("data/alns/pspa_snf7.fa"), tre_path = here("data/alns/pspa_snf7.tre"), fasttree_path = here("src/FastTree")) { @@ -72,16 +81,22 @@ convertFA2Tree <- function(fa_path = here("data/alns/pspa_snf7.fa"), #' @description #' Generate Trees for ALL fasta files in "data/alns" #' -#' @param aln_path +#' @param aln_path Path to the directory containing all the alignment FASTA +#' files (.fa) for which trees will be generated. Default is "data/alns/". +#' #' #' @importFrom here here #' @importFrom purrr pmap #' @importFrom stringr str_replace_all #' -#' @return +#' @return No return value. The function generates tree files (.tre) for each +#' alignment file in the specified directory. #' @export #' #' @examples +#' \dontrun{ +#' generate_trees(here("data/alns/")) +#' } convertAlignment2Trees <- function(aln_path = here("data/alns/")) { # finding all fasta alignment files fa_filenames <- list.files(path = aln_path, pattern = "*.fa") @@ -111,16 +126,19 @@ convertAlignment2Trees <- function(aln_path = here("data/alns/")) { #' @description #' Generating phylogenetic tree from alignment file '.fa' #' -#' @param fa_file Character. Path to file. -#' Default is 'pspa_snf7.fa' -#' @param out_file +#' @param fa_file Character. Path to the alignment FASTA file (.fa) from which +#' the phylogenetic tree will be generated. Default is 'pspa_snf7.fa'. +#' @param out_file Path to the output file where the generated tree (.tre) will +#' be saved. Default is "data/alns/pspa_snf7.tre". #' #' @importFrom ape write.tree #' @importFrom phangorn bootstrap.pml dist.ml NJ modelTest phyDat plotBS pml pml.control pratchet optim.parsimony optim.pml read.phyDat upgma #' @importFrom seqinr dist.alignment read.alignment #' @importFrom stats logLik #' -#' @return +#' @return No return value. The function generates a phylogenetic tree file +#' (.tre) based on different approaches like Neighbor Joining, UPGMA, and +#' Maximum Likelihood. #' @export #' #' @details The alignment file would need two columns: 1. accession + diff --git a/man/GCA2Lineage.Rd b/man/GCA2Lineage.Rd index 9ec0ce56..9a2a7a30 100644 --- a/man/GCA2Lineage.Rd +++ b/man/GCA2Lineage.Rd @@ -19,7 +19,7 @@ This file can be generated using the "downloadAssemblySummary()" function} \item{lineagelookup_path}{String of the path to the lineage lookup file (taxid to lineage mapping). This file can be generated using the -"create_lineage_lookup()" function} +"createLineageLookup()" function} \item{acc_col}{} } diff --git a/man/IPG2Lineage.Rd b/man/IPG2Lineage.Rd index e24ab617..118812ab 100644 --- a/man/IPG2Lineage.Rd +++ b/man/IPG2Lineage.Rd @@ -29,16 +29,18 @@ file} \item{lineagelookup_path}{String of the path to the lineage lookup file (taxid to lineage mapping). This file can be generated using the -"create_lineage_lookup()" function} +"createLineageLookup()" function} \item{assembly_path}{String of the path to the assembly_summary path This file can be generated using the \link[MolEvolvR]{downloadAssemblySummary} function} } \value{ -Describe return, in detail +A \code{data.table} with the lineage information for the provided protein +accessions. } \description{ -Takes the resulting file of an efetch run on the ipg database and +Takes the resulting file +of an efetch run on the ipg database and Takes the resulting file of an efetch run on the ipg database and append lineage, and taxid columns diff --git a/man/MolEvolvR_summary.Rd b/man/MolEvolvR_summary.Rd new file mode 100644 index 00000000..262c4719 --- /dev/null +++ b/man/MolEvolvR_summary.Rd @@ -0,0 +1,157 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{MolEvolvR_summary} +\alias{MolEvolvR_summary} +\alias{summarizeByLineage} +\alias{summarizeDomArch_ByLineage} +\alias{summarizeDomArch} +\alias{summarizeGenContext_ByDomArchLineage} +\alias{summarizeGenContext_ByLineage} +\alias{summarizeGenContext} +\alias{totalGenContextOrDomArchCounts} +\title{MolEvolvR Summary} +\usage{ +summarizeByLineage(prot = "prot", column = "DomArch", by = "Lineage", query) + +summarizeDomArch_ByLineage(x) + +summarizeDomArch(x) + +summarizeGenContext_ByDomArchLineage(x) + +summarizeGenContext_ByLineage(x) + +summarizeGenContext(x) + +totalGenContextOrDomArchCounts( + prot, + column = "DomArch", + lineage_col = "Lineage", + cutoff = 90, + RowsCutoff = FALSE, + digits = 2 +) +} +\arguments{ +\item{prot}{A data frame that must contain columns: +\itemize{\item Either 'GenContext' or 'DomArch.norep' \item count}} + +\item{column}{Character. The column to summarize, default is "DomArch".} + +\item{by}{A string representing the grouping column (e.g., \code{Lineage}). +Default is "Lineage".} + +\item{query}{A string specifying the query pattern for filtering the target +column. Use "all" to skip filtering and include all rows.} + +\item{x}{A dataframe or tibble containing the data. It must have columns +named \code{GenContext}, \code{DomArch}, and \code{Lineage}.} + +\item{lineage_col}{Character. The name of the lineage column, default is +"Lineage".} + +\item{cutoff}{Numeric. Cutoff for total count. Counts below this cutoff value +will not be shown. Default is 0.} + +\item{RowsCutoff}{Logical. If TRUE, filters based on cumulative percentage +cutoff. Default is FALSE.} + +\item{digits}{Numeric. Number of decimal places for percentage columns. +Default is 2.} +} +\value{ +A tibble summarizing the counts of occurrences of elements in +the \code{column}, grouped by the \code{by} column. The result includes the number +of occurrences (\code{count}) and is arranged in descending order of count. + +A tibble summarizing the counts of unique domain architectures +(\code{DomArch}) per lineage (\code{Lineage}). The resulting table contains three +columns: \code{DomArch}, \code{Lineage}, and \code{count}, which indicates the frequency +of each domain architecture for each lineage. The results are arranged in +descending order of \code{count}. + +A tibble summarizing each unique \code{DomArch}, along with the following +columns: +\itemize{ +\item \code{totalcount}: The total occurrences of each \code{DomArch} across all lineages. +\item \code{totallin}: The total number of unique lineages in which each \code{DomArch} +appears. +The results are arranged in descending order of \code{totallin} and \code{totalcount}. +} + +A tibble summarizing each unique combination of \code{GenContext}, +\code{DomArch}, and \code{Lineage}, along with the following columns: +\itemize{ +\item \code{GenContext}: The genomic context for each entry. +\item \code{DomArch}: The domain architecture for each entry. +\item \code{Lineage}: The lineage associated with each entry. +\item \code{count}: The total number of occurrences for each combination of +\code{GenContext}, \code{DomArch}, and \code{Lineage}. +} + +The results are arranged in descending order of \code{count}. + +Describe return, in detail + +A tibble summarizing each unique combination of \code{GenContext} and +\code{Lineage}, along with the following columns: +\itemize{ +\item \code{GenContext}: The genomic context for each entry. +\item \code{Lineage}: The lineage associated with each entry. +\item \code{count}: The total number of occurrences for each combination of +\code{GenContext} and \code{Lineage}. +} + +The results are arranged in descending order of \code{count}. + +A data frame with the following columns: +\itemize{ +\item \code{{{ column }}}: Unique values from the specified column. +\item \code{totalcount}: The total count of occurrences for each unique value in +the specified column. +\item \code{IndividualCountPercent}: The percentage of each \code{totalcount} relative to +the overall count. +\item \code{CumulativePercent}: The cumulative percentage of total counts. +} +} +\description{ +A collection of summary functions for the MolEvolvR package. + +Function to summarize and retrieve counts by Domains & Domains+Lineage + +Function to retrieve counts of how many lineages a DomArch appears in + +Creates a data frame with a totalcount column + +This function is designed to sum the counts column by either Genomic Context or Domain Architecture and creates a totalcount column from those sums. +} +\note{ +Please refer to the source code if you have alternate file formats and/or +column names. +} +\examples{ +\dontrun{ +library(tidyverse) +tibble(DomArch = c("a+b", "a+b", "b+c", "a+b"), Lineage = c("l1", "l1", "l1", "l2")) |> + summarizeByLineage(query = "all") +} + +\dontrun{ +summarizeDomArch_ByLineage() +} +\dontrun{ +summarizeDomArch() +} +\dontrun{ +summarizeGenContext_ByDomArchLineage +} +\dontrun{ +summarizeGenContext_ByLineage() +} +\dontrun{ +summarizeGenContext() +} +\dontrun{ +totalGenContextOrDomArchCounts(pspa - gc_lin_counts, 0, "GC") +} +} diff --git a/man/acc2Lineage.Rd b/man/acc2Lineage.Rd index a24bdc9a..a46b6f20 100644 --- a/man/acc2Lineage.Rd +++ b/man/acc2Lineage.Rd @@ -35,7 +35,8 @@ on the ipg database. If NULL, the file will not be written. Defaults to NULL} \item{plan}{} } \value{ -Describe return, in detail +A \code{data.table} that contains the lineage information, mapping protein +accessions to their tax IDs and lineages. } \description{ This function combines 'efetchIPG()' and 'IPG2Lineage()' to map a set diff --git a/man/acc2fa.Rd b/man/acc2fa.Rd index 158b2d51..3e7a756d 100644 --- a/man/acc2fa.Rd +++ b/man/acc2fa.Rd @@ -15,6 +15,9 @@ Function may not work for vectors of length > 10,000} \item{plan}{} } \description{ +converts protein accession numbers to a fasta format. Resulting +fasta file is written to the outpath. + acc2fa converts protein accession numbers to a fasta format. Resulting fasta file is written to the outpath. } diff --git a/man/addLeaves2Alignment.Rd b/man/addLeaves2Alignment.Rd index a758ebd5..d00e6df7 100644 --- a/man/addLeaves2Alignment.Rd +++ b/man/addLeaves2Alignment.Rd @@ -1,9 +1,15 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{addLeaves2Alignment} \alias{addLeaves2Alignment} -\title{Adding Leaves to an alignment file w/ accessions} +\title{addLeaves2Alignment} \usage{ +addLeaves2Alignment( + aln_file = "", + lin_file = "data/rawdata_tsv/all_semiclean.txt", + reduced = FALSE +) + addLeaves2Alignment( aln_file = "", lin_file = "data/rawdata_tsv/all_semiclean.txt", @@ -11,7 +17,7 @@ addLeaves2Alignment( ) } \arguments{ -\item{aln_file}{haracter. Path to file. Input tab-delimited file + +\item{aln_file}{Character. Path to file. Input tab-delimited file + alignment file accnum & alignment. Default is 'pspa_snf7.aln'} @@ -23,15 +29,25 @@ Default is 'pspa.txt'} only one sequence per lineage. Default is FALSE.} } \description{ +Adding Leaves to an alignment file w/ accessions +Genomic Contexts vs Domain Architectures. + Adding Leaves to an alignment file w/ accessions Genomic Contexts vs Domain Architectures. } \details{ +The alignment file would need two columns: 1. accession + +number and 2. alignment. The protein homolog accession to lineage mapping + +file should have + The alignment file would need two columns: 1. accession + number and 2. alignment. The protein homolog accession to lineage mapping + file should have } \note{ +Please refer to the source code if you have alternate + +file formats and/or column names. + Please refer to the source code if you have alternate + file formats and/or column names. } @@ -39,6 +55,9 @@ file formats and/or column names. \dontrun{ addLeaves2Alignment("pspa_snf7.aln", "pspa.txt") } +\dontrun{ +addLeaves2Alignment("pspa_snf7.aln", "pspa.txt") +} } \author{ Janani Ravi diff --git a/man/addLineage.Rd b/man/addLineage.Rd index 6694e94c..ab02a5ab 100644 --- a/man/addLineage.Rd +++ b/man/addLineage.Rd @@ -23,10 +23,26 @@ addLineage( ) } \arguments{ +\item{df}{A \code{data.frame} containing the input data. One column must contain +the accession numbers.} + +\item{acc_col}{A string specifying the column name in \code{df} that holds the +accession numbers. Defaults to \code{"AccNum"}.} + +\item{assembly_path}{A string specifying the path to the \code{assembly_summary.txt} +file. This file contains metadata about assemblies.} + +\item{lineagelookup_path}{A string specifying the path to the lineage lookup +file, which contains a mapping from tax IDs to their corresponding lineages.} + +\item{ipgout_path}{(Optional) A string specifying the path where IPG database +fetch results will be saved. If \code{NULL}, the results are not written to a file.} + \item{plan}{} } \value{ -Describe return, in detail +A \code{data.frame} that combines the original \code{df} with the lineage +information. } \description{ addLineage diff --git a/man/addName.Rd b/man/addName.Rd index e04f9849..6f171456 100644 --- a/man/addName.Rd +++ b/man/addName.Rd @@ -1,9 +1,18 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{addName} \alias{addName} -\title{Add Name} +\title{addName} \usage{ +addName( + data, + accnum_col = "AccNum", + spec_col = "Species", + lin_col = "Lineage", + lin_sep = ">", + out_col = "Name" +) + addName( data, accnum_col = "AccNum", @@ -28,9 +37,14 @@ addName( Lineage, and AccNum info} } \value{ +Original data with a 'Name' column + Original data with a 'Name' column } \description{ +This function adds a new 'Name' column that is comprised of components from +Kingdom, Phylum, Genus, and species, as well as the accession + This function adds a new 'Name' column that is comprised of components from Kingdom, Phylum, Genus, and species, as well as the accession } diff --git a/man/add_leaves.Rd b/man/add_leaves.Rd deleted file mode 100644 index f1eeed10..00000000 --- a/man/add_leaves.Rd +++ /dev/null @@ -1,50 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{add_leaves} -\alias{add_leaves} -\title{Adding Leaves to an alignment file w/ accessions} -\usage{ -add_leaves( - aln_file = "", - lin_file = "data/rawdata_tsv/all_semiclean.txt", - reduced = FALSE -) -} -\arguments{ -\item{aln_file}{Character. Path to file. Input tab-delimited file + -alignment file accnum & alignment. -Default is 'pspa_snf7.aln'} - -\item{lin_file}{Character. Path to file. Protein file with accession + -number to lineage mapping. -Default is 'pspa.txt'} - -\item{reduced}{Boolean. If TRUE, a reduced data frame will be generated with -only one sequence per lineage. Default is FALSE.} -} -\description{ -Adding Leaves to an alignment file w/ accessions -Genomic Contexts vs Domain Architectures. -} -\details{ -The alignment file would need two columns: 1. accession + -number and 2. alignment. The protein homolog accession to lineage mapping + -file should have -} -\note{ -Please refer to the source code if you have alternate + -file formats and/or column names. -} -\examples{ -\dontrun{ -add_leaves("pspa_snf7.aln", "pspa.txt") -} -} -\author{ -Janani Ravi -} -\keyword{accnum,} -\keyword{alignment,} -\keyword{leaves,} -\keyword{lineage,} -\keyword{species} diff --git a/man/add_name.Rd b/man/add_name.Rd deleted file mode 100644 index f19139e1..00000000 --- a/man/add_name.Rd +++ /dev/null @@ -1,39 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{add_name} -\alias{add_name} -\title{Title} -\usage{ -add_name( - data, - accnum_col = "AccNum", - spec_col = "Species", - lin_col = "Lineage", - lin_sep = ">", - out_col = "Name" -) -} -\arguments{ -\item{data}{Data to add name column to} - -\item{accnum_col}{Column containing accession numbers} - -\item{spec_col}{Column containing species} - -\item{lin_col}{Column containing lineage} - -\item{lin_sep}{Character separating lineage levels} - -\item{out_col}{Column that contains the new 'Name' derived from Species, -Lineage, and AccNum info} -} -\value{ -Original data with a 'Name' column -} -\description{ -This function adds a new 'Name' column that is comprised of components from -Kingdom, Phylum, Genus, and species, as well as the accession -} -\author{ -Samuel Chen, Janani Ravi -} diff --git a/man/alignFasta.Rd b/man/alignFasta.Rd index 21b020cf..02a3026b 100644 --- a/man/alignFasta.Rd +++ b/man/alignFasta.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{alignFasta} \alias{alignFasta} -\title{Perform a Multiple Sequence Alignment on a FASTA file.} +\title{alignFasta} \usage{ alignFasta(fasta_file, tool = "Muscle", outpath = NULL) @@ -21,6 +21,8 @@ aligned fasta sequence as a MsaAAMultipleAlignment object aligned fasta sequence as a MsaAAMultipleAlignment object } \description{ +Perform a Multiple Sequence Alignment on a FASTA file. + Perform a Multiple Sequence Alignment on a FASTA file. } \author{ diff --git a/man/assign_job_queue.Rd b/man/assignJobQueue.Rd similarity index 58% rename from man/assign_job_queue.Rd rename to man/assignJobQueue.Rd index ceb6fa77..de646a82 100644 --- a/man/assign_job_queue.Rd +++ b/man/assignJobQueue.Rd @@ -1,14 +1,14 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{assign_job_queue} -\alias{assign_job_queue} -\title{Decision function to assign job queue} +\name{assignJobQueue} +\alias{assignJobQueue} +\title{assignJobQueue} \usage{ -assign_job_queue(t_sec_estimate, t_cutoff = 21600) +assignJobQueue(t_sec_estimate, t_cutoff = 21600) } \arguments{ \item{t_sec_estimate}{estimated number of seconds a job will process -(from advanced_opts2est_walltime())} +(from calculateEstimatedWallTimeFromOpts ())} \item{t_long}{threshold value that defines the lower bound for assigning a job to the "long queue"} @@ -17,8 +17,9 @@ job to the "long queue"} a string of "short" or "long" example: -advanced_opts2est_walltime(c("homology_search", "domain_architecture"), 3) |> -assign_job_queue() +calculateEstimatedWallTimeFromOpts (c("homology_search", +"domain_architecture"), 3) |> +assignJobQueue() } \description{ Decision function to assign job queue diff --git a/man/advanced_opts2est_walltime.Rd b/man/calculateEstimatedWallTimeFromOpts.Rd similarity index 58% rename from man/advanced_opts2est_walltime.Rd rename to man/calculateEstimatedWallTimeFromOpts.Rd index ea4b29e6..d5361001 100644 --- a/man/advanced_opts2est_walltime.Rd +++ b/man/calculateEstimatedWallTimeFromOpts.Rd @@ -1,11 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{advanced_opts2est_walltime} -\alias{advanced_opts2est_walltime} -\title{Given MolEvolvR advanced options and number of inputs, -calculate the total estimated walltime for the job} +\name{calculateEstimatedWallTimeFromOpts} +\alias{calculateEstimatedWallTimeFromOpts} +\title{calculateEstimatedWallTimeFromOpts} \usage{ -advanced_opts2est_walltime( +calculateEstimatedWallTimeFromOpts( advanced_opts, n_inputs = 1L, n_hits = NULL, @@ -14,14 +13,16 @@ advanced_opts2est_walltime( } \arguments{ \item{advanced_opts}{character vector of MolEvolvR advanced options -(see make_opts2procs for the options)} +(see mapOption2Process for the options)} \item{n_inputs}{total number of input proteins} } \value{ total estimated number of seconds a job will process (walltime) -example: advanced_opts2est_walltime(c("homology_search", "domain_architecture"), n_inputs = 3, n_hits = 50L) +example: calculateEstimatedWallTimeFromOpts (c("homology_search", +"domain_architecture"), +n_inputs = 3, n_hits = 50L) } \description{ Given MolEvolvR advanced options and number of inputs, diff --git a/man/get_proc_medians.Rd b/man/calculateProcessRuntime.Rd similarity index 72% rename from man/get_proc_medians.Rd rename to man/calculateProcessRuntime.Rd index b6db0b56..579ea2b6 100644 --- a/man/get_proc_medians.Rd +++ b/man/calculateProcessRuntime.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{get_proc_medians} -\alias{get_proc_medians} -\title{Scrape MolEvolvR logs and calculate median processes} +\name{calculateProcessRuntime} +\alias{calculateProcessRuntime} +\title{calculateProcessRuntime} \usage{ -get_proc_medians(dir_job_results) +calculateProcessRuntime(dir_job_results) } \arguments{ \item{dir_job_results}{\link{chr} path to MolEvolvR job_results @@ -21,12 +21,12 @@ examples: } dir_job_results <- "/data/scratch/janani/molevolvr_out" -list_proc_medians <- get_proc_medians(dir_job_results) +list_proc_medians <- calculateProcessRuntime(dir_job_results) \enumerate{ \item from outside container environment common_root <- "/data/molevolvr_transfer/molevolvr_dev" dir_job_results <- "/data/molevolvr_transfer/molevolvr_dev/job_results" -list_proc_medians <- get_proc_medians(dir_job_results) +list_proc_medians <- calculateProcessRuntime(dir_job_results) } } \description{ diff --git a/man/clean_clust_file.Rd b/man/cleanClusterFile.Rd similarity index 82% rename from man/clean_clust_file.Rd rename to man/cleanClusterFile.Rd index bba3072e..d2818662 100644 --- a/man/clean_clust_file.Rd +++ b/man/cleanClusterFile.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/clean_clust_file.R -\name{clean_clust_file} -\alias{clean_clust_file} +\name{cleanClusterFile} +\alias{cleanClusterFile} \title{Clean Cluster File} \usage{ -clean_clust_file(path, writepath = NULL, query) +cleanClusterFile(path, writepath = NULL, query) } \arguments{ \item{path}{A character to the path of the cluster file to be cleaned} @@ -24,6 +24,6 @@ This function reads a space-separated cluster file and converts it to a cleaned } \examples{ \dontrun{ -clean_clust_file("data/pspa.op_ins_cls", writepath = NULL, query = "pspa") +cleanClusterFile("data/pspa.op_ins_cls", writepath = NULL, query = "pspa") } } diff --git a/man/combine_files.Rd b/man/combineFiles.Rd similarity index 92% rename from man/combine_files.Rd rename to man/combineFiles.Rd index 4126eb9e..3b56b923 100644 --- a/man/combine_files.Rd +++ b/man/combineFiles.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/combine_files.R -\name{combine_files} -\alias{combine_files} +\name{combineFiles} +\alias{combineFiles} \title{Download the combined assembly summaries of genbank and refseq} \usage{ -combine_files( +combineFiles( inpath = c("../molevol_data/project_data/phage_defense/"), pattern = "*full_analysis.tsv", delim = "\\t", diff --git a/man/combine_full.Rd b/man/combineFullAnalysis.Rd similarity index 69% rename from man/combine_full.Rd rename to man/combineFullAnalysis.Rd index f4e6597b..35925e86 100644 --- a/man/combine_full.Rd +++ b/man/combineFullAnalysis.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/combine_analysis.R -\name{combine_full} -\alias{combine_full} +\name{combineFullAnalysis} +\alias{combineFullAnalysis} \title{Combining full_analysis files} \usage{ -combine_full(inpath, ret = FALSE) +combineFullAnalysis(inpath, ret = FALSE) } \arguments{ \item{ret}{} diff --git a/man/combine_ipr.Rd b/man/combineIPR.Rd similarity index 74% rename from man/combine_ipr.Rd rename to man/combineIPR.Rd index 52aa3057..035c4274 100644 --- a/man/combine_ipr.Rd +++ b/man/combineIPR.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/combine_analysis.R -\name{combine_ipr} -\alias{combine_ipr} +\name{combineIPR} +\alias{combineIPR} \title{Combining clean ipr files} \usage{ -combine_ipr(inpath, ret = FALSE) +combineIPR(inpath, ret = FALSE) } \arguments{ \item{ret}{} diff --git a/man/convert2TitleCase.Rd b/man/convert2TitleCase.Rd index 84e7fa00..72619285 100644 --- a/man/convert2TitleCase.Rd +++ b/man/convert2TitleCase.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{convert2TitleCase} \alias{convert2TitleCase} \alias{totitle,} @@ -7,6 +7,8 @@ \title{Changing case to 'Title Case'} \usage{ convert2TitleCase(text, delimitter) + +to_titlecase(text, delimitter) } \arguments{ \item{x}{Character vector.} @@ -15,8 +17,13 @@ convert2TitleCase(text, delimitter) } \description{ Translate string to Title Case w/ delimitter. + +Translate string to Title Case w/ delimitter. +Changing case to 'Title Case' } \seealso{ +chartr, toupper, and tolower. + chartr, toupper, and tolower. } \author{ diff --git a/man/convertAlignment2FA.Rd b/man/convertAlignment2FA.Rd index d6b4dc56..8e9ceb94 100644 --- a/man/convertAlignment2FA.Rd +++ b/man/convertAlignment2FA.Rd @@ -1,9 +1,16 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{convertAlignment2FA} \alias{convertAlignment2FA} -\title{Adding Leaves to an alignment file w/ accessions} +\title{convertAlignment2FA} \usage{ +convertAlignment2FA( + aln_file = "", + lin_file = "data/rawdata_tsv/all_semiclean.txt", + fa_outpath = "", + reduced = FALSE +) + convertAlignment2FA( aln_file = "", lin_file = "data/rawdata_tsv/all_semiclean.txt", @@ -31,11 +38,18 @@ Adding Leaves to an alignment file w/ accessions Genomic Contexts vs Domain Architectures. } \details{ +The alignment file would need two columns: 1. accession + +number and 2. alignment. The protein homolog accession to lineage mapping + +file should have + The alignment file would need two columns: 1. accession + number and 2. alignment. The protein homolog accession to lineage mapping + file should have } \note{ +Please refer to the source code if you have alternate + +file formats and/or column names. + Please refer to the source code if you have alternate + file formats and/or column names. } @@ -44,6 +58,9 @@ file formats and/or column names. addLeaves2Alignment("pspa_snf7.aln", "pspa.txt") } +\dontrun{ +convertAlignment2FA("pspa_snf7.aln", "pspa.txt") +} } \author{ Janani Ravi diff --git a/man/convertAlignment2Trees.Rd b/man/convertAlignment2Trees.Rd index 002f5203..e0c8fe34 100644 --- a/man/convertAlignment2Trees.Rd +++ b/man/convertAlignment2Trees.Rd @@ -7,8 +7,18 @@ convertAlignment2Trees(aln_path = here("data/alns/")) } \arguments{ -\item{aln_path}{} +\item{aln_path}{Path to the directory containing all the alignment FASTA +files (.fa) for which trees will be generated. Default is "data/alns/".} +} +\value{ +No return value. The function generates tree files (.tre) for each +alignment file in the specified directory. } \description{ Generate Trees for ALL fasta files in "data/alns" } +\examples{ +\dontrun{ +generate_trees(here("data/alns/")) +} +} diff --git a/man/convertFA2Tree.Rd b/man/convertFA2Tree.Rd index b2fb93de..f97cd3d7 100644 --- a/man/convertFA2Tree.Rd +++ b/man/convertFA2Tree.Rd @@ -11,8 +11,26 @@ convertFA2Tree( ) } \arguments{ -\item{fasttree_path}{} +\item{fa_path}{Path to the input FASTA alignment file (.fa). Default is the +path to "data/alns/pspa_snf7.fa".} + +\item{tre_path}{Path to the output file where the generated tree (.tre) will +be saved. Default is the path to "data/alns/pspa_snf7.tre".} + +\item{fasttree_path}{Path to the FastTree executable, which is used to +generate the phylogenetic tree. Default is "src/FastTree".} +} +\value{ +No return value. The function generates a tree file (.tre) from the +input FASTA file. } \description{ convertFA2Tree } +\examples{ +\dontrun{ +convert_fa2tre(here("data/alns/pspa_snf7.fa"), + here("data/alns/pspa_snf7.tre"), + here("src/FastTree") +} +} diff --git a/man/convert_aln2fa.Rd b/man/convert_aln2fa.Rd deleted file mode 100644 index 8bebe31d..00000000 --- a/man/convert_aln2fa.Rd +++ /dev/null @@ -1,53 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{convert_aln2fa} -\alias{convert_aln2fa} -\title{Adding Leaves to an alignment file w/ accessions} -\usage{ -convert_aln2fa( - aln_file = "", - lin_file = "data/rawdata_tsv/all_semiclean.txt", - fa_outpath = "", - reduced = FALSE -) -} -\arguments{ -\item{aln_file}{Character. Path to file. Input tab-delimited file + -alignment file accnum & alignment. -Default is 'pspa_snf7.aln'} - -\item{lin_file}{Character. Path to file. Protein file with accession + -number to lineage mapping. -Default is 'pspa.txt'} - -\item{fa_outpath}{Character. Path to the written fasta file. -Default is 'NULL'} - -\item{reduced}{Boolean. If TRUE, the fasta file will contain only one sequence per lineage. -Default is 'FALSE'} -} -\description{ -Adding Leaves to an alignment file w/ accessions -} -\details{ -The alignment file would need two columns: 1. accession + -number and 2. alignment. The protein homolog accession to lineage mapping + -file should have -} -\note{ -Please refer to the source code if you have alternate + -file formats and/or column names. -} -\examples{ -\dontrun{ -add_leaves("pspa_snf7.aln", "pspa.txt") -} -} -\author{ -Janani Ravi -} -\keyword{accnum,} -\keyword{alignment,} -\keyword{leaves,} -\keyword{lineage,} -\keyword{species} diff --git a/man/countByColumn.Rd b/man/countByColumn.Rd new file mode 100644 index 00000000..57ff9ac4 --- /dev/null +++ b/man/countByColumn.Rd @@ -0,0 +1,38 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{countByColumn} +\alias{countByColumn} +\title{countByColumn} +\usage{ +countByColumn(prot = prot, column = "DomArch", min.freq = 1) +} +\arguments{ +\item{prot}{A data frame containing the dataset to analyze, typically with +multiple columns including the one specified by the \code{column} parameter.} + +\item{column}{A character string specifying the name of the column to analyze. +The default is "DomArch".} + +\item{min.freq}{An integer specifying the minimum frequency an element must +have to be included in the output. Default is 1.} +} +\value{ +A tibble with two columns: +\describe{ +\item{\code{column}}{The unique elements from the specified column +(e.g., "DomArch").} +\item{\code{freq}}{The frequency of each element, i.e., the number of times +each element appears in the specified column.} +} +The tibble is filtered to only include elements that have a frequency +greater than or equal to \code{min.freq} and does not include elements with \code{NA} +values or those starting with a hyphen ("-"). +} +\description{ +Function to obtain element counts (DA, GC) +} +\examples{ +\dontrun{ +countByColumn(prot = my_data, column = "DomArch", min.freq = 10) +} +} diff --git a/man/createFA2Tree.Rd b/man/createFA2Tree.Rd index 76da7807..90054280 100644 --- a/man/createFA2Tree.Rd +++ b/man/createFA2Tree.Rd @@ -10,10 +10,16 @@ createFA2Tree( ) } \arguments{ -\item{fa_file}{Character. Path to file. -Default is 'pspa_snf7.fa'} +\item{fa_file}{Character. Path to the alignment FASTA file (.fa) from which +the phylogenetic tree will be generated. Default is 'pspa_snf7.fa'.} -\item{out_file}{} +\item{out_file}{Path to the output file where the generated tree (.tre) will +be saved. Default is "data/alns/pspa_snf7.tre".} +} +\value{ +No return value. The function generates a phylogenetic tree file +(.tre) based on different approaches like Neighbor Joining, UPGMA, and +Maximum Likelihood. } \description{ Generating phylogenetic tree from alignment file '.fa' diff --git a/man/create_lineage_lookup.Rd b/man/createLineageLookup.Rd similarity index 85% rename from man/create_lineage_lookup.Rd rename to man/createLineageLookup.Rd index 51670f35..132019ce 100644 --- a/man/create_lineage_lookup.Rd +++ b/man/createLineageLookup.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/create_lineage_lookup.R -\name{create_lineage_lookup} -\alias{create_lineage_lookup} -\title{Create a look up table that goes from TaxID, to Lineage} +\name{createLineageLookup} +\alias{createLineageLookup} +\title{createLineageLookup} \usage{ -create_lineage_lookup( +createLineageLookup( lineage_file = here("data/rankedlineage.dmp"), outfile, taxonomic_rank = "phylum" diff --git a/man/RepresentativeAccNums.Rd b/man/createRepresentativeAccNum.Rd similarity index 60% rename from man/RepresentativeAccNums.Rd rename to man/createRepresentativeAccNum.Rd index f617cde4..3bd20522 100644 --- a/man/RepresentativeAccNums.Rd +++ b/man/createRepresentativeAccNum.Rd @@ -1,12 +1,20 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R -\name{RepresentativeAccNums} -\alias{RepresentativeAccNums} -\title{Function to generate a vector of one Accession number per distinct observation from 'reduced' column} +\name{createRepresentativeAccNum} +\alias{createRepresentativeAccNum} +\title{createRepresentativeAccNum} \usage{ -RepresentativeAccNums(prot_data, reduced = "Lineage", accnum_col = "AccNum") +createRepresentativeAccNum( + prot_data, + reduced = "Lineage", + accnum_col = "AccNum" +) -RepresentativeAccNums(prot_data, reduced = "Lineage", accnum_col = "AccNum") +createRepresentativeAccNum( + prot_data, + reduced = "Lineage", + accnum_col = "AccNum" +) } \arguments{ \item{prot_data}{Data frame containing Accession Numbers} @@ -18,6 +26,8 @@ One accession number will be assigned for each of these observations} \item{accnum_col}{Column from prot_data that contains Accession Numbers} } \description{ +Function to generate a vector of one Accession number per distinct observation from 'reduced' column + Function to generate a vector of one Accession number per distinct observation from 'reduced' column } \author{ diff --git a/man/efetchIPG.Rd b/man/efetchIPG.Rd index 6a5d85a4..047e2652 100644 --- a/man/efetchIPG.Rd +++ b/man/efetchIPG.Rd @@ -20,7 +20,7 @@ the ipg database} the ipg database} } \value{ -Describe return, in detail +No return value. The function writes the fetched results to \code{out_path}. } \description{ Perform efetch on the ipg database and write the results to out_path diff --git a/man/elements2Words.Rd b/man/elements2Words.Rd index 1094d363..bfd3071b 100644 --- a/man/elements2Words.Rd +++ b/man/elements2Words.Rd @@ -2,20 +2,30 @@ % Please edit documentation in R/summarize.R \name{elements2Words} \alias{elements2Words} -\title{Elements 2 Words} +\title{elements2Words} \usage{ elements2Words(prot, column = "DomArch", conversion_type = "da2doms") } \arguments{ -\item{prot}{\link{dataframe}} +\item{prot}{A dataframe containing the dataset to analyze. The specified +\code{column} contains the string elements to be processed.} -\item{column}{\link{string} column name} +\item{column}{A character string specifying the name of the column to analyze. +Default is "DomArch".} -\item{conversion_type}{\link{string} type of conversion: 'da2doms': domain architectures to -domains. 'gc2da' genomic context to domain architectures} +\item{conversion_type}{A character string specifying the type of conversion. +Two options are available: +\describe{ +\item{\code{da2doms}}{Convert domain architectures into individual domains by +replacing \code{+} symbols with spaces.} +\item{\code{gc2da}}{Convert genomic context into domain architectures by +replacing directional symbols (\verb{<-}, \verb{->}, and \code{|}) with spaces.} +}} } \value{ -\link{string} with words delimited by spaces +A single string where elements are delimited by spaces. The function +performs necessary substitutions based on the \code{conversion_type} and cleans up +extraneous characters like newlines, tabs, and multiple spaces. } \description{ Break string ELEMENTS into WORDS for domain architecture (DA) and genomic @@ -23,7 +33,8 @@ context (GC) } \examples{ \dontrun{ -tibble::tibble(DomArch = c("aaa+bbb", "a+b", "b+c", "b-c")) |> elements2Words() +tibble::tibble(DomArch = c("aaa+bbb", +"a+b", "b+c", "b-c")) |> elements2Words() } } diff --git a/man/filterByDomains.Rd b/man/filterByDomains.Rd new file mode 100644 index 00000000..afb3e5cb --- /dev/null +++ b/man/filterByDomains.Rd @@ -0,0 +1,44 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{filterByDomains} +\alias{filterByDomains} +\title{filterByDomains} +\usage{ +filterByDomains( + prot, + column = "DomArch", + doms_keep = c(), + doms_remove = c(), + ignore.case = FALSE +) +} +\arguments{ +\item{prot}{Dataframe to filter} + +\item{column}{Column to search for domains in (DomArch column)} + +\item{doms_keep}{Vector of domains that must be identified within column in order for +observation to be kept} + +\item{doms_remove}{Vector of domains that, if found within an observation, will be removed} + +\item{ignore.case}{Should the matching be non case sensitive} +} +\value{ +Filtered data frame +} +\description{ +filterByDomains filters a data frame by identifying exact domain matches +and either keeping or removing rows with the identified domain +} +\note{ +There is no need to make the domains 'regex safe', that will be handled by this function +} +\examples{ +\dontrun{ +filterByDomains() +} +} +\author{ +Samuel Chen, Janani Ravi +} diff --git a/man/filterByFrequency.Rd b/man/filterByFrequency.Rd new file mode 100644 index 00000000..15d06d67 --- /dev/null +++ b/man/filterByFrequency.Rd @@ -0,0 +1,28 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{filterByFrequency} +\alias{filterByFrequency} +\title{filterByFrequency} +\usage{ +filterByFrequency(x, min.freq) +} +\arguments{ +\item{x}{A tibble (tbl_df) containing at least two columns: one for +elements (e.g., \code{words}) and one for their frequency (e.g., \code{freq}).} + +\item{min.freq}{A numeric value specifying the minimum frequency threshold. +Only elements with frequencies greater than or equal to this value will be +retained.} +} +\value{ +A tibble with the same structure as \code{x}, but filtered to include +only rows where the frequency is greater than or equal to \code{min.freq}. +} +\description{ +Function to filter based on frequencies +} +\examples{ +\dontrun{ +filterByFrequency() +} +} diff --git a/man/findParalogs.Rd b/man/findParalogs.Rd new file mode 100644 index 00000000..d92edf71 --- /dev/null +++ b/man/findParalogs.Rd @@ -0,0 +1,26 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{findParalogs} +\alias{findParalogs} +\title{findParalogs} +\usage{ +findParalogs(prot) +} +\arguments{ +\item{prot}{A data frame filtered by a Query, containing columns Species and Lineage} +} +\value{ +returns a dataframe containing paralogs and the counts. +} +\description{ +Creates a data frame of paralogs. +} +\note{ +Please refer to the source code if you have alternate file formats and/or +column names. +} +\examples{ +\dontrun{ +findParalogs(pspa) +} +} diff --git a/man/generateAllAlignments2FA.Rd b/man/generateAllAlignments2FA.Rd index 3bf9938a..8f9d8ffc 100644 --- a/man/generateAllAlignments2FA.Rd +++ b/man/generateAllAlignments2FA.Rd @@ -1,9 +1,16 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{generateAllAlignments2FA} \alias{generateAllAlignments2FA} -\title{Adding Leaves to an alignment file w/ accessions} +\title{generateAllAlignments2FA} \usage{ +generateAllAlignments2FA( + aln_path = here("data/rawdata_aln/"), + fa_outpath = here("data/alns/"), + lin_file = here("data/rawdata_tsv/all_semiclean.txt"), + reduced = F +) + generateAllAlignments2FA( aln_path = here("data/rawdata_aln/"), fa_outpath = here("data/alns/"), @@ -15,28 +22,44 @@ generateAllAlignments2FA( \item{aln_path}{Character. Path to alignment files. Default is 'here("data/rawdata_aln/")'} -\item{fa_outpath}{Character. Path to file. Master protein file with AccNum & lineages. -Default is 'here("data/rawdata_tsv/all_semiclean.txt")'} - -\item{lin_file}{Character. Path to the written fasta file. +\item{fa_outpath}{Character. Path to the written fasta file. Default is 'here("data/alns/")'.} +\item{lin_file}{Character. Path to file. Master protein file with AccNum & lineages. +Default is 'here("data/rawdata_tsv/all_semiclean.txt")'} + \item{reduced}{Boolean. If TRUE, the fasta file will contain only one sequence per lineage. Default is 'FALSE'.} } \description{ +Adding Leaves to an alignment file w/ accessions + +Adding Leaves to all alignment files w/ accessions & DAs? + +Adding Leaves to an alignment file w/ accessions + Adding Leaves to all alignment files w/ accessions & DAs? } \details{ +The alignment files would need two columns separated by spaces: 1. AccNum and 2. alignment. The protein homolog file should have AccNum, Species, Lineages. + The alignment files would need two columns separated by spaces: 1. AccNum and 2. alignment. The protein homolog file should have AccNum, Species, Lineages. } \note{ +Please refer to the source code if you have alternate + file formats and/or column names. + Please refer to the source code if you have alternate + file formats and/or column names. } \examples{ \dontrun{ generateAllAlignments2FA() } +\dontrun{ +generateAllAlignments2FA() +} +} +\author{ +Janani Ravi } \keyword{accnum,} \keyword{alignment,} diff --git a/man/generate_all_aln2fa.Rd b/man/generate_all_aln2fa.Rd deleted file mode 100644 index ad6b7136..00000000 --- a/man/generate_all_aln2fa.Rd +++ /dev/null @@ -1,48 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{generate_all_aln2fa} -\alias{generate_all_aln2fa} -\title{Adding Leaves to an alignment file w/ accessions} -\usage{ -generate_all_aln2fa( - aln_path = here("data/rawdata_aln/"), - fa_outpath = here("data/alns/"), - lin_file = here("data/rawdata_tsv/all_semiclean.txt"), - reduced = F -) -} -\arguments{ -\item{aln_path}{Character. Path to alignment files. -Default is 'here("data/rawdata_aln/")'} - -\item{fa_outpath}{Character. Path to the written fasta file. -Default is 'here("data/alns/")'.} - -\item{lin_file}{Character. Path to file. Master protein file with AccNum & lineages. -Default is 'here("data/rawdata_tsv/all_semiclean.txt")'} - -\item{reduced}{Boolean. If TRUE, the fasta file will contain only one sequence per lineage. -Default is 'FALSE'.} -} -\description{ -Adding Leaves to all alignment files w/ accessions & DAs? -} -\details{ -The alignment files would need two columns separated by spaces: 1. AccNum and 2. alignment. The protein homolog file should have AccNum, Species, Lineages. -} -\note{ -Please refer to the source code if you have alternate + file formats and/or column names. -} -\examples{ -\dontrun{ -generate_all_aln2fa() -} -} -\author{ -Janani Ravi -} -\keyword{accnum,} -\keyword{alignment,} -\keyword{leaves,} -\keyword{lineage,} -\keyword{species} diff --git a/man/getAccNumFromFA.Rd b/man/getAccNumFromFA.Rd new file mode 100644 index 00000000..d3ab8177 --- /dev/null +++ b/man/getAccNumFromFA.Rd @@ -0,0 +1,18 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R +\name{getAccNumFromFA} +\alias{getAccNumFromFA} +\title{getAccNumFromFA} +\usage{ +getAccNumFromFA(fasta_file) + +getAccNumFromFA(fasta_file) +} +\arguments{ +\item{fasta_file}{} +} +\description{ +getAccNumFromFA + +getAccNumFromFA +} diff --git a/man/get_proc_weights.Rd b/man/getProcessRuntimeWeights.Rd similarity index 64% rename from man/get_proc_weights.Rd rename to man/getProcessRuntimeWeights.Rd index 0f4beb57..de0e2ea6 100644 --- a/man/get_proc_weights.Rd +++ b/man/getProcessRuntimeWeights.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{get_proc_weights} -\alias{get_proc_weights} -\title{Quickly get the runtime weights for MolEvolvR backend processes} +\name{getProcessRuntimeWeights} +\alias{getProcessRuntimeWeights} +\title{getProcessRuntimeWeights} \usage{ -get_proc_weights(medians_yml_path = NULL) +getProcessRuntimeWeights(medians_yml_path = NULL) } \arguments{ \item{dir_job_results}{\link{chr} path to MolEvolvR job_results @@ -13,7 +13,7 @@ directory} \value{ \link{list} names: processes; values: median runtime (seconds) -example: get_proc_weights() +example: writeProcessRuntime2YML() } \description{ Quickly get the runtime weights for MolEvolvR backend processes diff --git a/man/find_top_acc.Rd b/man/getTopAccByLinDomArch.Rd similarity index 70% rename from man/find_top_acc.Rd rename to man/getTopAccByLinDomArch.Rd index 780cde11..b8571350 100644 --- a/man/find_top_acc.Rd +++ b/man/getTopAccByLinDomArch.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ipr2viz.R -\name{find_top_acc} -\alias{find_top_acc} -\title{Group by lineage + DA then take top 20} +\name{getTopAccByLinDomArch} +\alias{getTopAccByLinDomArch} +\title{getTopAccByLinDomArch} \usage{ -find_top_acc( +getTopAccByLinDomArch( infile_full, DA_col = "DomArch.Pfam", lin_col = "Lineage_short", diff --git a/man/get_accnums_from_fasta_file.Rd b/man/get_accnums_from_fasta_file.Rd deleted file mode 100644 index 84c163cc..00000000 --- a/man/get_accnums_from_fasta_file.Rd +++ /dev/null @@ -1,18 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R -\name{get_accnums_from_fasta_file} -\alias{get_accnums_from_fasta_file} -\title{Get accnums from fasta file} -\usage{ -get_accnums_from_fasta_file(fasta_file) - -get_accnums_from_fasta_file(fasta_file) -} -\arguments{ -\item{fasta_file}{} -} -\description{ -Get accnums from fasta file - -get_accnums_from_fasta_file -} diff --git a/man/mapAcc2Name.Rd b/man/mapAcc2Name.Rd index 0f5d447d..39ecb065 100644 --- a/man/mapAcc2Name.Rd +++ b/man/mapAcc2Name.Rd @@ -1,13 +1,15 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/CHANGED-pre-msa-tree.R +% Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R \name{mapAcc2Name} \alias{mapAcc2Name} -\title{Default renameFA() replacement function. Maps an accession number to its name} +\title{mapAcc2Name} \usage{ +mapAcc2Name(line, acc2name, acc_col = "AccNum", name_col = "Name") + mapAcc2Name(line, acc2name, acc_col = "AccNum", name_col = "Name") } \arguments{ -\item{line}{The line of a fasta file starting with '>'} +\item{line}{he line of a fasta file starting with '>'} \item{acc2name}{Data Table containing a column of accession numbers and a name column} @@ -18,4 +20,6 @@ are mapped to} } \description{ Default renameFA() replacement function. Maps an accession number to its name + +Default rename_fasta() replacement function. Maps an accession number to its name } diff --git a/man/map_advanced_opts2procs.Rd b/man/mapAdvOption2Process.Rd similarity index 66% rename from man/map_advanced_opts2procs.Rd rename to man/mapAdvOption2Process.Rd index 631708b4..6a210a20 100644 --- a/man/map_advanced_opts2procs.Rd +++ b/man/mapAdvOption2Process.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{map_advanced_opts2procs} -\alias{map_advanced_opts2procs} -\title{Use MolEvolvR advanced options to get associated processes} +\name{mapAdvOption2Process} +\alias{mapAdvOption2Process} +\title{mapAdvOption2Process} \usage{ -map_advanced_opts2procs(advanced_opts) +mapAdvOption2Process(advanced_opts) } \arguments{ \item{advanced_opts}{character vector of MolEvolvR advanced options} @@ -15,7 +15,7 @@ the advanced options example: advanced_opts <- c("homology_search", "domain_architecture") -procs <- map_advanced_opts2procs(advanced_opts) +procs <- mapAdvOption2Process(advanced_opts) } \description{ Use MolEvolvR advanced options to get associated processes diff --git a/man/make_opts2procs.Rd b/man/mapOption2Process.Rd similarity index 58% rename from man/make_opts2procs.Rd rename to man/mapOption2Process.Rd index 07e208b2..9645617b 100644 --- a/man/make_opts2procs.Rd +++ b/man/mapOption2Process.Rd @@ -1,15 +1,15 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{make_opts2procs} -\alias{make_opts2procs} -\title{Construct list where names (MolEvolvR advanced options) point to processes} +\name{mapOption2Process} +\alias{mapOption2Process} +\title{mapOption2Process} \usage{ -make_opts2procs() +mapOption2Process() } \value{ list where names (MolEvolvR advanced options) point to processes -example: list_opts2procs <- make_opts2procs +example: list_opts2procs <- mapOption2Process } \description{ Construct list where names (MolEvolvR advanced options) point to processes diff --git a/man/map_acc2name.Rd b/man/map_acc2name.Rd deleted file mode 100644 index fcdb3023..00000000 --- a/man/map_acc2name.Rd +++ /dev/null @@ -1,21 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{map_acc2name} -\alias{map_acc2name} -\title{Default rename_fasta() replacement function. Maps an accession number to its name} -\usage{ -map_acc2name(line, acc2name, acc_col = "AccNum", name_col = "Name") -} -\arguments{ -\item{line}{he line of a fasta file starting with '>'} - -\item{acc2name}{Data Table containing a column of accession numbers and a name column} - -\item{acc_col}{Name of the column containing Accession numbers} - -\item{name_col}{Name of the column containing the names that the accession numbers -are mapped to} -} -\description{ -Default rename_fasta() replacement function. Maps an accession number to its name -} diff --git a/man/plotEstimatedWallTimes.Rd b/man/plotEstimatedWallTimes.Rd new file mode 100644 index 00000000..36b0ecd5 --- /dev/null +++ b/man/plotEstimatedWallTimes.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/assign_job_queue.R +\name{plotEstimatedWallTimes} +\alias{plotEstimatedWallTimes} +\title{plotEstimatedWallTimes} +\usage{ +plotEstimatedWallTimes() +} +\value{ +line plot object + +example: +p <- plotEstimatedWallTimes() +ggplot2::ggsave(filename = "/data/molevolvr_transfer/molevolvr_ +dev/molevol_scripts/docs/estimate_walltimes.png", plot = p) +} +\description{ +Plot the estimated runtimes for different advanced options and number +of inputs + +this function was just for fun; very, very messy code +} diff --git a/man/ipr2viz.Rd b/man/plotIPR2Viz.Rd similarity index 80% rename from man/ipr2viz.Rd rename to man/plotIPR2Viz.Rd index 79063497..7ed420c9 100644 --- a/man/ipr2viz.Rd +++ b/man/plotIPR2Viz.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ipr2viz.R -\name{ipr2viz} -\alias{ipr2viz} -\title{IPR2Viz} +\name{plotIPR2Viz} +\alias{plotIPR2Viz} +\title{plotIPR2Viz} \usage{ -ipr2viz( +plotIPR2Viz( infile_ipr = NULL, infile_full = NULL, accessions = c(), @@ -20,5 +20,5 @@ ipr2viz( \item{query}{} } \description{ -IPR2Viz +plotIPR2Viz } diff --git a/man/ipr2viz_web.Rd b/man/plotIPR2VizWeb.Rd similarity index 77% rename from man/ipr2viz_web.Rd rename to man/plotIPR2VizWeb.Rd index 896445bd..3b94a5a7 100644 --- a/man/ipr2viz_web.Rd +++ b/man/plotIPR2VizWeb.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ipr2viz.R -\name{ipr2viz_web} -\alias{ipr2viz_web} -\title{IPR2Viz Web} +\name{plotIPR2VizWeb} +\alias{plotIPR2VizWeb} +\title{plotIPR2VizWeb} \usage{ -ipr2viz_web( +plotIPR2VizWeb( infile_ipr, accessions, analysis = c("Pfam", "Phobius", "TMHMM", "Gene3D"), @@ -20,5 +20,5 @@ ipr2viz_web( \item{rows}{} } \description{ -IPR2Viz Web +plotIPR2VizWeb } diff --git a/man/plot_estimated_walltimes.Rd b/man/plot_estimated_walltimes.Rd deleted file mode 100644 index 3669e0e0..00000000 --- a/man/plot_estimated_walltimes.Rd +++ /dev/null @@ -1,19 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/assign_job_queue.R -\name{plot_estimated_walltimes} -\alias{plot_estimated_walltimes} -\title{Plot the estimated runtimes for different advanced options and number -of inputs} -\usage{ -plot_estimated_walltimes() -} -\value{ -line plot object - -example: -p <- plot_estimated_walltimes() -ggplot2::ggsave(filename = "/data/molevolvr_transfer/molevolvr_dev/molevol_scripts/docs/estimate_walltimes.png", plot = p) -} -\description{ -this function was just for fun; very, very messy code -} diff --git a/man/reverse_operon.Rd b/man/reverseOperonSeq.Rd similarity index 56% rename from man/reverse_operon.Rd rename to man/reverseOperonSeq.Rd index 270e2a62..d61ec5f2 100644 --- a/man/reverse_operon.Rd +++ b/man/reverseOperonSeq.Rd @@ -1,14 +1,14 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/reverse_operons.R -\name{reverse_operon} -\alias{reverse_operon} -\title{reverse_operon} +\name{reverseOperonSeq} +\alias{reverseOperonSeq} +\title{reverseOperonSeq} \usage{ -reverse_operon(prot) +reverseOperonSeq(prot) } \arguments{ \item{prot}{} } \description{ -reverse_operon +reverseOperonSeq } diff --git a/man/reveql.Rd b/man/straightenOperonSeq.Rd similarity index 53% rename from man/reveql.Rd rename to man/straightenOperonSeq.Rd index 9dc2bcb8..fcd0c923 100644 --- a/man/reveql.Rd +++ b/man/straightenOperonSeq.Rd @@ -1,14 +1,14 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/reverse_operons.R -\name{reveql} -\alias{reveql} -\title{reveql} +\name{straightenOperonSeq} +\alias{straightenOperonSeq} +\title{straightenOperonSeq} \usage{ -reveql(prot) +straightenOperonSeq(prot) } \arguments{ \item{prot}{} } \description{ -reveql +straightenOperonSeq } diff --git a/man/summarizeDomArch.Rd b/man/summarizeDomArch.Rd deleted file mode 100644 index 11db1afa..00000000 --- a/man/summarizeDomArch.Rd +++ /dev/null @@ -1,22 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarize.R -\name{summarizeDomArch} -\alias{summarizeDomArch} -\title{summarizeDomArch} -\usage{ -summarizeDomArch(x) -} -\arguments{ -\item{x}{} -} -\value{ -Describe return, in detail -} -\description{ -Function to retrieve counts of how many lineages a DomArch appears in -} -\examples{ -\dontrun{ -summarizeDomArch() -} -} diff --git a/man/summarizeDomArch_ByLineage.Rd b/man/summarizeDomArch_ByLineage.Rd deleted file mode 100644 index cf5fac22..00000000 --- a/man/summarizeDomArch_ByLineage.Rd +++ /dev/null @@ -1,22 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarize.R -\name{summarizeDomArch_ByLineage} -\alias{summarizeDomArch_ByLineage} -\title{summarizeDomArch_ByLineage} -\usage{ -summarizeDomArch_ByLineage(x) -} -\arguments{ -\item{x}{} -} -\value{ -Describe return, in detail -} -\description{ -Function to summarize and retrieve counts by Domains & Domains+Lineage -} -\examples{ -\dontrun{ -summarizeDomArch_ByLineage() -} -} diff --git a/man/summarizeGenContext.Rd b/man/summarizeGenContext.Rd deleted file mode 100644 index 5a40811b..00000000 --- a/man/summarizeGenContext.Rd +++ /dev/null @@ -1,22 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarize.R -\name{summarizeGenContext} -\alias{summarizeGenContext} -\title{summarizeGenContext} -\usage{ -summarizeGenContext(x) -} -\arguments{ -\item{x}{} -} -\value{ -Describe return, in detail -} -\description{ -summarizeGenContext -} -\examples{ -\dontrun{ -summarizeGenContext() -} -} diff --git a/man/summarizeGenContext_ByDomArchLineage.Rd b/man/summarizeGenContext_ByDomArchLineage.Rd deleted file mode 100644 index 59e0376e..00000000 --- a/man/summarizeGenContext_ByDomArchLineage.Rd +++ /dev/null @@ -1,22 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarize.R -\name{summarizeGenContext_ByDomArchLineage} -\alias{summarizeGenContext_ByDomArchLineage} -\title{summarizeGenContext_ByDomArchLineage} -\usage{ -summarizeGenContext_ByDomArchLineage(x) -} -\arguments{ -\item{x}{} -} -\value{ -Define return, in detail -} -\description{ -summarizeGenContext_ByDomArchLineage -} -\examples{ -\dontrun{ -summarizeGenContext_ByDomArchLineage -} -} diff --git a/man/summarizeGenContext_ByLineage.Rd b/man/summarizeGenContext_ByLineage.Rd deleted file mode 100644 index 932fe6a7..00000000 --- a/man/summarizeGenContext_ByLineage.Rd +++ /dev/null @@ -1,22 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarize.R -\name{summarizeGenContext_ByLineage} -\alias{summarizeGenContext_ByLineage} -\title{summarizeGenContext_ByLineage} -\usage{ -summarizeGenContext_ByLineage(x) -} -\arguments{ -\item{x}{} -} -\value{ -Describe return, in detail -} -\description{ -summarizeGenContext_ByLineage -} -\examples{ -\dontrun{ -summarizeGenContext_ByLineage() -} -} diff --git a/man/theme_genes2.Rd b/man/themeGenes2.Rd similarity index 55% rename from man/theme_genes2.Rd rename to man/themeGenes2.Rd index 29f79673..64ae9273 100644 --- a/man/theme_genes2.Rd +++ b/man/themeGenes2.Rd @@ -1,11 +1,11 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ipr2viz.R -\name{theme_genes2} -\alias{theme_genes2} -\title{Theme Genes2} +\name{themeGenes2} +\alias{themeGenes2} +\title{themeGenes2} \usage{ -theme_genes2() +themeGenes2() } \description{ -Theme Genes2 +themeGenes2 } diff --git a/man/to_titlecase.Rd b/man/to_titlecase.Rd deleted file mode 100644 index 45139d3b..00000000 --- a/man/to_titlecase.Rd +++ /dev/null @@ -1,25 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pre-msa-tree.R -\name{to_titlecase} -\alias{to_titlecase} -\alias{totitle,} -\alias{to_title} -\title{To Titlecase} -\usage{ -to_titlecase(text, delimitter) -} -\arguments{ -\item{x}{Character vector.} - -\item{y}{Delimitter. Default is space (" ").} -} -\description{ -Translate string to Title Case w/ delimitter. -Changing case to 'Title Case' -} -\seealso{ -chartr, toupper, and tolower. -} -\author{ -Andrie, Janani Ravi -} diff --git a/man/words2WordCounts.Rd b/man/words2WordCounts.Rd new file mode 100644 index 00000000..370dec7f --- /dev/null +++ b/man/words2WordCounts.Rd @@ -0,0 +1,32 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarize.R +\name{words2WordCounts} +\alias{words2WordCounts} +\title{words2WordCounts} +\usage{ +words2WordCounts(string) +} +\arguments{ +\item{string}{A character string containing the elements (words) to count. +This would typically be a space-delimited string representing domain +architectures or genomic contexts.} +} +\value{ +A tibble (tbl_df) with two columns: +\describe{ +\item{\code{words}}{A column containing the individual words +(domains or domain architectures).} +\item{\code{freq}}{A column containing the frequency counts for each word.} +} +} +\description{ +Get word counts (wc) \link{DOMAINS (DA) or DOMAIN ARCHITECTURES (GC)} +} +\examples{ +\dontrun{ +tibble::tibble(DomArch = c("aaa+bbb", "a+b", "b+c", "b-c")) |> + elements2Words() |> + words2WordCounts() +} + +} diff --git a/man/write.MsaAAMultipleAlignment.Rd b/man/writeMSA_AA2FA.Rd similarity index 65% rename from man/write.MsaAAMultipleAlignment.Rd rename to man/writeMSA_AA2FA.Rd index 17a05f50..a6798469 100644 --- a/man/write.MsaAAMultipleAlignment.Rd +++ b/man/writeMSA_AA2FA.Rd @@ -1,12 +1,12 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/CHANGED-pre-msa-tree.R, R/pre-msa-tree.R -\name{write.MsaAAMultipleAlignment} -\alias{write.MsaAAMultipleAlignment} -\title{Write MsaAAMultpleAlignment Objects as algined fasta sequence} +\name{writeMSA_AA2FA} +\alias{writeMSA_AA2FA} +\title{writeMSA_AA2FA} \usage{ -write.MsaAAMultipleAlignment(alignment, outpath) +writeMSA_AA2FA(alignment, outpath) -write.MsaAAMultipleAlignment(alignment, outpath) +writeMSA_AA2FA(alignment, outpath) } \arguments{ \item{alignment}{MsaAAMultipleAlignment object to be written as a fasta} @@ -17,7 +17,7 @@ write.MsaAAMultipleAlignment(alignment, outpath) MsaAAMultipleAlignment Objects are generated from calls to msaClustalOmega and msaMuscle from the 'msa' package -Write MsaAAMultpleAlignment Objects as algined fasta sequence +Write MsaAAMultpleAlignment Objects as aligned fasta sequence MsaAAMultipleAlignment Objects are generated from calls to msaClustalOmega and msaMuscle from the 'msa' package } diff --git a/man/write_proc_medians_table.Rd b/man/writeProcessRuntime2TSV.Rd similarity index 67% rename from man/write_proc_medians_table.Rd rename to man/writeProcessRuntime2TSV.Rd index 2ae7a97b..0e045a5c 100644 --- a/man/write_proc_medians_table.Rd +++ b/man/writeProcessRuntime2TSV.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{write_proc_medians_table} -\alias{write_proc_medians_table} -\title{Write a table of 2 columns: 1) process and 2) median seconds} +\name{writeProcessRuntime2TSV} +\alias{writeProcessRuntime2TSV} +\title{writeProcessRuntime2TSV} \usage{ -write_proc_medians_table(dir_job_results, filepath) +writeProcessRuntime2TSV(dir_job_results, filepath) } \arguments{ \item{dir_job_results}{\link{chr} path to MolEvolvR job_results} @@ -14,7 +14,7 @@ write_proc_medians_table(dir_job_results, filepath) \value{ \link{tbl_df} 2 columns: 1) process and 2) median seconds -example: write_proc_medians_table( +example: writeProcessRuntime2TSV( "/data/scratch/janani/molevolvr_out/", "/data/scratch/janani/molevolvr_out/log_tbl.tsv" ) diff --git a/man/write_proc_medians_yml.Rd b/man/writeProcessRuntime2YML.Rd similarity index 50% rename from man/write_proc_medians_yml.Rd rename to man/writeProcessRuntime2YML.Rd index a3d8ee5f..5e0a05a4 100644 --- a/man/write_proc_medians_yml.Rd +++ b/man/writeProcessRuntime2YML.Rd @@ -1,25 +1,28 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/assign_job_queue.R -\name{write_proc_medians_yml} -\alias{write_proc_medians_yml} -\title{Compute median process runtimes, then write a YAML list of the processes and -their median runtimes in seconds to the path specified by 'filepath'.} +\name{writeProcessRuntime2YML} +\alias{writeProcessRuntime2YML} +\title{writeProcessRuntime2YML} \usage{ -write_proc_medians_yml(dir_job_results, filepath = NULL) +writeProcessRuntime2YML(dir_job_results, filepath = NULL) } \arguments{ \item{dir_job_results}{\link{chr} path to MolEvolvR job_results directory} -\item{filepath}{\link{chr} path to save YAML file; if NULL, uses ./molevol_scripts/log_data/job_proc_weights.yml} +\item{filepath}{\link{chr} path to save YAML file; if NULL, +uses ./molevol_scripts/log_data/job_proc_weights.yml} } \description{ +Compute median process runtimes, then write a YAML list of the processes and +their median runtimes in seconds to the path specified by 'filepath'. + The default value of filepath is the value of the env var -MOLEVOLVR_PROC_WEIGHTS, which get_proc_weights() also uses as its default +MOLEVOLVR_PROC_WEIGHTS, which getProcessRuntimeWeights() also uses as its default read location. } \examples{ \dontrun{ -write_proc_medians_yml( +writeProcessRuntime2YML( "/data/scratch/janani/molevolvr_out/", "/data/scratch/janani/molevolvr_out/log_tbl.yml" )