Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kalis v2 #4

Merged
merged 43 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
61570f0
Start work on v2
louisaslett Sep 23, 2024
55c3672
Introducing O(n) clustering algorithm (blobby) and C-core for clade c…
ryanchrist Sep 23, 2024
e0df97f
Introducing R interface for Clade/Sprig calling and Clade Matrix (gen…
ryanchrist Sep 23, 2024
2805f31
Introducing optimal checkpointing routines and iterator interface for…
ryanchrist Sep 23, 2024
ca8fc82
Introducing efficient algorithm for parallelized implicit matrix trac…
ryanchrist Sep 23, 2024
add21d5
Introducing tests for CladeMat
ryanchrist Sep 23, 2024
9a6a9e6
Adding vignette for iterating over target loci sequentially
ryanchrist Sep 23, 2024
a23711f
Registrations required for previous C commits
ryanchrist Sep 23, 2024
bd5eb3c
Reducing memory footprint of loading haplotypes
ryanchrist Sep 23, 2024
073ec63
Changing rhdf5 h5write call to h5writeDataset based on some prior fai…
ryanchrist Sep 23, 2024
d7e3e6f
Minor bug fixes to Probs
ryanchrist Sep 23, 2024
7f9252b
roxygenize all updates
ryanchrist Sep 23, 2024
9749c4b
reintroducing inputting haplotypes to kalis vignette
ryanchrist Sep 23, 2024
2fbd289
iterating ready to go!
ryanchrist Sep 23, 2024
1338a48
Fixes for CHECK error:
louisaslett Sep 25, 2024
31c64cc
Fixes for CHECK errors:
louisaslett Sep 25, 2024
718f617
Partial fix for CHECK errors:
louisaslett Sep 25, 2024
2b0341b
Fix N checking Rd line widths ...
louisaslett Sep 25, 2024
5d325b5
Change to markdown \code and \link's and also fix W checking Rd cros…
louisaslett Sep 25, 2024
2ea7fdd
Update all doc \code and \link to markdown versions, including spotti…
louisaslett Sep 25, 2024
6550df1
Correct indentation of YAML and markdown blocks
louisaslett Sep 25, 2024
8fa8200
Turn off tests on package check for speed (kalis tests take ~30 minutes)
louisaslett Sep 26, 2024
d8f78a0
Eliminate compiler warnings about printing size_t's when int expected…
louisaslett Sep 26, 2024
43d2d4c
Avoid compiler warnings about using undefined pointers
louisaslett Sep 26, 2024
7beab0a
Eliminate apparently redundant variable causing compiler warnings
louisaslett Sep 26, 2024
667c798
Update configure script to have strict POSIX shell support
louisaslett Sep 26, 2024
e641ceb
Eliminate dangling a.out.dSYM after configure run on Macs
louisaslett Sep 26, 2024
ff8ee3e
Add cleanup script to remove temporary files created during configure…
louisaslett Sep 30, 2024
e7df962
Remove general checkpointing solution into separate dev branch and el…
louisaslett Oct 1, 2024
f595d72
Remove incorrectly tracked vignette files from version control
louisaslett Oct 1, 2024
6faf7a7
Eliminate some dev testing code from MakeUpdateCache (still tracked i…
louisaslett Oct 1, 2024
20794ef
Include vignette building in installation instructions
louisaslett Oct 1, 2024
1c9543a
First batch of documentation fixes, some TODO items remain in these.
louisaslett Oct 2, 2024
4d751e3
Update to description file with ORCIDs, first kalis paper, and bug re…
louisaslett Oct 2, 2024
e4243e2
Second batch of documentation fixes, many TODOs remain.
louisaslett Oct 2, 2024
a6015ba
Remove exit() call from CladeMat() C function.
louisaslett Oct 2, 2024
047287b
Allocation checks not required for R_alloc (see Writing R Extensions)
louisaslett Oct 2, 2024
cc97b94
documenting new kalis v2 functions
ryanchrist Nov 12, 2024
4cde13f
Tweaks to added documentation
louisaslett Nov 13, 2024
1ad8095
Fix pkgdown maths
louisaslett Nov 13, 2024
a9ed50a
Add kalis paper to all v1 functions and tidy other references in v1 f…
louisaslett Nov 13, 2024
13dbdc4
Documentation and exported function review by Ryan (pushed by me)
louisaslett Nov 13, 2024
1b02952
Final documentation sweep for correct references, and other minor tweaks
louisaslett Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
Package: kalis
Type: Package
Title: High Performance Li & Stephens Local Ancestry Inference
Version: 1.0.0
Version: 2.0.0
Authors@R: c(person("Louis", "Aslett", role = c("aut", "cre"),
email = "[email protected]"),
email = "[email protected]",
comment = c(ORCID = "0000-0003-2211-233X")),
person("Ryan", "Christ", role = "aut",
email = "[email protected]"))
email = "[email protected]",
comment = c(ORCID = "0000-0002-2049-3389")))
Author: Louis Aslett [aut, cre],
Ryan Christ [aut]
Maintainer: Louis Aslett <[email protected]>
Description: kalis provides a high performance implementation of the Li &
Stephens model <https://www.ncbi.nlm.nih.gov/pubmed/14704198> for local
ancestry inference (local referring to a region of the genome). For a set of N
phased haplotypes, kalis computes the posterior marginal probability of each
haplotype copying every other haplotype by running N hidden Markov models in
parallel. This yields an N x N distance matrix that summarizes the recent
local ancestry at each variant of interest. The package provides functionality
for specifying a recombination map, site-specific mutation rates, and
differing prior copying probabilities for each recipient haplotype. Extensive
use is made of low level threading and CPU vector instructions.
Description: kalis <doi:10.1186/s12859-024-05688-8> provides a high performance
implementation of the Li & Stephens model <doi:10.1093/genetics/165.4.2213>
for local ancestry inference (local referring to a region of the genome). For
a set of N phased haplotypes, kalis computes the posterior marginal
probability of each haplotype copying every other haplotype by running N
hidden Markov models in parallel. This yields an N x N distance matrix that
summarizes the recent local ancestry at each variant of interest. The package
provides functionality for specifying a recombination map, site-specific
mutation rates, and differing prior copying probabilities for each recipient
haplotype. Extensive use is made of low level threading and CPU vector
instructions.
License: GPL (>= 3)
BugReports: https://github.com/louisaslett/kalis/issues
URL: https://kalis.louisaslett.com/, https://github.com/louisaslett/kalis
LazyData: TRUE
Depends: R (>= 3.5.0)
Imports:
utils,
methods,
stats,
parallel,
dplyr,
Expand All @@ -47,7 +50,8 @@ Suggests:
rmarkdown,
fastcluster,
lattice,
testthat (>= 3.0.0)
testthat (>= 3.0.0),
data.table
VignetteBuilder: knitr
Encoding: UTF-8
Config/testthat/edition: 3
18 changes: 18 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,31 +1,49 @@
# Generated by roxygen2: do not edit by hand

S3method(plot,kalisDistanceMatrix)
S3method(plot,kalisIterator)
S3method(print,kalisBackwardTable)
S3method(print,kalisCheckpointTable)
S3method(print,kalisForwardTable)
S3method(print,kalisIterator)
S3method(print,kalisParameters)
S3method(targets,kalisIterator)
export(Backward)
export(CacheHaplotypes)
export(CacheSummary)
export(CalcCheckpointTables)
export(CalcRho)
export(CalcTraces)
export(CladeMat)
export(ClearHaplotypeCache)
export(CopyTable)
export(CreateForwardTableCache)
export(DistMat)
export(FillTableCache)
export(Forward)
export(ForwardIterator)
export(ForwardUsingTableCache)
export(L)
export(MakeBackwardTable)
export(MakeForwardTable)
export(N)
export(Parameters)
export(PostProbs)
export(PruneCladeMat)
export(QueryCache)
export(ReadHaplotypes)
export(ResetTable)
export(Sprigs)
export(WriteHaplotypes)
import(checkmate)
import(dplyr)
importFrom(digest,digest)
importFrom(glue,glue)
importFrom(glue,glue_collapse)
importFrom(graphics,axis)
importFrom(prettyunits,pretty_bytes)
importFrom(rlang,duplicate)
importFrom(stats,sd)
importFrom(utils,getFromNamespace)
importFrom(utils,tail)
useDynLib(kalis, .registration = TRUE, .fixes = "CCall_")
11 changes: 10 additions & 1 deletion R/CacheHaplotypes.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ assign("L", NA, envir = pkgVars) # must be integer
#'
#' (num rows)x(num cols) = (num variants)x(num haplotypes).
#'
#' It is fine to delete this matrix from R after calling \code{CacheHaplotypes}.
#' It is fine to delete this matrix from R after calling [CacheHaplotypes()].
#'
#'
#' **HDF5 format**
Expand All @@ -48,6 +48,9 @@ assign("L", NA, envir = pkgVars) # must be integer
#'
#'
#'
#' @references
#' Aslett, L.J.M. and Christ, R.R. (2024) "kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R", *BMC Bioinformatics*, **25**(1). Available at: \doi{10.1186/s12859-024-05688-8}.
#'
#' @param haps can be the name of a file from which the haplotypes are to be read, or can be an R matrix containing only 0/1s.
#' See Details section for supported file types.
#' @param loci.idx an optional vector of indices specifying the variants to load into the cache, indexed from 1.
Expand Down Expand Up @@ -219,6 +222,9 @@ CacheHaplotypes.err <- function(err) {
#' To achieve higher performance, kalis internally represents haplotypes in an efficient raw binary format in memory which cannot be directly viewed or manipulated in R.
#' This function enables you to copy whole or partial views of haplotypes/variants out of this low-level format and into a standard R matrix of 0's and 1's.
#'
#' @references
#' Aslett, L.J.M. and Christ, R.R. (2024) "kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R", *BMC Bioinformatics*, **25**(1). Available at: \doi{10.1186/s12859-024-05688-8}.
#'
#' @param loci.idx which variants to retrieve from the cache, specified as a (vector) index.
#' This enables specifying variants by offset in the order they were loaded into the cache (from 1 to the number of variants).
#' @param hap.idx which haplotypes to retrieve from the cache, specified as a (vector) index.
Expand Down Expand Up @@ -295,6 +301,9 @@ QueryCache <- function(loci.idx = NULL, hap.idx = NULL) {
#' In particular, this cache sits outside R's memory management and will never be garbage collected (unless R is quit or the package is unloaded).
#' Therefore, this function is provided to enable freeing the memory used by this cache.
#'
#' @references
#' Aslett, L.J.M. and Christ, R.R. (2024) "kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R", *BMC Bioinformatics*, **25**(1). Available at: \doi{10.1186/s12859-024-05688-8}.
#'
#' @return Nothing is returned.
#'
#' @seealso [CacheHaplotypes()] to create a haplotype cache;
Expand Down
6 changes: 4 additions & 2 deletions R/CacheHaplotypes_hdf5.R
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ CacheHaplotypes.hdf5.hdf5r <- function(hdf5.file,
else
res <- t(h5.haps[hap.idx[current.step:upto],loci.idx])
current.step <<- upto + 1
storage.mode(res) <- "integer"
res
}
}
Expand Down Expand Up @@ -201,10 +202,11 @@ CacheHaplotypes.hdf5.rhdf5 <- function(hdf5.file,
}
upto <- min(current.step + step.size - 1, N)
if(!transpose)
res <- matrix(as.integer(rhdf5::h5read(hdf5.file, haps.path, index = list(loci.idx, hap.idx[current.step:upto]))), nrow = length(loci.idx))
res <- rhdf5::h5read(hdf5.file, haps.path, index = list(loci.idx, hap.idx[current.step:upto]))
else
res <- t(matrix(as.integer(rhdf5::h5read(hdf5.file, haps.path, index = list(hap.idx[current.step:upto], loci.idx))), ncol = length(loci.idx)))
res <- t(rhdf5::h5read(hdf5.file, haps.path, index = list(hap.idx[current.step:upto], loci.idx)))
current.step <<- upto + 1
storage.mode(res) <- "integer"
res
}
}
Expand Down
12 changes: 8 additions & 4 deletions R/CacheSummary.R
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
#' Retrieve information about the haplotype cache
#'
#' @references
#' Aslett, L.J.M. and Christ, R.R. (2024) "kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R", *BMC Bioinformatics*, **25**(1). Available at: \doi{10.1186/s12859-024-05688-8}.
#'
#' @return
#' \code{CacheSummary()} prints information about the current state of the kalis cache.
#' Also invisibly returns a vector giving the dimensions of the cached haplotype data (num variants, num haplotypes), or \code{NULL} if the cache is empty.
#' `CacheSummary()` prints information about the current state of the kalis cache.
#' Also invisibly returns a vector giving the dimensions of the cached haplotype data (num variants, num haplotypes), or `NULL` if the cache is empty.
#'
#' \code{N()} returns the number of haplotypes currently in the kalis cache, or \code{NULL} if the cache is empty.
#' `N()` returns the number of haplotypes currently in the kalis cache, or `NULL` if the cache is empty.
#'
#' \code{L()} returns the number of variants currently in the kalis cache, or \code{NULL} if the cache is empty.
#' `L()` returns the number of variants currently in the kalis cache, or `NULL` if the cache is empty.
#'
#' @examples
#' # First fill the cache with the toy data included in the package
Expand All @@ -24,6 +27,7 @@
#' N()
#' L()
#'
#' @importFrom prettyunits pretty_bytes
#' @export
CacheSummary <- function() {
N <- get("N", envir = pkgVars)
Expand Down
41 changes: 41 additions & 0 deletions R/CalcTraces.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#' Fast Calculation of Matrix Trace and Hilbert Schmidt Norm
#'
#' Provides multithreaded calculation of trace and Hilbert Schmidt Norm of a matrix \eqn{PMP} (where \eqn{P} is a projection matrix and \eqn{M} is real symmetric) without explicitly forming \eqn{PMP}.
#'
#' \eqn{P} here is assumed to have the form \eqn{I-QQ'} for some matrix \eqn{Q} of orthogonal columns.
#'
#' @references
#' Christ, R.R., Wang, X., Aslett, L.J.M., Steinsaltz, D. and Hall, I. (2024) "Clade Distillation for Genome-wide Association Studies", bioRxiv 2024.09.30.615852. Available at: \doi{10.1101/2024.09.30.615852}.
#'
#' @param M
#' a real symmetric R matrix
#' @param tX
#' `t((Q %*% (J%*%Q)) - (M %*% Q))`
#' @param tQ
#' `t(Q)`
#' @param J
#' `crossprod(Q, M)`
#' @param from_recipient
#' haplotype index at which to start trace calculation --- useful for distributed computation (experimental feature, more documentation to come<!-- TODO -->)
#' @param nthreads
#' the number of CPU cores to use.
#' By default uses the `parallel` package to detect the number of physical cores.
#'
#' @return
#' A list containing three elements:
#'
#' \describe{
#' \item{`trace`}{the trace, \eqn{\mathrm{tr}(PMP)};}
#' \item{`hsnorm2`}{the *squared* Hilbert Schmidt Norm of \eqn{PMP}, \eqn{\mathrm{tr}((PMP)'PMP)};}
#' \item{`diag`}{the diagonal of \eqn{PMP}.}
#' }
#'
#' @examples
#' # TODO
#'
#' @export
CalcTraces <- function(M, tX, tQ, J,
from_recipient = 1L,
nthreads = min(parallel::detectCores(logical = FALSE), ncol(M))) {
.Call(CCall_CalcTraces, M, tX, tQ, J, from_recipient, nthreads)
}
83 changes: 83 additions & 0 deletions R/CladeMat.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#' Fast clade matrix construction
#'
#' Constructs a clade matrix using forward and backward tables.
#' The clade matrix captures genetic relatedness information in the distances from the Li & Stephens model that are not captured in the called clades.
#'
#' `CladeMat()` uses the forward and backward tables to construct the corresponding clade matrix which can then be tested, for example using a standard quadratic form score statistic.
#'
#' @references
#' Christ, R.R., Wang, X., Aslett, L.J.M., Steinsaltz, D. and Hall, I. (2024) "Clade Distillation for Genome-wide Association Studies", bioRxiv 2024.09.30.615852. Available at: \doi{10.1101/2024.09.30.615852}.
#'
#' @param fwd
#' a `kalisForwardTable` object, as returned by [MakeForwardTable()] and propagated to a target variant by [Forward()].
#' This table must be at the same variant location as argument `bck`.
#' @param bck
#' a `kalisBackwardTable` object, as returned by [MakeBackwardTable()] and propagated to a target variant by [Backward()].
#' This table must be at the same variant location as argument `fwd`.
#' @param M
#' a matrix with half the number of rows and columns as the corresponding forward/backward tables.
#' This matrix is overwritten in place with the clade matrix result for performance reasons.
#' @param unit.dist
#' the change in distance that is expected to correspond to a single mutation (typically \eqn{-\log(\mu)}) for the LS model)
#' @param thresh
#' a regularization parameter: differences of distances must exceed this threshold (in `unit.dist` units) in order to cause the introduction of a probabilistic clade.
#' Defaults to `0.2`.
#' @param max1var
#' a logical regularization parameter.
#' When `TRUE`, differences in distances exceeding 1 `unit.dist` are set to 1 (so that any edge in the latent ancestral tree with multiple mutations on them are treated as if only one mutation was on it).
#' @param nthreads
#' the number of CPU cores to use.
#' By default uses the `parallel` package to detect the number of physical cores.
#'
#' @return
#' A list, the first element contains a list of tied nearest neighbours (one for each haplotype).
#' Other elements of the returned list are for internal use by [PruneCladeMat()] to allow for efficient removal of singletons and sprigs.
#'
#' @examples
#' # TODO
#'
#'
#' @export CladeMat
CladeMat <- function(fwd, bck, M, unit.dist, thresh = 0.2, max1var = FALSE,
nthreads = min(parallel::detectCores(logical = FALSE), fwd$to_recipient-fwd$from_recipient+1)){

# input checks
#########################
input_checks_for_probs_and_dist_mat(fwd,bck)

if(nrow(fwd$alpha)%%2 !=0 || ncol(fwd$alpha)%%2 !=0 || nrow(bck$beta)%%2 !=0 || ncol(bck$beta)%%2 !=0 ){
stop("fwd and bck must both have an even number of recipient haplotypes and an even number of donor haplotypes")
}

if(!is.matrix(M) || !is.double(M) || nrow(M) != nrow(fwd$alpha)/2 || ncol(M) != ncol(fwd$alpha)/2){
stop("M must be a matrix of doubles with nrow(fwd$alpha)/2 rows and ncol(fwd$alpha)/2 columns")}

if(!is.atomic(unit.dist) || length(unit.dist)!=1L || !is.finite(unit.dist) || unit.dist <= 0){
stop("unit.dist must be a number greater than 0")}

if(is.integer(unit.dist)){
unit.dist <- as.double(unit.dist)
} else {
if(!is.double(unit.dist)){stop("unit.dist must be a number greater than 0")}}

if(!is.atomic(thresh) || length(thresh)!=1L || !is.finite(thresh) || thresh < 0 || thresh > 1){
stop("thresh must be a number in [0,1]")}

if(is.integer(thresh)){
thresh <- as.double(thresh)
} else {
if(!is.double(thresh)){stop("thresh must be a number in [0,1]")}}

if(!is.logical(max1var) || length(max1var) > 1){
stop("max1var must be TRUE or FALSE")}

nthreads <- as.integer(nthreads)
if(!is.integer(nthreads) || length(nthreads)!=1L || !is.finite(nthreads) || nthreads <= 0){
stop("nthreads must be a positive integer")}

if(nthreads > ncol(fwd$alpha)/2){
stop("nthreads cannot be greater than the number of recipient haplotypes divided by 2.")
}

invisible(.Call(CCall_CladeMat, fwd, bck, M, unit.dist, thresh, max1var, nthreads))
}
Loading