Skip to content

bioDS/Pint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

8bcc0a6 · Mar 26, 2024
Apr 13, 2022
Oct 4, 2022
Mar 26, 2024
Aug 15, 2023
Aug 8, 2022
Apr 8, 2022
Jul 25, 2019
Jun 23, 2021
Jun 10, 2020
Jun 10, 2020
Dec 3, 2020
Jun 10, 2020
Aug 14, 2023
Jul 23, 2021
Jun 10, 2020
Aug 14, 2023
Jul 31, 2019
Feb 28, 2021
Aug 15, 2019
Sep 1, 2019
Sep 1, 2019
Sep 1, 2019
Oct 4, 2022
Sep 1, 2019
Apr 8, 2022
Jul 23, 2021
Apr 5, 2022
Aug 7, 2019
Aug 15, 2019
Aug 17, 2020
Aug 3, 2020
Aug 3, 2020
Apr 2, 2019
Apr 2, 2019
Aug 4, 2020
Oct 4, 2022
Apr 2, 2019
Apr 2, 2019
Apr 12, 2019
Apr 3, 2019
Apr 2, 2019
Sep 16, 2019
Apr 2, 2019
Apr 2, 2019
Sep 16, 2019
Apr 3, 2019
Jul 11, 2019

Repository files navigation

tests cod cov

R package

This repo is an installable R package, you can install a locally cloned copy with R CMD INSTALL ./cloned-location.

Alternatively, install directly from github with:

install.packages("https://github.com/bioDS/Pint/archive/refs/heads/main.tar.gz", repos=NULL)

This library provides a single function that performs square root lasso regularised linear regression on all pairs of columns in the input matrix X, otherwise modelling Y ~ X. The primary function (including default arguments) is:

output <- interaction_lasso(X, Y, n = dim(X)[1], p = dim(X)[2], lambda_min = -1, halt_error_diff=1.01, max_interaction_distance=-1, max_nz_beta=-1, max_lambdas=200, verbose=FALSE, log_filename="regression.log", depth=2, log_level="none", estimate_unbiased=FALSE, use_intercept=TRUE, num_threads=-1, approximate_hierarchy=FALSE, check_duplicates=FALSE, continuous_X=FALSE)

Arguments:

X : A binary n × p matrix.

Y : A vector of n real values.

lambda_min : optionally set the final value of lambda. If $ < 0$ the default value of ϕ ¹ ( 0.95 2 × p ) is used.

halt_error_diff : The loss-threshold to determine when an iteration is complete.

max_interaction_distance : The maximum distance between any two components of an interaction effect. Set to '-1' for no limit (default).

max_nz_beta : If >=0, halt after this many β values are non-zero (note the the current λ iteration will be completed first, so more values may be set). '-1' implies no limit.

max_lambdas : maximum number of iterations (i.e. number of λ values). Initial iterations in which no β values are changed do not count.

depth : Maximum number of columns that may be included in an interaction. If depth=1, only main effects (columns on their own) are included. If depth=2, pairwise interactions are also included. If depth=3 main effects, pairwise and three-way interactions are included.

estimate_unbiased : once the non-zero β values have been determined, optionally re-fit with λ = 0 to avoid the minimising effect on β values, while still keeping the result sparse.

use_intercept : If true, allow a non-zero intercept.

approximate_hierarchy : Approximates a strong hierarchy by only allowing interactions between columns that are (or were at a larger λ value) non-zero. Note that a main effect may still be set to zero after the interactions is included, so this does not strictly enforce either a strong or weak hierarchy. This can considerably speed up fitting interactions on large data sets.

check_duplicates : Identify and report any duplicate columns or interactions, and only assign an effect to the one of them.

num_threads : Number of threads to use, use '-1' (default) to use all available CPU cores.

Experimental Features

A number of options have been implemented, but not thoroughly tested. These are:

continuous_X : If true, use floating point values for X. If false, all non-zero values in X are treated as 1. Note that this currently disables duplicate column detection.

log_filename : name of file to save current progress in case the process needs to be interrupted and resumed.

log_level : options are 'none' (no logging), and 'lambda' where progress is saved after each λ iteration is completed.

Return Values

A list of non-zero pairwise/interaction and main effects is returned.

More precisely:

final_lambda : the final value of λ .

intercept : (if use_intercept=TRUE) the intercept value.

main : A data frame effects containing i , β i for individual columns X i , and a list eqiuvalent of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).

pairwise (if depth 2 ) A data frame effects containing i , j , β i , j for X i X j and a list equivalent of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).

triple (if depth 3 ) A data frame effects containing i , j , k , β i , j , k for X i X j X k and a list equivalent of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).

estimate_unbiased : (if estimate_unbiased=TRUE) β i , β i , j , β i , j , k fit with λ = 0 , including only the effects that are non-zero for $lambda = $ final_lambda. For an estimate of the best fit, while excluding columns lasso regression sets to zero.

For an example that finds non-zero interactions with pint, before finding a more accurate estimate of effect strengths and various summary statistics with lm() see lm_example.R

Build Requirements

Compiling on Ubuntu 22.04 requires the following package:

libxxhash-dev

Additionally, the following are required for the standalone executable and/or running tests:

libgsl-dev
ninja-build
libglib2.0-dev
meson
gcovr

Standalone Executable

There is an executable version (primarily for testing) that can be run on X/Y as .csv files.

Build Utils

meson --buildtype release build
ninja -C build

Usage

./build/utils/src/lasso_exe X.csv Y.csv [main/int] verbose=T/F [max lambda] N P [max interaction distance] [frac overlap allowed] [q/t/filename] [log_level [i]ter/[l]ambda/[n]one]

All arguments must be supplied.

Argument Use
X.csv Path to X matrix in .csv format (see testX.csv for an example)
Y.csv Path to Y matrix in .csv format (see testY.csv for an example)
main/int: Find only main effects, or interactions. Main effects only intended for testing and may be broken.
verbose: For debugging purposes.
max lambda: Initial lambda value for regression, must be > 0.
N: Number of rows of X/Y (e.g. no. fitness scores)
P: Number of columns of X (e.g. no. genes)
max interaction distance: Only columns within this distance in X will be considered. -1 to use all pairs.
frac overlap: fraction of columns being updated at the same time that is allowed to overlap. No longer used.
q/t/filename: output mode. [q]uit immediately without printing output, [t]erminal: prints first 10 values < -500 to terminal, [filename]: prints all non-zero effects to the given file.
log_level: Whether and how to log partial results. iter -> every iteration, lambda -> every new lambda, none -> do not log.

Acknowledgements

This project includes the following work:

  • xxHash (for identifying identical columns) - BSD 2-Clause License.
  • Malte Skarupke's flat hash map - Boost Software License, Version 1.0.