The goal of this repository is to compare and compile a list of different statistical methods and algorithms, which take as inputs bivariate data (point clouds) and attempt to infer a causal direction.
Good references on this topic are:
- A Very Comprehensive Benchmark of methods using Additive Noise Models, and all the surrounding concepts
- Several machine-learning algorithms using distribution embeddings have been designed: RCC, KCDC. A more statistical approach is QCDC (copulas + quantile scores)
- The SLOPE algorithm is a framework assuming a set of basis functions, and iteratively weights goodness of fit and function complexity to find the "best" model. Various instantiations exist such as Slope-S, Slope-D, an identifiable variant, etc... More information can be found in their journal paper
- RECI is a statistical approach based on regression, identifiable in the low noise setting
- IGCI Justifies a statistical approach in the case the relationship is deterministic and invertible. Additional material can be found in their subsequent paper.
- A good review on graphical models for a number > 2 of variables can also be helpful to understand the general POV.
- CGNN Connects graphical models, generative models, and bivariate methods in an interpretable fashion (using neural networks). It is a good bridge between bivariate and graph methods. The authors are currently building a very helpful python causal discovery library
Many causal algorithms rely on independence tests and Similarity tests. Some examples are
- Bivariate Methods using Additive Noise Models often use Mutual Information or HSIC
- Constraint-based methods for graph data use conditional independence tests. A good statistical test is the KCI Test and the related KPC algorithm. In case one needs a faster, approximate method, the authors (and others) have recently designed approximations such as RCIT and RCOT. Another good but quadratic complexity conditional independence test is PCIT
- A good review on Dependence tests can be found in this interesting thesis
Here we are interested in differentiable versions of various statistical tests. We implemented some tests using PyTorch and using smooth approximations to existing tests, allowing backpropagation w.r.t each inputs/parameters.
-
PyTorch HSIC Test and an example of HSIC minimization ( code ) for ANM-detection. Although the HSIC test is differentiable wrt all inputs, our implementation doesn't yet support hyperparameter fitting.
-
Examples of 2D gaussian HSIC-Gamma test, and ANM-detection tests will be uploaded.
-
Might re-implement relative HSIC between two models
-
PyTorch MMD Test with Gamma Approximation.
-
Might re-implement optimized MMD from here or relative MMD between two models
Classifier Two Sample Tests (C2ST) have been introduced and tested in this paper. Here, we re-implement and slightly adapt the lua code of the authors, which includes
- C2ST-NN: using a shallow neural network classifier (ReLU + Sigmoid) with default 20 hidden units. While adding layers/hidden units is a good idea, we usually work with 500-5000 samples per distribution, and/or aim for accuracy higher than 55% to reject P=Q
- C2ST-KNN: K-nearest neighbors classifier with
k=floor(n_te/2)
. Usually worse than neural nets.
The idea in broad terms is that under H0 (P=Q) , the classifier cannot exceed 50% accuracy and n*acc
is distributed as Binomial(n_te, 0.5)
. Then acc
under H0 can be approximated as Normal(0.5, 0.25/n_te)
, we therefore use the approximate null to find a p-value on the accuracy and reject H0 accordingly.
Some basic examples can be found in this subdirectory.
We are currently re-implementing SLOPE in python, allowing both Numpy & PyTorch datatypes. An example of the SLOPE fit for 13 basis functions can be found in this folder ( code ), which also contains mixed fits for 8 functions, and a little bit more.
Fit a GMM ( code ) with flexible number of components.
- One dimensional on synthetic data (can be applied to estimate marginal complexity)
- Two dimensional on synthetic data (as an example of causality-agnostic distribution fitting)
Unless exceptions, every picture and experiment reported can be seen in the tests/data subdir. However, for particularly large files or high number of pictures, a different picture-only repo is available!
The dependencies can be installed using pip install -r requirements.txt
or
pip3 install -r requirements.txt