Improved code quality.

Updated documentation. Version bump to 0.1.1.
grnlab · Jan 29, 2025 · 86c2e98 · 86c2e98
1 parent 3d4c815
commit 86c2e98
Show file tree

Hide file tree

Showing 22 changed files with 143 additions and 589 deletions.
diff --git a/README.rst b/README.rst
@@ -3,35 +3,40 @@ Airqtl
 =========
 Airqtl is an efficient method to map expression quantitative trait loci (eQTLs) and infer causal gene regulatory networks (cGRNs) from population-scale single-cell studies. The core of airqtl is Array of Interleaved Repeats (AIR), an efficient data structure to store and process donor-level data in the cell-donor hierarchical setting. Airqtl offers over 8 orders of magnitude of acceleration of eQTL mapping with linear mixed models, arising from its superior time complexity and Graphic Processing Unit (GPU) utilization. 
 
-**This respository is being actively updated. Please check back later.**
-
 Installation
 =============
 Airqtl is on `PyPI <https://pypi.org/project/airqtl>`_. To install airqtl, you should first install `Pytorch 2 <https://pytorch.org/get-started/locally/>`_. Then you can install airqtl with pip: ``pip install airqtl`` or from github: ``pip install git+https://github.com/grnlab/airqtl.git``. Make sure you have added airqtl's install path into PATH environment before using the command-line interface (See FAQ_). Airqtl's installation can take several minutes including installing dependencies.
 
 Usage
 =====
-Airqtl provides command-line and python interfaces. For starters, you can run airqtl by typing ``airqtl -h`` on command-line. See our tutorials below.
+Airqtl provides both command-line and python interfaces. For starters, you can run airqtl by typing ``airqtl -h`` on command-line. Try our tutorial below and adapt it to your own dataset.
 
 Tutorials
 ==========================
-Currently we provide one tutorial to map cell state-specific single-cell eQTLs and infer cGRNs from the Randolph et al dataset in `docs/tutorials`. We are working on better documentation so you can easily understand the tutorial and repurpose it for your own dataset.
+Currently we provide `one tutorial <docs/tutorials/randolph>`_ to map cell state-specific single-cell eQTLs and infer cGRNs from the Randolph et al dataset in `docs/tutorials`.
 
 Issues
 ==========================
 Pease raise an issue on `github <https://github.com/grnlab/airqtl/issues/new>`_.
 
 References
 ==========================
-TBA
+* `"Airqtl dissects cell state-specific causal gene regulatory networks with efficient single-cell eQTL mapping" <https://www.biorxiv.org/content/10.1101/2025.01.15.633041>`_ (2025) by Lingfei Wang. bioRxiv 2025.01.15.633041.
 
 FAQ
 ==========================
-* What does airqtl stand for?
+* **What does airqtl stand for**?
 	Array of Interleaved Repeats for Quantitative Trait Loci
 
-* I installed airqtl but typing ``airqtl`` says 'command not found'.
+* **Why do I see this error:** ``AssertionError: Torch not compiled with CUDA enabled``?
+
+  This is because you installed a CPU-only pytorch but tried to run it on GPU. You have several options:
+
+  1. To run pytorch on **CPU**, set `device='cpu'` in `Snakefile.config` of the tutorial pipeline you use.
+  2. To run pytorch on **GPU**, reinstall pytorch with GPU support at `Installation`_.
+
+* **I installed airqtl but typing ``airqtl`` says 'command not found'**.
 	See below.
 
-* How do I use a specific python version for airqtl's command-line interface?
+* **How do I use a specific python version for airqtl's command-line interface**?
 	You can always use the python command to run airqtl, such as ``python3 -m airqtl`` to replace command ``airqtl``. You can also use a specific path or version for python, such as ``python3.12 -m airqtl`` or ``/usr/bin/python3.12 -m airqtl``. Make sure you have installed airqtl for this python version.
diff --git a/docs/tutorials/randolph/README.md b/docs/tutorials/randolph/README.md
@@ -1,9 +1,38 @@
 # Cell state-specific single-cell eQTL mapping and cGRN inference
 
-This is an [airqtl](https://github.com/grnlab/airqtl) tutorial to map expression quantitative trait loci (eQTLs) and infer causal gene regulatory networks (cGRNs) at the cell state level of specificity. The Randolph et al [dataset](https://zenodo.org/records/4273999) from their [original study](https://www.science.org/doi/full/10.1126/science.abg0928) is used.
+This is an [airqtl](https://github.com/grnlab/airqtl) tutorial to map single-cell expression quantitative trait loci (sceQTLs) and infer causal gene regulatory networks (cGRNs) at the cell state level of specificity. The Randolph et al [dataset](https://zenodo.org/records/4273999) from their [original study](https://www.science.org/doi/full/10.1126/science.abg0928) is used.
 
-To use this tutorial, first [install airqtl](https://github.com/grnlab/airqtl#installation) and download this folder. You may need to update `Snakefile.config`, especially the `device` parameter if you prefer to use a CPU or a different GPU. Then the pipeline can be run with `snakemake -j 1` in shell environment. It takes ~1 day on a top-end Dell Alienware Aurora R16®, in which single-cell eQTL mapping takes ~10mins for each cell state.
+**This tutorial is being actively updated. Please check back often.**
 
-More documentation is underway to help you understand and customize this pipeline and repurpose it for your own data.
+## Running the tutorial
+1. [Install airqtl](https://github.com/grnlab/airqtl#installation) and download this folder
+2. (Optional) Customize pipeline configuration in `Snakefile.config`, especially the `device` parameter if you prefer to use a CPU or a different GPU. See [Understanding and customizing the tutorial](#Understanding-and-customizing-the-tutorial).
+3. Run the pipeline with `snakemake -j 1` **twice** in shell. The first run will download the raw dataset from Zenodo. The second run will read in the cell states to map sceQTLs infer cGRNs for each cell state.
+4. Check the sceQTL output files at `data/association` and cGRN output file at `data/merge.tsv.gz`.
 
-If you face any issues or need any assistance, see [FAQ](https://github.com/grnlab/airqtl#faq) and [Issues](https://github.com/grnlab/airqtl#issues).
+The whole run takes ~1 day on a top-end Dell Alienware Aurora R16®, in which single-cell eQTL mapping takes ~10mins for each cell state. The download step can take longer if your internet is slow.
+
+After a successful run of this tutorial, you can [repurpose it for your own dataset](#Repurposing-the-tutorial-pipeline-for-your-own-dataset).
+
+## Understanding and customizing the tutorial
+* Input files of the pipeline are described in the `datasetfiles_data` and `datasetfiles_meta` variables in [airqtl.pipeline.dataset](../../../src/airqtl/pipeline/dataset.py). Check the downloaded files in `data/raw/` to understand their format.
+* Each step of the pipeline is defined as a rule sequentially in `Snakefile`. Take the sceQTL association as an example, it corresponds to i) the shell command `airqtl eqtl association` and ii) the python function `airqtl.pipline.eqtl.association`. Therefore, you can learn more from either the command `airqtl eqtl association -h` or the docstring of `airqtl.pipline.eqtl.association`. The output files and logs of each step are located in `data/x` and `log/x.log` respectively, where x is the name of the step/rule and can be either a folder or a file with name suffix. Some of the steps are run once for the whole dataset while some are run separately for each cell state.
+* To change pipeline parameters, modify `Snakefile.config`. You can use custom command-line parameters of each step according to their accepted parameters such as those obtained from `airqtl eqtl association -h`.
+* To run the tutorial pipeline in parallel or on a cluster, modify `Snakefile` which is based on [Snakemake](https://snakemake.readthedocs.io/en/stable/).
+
+## Repurposing the tutorial pipeline for your own dataset
+1. [Run this tutorial pipeline](#Running-the-tutorial) successfully
+2. [Understand the format of input files](#Understanding-and-customizing-the-tutorial) in `data/raw` folder
+3. Perform initial quality control of your own dataset
+4. Download this tutorial folder to a new location on your computer
+5. Reformat your own dataset into the accepted format and place the files in newly created `data/raw` folder
+6. [Customize the pipeline](#Understanding-and-customizing-the-tutorial) as needed
+7. [Run the pipeline](#Running-the-tutorial) for your own dataset
+8. Check the output files
+
+## Issues
+If you encounter any error, you are suggested to first troubleshoot on your own. The error logs are located inside console output and the folder `log`.
+
+If you cannot resolve the error or have any other question, please [check the FAQ](../../../#faq) or [raise an issue](../../../#issues).
+
+If you applied any fix to the code or pipeline, you are strongly suggested to start over from step 1 of [Running the tutorial](#Running-the-tutorial), unless you are experienced and know what you are doing.
diff --git a/docs/tutorials/randolph/Snakefile.config b/docs/tutorials/randolph/Snakefile.config
@@ -5,9 +5,9 @@
 
 #Device to use for association mapping. See Pytorch documentation for details.
 device='cuda:0'
-#Base directory for data for pipeline
+#Base directory of data for pipeline
 dirbase='data'
-#Base directory for log for pipeline
+#Base directory of log for pipeline
 dirlbase='log'
 #List of discrete cell and donor covariates to subset cells into distinct groups for separate sceQTL mapping
 subset_covs=(['celltype'],['infection'])
@@ -16,7 +16,7 @@ covs="none"
 #Whether to print verbose log messages
 verbose=True
 
-#Optional parameters to finetune each step. See `airqtl subset --help` or similar for details.
+#Optional parameters to finetune each step. See `airqtl eqtl subset --help` or similar for details.
 params_subset=''
 params_qc=''
 params_association=f'--device {device}'

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,14 +1,14 @@
 [project]
 name = "airqtl"
-version = "0.1.0"
+version = "0.1.1"
 authors = [
   { name="Lingfei Wang", email="[email protected]" },
 ]
 description = "Array of Interleaved Repeats for Quantitative Trait Loci"
 readme = "README.rst"
 license = {file = "LICENSE"}
 keywords = ["qtl","eqtl","scqtl","sceqtl","lmm","quantitative trait loci","expression quantitative trait loci","single-cell quantitative trait loci","single-cell expression quantitative trait loci","linear mixed model","linear mixed models","population-scale scRNA-seq","gene regulatory network","network inference","causal inference","mendelian randomization"]
-requires-python = ">=3.8"
+requires-python = ">=3.12"
 classifiers = [
 	"Development Status :: 4 - Beta",
 	"License :: OSI Approved :: BSD License",

diff --git a/setup.cfg b/setup.cfg
@@ -0,0 +1,9 @@
+[pylint.MASTER]
+ignore-paths=.*/[.].*,^[.].*,__pycache__
+
+[pylint.MESSAGES CONTROL]
+disable=W0311,C0301,C0413,W0406,C0103,C0415,E0102,E1101,C0206,C0303,C0209,W0611,R0913,R0914,W1514,C0302,W0511,R0902,E1121,R0915,R0912,E1136,W1202,W0102,W0201,E0101,C0200,W1401,W1201,R1710,W1203,E0401,R0903,E1102,R0401,R0904,C0325,R1735
+
+[flake8]
+ignore=F405,E225,E231,F403,E402,N801,W191,E501,E226,W293,E123,E301,E265,E302,E303,E227,E741,W291,E228,C901,E252,E128,E126,N806,N803,E124
+exclude = __pycache__,.*,src/dictys/net/layout.py,src/dictys/scripts
diff --git a/src/airqtl/__init__.py b/src/airqtl/__init__.py
@@ -3,36 +3,31 @@
 #
 # This file is part of airqtl.
 
-__all__ = ['air','association', 'cov', 'heritability', 'io', 'kinship', 'op', 'pipeline', 'sim', 'utils']
+__all__ = ['air','association', 'cov', 'heritability', 'kinship', 'op', 'pipeline', 'sim', 'utils']
 
 from . import *
 
 
 def _main_func_parser(parser,funcs):
-	parser.add_argument('-v',
-				dest='verbose',
-				action='store_true',
-				help='Verbose mode.')
+	parser.add_argument('-v',dest='verbose',action='store_true',help='Verbose mode.')
 	return parser,funcs
 
 def _main_func_args(args):
 	import logging
 	import sys
-	logging.basicConfig(
-		format=
-		'%(levelname)s:%(process)d:%(asctime)s:%(pathname)s:%(lineno)d:%(message)s',
-		level=logging.DEBUG if args.verbose else logging.WARNING)
+	logging.basicConfig(format='%(levelname)s:%(process)d:%(asctime)s:%(pathname)s:%(lineno)d:%(message)s',level=logging.DEBUG if args.verbose else logging.WARNING)
 	logging.info('Started: '+' '.join([f"'{x}'" for x in sys.argv]))
 	return args
 
-def _main_func_ret(args,ret):
+def _main_func_ret(_,ret):
 	import logging
 	import sys
 	logging.info('Completed: '+' '.join([f"'{x}'" for x in sys.argv]))
 	return ret
 
 def main():
 	import docstring2argparse as d
-	d.docstringrunner('airqtl.pipeline',func_filter=lambda name,obj:name[0]!='_' and hasattr(obj,'_da') and obj._da==True,func_parser=_main_func_parser,func_args=_main_func_args,func_ret=_main_func_ret)
+	d.docstringrunner('airqtl.pipeline',func_filter=lambda name,obj:name[0]!='_' and hasattr(obj,'_da') and obj._da is True,func_parser=_main_func_parser,func_args=_main_func_args,func_ret=_main_func_ret)
+
 
 assert __name__ != "__main__"
diff --git a/src/airqtl/__main__.py b/src/airqtl/__main__.py
@@ -5,4 +5,4 @@
 
 if __name__ == "__main__":
 	import airqtl
-	airqtl.main()
+	airqtl.main()
diff --git a/src/airqtl/air.py b/src/airqtl/air.py
@@ -106,9 +106,8 @@ def tofull(self,axis:Union[list[Optional[int]],int,None]=None)->Union[torch.Tens
 		if any([self.r[x] is not None and x not in axis for x in range(self.ndim)]):
 			#Output self class
 			return self.__class__(v,[self.r[x] if x not in axis else None for x in range(self.ndim)]).reduce()
-		else:
-			#Output tensor
-			return v
+		#Output tensor
+		return v
 	def tensor(self)->torch.Tensor:
 		ans=self.tofull()
 		assert isinstance(ans,torch.Tensor)
@@ -218,8 +217,8 @@ def __getitem__(self,key:Tuple[Iterable[int],int,slice])->Union['air',torch.Tens
 			r[xi]=None
 		#Reduce dimensions
 		if len(reduce_dims)>0:
-			assert all([v.shape[x]==1 for x in reduce_dims])
-			assert all([r[x] is None for x in reduce_dims])
+			assert all(v.shape[x]==1 for x in reduce_dims)
+			assert all(r[x] is None for x in reduce_dims)
 			v=v.squeeze(reduce_dims)
 			r=[r[x] for x in range(self.ndim) if x not in reduce_dims]
 		if v.ndim==0:
@@ -276,7 +275,6 @@ def toreduce(self,method:str,axis:dict[int,torch.Tensor])->'air':
 		axis=dict(zip(*axis))
 		v=self.v
 		r=list(self.r)
-		change=False
 		for xi in axis:
 			if r[xi] is None:
 				d=torch.zeros(v.shape[:xi]+(len(axis[xi]),)+v.shape[xi+1:],dtype=v.dtype,device=v.device,requires_grad=self.requires_grad)
@@ -430,24 +428,21 @@ def reduce(self,inplace:bool=False)->Union[air,torch.Tensor,'composite',None]:
 				self.vs,self.axis=ans
 				self._refresh()
 				return
-			else:
-				return self.__class__(*ans)
+			return self.__class__(*ans)
 		if isinstance(ans,(torch.Tensor,air)):
 			if inplace:
 				self.vs=[ans]
 				self.axis=0
 				self._refresh()
 				return
-			else:
-				return ans
+			return ans
 		if isinstance(ans,self.__class__):
 			if inplace:
 				self.vs=ans.vs
 				self.axis=ans.axis
 				self._refresh()
 				return
-			else:
-				return ans
+			return ans
 		raise TypeError(f'Unsupported type {type(ans)}.')
 	def _resolve_axis(self,axis:int)->int:
 		"""
@@ -580,8 +575,7 @@ def __matmul__(self,other:Union[air,torch.Tensor,'composite'])->Union[air,torch.
 		if self.axis!=self.ndim-1:
 			if self.axis==self.ndim-2:
 				return self.__class__([x@other for x in self.vs],self.axis).reduce()
-			else:
-				return self.__class__([self.vs[x]@other.swapaxes(self.axis,0)[self.sizesc[x]:self.sizesc[x+1]].swapaxes(self.axis,0) for x in range(len(self.vs))],self.axis).reduce()
+			return self.__class__([self.vs[x]@other.swapaxes(self.axis,0)[self.sizesc[x]:self.sizesc[x+1]].swapaxes(self.axis,0) for x in range(len(self.vs))],self.axis).reduce()
 		if isinstance(other,self.__class__) and other.axis==other.ndim-2 and len(self.sizes)==len(other.sizes) and (self.sizes==other.sizes).all():
 			ans=reduce(add,[x@y for x,y in zip(self.vs,other.vs)])
 		else:
@@ -616,4 +610,5 @@ def sum(self,axis:int=None)->Union[air,torch.Tensor,'composite',float,int]:
 			return reduce(add,[x.sum(axis=axis) for x in self.vs])
 		return self.__class__([x.sum(axis=axis) for x in self.vs],self.axis-(axis<self.axis)).reduce()
 
+
 assert __name__ != "__main__"
diff --git a/src/airqtl/association.py b/src/airqtl/association.py
@@ -110,8 +110,7 @@ def multi(dx,dy,dc0,dc1,ncs,mkl,mku,l0,f0,f1,nxd,out,fmt,bsx:int=128,bsy:int=327
 	from scipy.stats import beta
 
 	from .air import air
-	from .op import (mmatmul, mmatmul1, mmatmul2, mmmatmul, mmsquare, msquare,
-	                 msquare1)
+	from .op import (mmatmul, mmatmul1, mmatmul2, mmmatmul, mmsquare, msquare,msquare1)
 	if mkl is None:
 		if mku is not None or l0 is not None:
 			raise TypeError('Set mkl, mku, l0 all to None for linear model or all to not None for linear mixed model.')
@@ -130,7 +129,6 @@ def multi(dx,dy,dc0,dc1,ncs,mkl,mku,l0,f0,f1,nxd,out,fmt,bsx:int=128,bsy:int=327
 	ns = ncs.sum()
 	if dc0 is None:
 		dc0=np.zeros([0,ns],dtype=float)
-	nc0 = dc0.shape[0]
 	nc1 = dc1.shape[0]
 	# Validity checks
 	if nd == 0:
@@ -526,4 +524,5 @@ def multi_gxc(dx,dy,dc0,dc1,ncs,mkl,mku,l0,nxd,out,fmt,dom:bool=False,**ka):
 		raise ValueError('nxd must be greater than zero and less than or equal to the number of intermediate covariates (dc0.shape[0]).')
 	return multi(dx,dy,dc0,dc1,ncs,mkl,mku,l0,partial(multi_gxc_f0,nxd,dom=dom),partial(multi_gxc_f1,nxd),nxd,out,fmt,**ka)
 
+
 assert __name__ != "__main__"
diff --git a/src/airqtl/cov.py b/src/airqtl/cov.py
@@ -92,4 +92,5 @@ def o2d(dc,missing={}):
 	dcnew=dcnew.loc[:,dcnew.nunique()!=1].copy()
 	return dcnew
 
+
 assert __name__ != "__main__"
diff --git a/src/airqtl/heritability.py b/src/airqtl/heritability.py
@@ -20,7 +20,6 @@ def nll2(vx, mkl, ns):
 	"""Analytical computation of negative log likelihood given the MLEs"""
 	import numpy as np
 	sx0, sx = vx[:2]
-	xb = vx[2:].ravel()
 
 	n = ns.sum()
 	beta = sx**2
@@ -32,7 +31,6 @@ def nll3_est(sx, mkl, dtr, dcr, dpt, dpc, dptc, ns):
 	"""MLE estimator of other variables, given sx"""
 	import numpy as np
 	from normalisr.association import inv_rank
-	nn = len(ns)
 
 	beta = sx**2
 	t0 = (1 - 1 / (1 + beta * mkl))
Original file line number	Diff line number	Diff line change
Expand Up		@@ -92,4 +92,5 @@ def o2d(dc,missing={}):
		dcnew=dcnew.loc[:,dcnew.nunique()!=1].copy()
		return dcnew


		assert __name__ != "__main__"