Merge pull request #64 from MannLabs/develop

Develop
MannLabs · Jan 26, 2021 · 39ef5c6 · 39ef5c6
2 parents 5907261 + f66c4bc
commit 39ef5c6
Show file tree

Hide file tree

Showing 18 changed files with 108 additions and 68 deletions.
diff --git a/README.md b/README.md
@@ -266,7 +266,7 @@ Typical performance statistics on data in-/output and slicing of standard [HeLa
 | DIA  | 6 min    | 158,552,099   | 1.09 s / 381 ms    | 403 ms | 6.40 / 26.7 / 626 / 109     |
 | DDA  | 21 min   | 295,251,252   | 3.07 s / 913 ms    | 757 ms | 1.74 / 72.5 / 122 / 186      |
 | DIA  | 21 min   | 730,564,765   | 4.54 s / 2.20 s    | 1.85 s | 0.855 / 122 / 5040 / 404    |
-| DDA  | 120 min  | 2,074,019,899 | 24.1 s / 10.6 s    | 5.7 s  | 0.709 / 371 / 609 / 1200    |
+| DDA  | 120 min  | 2,074,019,899 | 24.1 s / 10.6 s    | 5.70 s  | 0.709 / 371 / 609 / 1200    |
 
 All slices were performed in a single dimension. Including more slices makes the analysis more stringent and hence faster. The considered dimensions were:
 
@@ -275,7 +275,7 @@ All slices were performed in a single dimension. Including more slices makes the
 * **Quadrupole:** 700.0 <= quad_mz_values < 710.0
 * **TOF:** 621.9 <= tof_mz_values < 622.1
 
-All of these analyses were timed with `timeit` and are the average of 5 runs. They were obtained on the following system:
+All of these analyses were timed with `timeit` and are the average of at least 7 runs. They were obtained on the following system:
 
 * **MacBook Pro:** (13-inch, 2020, Four Thunderbolt 3 ports)
 * **OS version:** macOS Catalina 10.15.7
@@ -291,7 +291,7 @@ Full details are available in the [perfomance notebook](nbs/performance.ipynb).
 The basic workflow of AlphaTims looks as follows:
 
 * Read data from a [Bruker `.d` folder](#bruker-raw-data).
-* Convert data to a [TimsTOF object in Python](#timstof-objects-in-python) and store them as a persistent HDF5 file.
+* Convert data to a [TimsTOF object in Python](#timstof-objects-in-python) and store them as a persistent [HDF5 file](https://www.hdfgroup.org/solutions/hdf5/).
 * Use Python's [slicing mechanism](#slicing-timstof-objects) to retrieve data from this object e.g. for visualisation.
 
 ### Bruker raw data
@@ -315,20 +315,20 @@ After reading the `PasefFrameMSMSInfo` or `DiaFrameMsMsWindows` table from the `
 * A `quad_indptr` array that indexes the `tof_indptr` array. Each element points to an index of the `tof_indptr` where the voltage on the quadrupole and collision cell is adjusted. For PASEF acquisitions, this is typically 20 times per MSMS frame (turning on and off a value for 10 precursor selections) and once per change from an MS (precursor) frame to an MSMS (fragment) frame. For diaPASEF, this is typically twice to 10 times per frame and with a repetitive pattern over the frame cycle. This results in an array of approximately `len(quad_indptr) = 100 * gradient_length_in_seconds`. As with the `tof_indptr` array, this array is converted to an offset array with size `+1`.
 * A `quad_low_values` array of `len(quad_indptr) - 1`. This array stores the lower m/z boundary that is selected with the quadrupole. For precursors without quadrupole selection, this value is set to -1.
 * A `quad_high_values` array, similar to `quad_low_values`.
-* A `precursor_indices` array of `len(quad_indptr) - 1`. For PASEF this array stores the index of the selected precursor. For diaPASEF, this array stores the `WindowGroup` of the fragment frame. As with the `quad_low_values` and `quad_high_values`, a value of -1 indicates a precursor without quadrupole selection.
+* A `precursor_indices` array of `len(quad_indptr) - 1`. For PASEF this array stores the index of the selected precursor. For diaPASEF, this array stores the `WindowGroup` of the fragment frame. A value of 0 indicates an MS1 ion (i.e. precursor) without quadrupole selection.
 
-After processing this summarising information from the `analysis.tdf` SQL database, the actual raw data from the `analysis.tdf_bin` binary file is read and stored in the empty `tof_indices`, `intensities` and `tof_indptr` arrays. This is done with the `tims_read_scans_v2` function from Bruker's `timsdata.dll` library (available in the [alphatims/ext](alphatims/ext) folder).
+After processing this summarising information from the `analysis.tdf` SQL database, the actual raw data from the `analysis.tdf_bin` binary file is read and stored in the empty `tof_indices`, `intensities` and `tof_indptr` arrays.
 
-Finally, three arrays are defined that allows quick translation of `frame_`, `scan_` and `tof_indices` to `rt_values`, `mobility_values` and `mz_values` arrays.
+Finally, three arrays are defined that allow quick translation of `frame_`, `scan_` and `tof_indices` to `rt_values`, `mobility_values` and `mz_values` arrays.
 * The `rt_values` array is read read directly from the `Frames` table in `analysis.tdf` and has a length equal to `frame_max_index + 1`. Note that an empty zeroth frame with `rt = 0` is created to make Python's 0-indexing compatible with Bruker's 1-indexing.
 * The `mobility_values` array is defined by using the function `tims_scannum_to_oneoverk0` from `timsdata.dll` on the first frame and typically has a length of `1000`.
 * Similarly, the `mz_values` array is defined by using the function `tims_index_to_mz` from `timsdata.dll` on the first frame. Typically this has a length of `400000`.
 
-All these arrays can be loaded into memory, taking up roughly twice as much RAM as the `.d` folder on disk. This increase in RAM memory is mainly due to the compression used in the `analysis.tdf_bin` file. If the Python object is stored as an HDF5 file, the empty `tof_indices` and `intensity` arrays can be created and filled on-disk, thereby minimizing RAM memory usage to less than 1 GB even for files that take up several GB on-disk. The HDF5 file can also be compressed so that its size is roughly halved and thereby has the same size as the Bruker `.d` folder, but (de)compression reduces accession times by 3-6 fold.
+All these arrays can be loaded into memory, taking up roughly twice as much RAM as the `.d` folder on disk. This increase in RAM memory is mainly due to the compression used in the `analysis.tdf_bin` file. The HDF5 file can also be compressed so that its size is roughly halved and thereby has the same size as the Bruker `.d` folder, but (de)compression reduces accession times by 3-6 fold.
 
 ### Slicing TimsTOF objects
 
-Once a Python TimsTOF object is available, it can be loaded into memory for ultrafast accession. Accession of the `data` object is done by simple Python slicing such as e.g. `selected_ion_indices = data[frame_selection, scan_selection, quad_selection, tof_selection]`. These ion indices are then easily parsed to a `pd.DataFrame` with the function `df = data.as_dataframe(selected_ion_indices)`. The columns of this dataframe contain all information, i.e. `frame`, `scan`, `precursor` and `tof` indices and `rt`, `mobility`, `quad_low`, `quad_high`, `mz` and `intensity` values.
+Once a Python TimsTOF object is available, it can be loaded into memory for ultrafast accession. Accession of the `data` object is done by simple Python slicing such as e.g. `selected_ion_indices = data[frame_selection, scan_selection, quad_selection, tof_selection]`. This slicing returns a `pd.DataFrame` for subsequent analysis. The columns of this dataframe contain all information for all selected ions, i.e. `frame`, `scan`, `precursor` and `tof` indices and `rt`, `mobility`, `quad_low`, `quad_high`, `mz` and `intensity` values. See the [tutorial jupyter notebook](nbs/tutorial.ipynb) for usage examples.
 
 ---
 ## Future perspectives
@@ -341,4 +341,4 @@ Once a Python TimsTOF object is available, it can be loaded into memory for ultr
 ---
 ## How to contribute
 
-All contributions are welcome. Feel free to post a new issue or clone the repository and create a PR with a new branch.
+All contributions are welcome. Feel free to post a new issue or clone the repository and create a PR with a new branch. For more information see [the Contributors License Agreement](misc/CLA.md)
diff --git a/alphatims/__init__.py b/alphatims/__init__.py
@@ -2,7 +2,7 @@
 
 
 __project__ = "alphatims"
-__version__ = "0.0.210122"
+__version__ = "0.0.210126"
 __license__ = "MIT"
 __description__ = "A python package for Bruker TimsTOF raw data accession and visualization"
 __author__ = "Sander Willems"

diff --git a/alphatims/bruker.py b/alphatims/bruker.py
@@ -29,6 +29,7 @@
     )
 else:
     logging.warning(
+        "WARNING: "
         "No Bruker libraries are available for this operating system. "
         "Intensities are uncalibrated, resulting in (very) small differences. "
         "However, mobility and m/z values need to be estimated. "
@@ -754,6 +755,8 @@ def __init__(
             )
         logging.info(f"Succesfully imported data from {bruker_d_folder_name}")
         self.slice_as_dataframe = slice_as_dataframe
+        # Precompile
+        self[0, "raw"]
 
     def _import_data_from_d_folder(
         self,
@@ -939,6 +942,7 @@ def convert_from_indices(
         quad_indices=None,
         scan_indices=None,
         tof_indices=None,
+        return_raw_indices: bool = False,
         return_frame_indices: bool = False,
         return_scan_indices: bool = False,
         return_quad_indices: bool = False,
@@ -964,6 +968,9 @@ def convert_from_indices(
             The scan indices for which coordinates need to be retrieved.
         tof_indices : np.int64[:], None
             The tof indices for which coordinates need to be retrieved.
+        return_raw_indices : bool
+            If True, include "raw_indices" in the dict.
+            Default is False.
         return_frame_indices : bool
             If True, include "frame_indices" in the dict.
             Default is False.
@@ -1040,6 +1047,8 @@ def convert_from_indices(
             )
         if (return_tof_indices or return_mz_values) and (tof_indices is None):
             tof_indices = self.tof_indices[raw_indices]
+        if return_raw_indices:
+            result["raw_indices"] = raw_indices
         if return_frame_indices:
             result["frame_indices"] = frame_indices
         if return_scan_indices:
@@ -1231,8 +1240,9 @@ def bin_intensities(self, indices: np.ndarray, axis: tuple):
 
     def as_dataframe(
         self,
-        raw_indices: np.ndarray,
+        indices: np.ndarray,
         *,
+        raw_indices: bool = True,
         frame_indices: bool = True,
         scan_indices: bool = True,
         quad_indices: bool = False,
@@ -1248,8 +1258,11 @@ def as_dataframe(
 
         Parameters
         ----------
-        raw_indices : np.int64[:]
+        indices : np.int64[:]
             The raw indices for which coordinates need to be retrieved.
+        raw_indices : bool
+            If True, include "raw_indices" in the dataframe.
+            Default is True.
         frame_indices : bool
             If True, include "frame_indices" in the dataframe.
             Default is True.
@@ -1289,7 +1302,8 @@ def as_dataframe(
         """
         return pd.DataFrame(
            self.convert_from_indices(
-                raw_indices,
+                indices,
+                return_raw_indices=raw_indices,
                 return_frame_indices=frame_indices,
                 return_scan_indices=scan_indices,
                 return_quad_indices=quad_indices,

diff --git a/alphatims/plotting.py b/alphatims/plotting.py
@@ -24,10 +24,12 @@ def line_plot(
 
     Parameters
     ----------
-    timstof_data : aphatims.bruker.TimsTOF
-        An aphatims.bruker.TimsTOF data object.
+    timstof_data : alphatims.bruker.TimsTOF
+        An alphatims.bruker.TimsTOF data object.
     selected_indices : np.int64[:]
-        The raw indices that are selected for this plot
+        The raw indices that are selected for this plot.
+        These are typically obtained by slicing the TimsTOF data object with
+        e.g. data[..., "raw"].
     x_axis_label : str
         A label that is used for projection
         (i.e. intensities are summed) on the x-axis. Options are:
@@ -116,7 +118,7 @@ def heatmap(
     Parameters
     ----------
     df : pd.DataFrame
-        A dataframe wirth coordinates.
+        A dataframe with coordinates.
         This should be obtained by slicing an alphatims.bruker.TimsTOF object.
     x_axis_label : str
         A label that is used for projection
@@ -127,7 +129,7 @@ def heatmap(
             - "Inversed IM, V·s·cm\u207B\u00B2"
     y_axis_label : str
         A label that is used for projection
-        (i.e. intensities are summed) on the x-axis. Options are:
+        (i.e. intensities are summed) on the y-axis. Options are:
 
             - "m/z, Th"
             - "RT, min"
@@ -207,8 +209,8 @@ def tic_plot(
 
     Parameters
     ----------
-    timstof_data : aphatims.bruker.TimsTOF
-        An aphatims.bruker.TimsTOF data object.
+    timstof_data : alphatims.bruker.TimsTOF
+        An alphatims.bruker.TimsTOF data object.
     title : str
         The title of this plot.
         Will be prepended with "TIC".

diff --git a/alphatims/utils.py b/alphatims/utils.py
@@ -280,7 +280,7 @@ def set_threads(threads: int, set_global: bool = True) -> int:
     threads : int
         The number of threads.
         If larger than available cores, it is trimmed to the available maximum.
-        If 0, it is set the the maximum cores available.
+        If 0, it is set to the maximum cores available.
         If negative, it indicates how many cores NOT to use.
     set_global : bool
         If False, the number of threads is only parsed to a valid value.
@@ -415,6 +415,7 @@ def pjit(
     *,
     thread_count=None,
     cache: bool = True,
+    **kwargs
 ):
     """A decorator that parallelizes the numba.njit decorator with threads.
 
@@ -439,7 +440,7 @@ def pjit(
         Default is None.
     cache : bool
         See numba.njit decorator.
-        Default is True (in contrast to numba) .
+        Default is True (in contrast to numba).
 
     Returns
     -------
@@ -452,7 +453,11 @@ def pjit(
     import numpy as np
 
     def parallel_compiled_func_inner(func):
-        numba_func = numba.njit(nogil=True, cache=True)(func)
+        if "cache" in kwargs:
+            cache = kwargs.pop("cache")
+        else:
+            cache = True
+        numba_func = numba.njit(nogil=True, cache=cache, **kwargs)(func)
 
         @numba.njit(nogil=True, cache=True)
         def numba_func_parallel(
@@ -830,7 +835,7 @@ class Global_Stack(object):
     i.e. option_value = self[option_key]
 
     Attributes
-    
+
         - is_locked : bool
             After each succesful update, undo or redo,
             the stack is locked and cannot be modified unless explicitly unlocked.

diff --git a/docs/_build/doctrees/alphatims.bruker.doctree b/docs/_build/doctrees/alphatims.bruker.doctree
diff --git a/docs/_build/doctrees/alphatims.plotting.doctree b/docs/_build/doctrees/alphatims.plotting.doctree
diff --git a/docs/_build/doctrees/alphatims.utils.doctree b/docs/_build/doctrees/alphatims.utils.doctree
diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle