Skip to content

Commit

Permalink
Merge pull request #64 from MannLabs/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
swillems authored Jan 26, 2021
2 parents 5907261 + f66c4bc commit 39ef5c6
Show file tree
Hide file tree
Showing 18 changed files with 108 additions and 68 deletions.
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ Typical performance statistics on data in-/output and slicing of standard [HeLa
| DIA | 6 min | 158,552,099 | 1.09 s / 381 ms | 403 ms | 6.40 / 26.7 / 626 / 109 |
| DDA | 21 min | 295,251,252 | 3.07 s / 913 ms | 757 ms | 1.74 / 72.5 / 122 / 186 |
| DIA | 21 min | 730,564,765 | 4.54 s / 2.20 s | 1.85 s | 0.855 / 122 / 5040 / 404 |
| DDA | 120 min | 2,074,019,899 | 24.1 s / 10.6 s | 5.7 s | 0.709 / 371 / 609 / 1200 |
| DDA | 120 min | 2,074,019,899 | 24.1 s / 10.6 s | 5.70 s | 0.709 / 371 / 609 / 1200 |

All slices were performed in a single dimension. Including more slices makes the analysis more stringent and hence faster. The considered dimensions were:

Expand All @@ -275,7 +275,7 @@ All slices were performed in a single dimension. Including more slices makes the
* **Quadrupole:** 700.0 <= quad_mz_values < 710.0
* **TOF:** 621.9 <= tof_mz_values < 622.1

All of these analyses were timed with `timeit` and are the average of 5 runs. They were obtained on the following system:
All of these analyses were timed with `timeit` and are the average of at least 7 runs. They were obtained on the following system:

* **MacBook Pro:** (13-inch, 2020, Four Thunderbolt 3 ports)
* **OS version:** macOS Catalina 10.15.7
Expand All @@ -291,7 +291,7 @@ Full details are available in the [perfomance notebook](nbs/performance.ipynb).
The basic workflow of AlphaTims looks as follows:

* Read data from a [Bruker `.d` folder](#bruker-raw-data).
* Convert data to a [TimsTOF object in Python](#timstof-objects-in-python) and store them as a persistent HDF5 file.
* Convert data to a [TimsTOF object in Python](#timstof-objects-in-python) and store them as a persistent [HDF5 file](https://www.hdfgroup.org/solutions/hdf5/).
* Use Python's [slicing mechanism](#slicing-timstof-objects) to retrieve data from this object e.g. for visualisation.

### Bruker raw data
Expand All @@ -315,20 +315,20 @@ After reading the `PasefFrameMSMSInfo` or `DiaFrameMsMsWindows` table from the `
* A `quad_indptr` array that indexes the `tof_indptr` array. Each element points to an index of the `tof_indptr` where the voltage on the quadrupole and collision cell is adjusted. For PASEF acquisitions, this is typically 20 times per MSMS frame (turning on and off a value for 10 precursor selections) and once per change from an MS (precursor) frame to an MSMS (fragment) frame. For diaPASEF, this is typically twice to 10 times per frame and with a repetitive pattern over the frame cycle. This results in an array of approximately `len(quad_indptr) = 100 * gradient_length_in_seconds`. As with the `tof_indptr` array, this array is converted to an offset array with size `+1`.
* A `quad_low_values` array of `len(quad_indptr) - 1`. This array stores the lower m/z boundary that is selected with the quadrupole. For precursors without quadrupole selection, this value is set to -1.
* A `quad_high_values` array, similar to `quad_low_values`.
* A `precursor_indices` array of `len(quad_indptr) - 1`. For PASEF this array stores the index of the selected precursor. For diaPASEF, this array stores the `WindowGroup` of the fragment frame. As with the `quad_low_values` and `quad_high_values`, a value of -1 indicates a precursor without quadrupole selection.
* A `precursor_indices` array of `len(quad_indptr) - 1`. For PASEF this array stores the index of the selected precursor. For diaPASEF, this array stores the `WindowGroup` of the fragment frame. A value of 0 indicates an MS1 ion (i.e. precursor) without quadrupole selection.

After processing this summarising information from the `analysis.tdf` SQL database, the actual raw data from the `analysis.tdf_bin` binary file is read and stored in the empty `tof_indices`, `intensities` and `tof_indptr` arrays. This is done with the `tims_read_scans_v2` function from Bruker's `timsdata.dll` library (available in the [alphatims/ext](alphatims/ext) folder).
After processing this summarising information from the `analysis.tdf` SQL database, the actual raw data from the `analysis.tdf_bin` binary file is read and stored in the empty `tof_indices`, `intensities` and `tof_indptr` arrays.

Finally, three arrays are defined that allows quick translation of `frame_`, `scan_` and `tof_indices` to `rt_values`, `mobility_values` and `mz_values` arrays.
Finally, three arrays are defined that allow quick translation of `frame_`, `scan_` and `tof_indices` to `rt_values`, `mobility_values` and `mz_values` arrays.
* The `rt_values` array is read read directly from the `Frames` table in `analysis.tdf` and has a length equal to `frame_max_index + 1`. Note that an empty zeroth frame with `rt = 0` is created to make Python's 0-indexing compatible with Bruker's 1-indexing.
* The `mobility_values` array is defined by using the function `tims_scannum_to_oneoverk0` from `timsdata.dll` on the first frame and typically has a length of `1000`.
* Similarly, the `mz_values` array is defined by using the function `tims_index_to_mz` from `timsdata.dll` on the first frame. Typically this has a length of `400000`.

All these arrays can be loaded into memory, taking up roughly twice as much RAM as the `.d` folder on disk. This increase in RAM memory is mainly due to the compression used in the `analysis.tdf_bin` file. If the Python object is stored as an HDF5 file, the empty `tof_indices` and `intensity` arrays can be created and filled on-disk, thereby minimizing RAM memory usage to less than 1 GB even for files that take up several GB on-disk. The HDF5 file can also be compressed so that its size is roughly halved and thereby has the same size as the Bruker `.d` folder, but (de)compression reduces accession times by 3-6 fold.
All these arrays can be loaded into memory, taking up roughly twice as much RAM as the `.d` folder on disk. This increase in RAM memory is mainly due to the compression used in the `analysis.tdf_bin` file. The HDF5 file can also be compressed so that its size is roughly halved and thereby has the same size as the Bruker `.d` folder, but (de)compression reduces accession times by 3-6 fold.

### Slicing TimsTOF objects

Once a Python TimsTOF object is available, it can be loaded into memory for ultrafast accession. Accession of the `data` object is done by simple Python slicing such as e.g. `selected_ion_indices = data[frame_selection, scan_selection, quad_selection, tof_selection]`. These ion indices are then easily parsed to a `pd.DataFrame` with the function `df = data.as_dataframe(selected_ion_indices)`. The columns of this dataframe contain all information, i.e. `frame`, `scan`, `precursor` and `tof` indices and `rt`, `mobility`, `quad_low`, `quad_high`, `mz` and `intensity` values.
Once a Python TimsTOF object is available, it can be loaded into memory for ultrafast accession. Accession of the `data` object is done by simple Python slicing such as e.g. `selected_ion_indices = data[frame_selection, scan_selection, quad_selection, tof_selection]`. This slicing returns a `pd.DataFrame` for subsequent analysis. The columns of this dataframe contain all information for all selected ions, i.e. `frame`, `scan`, `precursor` and `tof` indices and `rt`, `mobility`, `quad_low`, `quad_high`, `mz` and `intensity` values. See the [tutorial jupyter notebook](nbs/tutorial.ipynb) for usage examples.

---
## Future perspectives
Expand All @@ -341,4 +341,4 @@ Once a Python TimsTOF object is available, it can be loaded into memory for ultr
---
## How to contribute

All contributions are welcome. Feel free to post a new issue or clone the repository and create a PR with a new branch.
All contributions are welcome. Feel free to post a new issue or clone the repository and create a PR with a new branch. For more information see [the Contributors License Agreement](misc/CLA.md)
2 changes: 1 addition & 1 deletion alphatims/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


__project__ = "alphatims"
__version__ = "0.0.210122"
__version__ = "0.0.210126"
__license__ = "MIT"
__description__ = "A python package for Bruker TimsTOF raw data accession and visualization"
__author__ = "Sander Willems"
Expand Down
20 changes: 17 additions & 3 deletions alphatims/bruker.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
)
else:
logging.warning(
"WARNING: "
"No Bruker libraries are available for this operating system. "
"Intensities are uncalibrated, resulting in (very) small differences. "
"However, mobility and m/z values need to be estimated. "
Expand Down Expand Up @@ -754,6 +755,8 @@ def __init__(
)
logging.info(f"Succesfully imported data from {bruker_d_folder_name}")
self.slice_as_dataframe = slice_as_dataframe
# Precompile
self[0, "raw"]

def _import_data_from_d_folder(
self,
Expand Down Expand Up @@ -939,6 +942,7 @@ def convert_from_indices(
quad_indices=None,
scan_indices=None,
tof_indices=None,
return_raw_indices: bool = False,
return_frame_indices: bool = False,
return_scan_indices: bool = False,
return_quad_indices: bool = False,
Expand All @@ -964,6 +968,9 @@ def convert_from_indices(
The scan indices for which coordinates need to be retrieved.
tof_indices : np.int64[:], None
The tof indices for which coordinates need to be retrieved.
return_raw_indices : bool
If True, include "raw_indices" in the dict.
Default is False.
return_frame_indices : bool
If True, include "frame_indices" in the dict.
Default is False.
Expand Down Expand Up @@ -1040,6 +1047,8 @@ def convert_from_indices(
)
if (return_tof_indices or return_mz_values) and (tof_indices is None):
tof_indices = self.tof_indices[raw_indices]
if return_raw_indices:
result["raw_indices"] = raw_indices
if return_frame_indices:
result["frame_indices"] = frame_indices
if return_scan_indices:
Expand Down Expand Up @@ -1231,8 +1240,9 @@ def bin_intensities(self, indices: np.ndarray, axis: tuple):

def as_dataframe(
self,
raw_indices: np.ndarray,
indices: np.ndarray,
*,
raw_indices: bool = True,
frame_indices: bool = True,
scan_indices: bool = True,
quad_indices: bool = False,
Expand All @@ -1248,8 +1258,11 @@ def as_dataframe(
Parameters
----------
raw_indices : np.int64[:]
indices : np.int64[:]
The raw indices for which coordinates need to be retrieved.
raw_indices : bool
If True, include "raw_indices" in the dataframe.
Default is True.
frame_indices : bool
If True, include "frame_indices" in the dataframe.
Default is True.
Expand Down Expand Up @@ -1289,7 +1302,8 @@ def as_dataframe(
"""
return pd.DataFrame(
self.convert_from_indices(
raw_indices,
indices,
return_raw_indices=raw_indices,
return_frame_indices=frame_indices,
return_scan_indices=scan_indices,
return_quad_indices=quad_indices,
Expand Down
16 changes: 9 additions & 7 deletions alphatims/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,12 @@ def line_plot(
Parameters
----------
timstof_data : aphatims.bruker.TimsTOF
An aphatims.bruker.TimsTOF data object.
timstof_data : alphatims.bruker.TimsTOF
An alphatims.bruker.TimsTOF data object.
selected_indices : np.int64[:]
The raw indices that are selected for this plot
The raw indices that are selected for this plot.
These are typically obtained by slicing the TimsTOF data object with
e.g. data[..., "raw"].
x_axis_label : str
A label that is used for projection
(i.e. intensities are summed) on the x-axis. Options are:
Expand Down Expand Up @@ -116,7 +118,7 @@ def heatmap(
Parameters
----------
df : pd.DataFrame
A dataframe wirth coordinates.
A dataframe with coordinates.
This should be obtained by slicing an alphatims.bruker.TimsTOF object.
x_axis_label : str
A label that is used for projection
Expand All @@ -127,7 +129,7 @@ def heatmap(
- "Inversed IM, V·s·cm\u207B\u00B2"
y_axis_label : str
A label that is used for projection
(i.e. intensities are summed) on the x-axis. Options are:
(i.e. intensities are summed) on the y-axis. Options are:
- "m/z, Th"
- "RT, min"
Expand Down Expand Up @@ -207,8 +209,8 @@ def tic_plot(
Parameters
----------
timstof_data : aphatims.bruker.TimsTOF
An aphatims.bruker.TimsTOF data object.
timstof_data : alphatims.bruker.TimsTOF
An alphatims.bruker.TimsTOF data object.
title : str
The title of this plot.
Will be prepended with "TIC".
Expand Down
13 changes: 9 additions & 4 deletions alphatims/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ def set_threads(threads: int, set_global: bool = True) -> int:
threads : int
The number of threads.
If larger than available cores, it is trimmed to the available maximum.
If 0, it is set the the maximum cores available.
If 0, it is set to the maximum cores available.
If negative, it indicates how many cores NOT to use.
set_global : bool
If False, the number of threads is only parsed to a valid value.
Expand Down Expand Up @@ -415,6 +415,7 @@ def pjit(
*,
thread_count=None,
cache: bool = True,
**kwargs
):
"""A decorator that parallelizes the numba.njit decorator with threads.
Expand All @@ -439,7 +440,7 @@ def pjit(
Default is None.
cache : bool
See numba.njit decorator.
Default is True (in contrast to numba) .
Default is True (in contrast to numba).
Returns
-------
Expand All @@ -452,7 +453,11 @@ def pjit(
import numpy as np

def parallel_compiled_func_inner(func):
numba_func = numba.njit(nogil=True, cache=True)(func)
if "cache" in kwargs:
cache = kwargs.pop("cache")
else:
cache = True
numba_func = numba.njit(nogil=True, cache=cache, **kwargs)(func)

@numba.njit(nogil=True, cache=True)
def numba_func_parallel(
Expand Down Expand Up @@ -830,7 +835,7 @@ class Global_Stack(object):
i.e. option_value = self[option_key]
Attributes
- is_locked : bool
After each succesful update, undo or redo,
the stack is locked and cannot be modified unless explicitly unlocked.
Expand Down
Binary file modified docs/_build/doctrees/alphatims.bruker.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/alphatims.plotting.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/alphatims.utils.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Loading

0 comments on commit 39ef5c6

Please sign in to comment.