Skip to content

imputeTestbenchG: imputation testbench for Genomics data

Neeraj Bokde edited this page Apr 3, 2022 · 5 revisions

Background

Data Cleaning is one of the important and time-consuming steps in the Data Science and Data Analytics field. There are numerous methods, models, and algorithms for data cleaning processes that might be categorized under imputation, outlier detections, formatting, and visualizations among others. The process of evaluating these methods for a given dataset is challenging considering the volume of the dataset and the time/space complexities of the methods. Automation of the performance evaluation process can lead to a significant reduction in human efforts and time consumption with an unbiased comparison environment. Last year, GSoC-2021 produced an R package, named cleanTS (Publication) that has a huge potential in cleaning large time series in an efficient, accurate, and unbiased manner, which also reduced the human efforts and intervention in the process. This package is getting popular being a handy tool along with its capabilities in handling several anomalies in the time series simultaneously. Handling the missing values and patterns in the time series dataset is one of the crucial processes in the cleanTS package, and it is handled mostly with the imputeTestbench (Publication) package. An imputeTestbench package is an autoML tool that automates the process of performance evaluation and comparison of imputation methods for a given time series dataset at different scenarios. Again this tool has been used by several research teams considering its capabilities such as artificially generating missing patterns in the time series and evaluating multiple imputation methods simultaneously. In the present form, the imputeTestbench package is capable of handling the time series or temporal datasets. This tool has a huge potential in handling genetic/genomic datasets and has the scope to introduce parallel processing and high-performance computing concepts. Considering these as a motivation, we are inviting a contributor who would like to modify the imputeTestbench package for the Genomics applications with better computational capabilities and will be named as 'inputeTestbenchG'.

Related work

The proposed package will be the modification of the imputeTestbench package and will be made adaptive so that it can be integrated with cleanTS package.

Details of your coding project

The goal of this coding project is to develop a new R package, named 'imputeTestbenchG'. The expected tasks for this project are as follows:

  • Understand the concept of the existing 'imputeTestbench' package and AutoML tools.
  • The package needs to be adapted to work with various formats of data that are used in genomics datasets. The package should accept DNA/RNA format data. Genomics data be in a plain text file or can have various formats like FASTQ, BAM, FASTA, VCF, WIG, etc.
  • Currently, the imputeTestbench package uses the base R functions and data structures. The performance of the package can be improved by switching to data.tables and integrating it with Apache Spark (or a similar system). Further, the performance can be improved by using parallel processing.
  • Various imputation algorithms are specifically designed for genomics datasets. These yields far better results than the generic imputation methods. So, the implementation of such algorithms and providing them by default will be useful. The R package NAsImpute provides some of the imputation methods. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5157836/ (minimac3) introduces a new imputation method based on the state space reduction of the Hidden Markov Model.
  • The R’s shiny package is a framework that allows creating web applications. Having a dashboard for the package makes it easy to use and display output more intuitively. This also removes the programming dependency for the package.

Expected impact

This project will introduce a new R package that can be a stepping stone in AutoML applications in genomics data imputation processes.

Mentors

EVALUATING MENTOR: Neeraj Dhanraj Bokde, Postdoctoral researcher, Aarhus University, Denmark. [email protected]. Neeraj is Ph.D. in Data Science and contributed several R packages related to time series analysis, testbenches, and domain-specific ones. Neeraj has been a GSOC mentor since 2020. https://www.neerajbokde.in/

Co-mentor: Mogens Sandø Lund, Director and Head of Center for Quantitative Genetics and Genomics, Aarhus University, Denmark. [email protected]. Mogens has focused on research on developing and applying statistical genetic models to estimate population parameters, effects of single genes, and prediction of total genetic merit using genome-wide markers.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

Students, please do one or more of the following tests before contacting the mentors above.

  • Easy: Download the imputeTestbench package and demonstrate it with a naturally occurring time series. Document it with RMarkdown.

  • Medium: Suggest possible updates or a new feature you would like to include in the next version of the imputeTestbench package.

  • Hard: Develop a dummy code of 5 functions and a vignette and pass it with no Error/Warning/Note through https://win-builder.r-project.org/

Solutions of tests

Contributors, please post a link to your test results here.

  • EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
Contributor Name GitHub Profile Test Results
Mayur Shende https://github.com/Mayur1009 https://github.com/Mayur1009/GSoC22