Skip to content
Mayur Shende edited this page Apr 7, 2021 · 3 revisions

Background

Data cleaning is one of the most important tasks in process of data analysis. One of the perennial challenges in data analytics is the detection and handling of dirty data. Failing to do so can result in inaccurate analytics and unreliable decisions. The process of properly cleaning dirty data takes a lot of time. Errors are prevalent in time series data. It is usually found that real-world data is unclean and requires some pre-processing. Analysis of large amounts of data is particularly more difficult. This project is intended to provide easy to use and reliable system which automates the cleaning process of univariate time-series data. Automating the process greatly reduces the time required. Visualizing a large amount of data at once is not very effective. To tackle this issue, the proposed package provides a way to analyze the data on different scales and resolutions. Also, it provides users with tools and a benchmark system for comparing various techniques used in data cleaning.

Related work

In R, there are several packages for data imputation, data formations, outlier detections, but none of them addresses the time series cleaning as a soul target. Besides, the automated approach (AutoML) for such analysis is not yet observed in R.

Details of the coding project

The goal of this coding project is to develop a new R package, named 'cleanTS'. The expected tasks for this project are as follows:

  • to understand the concept of existing AutoML tools.
  • to review all possible R packages that can be used for time series cleaning (such as outlier detections and imputations) and handling.
  • to develop various functions for 'cleanTS' packages (which includes time series analysis, statistics, use of testbenches, reporting summary, visualization, graphical user interface, etc) as guided by the mentors.
  • Maintaining and publishing R package (e.g. building, installing, checking, Version control, pull requests, and GitHub)
  • to demonstrate the potential of the developed package in several real-world problems.
  • to work on novel animated visualization of time series.
  • to maintain documentation, tests, Vignettes, and website for the newly developed package.
  • to draft a manuscript for the journal submission describing the newly developed package.

Expected impact

This project will introduce a new R package that can be a stepping stone in AutoML applications in time series cleaning processes.

Mentors

Please get in touch after completing at least one of the tests below.

EVALUATING MENTOR: Neeraj Dhanraj Bokde, Postdoctoral researcher, Aarhus University, Denmark [email protected]. Neeraj is Ph.D. in Data Science and contributed several R packages related to time series analysis, testbenches, and domain-specific ones. Neeraj has been a GSOC mentor (https://summerofcode.withgoogle.com/archive/2020/projects/5767451238727680/).

Co-mentor: Andrés E. Feijóo-Lorenzo, Associate Professor, University of Vigo, Spain [email protected]. Andrés is Ph.D. in Electrical Engineering and having a huge experience in wind energy analysis.

Tests

Students, please do one or more of the following tests before contacting the mentors above.

  • Easy: Download the 'imputeTestbench' package and demonstrate it with a naturally occurring time series. Document it with RMarkdown.
  • Medium: Suggest possible updates or a new feature you would like to include in the next version of the 'imputeTestbench' package.
  • Hard: Develop a dummy code of 5 functions and a vignette and pass it with no Error/Warning/Note through https://win-builder.r-project.org/

Solutions of tests

Students, please post a link to your test results here.

  • EXAMPLE STUDENT 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
S No. STUDENT NAME GITHUB PROFILE TEST RESULTS LINK
1 Mayur Shende Mayur1009 Tests
Clone this wiki locally