Skip to content

Format input data

Zhe Zhang edited this page Jun 1, 2016 · 2 revisions

Properly formatting input data is THE most critical step to generate an RoCA report. In principal, the acceptable input data should be template-specific as each R Markdown template will import data from files specified in the YAML file and check the validity of input data during runtime. Therefore, the developers of each template should provide sufficient information about the input data. It is strongly recommended for developers to:

  • describe the requirements of input data at the beginning of the YAML file, such as
    • file type: Excel, txt, etc.
    • variable type: character, numeric, etc.
    • ID type of row names: gene symbol, Entrez ID, etc.
    • matching names, such as sample names vs. column names of data matrix
    • is missing value allowed?
  • provide sample data known to be acceptable to template
  • handle ill-formatted data properly within the R Markdown template

To reduce the burden of preparing and importing input data, the RoCA package also provides support to several common data formats as described below.

Import data from local files in supported formats

RoCA provides helper functions to import several common data types formatted as described below. The helper functions assume that data is saved in a local file, do not support importing data from a remote file yet.

Table-like data

The table-like data supported by RoCA must meet the following requirements:

  • The first column must be unique row names.
  • The first row must be unique column names.
  • Each column is a character or numeric vector.
  • Columns are separated by default separator: ',' for .csv and '\t' for .txt and .tab files

The ImportTable {RoCA} function can be used to import table-like data saved in any of the following file types. It determines file types based on known file extensions, such as .txt and .html, so users have the flexibility to import the files at their convenience. File extensions are case insensitive, and the first row/column is the unique column/row names in all text, Excel, and html file by default.

  • R file (.rdata, .rda, .rds): could be a matrix or data.frame
  • Tab separated text file (.txt, .tab): with '\t' as separator by default
  • Comma separated text file (.csv): with ',' as separator by default
  • Excel file (.xls, s.xlsx): import the first worksheet by default
  • HTML file (.html, .htm): import the first table on the html page by default

List-like data

RoCA supports list-like data saved in Excel or text files as long as they meet the following requirements. However, it should be noted that neither Excel nor text file is recommended options for storing list-like data.

  • The list has only one level
  • There is no header line unless it is commented out
  • First value of each row is the element name
  • In text files, values are separated by default separator: ',' for .csv and '\t' for .txt and .tab files

The ImportList {RoCA} function can be used to import list-like data saved in any of the following file types. It determines file types based on known file extensions, such as .txt and .rdata, so users have the flexibility to import the files at their convenience. File extensions are case insensitive, and the first value of each row is the element name in all text and Excel files by default.

  • R file (.rdata, _.rda, .rds): will be loaded as the way it was saved
  • Tab separated text file (.txt, .tab, .bed): with '\t' as separator and first value is element name by default
  • Comma separated text file (.csv): with ',' as separator and first value is element name by default
  • Excel file (.xls, s.xlsx): import the first worksheet and first column is element name by default
  • HTML file (.html, .htm): import the first list on the html page by default
  • YAML file (.yaml, .yml): could have multiple levels
  • JSON file (.json): could have multiple levels

Vector-like data

RoCA supports vector-like data saved in Excel or text files as long as they meet the following requirements.

  • There is no header line unless it is commented out
  • If the file has a single row or single column, it will be imported as a nameless vector; if the file has at least 2 columns, the first column will be used as element names and the seconde column will be used as element values.
  • In text files, values are separated by default separator: ',' for .csv and '\t' for .txt and .tab files

The ImportVector {RoCA} function can be used to import vector-like data saved in any of the following file types. It determines file types based on known file extensions, such as .txt and .rdata, so users have the flexibility to import the files at their convenience. File extensions are case insensitive. Imported data will be a nameless vector if the Excel or text file has a single column or row. The vector will be named using the first column if the file has 2 columns, and the second column will be used as vector values.

  • R file (.rdata, _.rda, .rds): will be loaded as the way it was saved
  • Tab separated text file (.txt, .tab): with '\t' as separator by default
  • Comma separated text file (.csv): with ',' as separator by default
  • Excel file (.xls, s.xlsx): import the first worksheet by default

R objects

R objects can be saved in .rdata, .rda, or .rds files. The ImportR {RoCA} function will recognize these files types and load the R objects as the way they were saved.

Import data from remote files

END OF DOCUMENT