-
Notifications
You must be signed in to change notification settings - Fork 7
rtracklayer improvements
R is a popular tool for genomic data analysis, and the Bioconductor project provides a unified, extensible platform for generating insights from genomic data. Computational biologists depend on visualization to make inferences across loosely coupled datasets and annotations, particularly through their coincidence on the genome.
The UCSC Genome Browser is a popular tool for viewing data in the context of genomic annotations. UCSC enables users to share datasets through a repository known as a track hub. To optimize data retrieval, a track hub requires the data to be stored in binary form. UCSC also provides its own database of standard data and annotations, which are useful in local exploratory analyses.
There is an opportunity to improve support in R for (1) generating UCSC track hubs, (2) reading and writing the standard binary file format BigBed and (3) accessing the UCSC database.
-
PeakSegPipeline
has code for generating track hubs, and currently relies on the
bedToBigBed
command line program for creating bigBed files. The command line program can not be portably installed on all supported R platforms (e.g. windows), may be slower than an in-memory implementation because it has to read files from disk, and it is also unavailable for commercial use. - rtracklayer is a core Bioconductor package for interacting with the UCSC Genome Browser and interoperating with standard genomic file formats. It does not support track hubs and the BigBed format. Its mode of accessing the UCSC database directly manipulates HTML forms, instead of calling the newly introduced REST API.
- trackhub is a python module for creating track hub meta-data files.
The interested student will implement at least two of the three proposed features for the rtracklayer package:
A track hub is a group of text files that describes a set of genomic data to display on the UCSC browser. It contains links to binary indexed files such as bigWig and bigBed. R needs a function for creating such files.
The student should implement functions such as
trackHub(
multiWig(
bigWig("http://path/to/data.bigWig", "red"),
bigWig("http://path/to/peaks.bigWig", "black")),
bigBed("http://path/to/labels.bigBed"),
trackDb="trackDb.txt",
genomes="genomes.txt",
db="hg19",
hub="hub.txt")
which would generate trackDb.txt, genomes.txt, and hub.txt which could then be uploaded to a web server for display on UCSC.
The bigBed file
format is useful for displaying genomic regions on UCSC track
hubs. The student should implement a BigBedFile
class with methods
import
, export
, etc, similar to the existing
BigWigFile
class. This would involve implementing an R wrapper around the public domain
C library for manipulating BigBed data.
The student will port the existing functionality in rtracklayer for retrieving data from the UCSC table browser so that it uses the new REST API instead.
This project will provide R/Bioconductor with functionality for creating track hubs, interoperating with bigBed files, and/or sustainably downloading data from the UCSC database.
Students, please contact mentors below after completing at least one of the tests below.
- EVALUATING MENTOR: Micheal Lawrence [email protected] is a member of R-core and the author of R package rtracklayer, which will be extended by these projects.
- Toby Hocking [email protected] is the author of R package PeakSegPipeline which has code for generating track hubs, and currently relies on the bedToBigBed command line program for creating bigBed files.
Students, please do one or more of the following tests before contacting the mentors above.
-
Write a test for the track hub export feature (create_track_hub function) in the PeakSegPipeline package.
-
Identify some of the functions in the Kent library that would need to be called to read and write BigBed files. Explain (possibly with pseudocode that mentions those functions) how you plan to implement the read/write features.
-
Use the restfulr package to retrieve the list of available UCSC genomes from their REST API. Provide a link to a web page with your R code and results/output (possibly using Rmd/Rpubs, http://rpubs.com/).
Students, please post a link to your test results here.
-
Student name: Sanchit Saini
Email: [email protected]
University: Guru Gobind Singh Indraprastha University
Program: Master of Computer Applications(MCA)
Solution to Tests: Solutions