Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient documentation #1

Open
cooketho opened this issue Sep 27, 2016 · 3 comments
Open

Insufficient documentation #1

cooketho opened this issue Sep 27, 2016 · 3 comments

Comments

@cooketho
Copy link

I'd like to use HiC-Box to prepare my data for genome finishing using GRAAL, but I have a number of questions and points to raise about the documentation.

  1. The main page doesn't describe what the software does.

  2. There is no link from the HiC-Box page to the GRAAL page, or from the GRAAL page to HiC-Box. It is not apparent that the two softwares are designed to work together, even though they do.

  3. There is no guidance for people who have already mapped and/or processed their sequencing reads and want to start HiC-Box downstream of the mapping step using, for example, a bam file or bowtie output.

  4. There is no description of the advanced parameters or guidance as to how to use them. The README says "tweak if needed" but doesn't say how to determine when tweaking is needed. What is "Total reads length"? Can it not handle reads of different lengths, for example from different experiments, or due to trimming? What is "Tag length"? My reads don't have a tag. Is HiC-Box going to try and trim 6 bp off anyway?

  5. Upon running "python main.py" a window pops up that prompts the user for reads in fastq format. Obviously 3C/Hi-C data is paired-end, and hence there are two fastq files--for read one and for read two--but there is only one box and it apparently only accepts one filename argument. This presents a problem, and I don't know how to proceed. The advanced settings box has a "Paired wise FASTQ" option that can be checked, which I presume relates to this, as well as a "Length paired wise FASTQ" option that is set to 3 by default. I don't know what this means. Does it mean that the FASTQ reads are meant to be supplied in interleaved format in groups of three? If so, this calls for pre-processing for which no instructions are given. Also is it able to handle multiple fastq files for each read? Bowtie simply accepts a comma-separated list, but its unclear what HiC-Box expects. Can it handle gzipped files? Also unclear.

  6. The instructions say to build a pyramid, but GRAAL also has a pyramid building step. Is this redundant? At which step am I supposed to stop with HiC-Box? Instructions are unclear.

  7. A comment related to point 3: If HiC-Box just took bowtie output as its input, there would be no need to ask many of these questions, since bowtie is already well-documented. One problem is HiC-Box packages the functionality of bowtie in an obscure way (a "black box"). Is this necessary? If so, it would be beneficial to explain what is does and why (again related to point 1).

I'm fully aware that in a research environment it's difficult to keep the documentation up to speed with the latest projects--If my comments here seem long-winded it's because I'm trying to help by giving thorough feedback. That said, I'd appreciate any advice/updates you can give. Thanks!

@rkoszul
Copy link
Member

rkoszul commented Sep 27, 2016

  1. and 2) Good points - we will add descriptions as soon as possible.

  2. The HiC-Box visualizer (and GRAAL as well) needs a specific output that can't be derived from the bowtie output alone. For instance, it needs to know the position of every single restriction site in the genome and store a fragment list somewhere in order to build the pyramid. I can add a more detailed pyramid template in the documentation for people who wish to manually convert the combination of a bowtie output + a genome into a browsable pyramid if that's of any help.

  3. Reads should indeed be of equal length. If your reads don't have a tag, you can set tag length to 0.

  4. You can select several filenames - simply hold ctrl and select both as needed. You can also type in manually both filepaths separated by a comma (/path/to/file.end1,/different/path/to/file.end2), or you can just select the one that ends in *.end1 and it will try to look for a file in the same folder with the same name that ends in *.end2.

  5. The pyramid building step is the same and necessary to both ends, except the box allows you to visualize said pyramid while GRAAL reassembles it. Once the terminal outputs "ready for computation" your dataset is generated and you can skip to the GRAAL part and build the pyramids with it. Or you can click on Pyramid in HiC-Box, build it ("pyramid built") and then move on to GRAAL that will in turn skip the pyramid building step - it doesn't matter. Ideally both softwares should be merged so as to avoid redundant parts, but GRAAL can be very demanding and somewhat complicated to deploy on a machine due to its specific requirements regarding the parts written in CUDA, so HiC-Box can be thought of as a more lightweight pipeline for visualizing the data without necessarily running an assembly step.

  6. What the box does is get basic info about the genome and its restriction fragments, map the reads onto it with bowtie and convert the sam output to a very large sparse matrix in text format that is basically the contact map, which is then processed into pyramids through binning and filtering etc. What you can do is run bowtie manually as you wish and place its output in HiC-Box's intended folder before running HiC-Box - when it detects a sam file it will assume the alignment part has already been done and will simply convert the file into a matrix. This can be handy if you want to run bowtie with specific options.

@cooketho
Copy link
Author

Thank you for the speedy and clear response!

I would be interested in manually converting the combination of bowtie output + genome into a browsable pyramid, so it would be great help if you uploaded the pyramid template you referred to, as well as any other documentation I might need. When you say place the bowtie output in HiC-Box's intended folder, is there a specific directory structure and/or file naming convention it expects? Do you have an example? Thanks again!

@rkoszul
Copy link
Member

rkoszul commented Sep 28, 2016

I've added a short description. More should follow, especially details on what the box's output folder is like. Thank you for your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants