Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data layout from SRA is not reliable #66

Open
aguang opened this issue May 13, 2020 · 0 comments
Open

data layout from SRA is not reliable #66

aguang opened this issue May 13, 2020 · 0 comments

Comments

@aguang
Copy link
Member

aguang commented May 13, 2020

The data layout (i.e. paired or single-ended) field from SRA is not reliable. This had led twice now to issues where bioflows expects the data to be paired end and searches for the pair, but it is not, so the workflow errors out. In one case SRX1726841 fastq-dump split the file into 2 by just putting the first half of the reads in 1 file and the second half of the reads in the other, but the read IDs did not actually match since they weren't paired.

One potential solution to this is to write a check on the fastq files to see if the first x read IDs match up. However this would require processing through the fastq files in addition to when FASTQC already runs on them, so this would increase bioflows run time, and may not be desirable behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant