Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read screen incorrectly counting basepairs #588

Closed
kapsakcj opened this issue Aug 19, 2024 · 5 comments · Fixed by #650
Closed

read screen incorrectly counting basepairs #588

kapsakcj opened this issue Aug 19, 2024 · 5 comments · Fixed by #650
Assignees
Labels
bug This issue is a bug: something doesn't work right

Comments

@kapsakcj
Copy link
Contributor

🐛

📝 Describe the Issue

I came across some ONT data where the (raw) read screen task fails to accurately count # of basepairs with some ONT FASTQ files. This bash one-liner...

read1_bp=`eval "${cat_reads} ~{read1}" | paste - - - - | cut -f2 | tr -d '\n' | wc -c`

...is failing due to tabs being present in the FASTQ header, example:

@95188ab4-78ef-44dd-8c17-67881e09547c   qs:f:3.89145    du:f:4.8936     mx:i:2  ch:i:2  st:Z:2024-08-11T18:05:38.980+00:00      rn:i:-1 fn:Z:FAW95599_9a7a72fa_fad33a0d_70.pod5 sm:f:414.135    sd:f:105.409      sv:Z:pa dx:i:0  RG:Z:fad33a0d89d184b44598c61e6e434edb1ebe861e_dna_r10.4.1_e8.2_400bps_sup@v5.0.0        pi:Z:974f9db2-55ba-4d24-a674-e7a22d5daa1a       sp:i:403278     BC:Z:V23004070

For this particular sample, the raw_read_screen outputs this and FAILS the sample:

"FAIL; the number of basepairs (1043344) is below the minimum of 2241820"

When fastq-scan shows the true number of basepairs, which is much higher for this 360 MB .fastq.gz file:

$ zcat sample01_sup.fastq.gz | fastq-scan -q
{
    "qc_stats": {
        "total_bp": 434069800,
        "coverage": 0.00,
        "read_total": 87764,
        "read_min": 1,
        "read_mean": 4945.88,
        "read_std": 7898.69,
        "read_median": 2086,
        "read_max": 337738,
        "read_25th": 460,
        "read_75th": 5951,
        "qual_min": 2,
        "qual_mean": 28.9399,
        "qual_std": 12.0037,
        "qual_max": 47,
        "qual_median": 33,
        "qual_25th": 19,
        "qual_75th": 39
    }
}

🔁 How to Reproduce

Ask me for a link to the theiaprok_ont workflow in Terra where this behavior was observed. Prefer not to post this link publicly to preserve privacy.

Feel free to answer the following questions to help us understand:

  • Was the workflow run on the Terra platform? Was it Terra on Azure or GCP? Terra on GCP
    • If necessary, we may ask you to share your Terra workspace with us. Usually READER access is sufficient, but we may ask for WRITER access if we need to make changes to the workspace to reproduce the issue.
  • Was the workflow run locally using miniwdl or cromwell?
    • If so, what was the exact command was used to launch the workflow?

💻 Version Information

TheiaProk_ONT v2.1.0

@kapsakcj kapsakcj added the bug This issue is a bug: something doesn't work right label Aug 19, 2024
@kapsakcj
Copy link
Contributor Author

fastq-scan is already present in the docker image used by the screen task, my suggestion to fix this bug is to use fastq-scan to count the number of bases instead of the bash one-liner

@sage-wright
Copy link
Member

Does it work without the tabs? Is the cut -f2 grabbing just the qs:f:3.89145 section instead of the second line?

Either way, I agree using an established tool that is already in the docker image is a good idea.

@kapsakcj
Copy link
Contributor Author

I'm not sure if it works without the tabs.

Yes, it's grabbing the qs:f:3.89145 string instead of the second line

@kapsakcj
Copy link
Contributor Author

I started a dev branch to address this issue, called cjk-read-screen-ont-fix. Really quick fix with not a ton of thought put into it, but it resolved the issue with a couple ONT samples that failed previously.

We can obviously take the solution a different direction than what I implemented, but wanted to get our user past this blocker.

@kapsakcj
Copy link
Contributor Author

Our PHL partner that reported this issue confirmed that the dev branch did resolve this issue for their recent batch of samples sequence on ONT 👍

Despite the success, if the team thinks my solution is not robust enough or knows of a better way let's discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug: something doesn't work right
Projects
None yet
2 participants