read screen incorrectly counting basepairs #588

kapsakcj · 2024-08-19T14:42:17Z

🐛

📝 Describe the Issue

I came across some ONT data where the (raw) read screen task fails to accurately count # of basepairs with some ONT FASTQ files. This bash one-liner...

public_health_bioinformatics/tasks/quality_control/comparisons/task_screen.wdl

Line 55 in 1508bb5

    
           read1_bp=`eval "${cat_reads} ~{read1}" | paste - - - - | cut -f2 | tr -d '\n' | wc -c`

...is failing due to tabs being present in the FASTQ header, example:

@95188ab4-78ef-44dd-8c17-67881e09547c   qs:f:3.89145    du:f:4.8936     mx:i:2  ch:i:2  st:Z:2024-08-11T18:05:38.980+00:00      rn:i:-1 fn:Z:FAW95599_9a7a72fa_fad33a0d_70.pod5 sm:f:414.135    sd:f:105.409      sv:Z:pa dx:i:0  RG:Z:fad33a0d89d184b44598c61e6e434edb1ebe861e_dna_r10.4.1_e8.2_400bps_sup@v5.0.0        pi:Z:974f9db2-55ba-4d24-a674-e7a22d5daa1a       sp:i:403278     BC:Z:V23004070

For this particular sample, the raw_read_screen outputs this and FAILS the sample:

"FAIL; the number of basepairs (1043344) is below the minimum of 2241820"

When fastq-scan shows the true number of basepairs, which is much higher for this 360 MB .fastq.gz file:

$ zcat sample01_sup.fastq.gz | fastq-scan -q
{
    "qc_stats": {
        "total_bp": 434069800,
        "coverage": 0.00,
        "read_total": 87764,
        "read_min": 1,
        "read_mean": 4945.88,
        "read_std": 7898.69,
        "read_median": 2086,
        "read_max": 337738,
        "read_25th": 460,
        "read_75th": 5951,
        "qual_min": 2,
        "qual_mean": 28.9399,
        "qual_std": 12.0037,
        "qual_max": 47,
        "qual_median": 33,
        "qual_25th": 19,
        "qual_75th": 39
    }
}

🔁 How to Reproduce

Ask me for a link to the theiaprok_ont workflow in Terra where this behavior was observed. Prefer not to post this link publicly to preserve privacy.

Feel free to answer the following questions to help us understand:

Was the workflow run on the Terra platform? Was it Terra on Azure or GCP? Terra on GCP
- If necessary, we may ask you to share your Terra workspace with us. Usually READER access is sufficient, but we may ask for WRITER access if we need to make changes to the workspace to reproduce the issue.
Was the workflow run locally using miniwdl or cromwell?
- If so, what was the exact command was used to launch the workflow?

💻 Version Information

TheiaProk_ONT v2.1.0

The text was updated successfully, but these errors were encountered:

kapsakcj · 2024-08-19T14:46:17Z

fastq-scan is already present in the docker image used by the screen task, my suggestion to fix this bug is to use fastq-scan to count the number of bases instead of the bash one-liner

sage-wright · 2024-08-19T14:53:00Z

Does it work without the tabs? Is the cut -f2 grabbing just the qs:f:3.89145 section instead of the second line?

Either way, I agree using an established tool that is already in the docker image is a good idea.

kapsakcj · 2024-08-19T15:45:44Z

I'm not sure if it works without the tabs.

Yes, it's grabbing the qs:f:3.89145 string instead of the second line

kapsakcj · 2024-08-20T18:28:14Z

I started a dev branch to address this issue, called cjk-read-screen-ont-fix. Really quick fix with not a ton of thought put into it, but it resolved the issue with a couple ONT samples that failed previously.

We can obviously take the solution a different direction than what I implemented, but wanted to get our user past this blocker.

kapsakcj · 2024-08-26T18:53:08Z

Our PHL partner that reported this issue confirmed that the dev branch did resolve this issue for their recent batch of samples sequence on ONT 👍

Despite the success, if the team thinks my solution is not robust enough or knows of a better way let's discuss.

kapsakcj added the bug This issue is a bug: something doesn't work right label Aug 19, 2024

sage-wright assigned kapsakcj and unassigned kapsakcj Aug 28, 2024

sage-wright assigned kapsakcj Sep 4, 2024

kapsakcj mentioned this issue Oct 17, 2024

[TheiaCov & TheiaProk & TheiaEuk] read screen ONT bugfix and improvements #650

Merged

10 tasks

sage-wright closed this as completed in #650 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read screen incorrectly counting basepairs #588

read screen incorrectly counting basepairs #588

kapsakcj commented Aug 19, 2024

kapsakcj commented Aug 19, 2024

sage-wright commented Aug 19, 2024

kapsakcj commented Aug 19, 2024

kapsakcj commented Aug 20, 2024

kapsakcj commented Aug 26, 2024

read screen incorrectly counting basepairs #588

read screen incorrectly counting basepairs #588

Comments

kapsakcj commented Aug 19, 2024

📝 Describe the Issue

🔁 How to Reproduce

💻 Version Information

kapsakcj commented Aug 19, 2024

sage-wright commented Aug 19, 2024

kapsakcj commented Aug 19, 2024

kapsakcj commented Aug 20, 2024

kapsakcj commented Aug 26, 2024