Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing input error #231

Open
Ge0rges opened this issue Jul 15, 2024 · 10 comments
Open

Parsing input error #231

Ge0rges opened this issue Jul 15, 2024 · 10 comments
Labels
logging Additional information should be logged troubleshooting workflow and data preparation questions

Comments

@Ge0rges
Copy link

Ge0rges commented Jul 15, 2024

Hi @ArtRand,

The following command which I believe to have executed on identical files in the past (perhaps on 0.3.0) seem to produce the error below now:

modkit dmr multi \
  -s methylation_10/brevundimonas_r-contigs/barcode01.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode02.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode03.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode05.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode06.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode07.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode08.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode09.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode10.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode11.bed.gz barcode11 \
  -s methylation_10/brevundimonas_r-contigs/barcode12.bed.gz barcode12 \
  -s methylation_10/brevundimonas_r-contigs/barcode13.bed.gz barcode13 \
  -s methylation_10/brevundimonas_r-contigs/barcode14.bed.gz barcode14 \
  -r methylation_10/brevundimonas_r-contigs/gene-coordinates.txt \
  -o methylation_10/brevundimonas_r-contigs/dmr_by_gene/ \
  -t 20 \
  --ref mags/brevundimonas_r-contigs.fna \
  --base C \
  --base A \
  --min-valid-coverage 10

Error: > Error! Parsing Error: Error { input: "\t\t", code: Many1 }

Is this due to a change/misformat in my input files that I might have missed or does it seem like a bug in modkit? The error is a buit mysterious.

@ArtRand
Copy link
Contributor

ArtRand commented Jul 15, 2024

@Ge0rges,

I agree, the parsing errors should be more informative. I'll fix that.

Could you tell me which version of modkit you used to generate the input data (the pileups)? Also could you attach or paste the gene-coordinates.txt file? (email is also fine).

@Ge0rges
Copy link
Author

Ge0rges commented Jul 15, 2024

I used 0.3.1, also the gene-coordinates file is the issue, just looked at it and it's not normal. Guess that was the issue! I'll fix it and confirm.

@Ge0rges
Copy link
Author

Ge0rges commented Jul 15, 2024

Seems like that fixed it @ArtRand next time I'll review my input files instead of trusting the script! Sneaky updates sneak pass me...

@Ge0rges Ge0rges closed this as completed Jul 15, 2024
@ArtRand
Copy link
Contributor

ArtRand commented Jul 15, 2024

@Ge0rges I'm going to re-open this issue to track work for better error messages when input fails to parse. Some other users have encountered the same error and it's not clear enough what the problem is.

@ArtRand ArtRand reopened this Jul 15, 2024
@ArtRand ArtRand added the logging Additional information should be logged label Jul 15, 2024
@Rpowellnz
Copy link

Hi @ArtRand,

I've also encountered a parsing error - I'm trying to run the script below, attempting to use the regions.bed.gz files as output from wf_human_variation --mod function. Have also tried with the wf_mods.bedmethyl.gz.

For the -r /regions-bed, I download the NCBI refseq track in bed format.

Define variables for paths

REF="/projects/health_sciences/oms/pathology/powry48p/202404ONT/reference/ref_genome/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
OUT_DIR="/weka/powry48p/results/modkit_output/"

Run modkit dmr

./modkit dmr multi
-s barcode17.regions.bed.gz Tri102_1
-s barcode19.regions.bed.gz Tri102_2
-s barcode21.regions.bed.gz Tri103_1
-s barcode23.regions.bed.gz Tri103_2
-o $OUT_DIR
-r refseq.bed
--ref $REF
-m C
--log-filepath dmr_multi.log

Error:

error fetching line from regions BED, stream did not contain valid UTF-8
error fetching line from regions BED, stream did not contain valid UTF-8
Error! Parsing Error: Error { input: "= {", code: Digit }

Any tips would be appreciated, thanks!

@ArtRand
Copy link
Contributor

ArtRand commented Aug 1, 2024

Hello @Rpowellnz,

Could you tell what

$ head -n 5 refseq.bed 

looks like?

@ArtRand ArtRand added the troubleshooting workflow and data preparation questions label Aug 1, 2024
@Rpowellnz
Copy link

Hi @ArtRand,

The output from $ head -n 5 refseq.bed is as below, which I'm guessing is not correctly formatted.. Could you provide some guidance on how to generate the appropriate .bed file for -r/ for a genome-wide differential methylation analysis of protein coding genes?

bplist00�_WebMainResource�

_ebResourceTextEncodingName_WebResourceData_WebResourceMIMEType_WebResourceFrameName^WebResourceURLUUTF-8O�S<style type="text/css"></style>

chr1	201283451	201332993	NM_000299	0	+	201283702	201328836	0	15	453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920,	0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
chr1 67092165 67134970 NM_001276351 0 - 6709300467127240 0 8 1439,187,70,113,158,92,86,41, 0,3069,4086,23186,33586,35000,38976,42764,
chr1 201283505 201332989 NM_001005337 0 + 201283702 201328836 0 14 399,104,395,145,208,178,115,156,177,154,187,85,107,2916, 0,10436,29660,33047,34066,35112,36761,38472,39507,40922,41435,42248,45256,46568,
chr1 67092165 67134970 NM_001276352 0 - 6709357967127240 0 9 1439,70,145,68,113,158,92,86,41, 0,4086,11072,19411,23186,33586,35000,38976,42764,

@ArtRand
Copy link
Contributor

ArtRand commented Aug 1, 2024

Hello @Rpowellnz,

You certainly need to remove any of those HTML tags at the start. The BED file should be a plain text file with 3 or 4 tab-separated fields: chrom, start, end, <name> (<name> is optional). You should also remove those blank lines.

@Rpowellnz
Copy link

Hi @ArtRand

I removed the HTML tags so now $ head -n refseq1.bed produces the output below.

chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
chr1 67092165 67134970 NM_001276351 0 - 67093004 67127240 0 8 1439,187,70,113,158,92,86,41, 0,3069,4086,23186,33586,35000,38976,42764,
chr1 201283505 201332989 NM_001005337 0 + 201283702 201328836 0 14 399,104,395,145,208,178,115,156,177,154,187,85,107,2916, 0,10436,29660,33047,34066,35112,36761,38472,39507,40922,41435,42248,45256,46568,
chr1 67092165 67134970 NM_001276352 0 - 67093579 67127240 0 9 1439,70,145,68,113,158,92,86,41, 0,4086,11072,19411,23186,33586,35000,38976,42764,
chr1 67092165 67134970 NR_075077 0 - 67134970 67134970 0 10 1439,70,145,68,143,113,158,92,86,41, 0,4086,11072,19411,21448,23186,33586,35000,38976,42764,

Trying to run modkit dmr as below, still produces the error

./modkit dmr multi
-s barcode17.regions.bed.gz Tri102_1
-s barcode19.regions.bed.gz Tri102_2
-s barcode21.regions.bed.gz Tri103_1
-s barcode23.regions.bed.gz Tri103_2
-o $OUT_DIR
-r refseq1.bed
--ref $REF
-m C
--log-filepath dmr_multi.log

Error! Parsing Error: Error { input: "= {", code: Digit }

@ArtRand
Copy link
Contributor

ArtRand commented Aug 27, 2024

@Rpowellnz The latest version will report out which file is failing to parse. Could you confirm that it's an issue with the argument to -r (the regions file)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logging Additional information should be logged troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

3 participants