Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import automation in some cases picks up the md5 sum file instead of the fastq file #421

Closed
aclum opened this issue Mar 10, 2025 · 2 comments · Fixed by #426
Closed

import automation in some cases picks up the md5 sum file instead of the fastq file #421

aclum opened this issue Mar 10, 2025 · 2 comments · Fixed by #426
Assignees

Comments

@aclum
Copy link
Contributor

aclum commented Mar 10, 2025

{
"resources": [
{
"id": "nmdc:dobj-11-wsa33q95",
"type": "nmdc:DataObject",
"name": "52644.1.406928.GTGAGTGA-CTCTGGTT.fastq.gz.md5",
"file_size_bytes": 33,
"md5_checksum": "04eb414d478126b1f1b994ed424cc5c1",
"data_object_type": "Metagenome Raw Reads",
"was_generated_by": "nmdc:omprc-11-zdqkf654",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-zdqkf654//global/cfs/cdirs/m3408/ficus/pipeline_products/nmdc:omprc-11-zdqkf654/nmdc:omprc-11-zdqkf654/52644.1.406928.GTGAGTGA-CTCTGGTT.fastq.gz.md5",
"description": "Metagenome Raw Reads for nmdc:omprc-11-zdqkf654"
}
]
}

Fix regex such that md5 file doesn't get picked up. It is picking up this file instead of the raw fastq file specifying this as has_input to reads filtering.

cc @mbthornton-lbl

@AmitBinf
Copy link
Contributor

AmitBinf commented Mar 10, 2025

Changing import_suffix for data_object_type: Metagenome Raw Reads in import.yaml and import-mt.yaml
from --> import_suffix: .[A,C,G,T]+-[A,C,G,T]+.fastq.gz
to --> import_suffix: \.[ACGT]+-[ACGT]+\.fastq\.gz$
should resolve this issue.

\. matches a literal dot (.).

[ACGT]+ matches one or more occurrences of the letters A, C, G, or T.

- matches a literal hyphen (-).

[ACGT]+ matches one or more occurrences of the letters A, C, G, or T again.

\.fastq\.gz matches the literal string .fastq.gz.

$ ensures that the pattern matches only at the end of the string, so it won’t match .fastq.gz.md5

@aclum @mbthornton-lbl

@aclum
Copy link
Contributor Author

aclum commented Mar 11, 2025

Please submit a PR including an updated test to this effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants