Skip to content

Commit

Permalink
Merge pull request #71 from CDCgov/master
Browse files Browse the repository at this point in the history
Update dev to v1.2.1
  • Loading branch information
dthoward96 committed Sep 11, 2024
2 parents b3e339f + 2e0bd2f commit c3ce914
Show file tree
Hide file tree
Showing 520 changed files with 11,498 additions and 8,943 deletions.
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/shiny/*
/vignettes/*
/docs/*
39 changes: 11 additions & 28 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,27 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''
about: Create a report to help us improve SeqSender
title: "[BUG]"
labels: bug
assignees: dthoward96

---

**Describe the bug**
A clear and concise description of what feature is not working.

**Impact**
Please describe the impact this bug is causing to your program or organization.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.
- Which databases are you attempting submission to?
- Is the error related to a specific database/metadata field? If so, which?
- Steps to reproduce the behavior:

**Logs**
If applicable, please attach logs to help describe your problem.

**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]

**Smartphone (please complete the following information):**
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]
**Version**
- Version Number: [e.g. v1.2.0.]
- SeqSender Version: [e.g. Singularity, Docker, Script]
- OS [e.g. Linux, Mac]

**Additional context**
Add any other context about the problem here.
14 changes: 7 additions & 7 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
about: Suggest a new feature for SeqSender
title: "[FEATURE REQUEST]"
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
**Is this feature for general SeqSender usage or for submitting to a specific database, if so what database?**
SeqSender or a specified database name.

**Is your feature request related to a problem? If so please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/general.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: General
about: SeqSender Issue
title: ''
labels: ''
assignees: ''

---

**Is this related to a specific database? If so, which database?**

**Describe your issue below:**
17 changes: 0 additions & 17 deletions .github/ISSUE_TEMPLATE/maintenance.md

This file was deleted.

5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ submit.ready
*report.xml
test_input/test_metadata.tsv
upload_log.csv
submission_log.csv
*.vscode
*.Rproj
*.Rhistory
Expand All @@ -19,3 +20,7 @@ docker-compose-*.yaml

# ignore folders
**/.Rproj.user
**/test_data/*
**/gisaid_cli/*
**/COV_TEST_DATA/*
**/FLU_TEST_DATA/*
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ github_pages_url <- description$GITHUB_PAGES

<p style="font-size: 16px;"><em>Public Database Submission Pipeline</em></p>

**Beta Version**: v1.2.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!
**Beta Version**: v1.2.1. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!

**General Disclaimer**: This repository was created for use by CDC programs to collaborate on public health related projects in support of the [CDC mission](https://www.cdc.gov/about/organization/mission.htm). GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Expand Down
80 changes: 1 addition & 79 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

</p>

**Beta Version**: 1.2.0. This pipeline is currently in Beta testing, and
**Beta Version**: 1.2.1. This pipeline is currently in Beta testing, and
issues could appear during submission. Please use it at your own risk.
Feedback and suggestions are welcome\!

Expand Down Expand Up @@ -45,84 +45,6 @@ issue.
| Maintainer | [Dakota Howard](https://github.com/dthoward96) |
| Back-Up | [Reina Chau](https://github.com/rchau88), [Brian Lee](https://github.com/leebrian) |

## Prerequisites

- **NCBI Submissions**

`seqsender` utilizes an UI-Less Data Submission Protocol to bulk upload
submission files (e.g., *submission.xml*, *submission.zip*, etc.) to
NCBI archives. The submission files are uploaded to the NCBI server via
FTP on the command line. Before attempting to submit a submission using
`seqsender`, submitter will need to

1. Have a NCBI account. To sign up, visit [NCBI
website](https://account.ncbi.nlm.nih.gov/).

2. Required for CDC users and highly recommended for others is creating
a center account for your institution/lab [NCBI Center Account
Instructions](https://submit.ncbi.nlm.nih.gov/sarscov2/sra/#step6).
Center accounts allow you to perform submissions UI-less submissions
as your institution/lab.

3. Required for CDC users and also recommended is creating a submission
group in [NCBI Submission Portal](https://submit.ncbi.nlm.nih.gov).
A group should include all individuals who need access to UI-less
submissions through the web interface with your center account. Each
member of the group must also have an individual NCBI account. [NCBI
website](https://account.ncbi.nlm.nih.gov/).

4. Refer to this page for information regarding requirements for
GenBank submissions via FTP only. This page applies only for COVID
and Influenza [NCBI GenBank FTP
Submissions](https://submit.ncbi.nlm.nih.gov/sarscov2/genbank/#step5)
For further questions contact
<a href="mailto:[email protected]">[email protected]</a>
to discuss requirements for submissions.

5. Coordinate a NCBI namespace name (**spuid\_namespace**) that will be
used with Submitter Provided Unique Identifiers (**spuid**) in the
submission. The liaison of **spuid\_namespace** and **spuid** is
used to report back assigned accessions as well as for cross-linking
objects within submission. The values of **spuid\_namespace** are up
to the submitter to decide but they must be unique and
well-coordinated prior to make a submission.

<!-- end list -->

- **GISAID Submissions**

`seqsender` makes use of GISAID’s Command Line Interface tools to bulk
uploading meta- and sequence-data to GISAID databases. Presently, the
pipeline supports upload to EpiFlu (**Influenza A Virus**), EpiCoV
(**SARS-COV-2**), EpiPox (**Monkeypox**), and EpiArbo (**Arbovirus**).
Before uploading, submitter needs to

1. Have a GISAID account. To sign up, visit [GISAID
Platform](https://gisaid.org/).

2. Request a client-ID for your specified Epi(Flu/CoV/Pox/Arbo)
database in order to use its CLI tool. The CLI utilizes the
client-ID along with the username and password to authenticate the
database prior to make a submission. To obtain a client-ID, please
email
<a href="mailto:[email protected]" >[email protected]</a> to
request. ***Important note**: If submitter would like to upload a
“test” submission first to familiarize themselves with the
submission process prior to make a real submission, one should
additionally request a test client-id to perform such submissions.*

3. Download the
<a href="https://cdcgov.github.io/seqsender/articles/images/fluCLI_download.png" target="_blank">EpiFlu</a>
or
<a href="https://cdcgov.github.io/seqsender/articles/images/covCLI_download.png" target="_blank">EpiCoV</a>
CLI from the **GISAID platform** and stored them in the destination
of choice prior to perform a batch upload.

Here is a quick look of where to store the downloaded **GISAID CLI**
package.

![](man/figures/gisaid_cli_dir.png)

## Code Attributions

Dakota Howard and Reina Chau for majority of the code base with input
Expand Down
2 changes: 1 addition & 1 deletion argument_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def args_parser():
required=True)
file_parser.add_argument("--fasta_file",
help="Fasta file used to generate submission files; fasta header should match the column 'sequence_name' stored in your metadata. Input either full file path or if just file name it must be stored at '<submission_dir>/<submission_name>/<fasta_file>'.",
required=True)
default = None)
file_parser.add_argument("--table2asn",
help="Perform a table2asn submission instead of GenBank FTP submission for organism choices 'FLU' or 'COV'.",
required=False,
Expand Down
70 changes: 47 additions & 23 deletions biosample_sra_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
column_ordered = ["sample_name","library_ID"]
prefix = "sra-"
# Create SRA specific fields
metadata["sra-title"] = config_dict["Description"]["Title"]
filename_cols = [col for col in metadata.columns.tolist() if re.match("sra-file_[1-9]\d*", col)]
# Correct index for filename column
for col in filename_cols:
Expand All @@ -69,8 +68,8 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
rename_columns[col] = col.replace("sra-file_", "sra-filename")
elif "BIOSAMPLE" in database:
metadata_regex = "^bs-|^organism$|^collection_date$"
rename_columns = {"bs-description":"sample_title","bioproject":"bioproject_accession"}
drop_columns = ["bs-package"]
rename_columns = {"bioproject":"bioproject_accession"}
drop_columns = ["bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"]
column_ordered = ["sample_name"]
prefix = "bs-"
else:
Expand All @@ -92,14 +91,31 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
file_handler.save_csv(df=database_df, file_path=submission_dir, file_name="metadata.tsv", sep="\t")

# Create submission XML
def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame, failed_seqs_auto_removed: bool = True) -> bytes:
def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame) -> bytes:
# Submission XML header
root = etree.Element("Submission")
description = etree.SubElement(root, "Description")
title = etree.SubElement(description, "Title")
title.text = config_dict["Description"]["Title"]
comment = etree.SubElement(description, "Comment")
comment.text = config_dict["Description"]["Comment"]
if "BIOSAMPLE" in database:
if "bs-title" in metadata and pd.notnull(metadata["bs-title"].iloc[0]) and metadata["bs-title"].iloc[0].strip() != 0:
title.text = metadata["bs-title"].iloc[0]
else:
title.text = submission_name + "-BS"
comment = etree.SubElement(description, "Comment")
if "bs-comment" in metadata and pd.notnull(metadata["bs-comment"].iloc[0]) and metadata["bs-comment"].iloc[0].strip() != 0:
comment.text = metadata["bs-comment"].iloc[0]
else:
comment.text = "BioSample Submission"
elif "SRA" in database:
if "sra-title" in metadata and pd.notnull(metadata["sra-title"].iloc[0]) and metadata["sra-title"].iloc[0].strip() != 0:
title.text = metadata["sra-title"].iloc[0]
else:
title.text = submission_name + "-SRA"
comment = etree.SubElement(description, "Comment")
if "sra-comment" in metadata and pd.notnull(metadata["sra-comment"].iloc[0]) and metadata["sra-comment"].iloc[0].strip() != 0:
comment.text = metadata["sra-comment"].iloc[0]
else:
comment.text = "SRA Submission"
# Description info including organization and contact info
organization = etree.SubElement(description, "Organization", type=config_dict["Description"]["Organization"]["Type"], role=config_dict["Description"]["Organization"]["Role"])
org_name = etree.SubElement(organization, "Name")
Expand All @@ -125,13 +141,18 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
sampleid = etree.SubElement(biosample, "SampleId")
spuid = etree.SubElement(sampleid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = row["bs-sample_name"]
descriptor = etree.SubElement(biosample, "Descriptor")
title = etree.SubElement(descriptor, "Title")
title.text = row["bs-description"]
if ("bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "") or ("bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != ""):
descriptor = etree.SubElement(biosample, "Descriptor")
if "bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "":
sample_title = etree.SubElement(descriptor, "Title")
sample_title.text = row["bs-sample_title"]
if "bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != "":
sample_description = etree.SubElement(descriptor, "Description")
sample_description.text = row["bs-sample_description"]
organismxml = etree.SubElement(biosample, "Organism")
organismname = etree.SubElement(organismxml, "OrganismName")
organismname.text = row["organism"]
if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
if "bioproject" in metadata and pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
bioproject = etree.SubElement(biosample, "BioProject")
primaryid = etree.SubElement(bioproject, "PrimaryId", db="BioProject")
primaryid.text = row["bioproject"]
Expand All @@ -140,10 +161,12 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
# Attributes
attributes = etree.SubElement(biosample, "Attributes")
# Remove columns with bs-prefix that are not attributes
biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-description"])]
biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"])]
for col in biosample_cols:
attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
attribute.text = row[col]
attribute_value = row[col]
if pd.notnull(attribute_value) and attribute_value.strip() != "":
attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
attribute.text = row[col]
# Add collection date to Attributes
attribute = etree.SubElement(attributes, "Attribute", attribute_name="collection_date")
attribute.text = row["collection_date"]
Expand Down Expand Up @@ -174,20 +197,21 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
datatype = etree.SubElement(file, "DataType")
datatype.text = "generic-data"
# Remove columns with sra- prefix that are not attributes
sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-file_location|sra-file_\d*)", col)]
sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-title|sra-comment|sra-file_location|sra-file_\d*)", col)]
for col in sra_cols:
attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
attribute.text = row[col]
attribute_value = row[col]
if pd.notnull(attribute_value) and attribute_value.strip() != "":
attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
attribute.text = row[col]
if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioProject")
refid = etree.SubElement(attribute_ref_id, "RefId")
primaryid = etree.SubElement(refid, "PrimaryId")
primaryid.text = row["bioproject"]
if config_dict["Link_Sample_Between_NCBI_Databases"] and metadata.columns.str.contains("bs-sample_name").any():
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
refid = etree.SubElement(attribute_ref_id, "RefId")
spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
refid = etree.SubElement(attribute_ref_id, "RefId")
spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
identifier = etree.SubElement(addfiles, "Identifier")
spuid = etree.SubElement(identifier, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = row["sra-sample_name"]
Expand All @@ -209,7 +233,7 @@ def create_biosample_sra_submission(organism: str, database: str, submission_nam
create_raw_reads_list(submission_dir=submission_dir, raw_files_list=raw_files_list)
manual_df = metadata.copy()
create_manual_submission_files(database=database, submission_dir=submission_dir, metadata=manual_df, config_dict=config_dict)
xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict, failed_seqs_auto_removed=True)
xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict)
file_handler.save_xml(xml_str, submission_dir)

# Read xml report and get status of the submission
Expand Down
Loading

0 comments on commit c3ce914

Please sign in to comment.