Merge pull request #71 from CDCgov/master

Update dev to v1.2.1
CDCgov · Sep 11, 2024 · c3ce914 · c3ce914
2 parents b3e339f + 2e0bd2f
commit c3ce914
Show file tree

Hide file tree

Showing 520 changed files with 11,498 additions and 8,943 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,3 @@
+/shiny/*
+/vignettes/*
+/docs/*
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -1,44 +1,27 @@
 ---
 name: Bug report
-about: Create a report to help us improve
-title: ''
-labels: ''
-assignees: ''
+about: Create a report to help us improve SeqSender
+title: "[BUG]"
+labels: bug
+assignees: dthoward96
 
 ---
 
 **Describe the bug**
 A clear and concise description of what feature is not working.
 
-**Impact**
-Please describe the impact this bug is causing to your program or organization.
-
 **To Reproduce**
-Steps to reproduce the behavior:
-1. Go to '...'
-2. Click on '....'
-3. Scroll down to '....'
-4. See error
-
-**Expected behavior**
-A clear and concise description of what you expected to happen.
-
-**Screenshots**
-If applicable, add screenshots to help explain your problem.
+ - Which databases are you attempting submission to?
+ - Is the error related to a specific database/metadata field? If so, which?
+ - Steps to reproduce the behavior:
 
 **Logs**
 If applicable, please attach logs to help describe your problem.
 
-**Desktop (please complete the following information):**
- - OS: [e.g. iOS]
- - Browser [e.g. chrome, safari]
- - Version [e.g. 22]
-
-**Smartphone (please complete the following information):**
- - Device: [e.g. iPhone6]
- - OS: [e.g. iOS8.1]
- - Browser [e.g. stock browser, safari]
- - Version [e.g. 22]
+**Version**
+ - Version Number: [e.g. v1.2.0.]
+ - SeqSender Version: [e.g. Singularity, Docker, Script]
+ - OS [e.g. Linux, Mac]
 
 **Additional context**
 Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -1,20 +1,20 @@
 ---
 name: Feature request
-about: Suggest an idea for this project
-title: ''
-labels: ''
+about: Suggest a new feature for SeqSender
+title: "[FEATURE REQUEST]"
+labels: enhancement
 assignees: ''
 
 ---
 
-**Is your feature request related to a problem? Please describe.**
+**Is this feature for general SeqSender usage or for submitting to a specific database, if so what database?**
+SeqSender or a specified database name.
+
+**Is your feature request related to a problem? If so please describe.**
 A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
 
 **Describe the solution you'd like**
 A clear and concise description of what you want to happen.
 
-**Describe alternatives you've considered**
-A clear and concise description of any alternative solutions or features you've considered.
-
 **Additional context**
 Add any other context or screenshots about the feature request here.
diff --git a/.github/ISSUE_TEMPLATE/general.md b/.github/ISSUE_TEMPLATE/general.md
@@ -0,0 +1,12 @@
+---
+name: General
+about: SeqSender Issue
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Is this related to a specific database? If so, which database?**
+
+**Describe your issue below:**
diff --git a/.github/ISSUE_TEMPLATE/maintenance.md b/.github/ISSUE_TEMPLATE/maintenance.md
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,7 @@ submit.ready
 *report.xml
 test_input/test_metadata.tsv
 upload_log.csv
+submission_log.csv
 *.vscode
 *.Rproj
 *.Rhistory
@@ -19,3 +20,7 @@ docker-compose-*.yaml
 
 # ignore folders
 **/.Rproj.user
+**/test_data/*
+**/gisaid_cli/*
+**/COV_TEST_DATA/*
+**/FLU_TEST_DATA/*
diff --git a/README.Rmd b/README.Rmd
@@ -26,7 +26,7 @@ github_pages_url <- description$GITHUB_PAGES
 
 <p style="font-size: 16px;"><em>Public Database Submission Pipeline</em></p>
 
-**Beta Version**: v1.2.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome! 
+**Beta Version**: v1.2.1. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome! 
 
 **General Disclaimer**: This repository was created for use by CDC programs to collaborate on public health related projects in support of the [CDC mission](https://www.cdc.gov/about/organization/mission.htm).  GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
 

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
 
 </p>
 
-**Beta Version**: 1.2.0. This pipeline is currently in Beta testing, and
+**Beta Version**: 1.2.1. This pipeline is currently in Beta testing, and
 issues could appear during submission. Please use it at your own risk.
 Feedback and suggestions are welcome\!
 
@@ -45,84 +45,6 @@ issue.
 | Maintainer | [Dakota Howard](https://github.com/dthoward96)                                           |
 | Back-Up    | [Reina Chau](https://github.com/rchau88), [Brian Lee](https://github.com/leebrian)       |
 
-## Prerequisites
-
-  - **NCBI Submissions**
-
-`seqsender` utilizes an UI-Less Data Submission Protocol to bulk upload
-submission files (e.g., *submission.xml*, *submission.zip*, etc.) to
-NCBI archives. The submission files are uploaded to the NCBI server via
-FTP on the command line. Before attempting to submit a submission using
-`seqsender`, submitter will need to
-
-1.  Have a NCBI account. To sign up, visit [NCBI
-    website](https://account.ncbi.nlm.nih.gov/).
-
-2.  Required for CDC users and highly recommended for others is creating
-    a center account for your institution/lab [NCBI Center Account
-    Instructions](https://submit.ncbi.nlm.nih.gov/sarscov2/sra/#step6).
-    Center accounts allow you to perform submissions UI-less submissions
-    as your institution/lab.
-
-3.  Required for CDC users and also recommended is creating a submission
-    group in [NCBI Submission Portal](https://submit.ncbi.nlm.nih.gov).
-    A group should include all individuals who need access to UI-less
-    submissions through the web interface with your center account. Each
-    member of the group must also have an individual NCBI account. [NCBI
-    website](https://account.ncbi.nlm.nih.gov/).
-
-4.  Refer to this page for information regarding requirements for
-    GenBank submissions via FTP only. This page applies only for COVID
-    and Influenza [NCBI GenBank FTP
-    Submissions](https://submit.ncbi.nlm.nih.gov/sarscov2/genbank/#step5)
-    For further questions contact
-    <a href="mailto:[email protected]">[email protected]</a>
-    to discuss requirements for submissions.
-
-5.  Coordinate a NCBI namespace name (**spuid\_namespace**) that will be
-    used with Submitter Provided Unique Identifiers (**spuid**) in the
-    submission. The liaison of **spuid\_namespace** and **spuid** is
-    used to report back assigned accessions as well as for cross-linking
-    objects within submission. The values of **spuid\_namespace** are up
-    to the submitter to decide but they must be unique and
-    well-coordinated prior to make a submission.
-
-<!-- end list -->
-
-  - **GISAID Submissions**
-
-`seqsender` makes use of GISAID’s Command Line Interface tools to bulk
-uploading meta- and sequence-data to GISAID databases. Presently, the
-pipeline supports upload to EpiFlu (**Influenza A Virus**), EpiCoV
-(**SARS-COV-2**), EpiPox (**Monkeypox**), and EpiArbo (**Arbovirus**).
-Before uploading, submitter needs to
-
-1.  Have a GISAID account. To sign up, visit [GISAID
-    Platform](https://gisaid.org/).
-
-2.  Request a client-ID for your specified Epi(Flu/CoV/Pox/Arbo)
-    database in order to use its CLI tool. The CLI utilizes the
-    client-ID along with the username and password to authenticate the
-    database prior to make a submission. To obtain a client-ID, please
-    email
-    <a href="mailto:[email protected]" >[email protected]</a> to
-    request. ***Important note**: If submitter would like to upload a
-    “test” submission first to familiarize themselves with the
-    submission process prior to make a real submission, one should
-    additionally request a test client-id to perform such submissions.*
-
-3.  Download the
-    <a href="https://cdcgov.github.io/seqsender/articles/images/fluCLI_download.png" target="_blank">EpiFlu</a>
-    or
-    <a href="https://cdcgov.github.io/seqsender/articles/images/covCLI_download.png" target="_blank">EpiCoV</a>
-    CLI from the **GISAID platform** and stored them in the destination
-    of choice prior to perform a batch upload.
-
-Here is a quick look of where to store the downloaded **GISAID CLI**
-package.
-
-![](man/figures/gisaid_cli_dir.png)
-
 ## Code Attributions
 
 Dakota Howard and Reina Chau for majority of the code base with input

diff --git a/argument_handler.py b/argument_handler.py
@@ -72,7 +72,7 @@ def args_parser():
 		required=True)
 	file_parser.add_argument("--fasta_file",
 		help="Fasta file used to generate submission files; fasta header should match the column 'sequence_name' stored in your metadata. Input either full file path or if just file name it must be stored at '<submission_dir>/<submission_name>/<fasta_file>'.",
-		required=True)
+		default = None)
 	file_parser.add_argument("--table2asn",
 		help="Perform a table2asn submission instead of GenBank FTP submission for organism choices 'FLU' or 'COV'.",
 		required=False,

diff --git a/biosample_sra_handler.py b/biosample_sra_handler.py
@@ -58,7 +58,6 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
 		column_ordered = ["sample_name","library_ID"]
 		prefix = "sra-"
 		# Create SRA specific fields
-		metadata["sra-title"] = config_dict["Description"]["Title"]
 		filename_cols = [col for col in metadata.columns.tolist() if re.match("sra-file_[1-9]\d*", col)]
 		# Correct index for filename column
 		for col in filename_cols:
@@ -69,8 +68,8 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
 				rename_columns[col] = col.replace("sra-file_", "sra-filename")
 	elif "BIOSAMPLE" in database:
 		metadata_regex = "^bs-|^organism$|^collection_date$"
-		rename_columns = {"bs-description":"sample_title","bioproject":"bioproject_accession"}
-		drop_columns = ["bs-package"]
+		rename_columns = {"bioproject":"bioproject_accession"}
+		drop_columns = ["bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"]
 		column_ordered = ["sample_name"]
 		prefix = "bs-"
 	else:
@@ -92,14 +91,31 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
 	file_handler.save_csv(df=database_df, file_path=submission_dir, file_name="metadata.tsv", sep="\t")
 
 # Create submission XML
-def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame, failed_seqs_auto_removed: bool = True) -> bytes:
+def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame) -> bytes:
 	# Submission XML header
 	root = etree.Element("Submission")
 	description = etree.SubElement(root, "Description")
 	title = etree.SubElement(description, "Title")
-	title.text = config_dict["Description"]["Title"]
-	comment = etree.SubElement(description, "Comment")
-	comment.text = config_dict["Description"]["Comment"]
+	if "BIOSAMPLE" in database:
+		if "bs-title" in metadata and pd.notnull(metadata["bs-title"].iloc[0]) and metadata["bs-title"].iloc[0].strip() != 0:
+			title.text = metadata["bs-title"].iloc[0]
+		else:
+			title.text = submission_name + "-BS"
+		comment = etree.SubElement(description, "Comment")
+		if "bs-comment" in metadata and pd.notnull(metadata["bs-comment"].iloc[0]) and metadata["bs-comment"].iloc[0].strip() != 0:
+			comment.text = metadata["bs-comment"].iloc[0]
+		else:
+			comment.text = "BioSample Submission"
+	elif "SRA" in database:
+		if "sra-title" in metadata and pd.notnull(metadata["sra-title"].iloc[0]) and metadata["sra-title"].iloc[0].strip() != 0:
+			title.text = metadata["sra-title"].iloc[0]
+		else:
+			title.text = submission_name + "-SRA"
+		comment = etree.SubElement(description, "Comment")
+		if "sra-comment" in metadata and pd.notnull(metadata["sra-comment"].iloc[0]) and metadata["sra-comment"].iloc[0].strip() != 0:
+			comment.text = metadata["sra-comment"].iloc[0]
+		else:
+			comment.text = "SRA Submission"
 	# Description info including organization and contact info
 	organization = etree.SubElement(description, "Organization", type=config_dict["Description"]["Organization"]["Type"], role=config_dict["Description"]["Organization"]["Role"])
 	org_name = etree.SubElement(organization, "Name")
@@ -125,13 +141,18 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
 			sampleid = etree.SubElement(biosample, "SampleId")
 			spuid = etree.SubElement(sampleid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
 			spuid.text = row["bs-sample_name"]
-			descriptor = etree.SubElement(biosample, "Descriptor")
-			title = etree.SubElement(descriptor, "Title")
-			title.text = row["bs-description"]
+			if ("bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "") or ("bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != ""):
+				descriptor = etree.SubElement(biosample, "Descriptor")
+				if "bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "":
+					sample_title = etree.SubElement(descriptor, "Title")
+					sample_title.text = row["bs-sample_title"]
+				if "bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != "":
+					sample_description = etree.SubElement(descriptor, "Description")
+					sample_description.text = row["bs-sample_description"]
 			organismxml = etree.SubElement(biosample, "Organism")
 			organismname = etree.SubElement(organismxml, "OrganismName")
 			organismname.text = row["organism"]
-			if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
+			if "bioproject" in metadata and pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
 				bioproject = etree.SubElement(biosample, "BioProject")
 				primaryid = etree.SubElement(bioproject, "PrimaryId", db="BioProject")
 				primaryid.text = row["bioproject"]
@@ -140,10 +161,12 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
 			# Attributes
 			attributes = etree.SubElement(biosample, "Attributes")
 			# Remove columns with bs-prefix that are not attributes
-			biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-description"])]
+			biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"])]
 			for col in biosample_cols:
-				attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
-				attribute.text = row[col]
+				attribute_value = row[col]
+				if pd.notnull(attribute_value) and attribute_value.strip() != "":
+					attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
+					attribute.text = row[col]
 			# Add collection date to Attributes
 			attribute = etree.SubElement(attributes, "Attribute", attribute_name="collection_date")
 			attribute.text = row["collection_date"]
@@ -174,20 +197,21 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
 				datatype = etree.SubElement(file, "DataType")
 				datatype.text = "generic-data"
 			# Remove columns with sra- prefix that are not attributes
-			sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-file_location|sra-file_\d*)", col)]
+			sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-title|sra-comment|sra-file_location|sra-file_\d*)", col)]
 			for col in sra_cols:
-				attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
-				attribute.text = row[col]
+				attribute_value = row[col]
+				if pd.notnull(attribute_value) and attribute_value.strip() != "":
+					attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
+					attribute.text = row[col]
 			if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
 				attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioProject")
 				refid = etree.SubElement(attribute_ref_id, "RefId")
 				primaryid = etree.SubElement(refid, "PrimaryId")
 				primaryid.text = row["bioproject"]
-			if config_dict["Link_Sample_Between_NCBI_Databases"] and metadata.columns.str.contains("bs-sample_name").any():
-				attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
-				refid = etree.SubElement(attribute_ref_id, "RefId")
-				spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
-				spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
+			attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
+			refid = etree.SubElement(attribute_ref_id, "RefId")
+			spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
+			spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
 			identifier = etree.SubElement(addfiles, "Identifier")
 			spuid = etree.SubElement(identifier, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
 			spuid.text = row["sra-sample_name"]
@@ -209,7 +233,7 @@ def create_biosample_sra_submission(organism: str, database: str, submission_nam
 		create_raw_reads_list(submission_dir=submission_dir, raw_files_list=raw_files_list)
 	manual_df = metadata.copy()
 	create_manual_submission_files(database=database, submission_dir=submission_dir, metadata=manual_df, config_dict=config_dict)
-	xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict, failed_seqs_auto_removed=True)
+	xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict)
 	file_handler.save_xml(xml_str, submission_dir)
 
 # Read xml report and get status of the submission