Changes to handling of sequencing data sections

mucosal-immunology-lab · Dec 7, 2024 · 45729eb · 45729eb
1 parent 0bf0a35
commit 45729eb
Show file tree

Hide file tree

Showing 11 changed files with 707 additions and 5 deletions.
diff --git a/docs/Utilities/convert-raw-novaseq-outputs.md b/docs/Utilities/convert-raw-novaseq-outputs.md
@@ -0,0 +1,133 @@
+# Handling NovaSeq sequencing outputs
+
+Here we discuss how to process the raw sequencing reads directly from the Illumina NovaSeq sequencer.
+
+## What you should have "out of the box" 🗃️
+
+Our runs are stored in Vault storage, and need to be transferred to the M3 MASSIVE cluster for processing. To inspect your files, the simplest way is to use FileZilla by setting up an SFTP connection as below. You need to ensure you have file access to the Vault prior to this.
+
+![Vault Set-up](../assets/Utilities/seq_data/vaultv2.png){ align=center }
+
+The basic file structure on the Vault should look something like below, with a main folder (long name) that contains the relevant files you need, and generally some sort of metadata file. You need to ensure that you have given all permissions to every file so that you can transfer them to the cluster &ndash; you can do this by right clicking the NovaSeq parent folder, selecting `File Attributes...`, and then adding all of the `Read`, `Write`, and `Execute` permissions, ensuring you select `Recurse into subdirectories`.
+
+![FileZilla](../assets/Utilities/seq_data/fz_novaseq_files.png){ align=center }
+
+## Transfer files to the cluster
+
+### Sequencing data transfer 🚛
+
+Navigate to an appropriate project folder on the cluster. An example command is shown below for transferring the data folder into a new folder called `raw_data` using `rsync`. If it doesn't exist, the folder you name will be created for you (just make sure you put a `/` after the new folder name).
+
+```bash
+rsync -aHWv --stats --progress MONASH\\[email protected]:Marsland-CCS-RAW-Sequencing-Archive/vault/03_NovaSeq/NovaSeq25_Olaf_Shotgun/231025_A00611_0223_AHGMNNDRX2/ raw_data/
+```
+
+### BCL Convert sample sheet preparation 🗒️
+
+Create a sample sheet document for BCL Convert (the tool that will demultiplex and prepare out FASTQ files from the raw data). The full documentation can be viewed [here](https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl_convert/bcl-convert-v3-7-5-software-guide-1000000163594-00.pdf).
+
+The document should be in the following format, where `index` is the `i7 adapter sequence` and `index2` is the `i5 adapter sequence`. An additional first column called `Lane` can be provided to specify a particular lane number only for FASTQ file generation. We will call this file `samplesheet.txt`.
+
+For the indexes, **both** sequences used on the sample sheet should be the reverse complement of the actual sequences.
+
+If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`.
+
+```bash
+[Header]
+FileFormatVersion,2
+
+[BCLConvert_Settings]
+CreateFastqForIndexReads,0
+
+[BCLConvert_Data]
+Sample_ID,i7_adapter,index,i5_adapter,index2
+Abx1_d21,N701,TAAGGCGA,S502,ATAGAGAG
+Abx2_d21,N702,CGTACTAG,S502,ATAGAGAG
+Abx3_d21,N703,AGGCAGAA,S502,ATAGAGAG
+Abx4_d21,N704,TCCTGAGC,S502,ATAGAGAG
+Abx5_d21,N705,GGACTCCT,S502,ATAGAGAG
+#etc.
+```
+
+## BCL Convert 🔄
+
+### Install ⬇️
+
+If you feel the need to have the latest version, visit the Illumina support website and copy the link for the latest CentOS version of the BCL Convert tool.
+
+Otherwise use the version that is available on the M3 MASSIVE cluster, and skip to the run section.
+
+```bash
+# Download from the support website in the main folder
+wget https://webdata.illumina.com/downloads/software/bcl-convert/bcl-convert-4.2.4-2.el7.x86_64.rpm
+
+# Install using rpm2cpio (change file name as required)
+module load rpm2cpio
+rpm2cpio bcl-convert-4.2.4-2.el7.x86_64.rpm | cpio -idv
+```
+
+The most up-to-date bcl-convert will be inside the output `usr/bin/` folder, and can be called from that location.
+
+### Run 🏃
+
+With the `raw_data` folder and `samplesheet.txt` both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session.
+
+You will need a very high limit for open files &ndash; BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves.
+
+You can request additional open file limit from the M3 MASSIVE help desk.
+
+!!! question "Can I run this on my local machine?"
+
+    **Please note that the node `m3k010` has been decommissioned due to system upgrades.**
+
+    However, it is more than possible to run this process quickly on a local machine if you have the raw BCL files available.
+    The minimum requirements (as of BCL Convert v4.0) are:
+
+    - **Hardware requirements**
+      - Single multiprocessor or multicore computer
+      - Minimum 64 GB RAM
+    - **Software requirements**
+      - Root access to your computer
+      - File system access to adjust ulimit
+
+You can start an interactive bash session and increase the open file limit as follows:
+
+```bash
+# Begin a new interactive bash session on the designated node
+srun --pty --partition=genomics --qos=genomics --nodelist=m3k010 --mem=320GB --ntasks=1 --cpus-per-task=48 bash -i
+
+# Increase the open file limit to 65,535
+ulimit -n 65535
+```
+
+```bash
+# Run bcl-convert
+bcl-convert \
+    --bcl-input-directory raw_data \
+    --output-directory fastq_files \
+    --sample-sheet samplesheet.txt
+```
+
+This will create a new output folder called `fastq_files` that contains your demultiplexed samples.
+
+### Merge lanes ⛙
+
+If you ran your samples without lane splitting, then you can merge the two lanes together using the following code, saved in the main project folder as `merge_lanes.sh`, and run using the command: `bash merge_lanes.sh`.
+
+```bash title="merge_lanes.sh"
+#!/bin/bash
+
+# Merge lanes 1 and 2
+cd fastq_files
+for f in *.fastq.gz
+  do
+  Basename=${f%_L00*}
+  ## merge R1
+  ls ${Basename}_L00*_R1_001.fastq.gz | xargs cat > ${Basename}_R1.fastq.gz
+  ## merge R2
+  ls ${Basename}_L00*_R2_001.fastq.gz | xargs cat > ${Basename}_R2.fastq.gz
+  done
+
+# Remove individual files to make space
+rm -rf *L00*
+```
diff --git a/docs/Utilities/sra-data-submission.md b/docs/Utilities/sra-data-submission.md
@@ -0,0 +1,175 @@
+# SRA sequencing data submission
+
+A guide to submitting sequencing data to the National Center for Biotechnology Information (NCBI) sequencing read archive (SRA) database. Includes information on uploading data to the SRA using the high-speed Aspera Connect tool.
+
+**Patient-derived sequencing files**
+
+If your samples are derived from humans, ensure that your **file names include no reference to patient identifiers**. Once uploaded to the SRA database, it is very difficult to change the names of files, and requires directly contacting the database to arrange for removal of files and for you to reupload the data. It also involves a difficult process of them re-mapping the new uploads to your existing SRA metadata files.
+
+Also ensure that you only include the absolute minimum amount of metadata, in a manner that protects patient confidentiality. Absolutely no information should be unique to one single patient in your cohort, even an age (if you have a patient with a unique age, this should be replaced with `NA` for the purposes of SRA submission). For manuscripts, you can include a phrase indicating that further metadata is available upon reasonable request. **The important thing here is to not infringe on patient privacy and confidentiality.**
+
+Things you could potentially include:
+- Modified and anonymised patient ID
+- Sampling group
+- Timepoint (not exact days or months)
+- Sex
+- Collection year (no exact dates)
+- Tissue
+
+## Process overview
+
+1.  Register a BioProject
+2.  Register BioSamples for the related BioProject
+3.  Submit data to SRA
+
+## Register a BioProject 📔
+
+The BioProject is an important element that can link together different types of sequencing data, and represents all the sequencing data for a given experiment.
+
+Go to the [SRA submission](https://submit.ncbi.nlm.nih.gov/) website to register a new BioProject.
+
+- Sample scope: Multispecies (if you have microbiome data)
+- Target description: Bacterial 16S metagenomics (change if you have shotgun metagenomics and/or host transcriptomics)
+- Organism name: Human (change if using mouse or rat data)
+- Project type: Metagenome (add transcriptome if you also have host transcriptomics)
+
+## Register BioSamples :test_tube:
+
+### Microbiome data 🦠
+
+Microbiome samples will be registered as *MIMARKS Specimen* samples. On the **BioSample Attributes** tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/).
+
+- Use the BioProject accession number previously generated
+- **Organism**: `human metagenome` (or as appropriate)
+- **Env broad scale**: `host-associated`
+- **Env local scale**: `mammalia-associated habitat`
+- **Env medium**: (as appropriate)
+- **Strain, isolate, cultivar, ecotype**: `NA`
+- Add any other relevant host information in the table, as well as the host tissue samples
+- Any other column which is not relevant can be set to `NA`
+
+The **SRA Metadata** tab is what will join everything together. Once again, download the provided Excel template, and fill everything in carefully.
+
+- **Sample name**: the base name of your samples
+- **Library ID**: you may have named your files differently than your sample names &ndash; provide this if so, otherwise you can repeat the sample name
+- **Title**: a short description of the sample in the form "`{methodology}` of `{organism}`: `{sample_info}`" &ndash; e.g. "Shotgun metagenomics of Homo sapiens: childhood bronchial brushing".
+- **Library strategy**: `WGS`
+- **Library source**: `METAGENOMIC`
+- **Library selection**: `RANDOM`
+- **Library layout**: `paired`
+- **Platform**: `ILLUMINA`
+- **Instrument model**: `Illumina NovaSeq 6000`
+- **Design description**: `NA`
+- **Filetype**: `fastq`
+- **Filename**: the file name of the forward reads
+- **Filename2**: the file name of the reverse reads
+
+### Transcriptomics data 👨🐭
+
+Host transcriptomics samples will be registered as either *HUMAN* or *Model organism or animal* samples. On the **BioSample Attributes** tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/).
+
+- Use the BioProject accession number previously generated
+- **Organism**: `Homo sapiens` (or `Mus musculus`/`Rattus norvegicus` as appropriate)
+- **Isolate**: NA
+- **Age**: fill this in, but leave `NA` for human samples if it would result in a unique combination of metadata variables with potential to allow identification of any individual.
+- **Biomaterial provider**: enter the lab, organisation etc. that provided the samples
+- **Collection date**: do not enter any exact dates for human samples
+- **Geo loc name**: country in which samples were collected
+- **Sex**: provide sex of host
+- **Tissue**: specify tissue origin of samples
+- Add any other relevant data, such as sampling group
+
+As above, the **SRA Metadata** tab is where the magic will happen :magic_wand::sparkles:. Once again, download the provided Excel template, and fill everything in carefully.
+
+- **Sample name**: the base name of your samples
+- **Library ID**: you may have named your files differently than your sample names &ndash; provide this if so, otherwise you can repeat the sample name
+- **Title**: a short description of the sample in the form "`{methodology}` of `{organism}`: `{sample_info}`" &ndash; e.g. "RNA-Seq of Homo sapiens: childhood bronchial brushing".
+- **Library strategy**: `RNA-Seq`
+- **Library source**: `TRANSCRIPTOMIC`
+- **Library selection**: `RANDOM`
+- **Library layout**: `paired`
+- **Platform**: `ILLUMINA`
+- **Instrument model**: `Illumina NovaSeq 6000`
+- **Design description**: `NA`
+- **Filetype**: `fastq`
+- **Filename**: the file name of the forward reads
+- **Filename2**: the file name of the reverse reads
+
+## Submit data to SRA 📤
+
+!!! question "Which upload option should I choose?"
+
+    You can choose either of the following upload options, and each has pros and cons.
+
+    - **Filezilla** allows parallel uploads according to your settings, but upload speed is typically slower.
+    - **Aspera Connect** (at least with NCBI) only allows sequential uploads, but the upload speed is significantly faster.
+
+### FileZilla 🦖
+
+Using [FileZilla](https://filezilla-project.org/) is more effective when you have large files and/or a large number of files.
+
+In FileZilla, open the sites manager and connect to NCBI as follows:
+- Protocol: `FTP`
+- Host: `ftp-private.ncbi.nlm.nih.gov`
+- Username: `subftp`
+- Password: this is your user-specific NCBI password given when you submit your data
+
+In the `Advanced` tab next to the `General` tab, set the `Default remote directory` field to the directory specified by NCBI. This will looks something like: `/uploads/{username}_{uniqueID}`.
+
+Select connect, and gain access to your account folder on the NCBI FTP server.
+
+**Create a new project folder** within the main upload folder, and enter the folder. Add your files to the upload queue, and begin the upload process.
+
+### Aspera Connect
+
+The IBM Aspera Connect tool allows for much faster uploads than FileZilla, and is a good alternative for large files.
+
+#### Linux process 🐧
+
+The process described here is for Linux, but is similar for Windows and MacOS operating systems. More information is provided on the [IBM website](https://www.ibm.com/docs/en/aspera-connect/4.2?topic=suc-installation).
+
+1.  Download the [Aspera Connect software](https://www.ibm.com/aspera/connect/).
+2.  Open a new terminal window (`Ctrl+Alt+T`)
+3.  Navigate to downloads, extract the `tar.gz` file.
+4.  Run the install script.
+
+```bash
+# Extract the file
+tar -zxvf ibm-aspera-connect-version+platform.tar.gz
+# Run the install script
+./ibm-aspera-connect-version+platform.sh
+```
+
+5. Add the Aspera Connect bin folder to your PATH variable (reopen terminal to apply changes).
+
+```bash
+# Add folder to PATH
+echo 'export PATH=$PATH:/home/{user}/.aspera/connect/bin/ >> ~/.bashrc'
+```
+
+6. Download the NCBI Aspera Connect [key file](https://submit.ncbi.nlm.nih.gov/preload/aspera_key/).
+7. Navigate to the parent folder of the folder containing the files you want to upload to the SRA database, and create a new bash script.
+
+```bash
+# Create a new bash script file
+touch upload_seq_data.sh
+```
+
+8. Add the following code to the bash script file. 
+-   The `-i` argument is the path to the key file, and must be given as a full path (not a relative one).
+-   The `-d` argument specifies that the directory will be created if it doesn't exist.
+-   You can adjust the maximum upload speed using the `-l500m` argument, where `500` is the speed in Mbps. You could increase or decrease as desired.
+-   Add the folder containing the data to upload, which can be relative to the folder containing the bash script.
+-   Next provide the upload folder provided by NCBI, which will be user-specific, and **ensure you provide a project folder** at the end of this. Data will not be available if it is uploaded into the main uploads folder.
+
+```bash title="upload_seq_data.sh"
+#!/bin/bash
+ascp -i {/full/path/to/key-file/aspera.openssh} -QT -l500m -k1 -d {./name-of-seq-data-folder} [email protected]:uploads/{user-specific-ID}/{name-of-project}
+```
+
+9. Run the bash script, and upload all files. The default settings will allow you to resume uploads if they are interrupted, and it will not overwrite files that are identical in the destination folder.
+
+```bash
+# Run script
+bash upload_seq_data.sh
+```
diff --git a/docs/assets/Utilities/seq_data/fz_novaseq_files.png b/docs/assets/Utilities/seq_data/fz_novaseq_files.png
diff --git a/docs/assets/Utilities/seq_data/vaultv2.png b/docs/assets/Utilities/seq_data/vaultv2.png
diff --git a/docs/index.md b/docs/index.md
@@ -15,16 +15,22 @@ These sort of tasks are never accomplished alone! Massive thanks to the people w
 -   ![Matthew Macowan](assets/Contributors/matthew_macowan.png){ width="150" }  
     **Matthew Macowan**
 
+    Bioinformatician &ndash; *Mucosal Immunology Group*
+
 -   ![Céline Pattaroni](assets/Contributors/celine_pattaroni.jpeg){ width="150" }  
     [**Céline Pattaroni**](https://research.monash.edu/en/persons/celine-pattaroni)
 
+    Group Leader &ndash; *Computational Immunology Group*
+
 -   ![Giulia Iacono](assets/Contributors/giulia_iacono.jpg){ width="150" }  
     **Giulia Iacono**
 
+    Post Doc &ndash; *Mucosal Immunology Group*
+
 </div>
 
-* Alana Butler
-* Bailey Cardwell
+* Alana Butler: Bioinformatician
+* Bailey Cardwell: PhD student &ndash; *Mucosal Immunology Group*
 
 ## Group Research Overview
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -8,6 +8,10 @@ nav:
   - Nextflow Pipelines: 
     - nf-mucimmuno: NextFlow/nf-mucimmuno.md
     - Single-cell RNAseq: NextFlow/scRNAseq.md
+  - Utilities:
+    - Handling sequencing data:
+      - Converting NovaSeq outputs: Utilities/convert-raw-novaseq-outputs.md
+      - SRA sequencing data submission: Utilities/sra-data-submission.md
   - Public Datasets: PublicDatasets/public-datasets.md
 
 theme: