repo cleanup part 2

ISUgenomics · Aug 9, 2021 · 0aa4508 · 0aa4508
1 parent 6073b95
commit 0aa4508
Show file tree

Hide file tree

Showing 10 changed files with 48 additions and 37 deletions.
diff --git a/02b_scripts.md b/02b_scripts.md
@@ -14,7 +14,8 @@ where
 
 ```
 
-## Assembly with Flye genome assembler
+## Assembly of each line with Flye genome assembler
+
 
 Runing flye
 
@@ -24,12 +25,24 @@ flye --nano-raw /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/01
  --genome-size 15m --threads 30 -i 4
  ```
 
-## Allignmnet
+## Alignment
 
-### Alliging the insert onto the assemblies with minimap2
+### Aligning the inserts on the assemblies with minimap2
 
 ```bash
 for n in 4 5 6 7 8 9 ; do
 minimap2 -aLx map-ont  /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData/inserts-all.fasta  /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/02-fly/out_
 flye_b$n/assembly.fasta  > aln-b$n\.sam " >> minimap2_$n
 ```
+
+### Selecting 10K bp upstream from the insert location on each assembly
+
+We used `samtools` to extract 10kb upstream sections of the assembly for each insert location. We aligned back these sections on the reference genome to estimate the insert locations on the reference genome.
+
+### Aligning the 10kb upstream (from the insert location) sections of the assemblies on the reference genome with minimap2
+
+```bash
+for n in 4 5 6 7 8 9 ; do
+minimap2 -aLx map-ont  /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna  inserts$n-minimap.fasta > aln$n.sam
+```
+From the sam files we can estimate the insert locations on the reference genome.
diff --git a/Notebook_Maryam/03a-alignmentBarcodes13.md b/Notebook_Maryam/03a-alignmentBarcodes13.md
@@ -3,13 +3,16 @@
 * Nova
 * Jan 21, 2021
 
+* Note1: I used 100 bp sections first around the insert location. I think 100 bp is not long enough. So later I tried 1000 bp sections.
+* Note2: I first started with both `minimap2` and `gmap` for alignment. However I decided to just continue with `minimap2`.
+
 The goal is to align the insert to the assembled genomes and find the location of inserts.
 
 #### Inserts
 
 The format of the insert sequences provided to us is `.dna`. I have to convert them to fasta format first.
 
-The program that generats `.dna` formats is [SnapGene](https://www.snapgene.com
+The program that generates `.dna` formats is [SnapGene](https://www.snapgene.com
   ). I download a free trial for my local machine and converted all the inserts to `fasta` format.
 
 * location of inserts: `/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData`
@@ -19,7 +22,7 @@ Inserts:
 ```
 BNU-loxP_sequence.fasta
 BSA4upstream.fasta
-BTDNU-loxP_sequence.fasta  
+BTDNU-loxP_sequence.fasta
 BTDN_sequence.fasta
 GSA4_sequence.fasta
 GTDNUdownstream.fasta
@@ -129,7 +132,7 @@ making a new directory for the reads +50bp and -50bp of each insert:
 cd /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/
 mkdir uniq-alignment
 
-module laod samtools
+module load samtools
 cp ../../../02-fly/out_flye_b13/assembly.fasta ./assembly-b13.fasta
 
 samtools --indexing faixd assembly-b13.fasta
@@ -210,12 +213,8 @@ sbatch $filename
 
 running in a for loop.
 
-```bash
-for n in 13 14 15 16 17 18 19
 
 
-```
-
 ####  Barcode13 (strain 6):
 
 I am trying to look at the aligned regions.
@@ -263,7 +262,7 @@ For `2.GTDNUdownstream` I think this is the best alignment `2.GTDNUdownstream 0
 
 I now proceed to run alignments on read sections of the assembly near the insertion area.
 
-First I use minimap to find alignemnt of a section of the assembly from  x-1000 bp till x bp.
+First I use minimap2 to find the alignment of a section of the assembly from  x-1000 bp till x bp.
 
 * 1.BSA4upstream (x=contig_18 784739)
 
@@ -394,7 +393,7 @@ less b13-gmap.gff3 | grep -v "#"| awk '{print $9}' | sed 's/:/ /g' | awk '{print
 ID=scaffold_3-2.GTDNUdownstream
 ```
 
-gmap probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. minimap2 is designed for shorter more noisy reads.
+`gmap` probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. `minimap2` is designed for shorter more noisy reads.
 
 ##### Location of each insert respect to the reference genome
 
@@ -464,7 +463,7 @@ Which was confirmed by minimap2 for `2.GTDNUdownstream`. Still didn't get any al
 * Feb 1 , 2021
 * Nova:/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment
 
-We decided that gmap is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on minimap alignments.
+We decided that `gmap` is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on `minimap` alignments.
 
 
 I have aligned sections of assembly 1000 bp upstream of the insertion on the reference genome to estimate the insertion site on the ref. In some cases the location we got on the ref and assembly did not match. We have 2 pieces of evidence.
@@ -496,5 +495,5 @@ I renamed the sections so it is easier to find them in the sam file.
 Running minimap
 ```bash
 minimap2 -aLx map-ont  /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/inserts-minimap.fasta  /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna > /work/gif/Maryam/projects/Zengyi
--2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam 
+-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam
 ```
diff --git a/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md b/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md
@@ -30,9 +30,9 @@ ln -s ../../00-RawData/barcode13/combine-barcode13.fastq .
 I tried aligning the raw reads onto the insert and also tried aligning insert onto the raw reads. I also changed some parameters:
 
 
--N to get more alignments on the same target ( default was 5). This is specially important whan I am trying to align the reads on one insert!
+-N to get more alignments on the same target ( default was 5). This is specially important when I am trying to align the reads on one insert!
 
--p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default  is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise.  
+-p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default  is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise.
 
 
 
@@ -62,6 +62,3 @@ Now extract the reads from the list from the raw reads.
 module load seqtk
 seqtk subseq combine-barcode13.fastq name.list > rawread-sublist.fastq
 ```
-
-333817 + 15065 = 348882
-333817 + 8873 = 342690
diff --git a/...istogramsOfrawReadAlignments2reference.md → ...istogramsOfrawReadAlignments2reference.md b/...istogramsOfrawReadAlignments2reference.md → ...istogramsOfrawReadAlignments2reference.md
@@ -1,5 +1,6 @@
 # Alignment histograms
 
+* Note: We looked at the histogram of 
 ## Barcode 13
 
 For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` have been integrated.
@@ -8,7 +9,7 @@ For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` ha
 | --- |  ---| ---|---|
 |1.BSA4upstream  |  PQNB01000001.1 777685 |517M484S |contig_18 784739|
 |2.GTDNUdownstream |   PQNB01000001.1 333817| 395S112M1D71M1D414M9S|scaffold_3 1109630|
-|2.GTDNUdownstream | | | scaffold_3 1101573| 
+|2.GTDNUdownstream | | | scaffold_3 1101573|
 
 ```
 ./run_alignment.bash 13 1.BSA4upstream

diff --git a/Notebook_Maryam/08-dotplots.md b/Notebook_Maryam/08-dotplots.md
@@ -41,10 +41,12 @@ I looked at the BUSCO score for the assemblies :
 
 I think for barcodes 16 and 19, the assembly is a little worse than barcode 13 and 18 ( from the slight increase in the number of missing BUSCOs). So the insertion location of insert 1 on these strains are different from each other among each method. I think that is why the histogram method provides the same location because it is based on the alignment of the reads on the reference genome not the assembly !
 
+Note: After much more deliberations on the insert locations and trying to understand why the direction of insert 1 is different in assemblies with inserts b13 and b18 versus b16 and b19, we figured that the assemblies are in the reverse directions not the inserts. This will effect how I choose 10Kb upstream and how I need to correct for the insert location. After the necessary modifications, we had insert 1, integrated at the same locations in all cases as expected.
+
 
 
 -----------
-## Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1)
+### Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1)
 
 Chromosome 1	CM000437.1	=	NC_009068.1
 Chromosome 2	CP000496.1	=	NC_009042.1

diff --git a/Notebook_Maryam/CpOfResults.md → Notebook_Maryam/ResultsAndConclusion.md b/Notebook_Maryam/CpOfResults.md → Notebook_Maryam/ResultsAndConclusion.md
@@ -1,4 +1,16 @@
-# Results
+# Insert locations
+
+* Note: Our initial thought was adding 10K to the insert location to correct for the length of the section we did alignment with. Remember that we used 10K upstream sections from the insert location on the assemblies and aligned them on the reference genome.
+Because we used 10K upstream, we assumed that the correct location on the reference genome should be correct by adding 10K to the insert location on the reference genome.
+We later figured that we had to correct for the direction of the assembly as well. The numbers in the tables in this section are not corrected for the direction of the assembly and might not match the final results.
+
+
+![+direction](png/insertLocation.png)
+
+![-direction](png/insertLocationReverseDirection.png)
+
+
+
 
 ## Table 1 , Estimating the insertion locations from the alignment
 
@@ -57,11 +69,6 @@
 
 
 
-| strain | insert | 10kb upstream location 2019 ref | direction of insert |5 kb upstream location 2019 ref | estimated insert location |
-| --- | --- | --- | --- | --- | ---|
-| 6 (b13) | 1.BSA4upstream | PQNB01000001.1 768684:778196 | + |PQNB01000001.1 773690:778196 |
-||||||||
-| 53 (b16)| 1.BSA4upstream | PQNB01000001.1 778200:787694 | - | PQNB01000001.1 	778200:782687|
 
 
 *

diff --git a/Notebook_Maryam/ToDo.md b/Notebook_Maryam/ToDo.md
diff --git a/Notebook_Maryam/methods.md b/Notebook_Maryam/methods.md
@@ -4,7 +4,7 @@
 ## Sample Collection
 
 ## Sequence Analysis
-In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow.   
+In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow.
 
 ## QC and trimming
 
@@ -22,6 +22,6 @@ We used Flye (v.2.8.2-b1691) for de novo assembly of each of the strain's genome
 ## Comparing genomes using dotplot
 
 We used [re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) to compare the reference genome and assembled genome for each strain.
+
 =======
 In order to visualize pairwise comparison between sequences we used a desktop app ([re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) ) from Babraham Bioinformatics to plot  interactively and also save the plots as images. We made pairwise comparisons between the reference genome and each of the seven assembled genomes to look for insert location and for any genome re-arrangements. In order to check the location Additionally we also plotted each assembly against the relevant insert sequences. For detailed information; Simon Andrews, the developer has a really good [video tutorial](https://www.youtube.com/watch?v=qPxl2hflG9Q&feature=emb_logo).
-
diff --git a/Notebook_Maryam/png/insertLocation.png b/Notebook_Maryam/png/insertLocation.png
diff --git a/Notebook_Maryam/png/insertLocationReverseDirection.png b/Notebook_Maryam/png/insertLocationReverseDirection.png