diff --git a/02b_scripts.md b/02b_scripts.md index 934d417..dc7c0ed 100644 --- a/02b_scripts.md +++ b/02b_scripts.md @@ -14,7 +14,8 @@ where ``` -## Assembly with Flye genome assembler +## Assembly of each line with Flye genome assembler + Runing flye @@ -24,12 +25,24 @@ flye --nano-raw /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/01 --genome-size 15m --threads 30 -i 4 ``` -## Allignmnet +## Alignment -### Alliging the insert onto the assemblies with minimap2 +### Aligning the inserts on the assemblies with minimap2 ```bash for n in 4 5 6 7 8 9 ; do minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData/inserts-all.fasta /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/02-fly/out_ flye_b$n/assembly.fasta > aln-b$n\.sam " >> minimap2_$n ``` + +### Selecting 10K bp upstream from the insert location on each assembly + +We used `samtools` to extract 10kb upstream sections of the assembly for each insert location. We aligned back these sections on the reference genome to estimate the insert locations on the reference genome. + +### Aligning the 10kb upstream (from the insert location) sections of the assemblies on the reference genome with minimap2 + +```bash +for n in 4 5 6 7 8 9 ; do +minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna inserts$n-minimap.fasta > aln$n.sam +``` +From the sam files we can estimate the insert locations on the reference genome. diff --git a/Notebook_Maryam/03a-alignmentBarcodes13.md b/Notebook_Maryam/03a-alignmentBarcodes13.md index 4a176d3..fc03c4c 100644 --- a/Notebook_Maryam/03a-alignmentBarcodes13.md +++ b/Notebook_Maryam/03a-alignmentBarcodes13.md @@ -3,13 +3,16 @@ * Nova * Jan 21, 2021 +* Note1: I used 100 bp sections first around the insert location. I think 100 bp is not long enough. So later I tried 1000 bp sections. +* Note2: I first started with both `minimap2` and `gmap` for alignment. However I decided to just continue with `minimap2`. + The goal is to align the insert to the assembled genomes and find the location of inserts. #### Inserts The format of the insert sequences provided to us is `.dna`. I have to convert them to fasta format first. -The program that generats `.dna` formats is [SnapGene](https://www.snapgene.com +The program that generates `.dna` formats is [SnapGene](https://www.snapgene.com ). I download a free trial for my local machine and converted all the inserts to `fasta` format. * location of inserts: `/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData` @@ -19,7 +22,7 @@ Inserts: ``` BNU-loxP_sequence.fasta BSA4upstream.fasta -BTDNU-loxP_sequence.fasta +BTDNU-loxP_sequence.fasta BTDN_sequence.fasta GSA4_sequence.fasta GTDNUdownstream.fasta @@ -129,7 +132,7 @@ making a new directory for the reads +50bp and -50bp of each insert: cd /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/ mkdir uniq-alignment -module laod samtools +module load samtools cp ../../../02-fly/out_flye_b13/assembly.fasta ./assembly-b13.fasta samtools --indexing faixd assembly-b13.fasta @@ -210,12 +213,8 @@ sbatch $filename running in a for loop. -```bash -for n in 13 14 15 16 17 18 19 -``` - #### Barcode13 (strain 6): I am trying to look at the aligned regions. @@ -263,7 +262,7 @@ For `2.GTDNUdownstream` I think this is the best alignment `2.GTDNUdownstream 0 I now proceed to run alignments on read sections of the assembly near the insertion area. -First I use minimap to find alignemnt of a section of the assembly from x-1000 bp till x bp. +First I use minimap2 to find the alignment of a section of the assembly from x-1000 bp till x bp. * 1.BSA4upstream (x=contig_18 784739) @@ -394,7 +393,7 @@ less b13-gmap.gff3 | grep -v "#"| awk '{print $9}' | sed 's/:/ /g' | awk '{print ID=scaffold_3-2.GTDNUdownstream ``` -gmap probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. minimap2 is designed for shorter more noisy reads. +`gmap` probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. `minimap2` is designed for shorter more noisy reads. ##### Location of each insert respect to the reference genome @@ -464,7 +463,7 @@ Which was confirmed by minimap2 for `2.GTDNUdownstream`. Still didn't get any al * Feb 1 , 2021 * Nova:/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment -We decided that gmap is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on minimap alignments. +We decided that `gmap` is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on `minimap` alignments. I have aligned sections of assembly 1000 bp upstream of the insertion on the reference genome to estimate the insertion site on the ref. In some cases the location we got on the ref and assembly did not match. We have 2 pieces of evidence. @@ -496,5 +495,5 @@ I renamed the sections so it is easier to find them in the sam file. Running minimap ```bash minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/inserts-minimap.fasta /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna > /work/gif/Maryam/projects/Zengyi --2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam +-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam ``` diff --git a/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md b/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md index 491f5a2..236a13c 100644 --- a/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md +++ b/Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md @@ -30,9 +30,9 @@ ln -s ../../00-RawData/barcode13/combine-barcode13.fastq . I tried aligning the raw reads onto the insert and also tried aligning insert onto the raw reads. I also changed some parameters: --N to get more alignments on the same target ( default was 5). This is specially important whan I am trying to align the reads on one insert! +-N to get more alignments on the same target ( default was 5). This is specially important when I am trying to align the reads on one insert! --p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise. +-p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise. @@ -62,6 +62,3 @@ Now extract the reads from the list from the raw reads. module load seqtk seqtk subseq combine-barcode13.fastq name.list > rawread-sublist.fastq ``` - -333817 + 15065 = 348882 -333817 + 8873 = 342690 diff --git a/Notebook_Maryam/04-histogramsOfrawReadAlignments2reference.md b/Notebook_Maryam/04b-histogramsOfrawReadAlignments2reference.md similarity index 98% rename from Notebook_Maryam/04-histogramsOfrawReadAlignments2reference.md rename to Notebook_Maryam/04b-histogramsOfrawReadAlignments2reference.md index 1bfec9e..632ec52 100644 --- a/Notebook_Maryam/04-histogramsOfrawReadAlignments2reference.md +++ b/Notebook_Maryam/04b-histogramsOfrawReadAlignments2reference.md @@ -1,5 +1,6 @@ # Alignment histograms +* Note: We looked at the histogram of ## Barcode 13 For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` have been integrated. @@ -8,7 +9,7 @@ For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` ha | --- | ---| ---|---| |1.BSA4upstream | PQNB01000001.1 777685 |517M484S |contig_18 784739| |2.GTDNUdownstream | PQNB01000001.1 333817| 395S112M1D71M1D414M9S|scaffold_3 1109630| -|2.GTDNUdownstream | | | scaffold_3 1101573| +|2.GTDNUdownstream | | | scaffold_3 1101573| ``` ./run_alignment.bash 13 1.BSA4upstream diff --git a/Notebook_Maryam/08-dotplots.md b/Notebook_Maryam/08-dotplots.md index de5cde3..808727c 100644 --- a/Notebook_Maryam/08-dotplots.md +++ b/Notebook_Maryam/08-dotplots.md @@ -41,10 +41,12 @@ I looked at the BUSCO score for the assemblies : I think for barcodes 16 and 19, the assembly is a little worse than barcode 13 and 18 ( from the slight increase in the number of missing BUSCOs). So the insertion location of insert 1 on these strains are different from each other among each method. I think that is why the histogram method provides the same location because it is based on the alignment of the reads on the reference genome not the assembly ! +Note: After much more deliberations on the insert locations and trying to understand why the direction of insert 1 is different in assemblies with inserts b13 and b18 versus b16 and b19, we figured that the assemblies are in the reverse directions not the inserts. This will effect how I choose 10Kb upstream and how I need to correct for the insert location. After the necessary modifications, we had insert 1, integrated at the same locations in all cases as expected. + ----------- -## Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1) +### Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1) Chromosome 1 CM000437.1 = NC_009068.1 Chromosome 2 CP000496.1 = NC_009042.1 diff --git a/Notebook_Maryam/CpOfResults.md b/Notebook_Maryam/ResultsAndConclusion.md similarity index 88% rename from Notebook_Maryam/CpOfResults.md rename to Notebook_Maryam/ResultsAndConclusion.md index 3b8c0d5..279e349 100644 --- a/Notebook_Maryam/CpOfResults.md +++ b/Notebook_Maryam/ResultsAndConclusion.md @@ -1,4 +1,16 @@ -# Results +# Insert locations + +* Note: Our initial thought was adding 10K to the insert location to correct for the length of the section we did alignment with. Remember that we used 10K upstream sections from the insert location on the assemblies and aligned them on the reference genome. +Because we used 10K upstream, we assumed that the correct location on the reference genome should be correct by adding 10K to the insert location on the reference genome. +We later figured that we had to correct for the direction of the assembly as well. The numbers in the tables in this section are not corrected for the direction of the assembly and might not match the final results. + + +![+direction](png/insertLocation.png) + +![-direction](png/insertLocationReverseDirection.png) + + + ## Table 1 , Estimating the insertion locations from the alignment @@ -57,11 +69,6 @@ -| strain | insert | 10kb upstream location 2019 ref | direction of insert |5 kb upstream location 2019 ref | estimated insert location | -| --- | --- | --- | --- | --- | ---| -| 6 (b13) | 1.BSA4upstream | PQNB01000001.1 768684:778196 | + |PQNB01000001.1 773690:778196 | -|||||||| -| 53 (b16)| 1.BSA4upstream | PQNB01000001.1 778200:787694 | - | PQNB01000001.1 778200:782687| * diff --git a/Notebook_Maryam/ToDo.md b/Notebook_Maryam/ToDo.md deleted file mode 100644 index 49ee60a..0000000 --- a/Notebook_Maryam/ToDo.md +++ /dev/null @@ -1,8 +0,0 @@ -# To Do list - -1- QC -2- Seven genome assembly -3-Polishing -4-Align cassettes to the Genome -5-Find the insert locations for the figures in publication -6-Deposit the data diff --git a/Notebook_Maryam/methods.md b/Notebook_Maryam/methods.md index e18a443..7ba9e7e 100644 --- a/Notebook_Maryam/methods.md +++ b/Notebook_Maryam/methods.md @@ -4,7 +4,7 @@ ## Sample Collection ## Sequence Analysis -In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow. +In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow. ## QC and trimming @@ -22,6 +22,6 @@ We used Flye (v.2.8.2-b1691) for de novo assembly of each of the strain's genome ## Comparing genomes using dotplot We used [re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) to compare the reference genome and assembled genome for each strain. + ======= In order to visualize pairwise comparison between sequences we used a desktop app ([re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) ) from Babraham Bioinformatics to plot interactively and also save the plots as images. We made pairwise comparisons between the reference genome and each of the seven assembled genomes to look for insert location and for any genome re-arrangements. In order to check the location Additionally we also plotted each assembly against the relevant insert sequences. For detailed information; Simon Andrews, the developer has a really good [video tutorial](https://www.youtube.com/watch?v=qPxl2hflG9Q&feature=emb_logo). - diff --git a/Notebook_Maryam/png/insertLocation.png b/Notebook_Maryam/png/insertLocation.png new file mode 100644 index 0000000..b548346 Binary files /dev/null and b/Notebook_Maryam/png/insertLocation.png differ diff --git a/Notebook_Maryam/png/insertLocationReverseDirection.png b/Notebook_Maryam/png/insertLocationReverseDirection.png new file mode 100644 index 0000000..52d7e34 Binary files /dev/null and b/Notebook_Maryam/png/insertLocationReverseDirection.png differ