Skip to content

Commit

Permalink
repo cleanup part 2
Browse files Browse the repository at this point in the history
  • Loading branch information
Maria-ISU committed Aug 9, 2021
1 parent 6073b95 commit 0aa4508
Show file tree
Hide file tree
Showing 10 changed files with 48 additions and 37 deletions.
19 changes: 16 additions & 3 deletions 02b_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ where

```

## Assembly with Flye genome assembler
## Assembly of each line with Flye genome assembler


Runing flye

Expand All @@ -24,12 +25,24 @@ flye --nano-raw /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/01
--genome-size 15m --threads 30 -i 4
```
## Allignmnet
## Alignment
### Alliging the insert onto the assemblies with minimap2
### Aligning the inserts on the assemblies with minimap2
```bash
for n in 4 5 6 7 8 9 ; do
minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData/inserts-all.fasta /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/02-fly/out_
flye_b$n/assembly.fasta > aln-b$n\.sam " >> minimap2_$n
```
### Selecting 10K bp upstream from the insert location on each assembly
We used `samtools` to extract 10kb upstream sections of the assembly for each insert location. We aligned back these sections on the reference genome to estimate the insert locations on the reference genome.
### Aligning the 10kb upstream (from the insert location) sections of the assemblies on the reference genome with minimap2
```bash
for n in 4 5 6 7 8 9 ; do
minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna inserts$n-minimap.fasta > aln$n.sam
```
From the sam files we can estimate the insert locations on the reference genome.
21 changes: 10 additions & 11 deletions Notebook_Maryam/03a-alignmentBarcodes13.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@
* Nova
* Jan 21, 2021

* Note1: I used 100 bp sections first around the insert location. I think 100 bp is not long enough. So later I tried 1000 bp sections.
* Note2: I first started with both `minimap2` and `gmap` for alignment. However I decided to just continue with `minimap2`.

The goal is to align the insert to the assembled genomes and find the location of inserts.

#### Inserts

The format of the insert sequences provided to us is `.dna`. I have to convert them to fasta format first.

The program that generats `.dna` formats is [SnapGene](https://www.snapgene.com
The program that generates `.dna` formats is [SnapGene](https://www.snapgene.com
). I download a free trial for my local machine and converted all the inserts to `fasta` format.

* location of inserts: `/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/00-RawData`
Expand All @@ -19,7 +22,7 @@ Inserts:
```
BNU-loxP_sequence.fasta
BSA4upstream.fasta
BTDNU-loxP_sequence.fasta
BTDNU-loxP_sequence.fasta
BTDN_sequence.fasta
GSA4_sequence.fasta
GTDNUdownstream.fasta
Expand Down Expand Up @@ -129,7 +132,7 @@ making a new directory for the reads +50bp and -50bp of each insert:
cd /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/
mkdir uniq-alignment

module laod samtools
module load samtools
cp ../../../02-fly/out_flye_b13/assembly.fasta ./assembly-b13.fasta

samtools --indexing faixd assembly-b13.fasta
Expand Down Expand Up @@ -210,12 +213,8 @@ sbatch $filename

running in a for loop.

```bash
for n in 13 14 15 16 17 18 19


```
#### Barcode13 (strain 6):

I am trying to look at the aligned regions.
Expand Down Expand Up @@ -263,7 +262,7 @@ For `2.GTDNUdownstream` I think this is the best alignment `2.GTDNUdownstream 0
I now proceed to run alignments on read sections of the assembly near the insertion area.
First I use minimap to find alignemnt of a section of the assembly from x-1000 bp till x bp.
First I use minimap2 to find the alignment of a section of the assembly from x-1000 bp till x bp.
* 1.BSA4upstream (x=contig_18 784739)
Expand Down Expand Up @@ -394,7 +393,7 @@ less b13-gmap.gff3 | grep -v "#"| awk '{print $9}' | sed 's/:/ /g' | awk '{print
ID=scaffold_3-2.GTDNUdownstream
```

gmap probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. minimap2 is designed for shorter more noisy reads.
`gmap` probably did only picked `GTDNUdownstream` insert because it is more used for long less noisy reads and more sensitive to mismatches. `minimap2` is designed for shorter more noisy reads.

##### Location of each insert respect to the reference genome

Expand Down Expand Up @@ -464,7 +463,7 @@ Which was confirmed by minimap2 for `2.GTDNUdownstream`. Still didn't get any al
* Feb 1 , 2021
* Nova:/work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment

We decided that gmap is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on minimap alignments.
We decided that `gmap` is not suitable for aligning nanopore reads. It is designed for shorter reads that are less noisy. So From now on I will only focus on `minimap` alignments.


I have aligned sections of assembly 1000 bp upstream of the insertion on the reference genome to estimate the insertion site on the ref. In some cases the location we got on the ref and assembly did not match. We have 2 pieces of evidence.
Expand Down Expand Up @@ -496,5 +495,5 @@ I renamed the sections so it is easier to find them in the sam file.
Running minimap
```bash
minimap2 -aLx map-ont /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/inserts-minimap.fasta /work/gif/Maryam/projects/Zengyi-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/GCA_006942115.1_ASM694211v1_genomic.fna > /work/gif/Maryam/projects/Zengyi
-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam
-2021-ScheffersomycesStipitis/03-alignment/uniq-alignemnt/b13/aln-b13.sam
```
7 changes: 2 additions & 5 deletions Notebook_Maryam/04a-b13rawReadAlignment2RefGenome.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ ln -s ../../00-RawData/barcode13/combine-barcode13.fastq .
I tried aligning the raw reads onto the insert and also tried aligning insert onto the raw reads. I also changed some parameters:


-N to get more alignments on the same target ( default was 5). This is specially important whan I am trying to align the reads on one insert!
-N to get more alignments on the same target ( default was 5). This is specially important when I am trying to align the reads on one insert!

-p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise.
-p this is the ratio of primary to secondary ratio. by reducing this number we get more alignments, we get some alignment with lower alignment score. The default is 0.8. With p=0.8 we get very few alignments. With 0.1 we got too much noise.



Expand Down Expand Up @@ -62,6 +62,3 @@ Now extract the reads from the list from the raw reads.
module load seqtk
seqtk subseq combine-barcode13.fastq name.list > rawread-sublist.fastq
```

333817 + 15065 = 348882
333817 + 8873 = 342690
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Alignment histograms

* Note: We looked at the histogram of
## Barcode 13

For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` have been integrated.
Expand All @@ -8,7 +9,7 @@ For this strain we know that inserts `1.BSA4upstream` and `2.GTDNUdownstream` ha
| --- | ---| ---|---|
|1.BSA4upstream | PQNB01000001.1 777685 |517M484S |contig_18 784739|
|2.GTDNUdownstream | PQNB01000001.1 333817| 395S112M1D71M1D414M9S|scaffold_3 1109630|
|2.GTDNUdownstream | | | scaffold_3 1101573|
|2.GTDNUdownstream | | | scaffold_3 1101573|

```
./run_alignment.bash 13 1.BSA4upstream
Expand Down
4 changes: 3 additions & 1 deletion Notebook_Maryam/08-dotplots.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,12 @@ I looked at the BUSCO score for the assemblies :

I think for barcodes 16 and 19, the assembly is a little worse than barcode 13 and 18 ( from the slight increase in the number of missing BUSCOs). So the insertion location of insert 1 on these strains are different from each other among each method. I think that is why the histogram method provides the same location because it is based on the alignment of the reads on the reference genome not the assembly !

Note: After much more deliberations on the insert locations and trying to understand why the direction of insert 1 is different in assemblies with inserts b13 and b18 versus b16 and b19, we figured that the assemblies are in the reverse directions not the inserts. This will effect how I choose 10Kb upstream and how I need to correct for the insert location. After the necessary modifications, we had insert 1, integrated at the same locations in all cases as expected.



-----------
## Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1)
### Chromose information based on 2007 genome [ASM20916v1](https://www.ncbi.nlm.nih.gov/assembly/GCF_000209165.1)

Chromosome 1 CM000437.1 = NC_009068.1
Chromosome 2 CP000496.1 = NC_009042.1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
# Results
# Insert locations

* Note: Our initial thought was adding 10K to the insert location to correct for the length of the section we did alignment with. Remember that we used 10K upstream sections from the insert location on the assemblies and aligned them on the reference genome.
Because we used 10K upstream, we assumed that the correct location on the reference genome should be correct by adding 10K to the insert location on the reference genome.
We later figured that we had to correct for the direction of the assembly as well. The numbers in the tables in this section are not corrected for the direction of the assembly and might not match the final results.


![+direction](png/insertLocation.png)

![-direction](png/insertLocationReverseDirection.png)




## Table 1 , Estimating the insertion locations from the alignment

Expand Down Expand Up @@ -57,11 +69,6 @@



| strain | insert | 10kb upstream location 2019 ref | direction of insert |5 kb upstream location 2019 ref | estimated insert location |
| --- | --- | --- | --- | --- | ---|
| 6 (b13) | 1.BSA4upstream | PQNB01000001.1 768684:778196 | + |PQNB01000001.1 773690:778196 |
||||||||
| 53 (b16)| 1.BSA4upstream | PQNB01000001.1 778200:787694 | - | PQNB01000001.1 778200:782687|


*
Expand Down
8 changes: 0 additions & 8 deletions Notebook_Maryam/ToDo.md

This file was deleted.

4 changes: 2 additions & 2 deletions Notebook_Maryam/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
## Sample Collection

## Sequence Analysis
In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow.
In order to estimate the insert locations, all strains have beed subject to nanopore sequencing using [GridION X5](https://nanoporetech.com/products/gridion) from Oxford Nanopore Technologies. For each strain we assembled a genome and locate the inserts on that genome as will be discussed further bellow.

## QC and trimming

Expand All @@ -22,6 +22,6 @@ We used Flye (v.2.8.2-b1691) for de novo assembly of each of the strain's genome
## Comparing genomes using dotplot

We used [re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) to compare the reference genome and assembled genome for each strain.

=======
In order to visualize pairwise comparison between sequences we used a desktop app ([re-DOT-able](https://www.bioinformatics.babraham.ac.uk/projects/redotable/) ) from Babraham Bioinformatics to plot interactively and also save the plots as images. We made pairwise comparisons between the reference genome and each of the seven assembled genomes to look for insert location and for any genome re-arrangements. In order to check the location Additionally we also plotted each assembly against the relevant insert sequences. For detailed information; Simon Andrews, the developer has a really good [video tutorial](https://www.youtube.com/watch?v=qPxl2hflG9Q&feature=emb_logo).

Binary file added Notebook_Maryam/png/insertLocation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0aa4508

Please sign in to comment.