Here, we will create a list of TxDb objects from a list
of GRanges objects using the function
-makeTxDbFromGRanges from txdbmaker.
+makeTxDbFromGRanges() from txdbmaker.
Importantly, to create a TxDb from a GRanges,
the GRanges object must contain genomic coordinates for all
features, including transcripts, exons, etc. Because of that, we will
@@ -687,53 +687,6 @@
library(txdbmaker)
-#> Loading required package: BiocGenerics
-#>
-#> Attaching package: 'BiocGenerics'
-#> The following objects are masked from 'package:stats':
-#>
-#> IQR, mad, sd, var, xtabs
-#> The following objects are masked from 'package:base':
-#>
-#> anyDuplicated, aperm, append, as.data.frame, basename, cbind,
-#> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
-#> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
-#> match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
-#> Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
-#> tapply, union, unique, unsplit, which.max, which.min
-#> Loading required package: S4Vectors
-#> Loading required package: stats4
-#>
-#> Attaching package: 'S4Vectors'
-#> The following object is masked from 'package:utils':
-#>
-#> findMatches
-#> The following objects are masked from 'package:base':
-#>
-#> expand.grid, I, unname
-#> Loading required package: GenomeInfoDb
-#> Loading required package: IRanges
-#> Loading required package: GenomicRanges
-#> Loading required package: GenomicFeatures
-#> Loading required package: AnnotationDbi
-#> Loading required package: Biobase
-#> Welcome to Bioconductor
-#>
-#> Vignettes contain introductory material; view with
-#> 'browseVignettes()'. To cite Bioconductor, see
-#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
-#>
-#> Attaching package: 'txdbmaker'
-#> The following objects are masked from 'package:GenomicFeatures':
-#>
-#> browseUCSCtrack, getChromInfoFromBiomart, makeFDbPackageFromUCSC,
-#> makeFeatureDbFromUCSC, makePackageName, makeTxDb,
-#> makeTxDbFromBiomart, makeTxDbFromEnsembl, makeTxDbFromGFF,
-#> makeTxDbFromGRanges, makeTxDbFromUCSC, makeTxDbPackage,
-#> makeTxDbPackageFromBiomart, makeTxDbPackageFromUCSC,
-#> supportedMiRBaseBuildValues, supportedUCSCFeatureDbTables,
-#> supportedUCSCFeatureDbTracks, supportedUCSCtables,
-#> UCSCFeatureDbTableSchema# Create a list of `TxDb` objects from a list of `GRanges` objectstxdb_list<-lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)txdb_list
@@ -744,7 +697,7 @@
The full scheme (SSD
#> # Genome: NA#> # Nb of transcripts: 6631#> # Db created by: txdbmaker package from Bioconductor
-#> # Creation time: 2024-07-25 09:05:44 +0000 (Thu, 25 Jul 2024)
+#> # Creation time: 2024-10-02 09:49:23 +0000 (Wed, 02 Oct 2024)#> # txdbmaker version at creation time: 1.1.1#> # RSQLite version at creation time: 2.3.7#> # DBSCHEMAVERSION: 1.2
@@ -756,7 +709,7 @@
The full scheme (SSD
#> # Genome: NA#> # Nb of transcripts: 5389#> # Db created by: txdbmaker package from Bioconductor
-#> # Creation time: 2024-07-25 09:05:45 +0000 (Thu, 25 Jul 2024)
+#> # Creation time: 2024-10-02 09:49:23 +0000 (Wed, 02 Oct 2024)#> # txdbmaker version at creation time: 1.1.1#> # RSQLite version at creation time: 2.3.7#> # DBSCHEMAVERSION: 1.2
Importantly, pairs2kaks() expects all genes in the gene
+pairs to be present in the CDS, with matching names. Species
+abbreviations in gene pairs (added by syntenet)
+are automatically removed, so you should not add them to the sequence
+names of your CDS.
Age groups can also be used to identify SD gene pairs that likely
+originated from whole-genome duplications. The rationale here is that
+segmental duplicates with
+
+values near
+
+peaks (indicating WGD events) were likely created by such WGDs. In a
+similar logic, SD pairs with
+
+values that are too distant from
+
+peaks (e.g., >2 standard deviations away from the mean) were likely
+created by duplications of large genomic segments, but not duplications
+of the entire genome.
+
As an example, to find gene pairs in the soybean genome that likely
+originated from the WGD event shared by all legumes (at ~58 million
+years ago), you’d need to extract SD pairs in age group 2 using the
+following code:
+
+# Get all pairs in age group 2
+pairs_ag2<-pairs_age_group$pairs[pairs_age_group$pairs$peak==2, c(1,2)]
+
+# Get all SD pairs
+sd_pairs<-gmax_ks[gmax_ks$type=="SD", c(1,2)]
+
+# Merge tables
+pairs_wgd_legumes<-merge(pairs_ag2, sd_pairs)
+
+head(pairs_wgd_legumes)
+#> dup1 dup2
+#> 1 GLYMA_01G001800 GLYMA_07G130700
+#> 2 GLYMA_01G002100 GLYMA_05G221300
+#> 3 GLYMA_01G002300 GLYMA_07G130100
+#> 4 GLYMA_01G002600 GLYMA_07G129700
+#> 5 GLYMA_01G003500 GLYMA_05G222800
+#> 6 GLYMA_01G003500 GLYMA_08G029700
Data visualization
@@ -1112,7 +1106,7 @@
Visualizing the freque
demonstrate how this works, we will use an example data set with
duplicate pairs for 3 fungi species (and substitution rates, which will
be ignored by duplicates2counts()).
-
+
# Load data set with pre-computed duplicates for 3 fungi speciesdata(fungi_kaks)names(fungi_kaks)
@@ -1154,7 +1148,7 @@
Visualizing the freque
duplication type with the function plot_duplicate_freqs().
You can visualize frequencies in three different ways, as demonstrated
below.
-
Visualizing the freque
# Combine plots, one per rowpatchwork::wrap_plots(p1, p2, p3, nrow =3)+patchwork::plot_annotation(tag_levels ="A")
-
+
If you want to visually the frequency of duplicated
genes (not gene pairs), you’d first need to classify
genes into unique modes of duplication with
classify_genes(), and then repeat the code above. For
example:
-
+
# Frequency of duplicated genes by modeclassify_genes(fungi_kaks)|># classify genes into unique duplication typesduplicates2counts()|># get a data frame of counts (long format)plot_duplicate_freqs()# plot frequencies
-
+
Visualizing
@@ -1189,7 +1183,7 @@
Visualizing
distribution for the whole paranome, you will use the function
plot_ks_distro().
-
+
ks_df<-fungi_kaks$saccharomyces_cerevisiae# A) Histogram, whole paranome
@@ -1204,7 +1198,7 @@
Visualizing
# Combine plots side by sidepatchwork::wrap_plots(p1, p2, p3, nrow =1)+patchwork::plot_annotation(tag_levels ="A")
-
+
However, visualizing the distribution for the whole paranome can mask
patterns that only happen for duplicates originating from particular
duplication types. For instance, when looking for evidence of WGD
@@ -1214,7 +1208,7 @@
Visualizing
cluster together, suggesting the presence of WGD history. To visualize
the distribution by duplication type, use bytype = TRUE in
plot_ks_distro().
-
+
# A) Duplicates by type, histogramp1<-plot_ks_distro(ks_df, bytype =TRUE, plot_type ="histogram")
@@ -1224,7 +1218,7 @@
Visualizing
# Combine plots side by sidepatchwork::wrap_plots(p1, p2)+patchwork::plot_annotation(tag_levels ="A")
-
+
Visualizing substitution rates by species
@@ -1238,7 +1232,7 @@
Visualizing substitution rate
by species. You can choose which rate you want to visualize, and whether
or not to group gene pairs by duplication mode, as demonstrated
below.
-
+
# A) Ks for each speciesp1<-plot_rates_by_species(fungi_kaks)
@@ -1248,65 +1242,65 @@
Visualizing substitution rate
# Combine plots - one per rowpatchwork::wrap_plots(p1, p2, nrow =2)+patchwork::plot_annotation(tag_levels ="A")
-
+
Session information
This document was created under the following conditions:
@Manual{,
title = {doubletrouble: Identification and classification of duplicated genes},
author = {Fabrício Almeida-Silva and Yves {Van de Peer}},
year = {2024},
- note = {R package version 1.5.1},
+ note = {R package version 1.5.2},
url = {https://github.com/almeidasilvaf/doubletrouble},
}
The major goal of doubletrouble is to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. Duplicates can be classified using four different classification schemes, which increase the complexity and level of details in a stepwise manner. The classification schemes and the duplication modes they can classify are: