Skip to content

Section 2 Compara Centric Commands

Gavin Huttley edited this page Jul 7, 2024 · 1 revision

Compara Centric Commands

And this is where Ensembl really shines. Whole genome alignment is hard. Homology classification is hard. These are the centerpieces of nearly all of the Ensembl genomic data resources.

At present we only have two subcommands pertinent to genomic comparison.

To use these, let's switch to the ape installation, which I have provided for you. This installation has both the homology data and whole genome sequence alignments.

Before you do anything, use the installed subcommand pointed at the directory ~/workshop/data/apes_112 to see what data has been installed. Then use the species-summary subcommand for a couple of the species to give you an idea about their scale.

homologs

The homologs subcommand currently exports DNA sequence data for the CDS for 1-to-1 orthologs. (Strictly speaking, it's the sequence for the CDS of the canonical transcript for each gene where the canonical transcript is defined by Ensembl.)

At present, this command requires that you specify a reference species. What this means is that the genes in that species are going to be used for querying for the orthologs in the remaining species in the installation. My own prejudice for this approach is to use the best annotated genome as the reference, which in this case is ours. The following command limits the query to 100 records. I've done this in the interest of time and because using small datasets is a fundamental part of prototyping and that's always a good idea.

elt homologs -i ~/workshop/data/apes_112 -o ~/workshop/data/ortho-100 --ref human --limit 100

This took about 1'40" on my laptop using a single CPU. We will interrogate the results of this subcommand in the Cogent section.

NOTE You can use as many CPUs as you like for this sampling process, specified via the -np # command line option.

Question

Aside from including other relationship types, what other capabilities would be useful for this subcommand?

alignments

As the name implies this subcommand exports whole genome alignments matching your input criteria. One of the key pieces of information you need in order to execute this is the name of the alignments that have been installed. Refer back to the output of the installed subcommand that you ran a few minutes ago. The other crucial information is a table of Ensembl stable IDs for genes. The generic coordinates of each gene are used to identify the segments of the whole genome alignment to be exported. This file must also contain a header row with a column whose name is stableid in lower case. The easiest way to get a couple of stable IDs to put into a sample file is to list the contents of the directory generated by the homologs subcommand. Here's an example

stableid
ENSG00000007923
ENSG00000008130
ENSG00000041988

I have created that file inside the data subdirectory. Then use the command

elt alignments --installed data/apes_112 --outdir data/aligns-demo --align_name "10_p*" --ref human --ref_genes_file data/stableids.tsv

Notice you have to use both the reference species AND the gene ID file. Also, I've used a wild card to simplify writing the alignment name.

This command writes out gzipped alignment files into the nominated outdir.

Question

What would you like to see for this subcommand? Are pairwise whole genome alignments important to you?

Next up, using Cogen3 to polish the data in preparation for the rest of your research.