CE-Symm is a tool for detecting internal symmetry in protein structures. CE-Symm 2.0 is able to detect both open and closed symmetry and provide a multiple alignment of all repeats.
When using CE-Symm, please cite:
Douglas Myers-Turnbull, Spencer E Bliven, Peter W Rose, Zaid K Aziz, Philippe Youkharibache, Philip E Bourne, Andreas Prlic (2013) Systematic detection of internal symmetry in proteins using CE-Symm. J Mol Biol, 426(11), 2255–2268. [PMID 24681267]
or the following preprint about CE-Symm 2.0:
Spencer E Bliven, Aleix Lafita,Peter W Rose, Guido Capitani, Andreas Prlić, Philip E Bourne. Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm. Preprint on BioRxiv: https://doi.org/10.1101/297960
CE-Symm requires Java 8 or newer to run.
The latest release can be downloaded from Github and contains the cesymm jar file and the runQuatSymm.sh
wrapper script. The script may be called directly or added to your PATH.
The easiest way to run CE-Symm is via the included wrapper script.
usage: runCESymm.sh [OPTIONS] [structures...]
CE-Symm can also run directly through java without using the wrapper script. On many systems, double clicking the JAR file should open the program in interactive mode. From the command line, it can be run as
java -Xmx500M -jar cesymm-*.jar [OPTIONS] [structures...]
Options are specified in gnu style. Boolean options can be negated by prefixing with "no". All common options have short forms.
Short Option | Long Option | Description |
---|---|---|
-h | --help | Print usage information |
--version | Print CE-Symm version | |
-i | --input=file | File listing whitespace-delimited query structures |
-v | --verbose | Output verbose logging information. |
-q | --noverbose | Disable verbose logging information, as well as the default (--simple) output. |
-o | --simple=file | Output result in a simple format (default) |
--stats=file | Output a tsv file with detailed symmetry information | |
--tsv=file | Output alignment as a tsv-formatted list of aligned residues. | |
--xml=file | Output alignment as XML | |
--fatcat=file | Output alignment as FATCAT output | |
--fasta=file | Output alignment as FASTA alignment output | |
--axes=file | Output each axis of symmetry found in the structure as two points and a rotation angle | |
-j | --show3d | Force Jmol display for each structure [default for <10 structures when specified on command line] |
-J | --noshow3d | Disable Jmol display [default with --input or for >=10 structures] |
--ordermethod=Class | Order detection method: SEQUENCE_FUNCTION (default), GRAPH_COMPONENT, ANGLE, or USER_INPUT | |
--order | Force a particular order. If positive, implies --ordermethod=USER_INPUT. | |
--refinemethod=Class | Refiner method: SEQUENCE_FUNCTION (default), NOT_REFINED, or GRAPH_COMPONENT | |
--symmtype=Class | Restrict symmetry to: CLOSED, OPEN, or AUTO (default) | |
--pdbfilepath=dir | Download directory for new structures [default tmp folder]. Can also be set with the PDB_DIR environmental variable. | |
--threads=int | Number of threads | |
--maxgapsize=float | This parameter configures the maximum gap size G, that is applied during the AFP extension. The larger the value, the longer the calculation time can become, Default value is 30. Set to 0 for no limit. | |
--scoringstrategy=str | Which scoring function to use: CA_SCORING, SIDE_CHAIN_SCORING, SIDE_CHAIN_ANGLE_SCORING, CA_AND_SIDE_CHAIN_ANGLE_SCORING, or SEQUENCE_CONSERVATION | |
--winsize=int | This configures the fragment size m of Aligned Fragment Pairs (AFPs). | |
--maxrmsd=float | The maximum RMSD at which to stop alignment optimization. (default: unlimited=99) | |
--nointernalgaps | Force alignment to include a residue from all repeats. (By default only 50% of repeats must be aligned in each column.) | |
--gapopen=float | Gap opening penalty during alignment optimization [default: 5.0]. | |
--gapextension=float | Gap extension penalty during alignment optimization [default: 0.5]. | |
--symmlevels=int | Run iteratively the algorithm to find multiple symmetry levels. The parameter controls the maximum symmetry levels allowed. 0 means unbounded. [default: 0]. | |
--noopt | Disable optimization of the resulting symmetry alignment. | |
--unrefinedscorethreshold=float | The TM-score threshold before refinement of the self-alignment. TM-scores above this value will be considered significant results [default: 0.4, interval [0.0,1.0]]. | |
--refinedscorethreshold=float | The TM-score threshold of the multiple alignment of repeats after refinement. TM-scores above this value will be considered significant results [default: 0.36, interval [0.0,1.0]]. | |
--ssethrehold=int | The minimum number of secondary structure elements (SSE) for each symmetric subunit, for the result to be significant [default: 2]. | |
--maxorder=int | The maximum number of symmetric subunits [default: 8]. | |
--rndseed=int | The random seed used in optimization, for reproducibility of the results [default: 0]. | |
--minlen=int | The minimum length, expressed in number of core aligned residues, of a symmetric subunit [default: 15]. | |
--dcutoff=float | The maximum distance, in A, allowed between any two aligned residue positions [default: 7.0]. | |
--scopversion=version | Version of SCOP or SCOPe to use when resolving SCOP identifiers [defaults to latest SCOPe] |
By default, running CE-Symm will enter interactive mode. It will prompt you for an input structure and display the result in a Jmol window.
runCESymm.sh
You can also pass one or more structures to the program to bypass the input dialog.
runCESymm.sh 1HIV
runCESymm.sh 1GEN.A d1tl2a_ PDP:1RI6Aa
To run CE-Symm on many structures, specify input and output files. The input should contain a list of whitespace separated structure identifiers. Lines beginning with '#' are ignored. Note that using --input disables Jmol display by default. Use --jmol to override this behavior.
runCESymm.sh --input=queries.txt --xml=output.xml
An important feature in the batch mode is the multithreading implementation.
By default, all the allowed CPUs are used to run the symmetry analyses in
parallel. The option -threads N
can be used to set the number N of threads
to use. Note that the scaling efficiency of the multithreading is not perfect,
and the thread overhead becomes significant with more than 8 threads.
CE-Symm accepts a wide variety of ways to specify structures. Some examples:
- PDB Code:
1TGH
- Single chain:
1ITB.A
- Residue Range:
2NWX.A:255-416
- Multiple ranges can be concatenated by commas if needed
- SCOP/CATH/ECOD/PDP domains:
d1u6dx_
,e2j5aA1
- Files:
file.cif
,file:///path/to/file.pdb?chainId=A
- URLs:
http://files.rcsb.org/download/1TIM.pdb
- Biological Assemblies:
BIO:3HDP:2
- Biological Assembly support is currently experimental. Some visualization features may not behave properly.
CE-Symm can output results in a number of formats. Format options may be followed
by a filename (--stats=out.txt
). If the filename is empty or '-', output will
be sent to the terminal. To prevent structure names from being interpreted as
filenames, it is recommended to use the hyphen explicitly (e.g. --stats=-
,
although --stats
may work alone in some contexts).
If no format is specified, CE-Symm will default to printing the simple format to
standard out, although this can be suppressed with the -q
option.
The following formats are supported. Most formats represent the symmetry as an alignment from the structure to itself.
- Simple: Intended to be a simple human-readable summary of results. Lists one line per query, giving the symmetry determination and information about how that conclusion was reached.
- Stats: Provides more detailed statistics about each result. See below for a description of the fields.
- Axes: Provides detailed information about the axes of symmetry found.
- See below for a description of the fields.
- TSV: A list of aligned residues for each structure, with a line containing only '//' to separate records.
- FATCAT: A single file containing all alignments in FATCAT's traditional output format, with a line containing only '//' to separate records.
- XML: All the alignments in a custom XML format suitable for machine parsing.
The --stats
option outputs a tab-delimited file, where each row corresponds to an input structure, with the following columns:
- Name Name of the structure
- NumRepeats Total number of repeats detected by CE-Symm, including multiple levels of symmetry if detected
- SymmGroup Symmetry Group of the top level of symmetry. This includes point group symmetry (Cn or Dn), helical symmetry (H), and translational repeats (R).
- Refined 'true' or 'false', indicating whether refinement was successful
- SymmLevels Number of symmetry levels detected
- SymmType OPEN or CLOSED for symmetry or NONE for asymmetry. Multiple levels of symmetry are comma separated, in order from the first level to the last
- RotationAngle Angle of rotation at the principal axis (degrees). Closed symmetry may deviate from ideal values due to the superposition procedure. Multiple axes of symmetry are comma separated, in order from the first symmetry level to the last
- ScrewTranslation Translation parallel the principal axis (Å). Multiple axes of symmetry are comma separated, in order from the first symmetry level to the last
- UnrefinedTMscore TM-Score of the self-alignment prior to refinement
- UnrefinedRMSD RMSD of the self-alignment prior to refinement
- SymmTMscore Average pairwise TM-Score of all repeats in the alignment after refinement
- SymmRMSD Average RMSD of all repeats in the alignment after refinement
- RepeatLength Total number of alignment positions in the multiple repeats alignment
- CoreLength Number of ungapped columns in the multiple alignment. Equivalent to (RepeatLength*NumRepeats)-Gaps
- Length Total length of the protein
- Coverage Fraction of the protein aligned
- Repeats Substructure identifiers of each repeat range, semicolon separated. The numbers correspond to residue numbers in the original structure
Information about the rotation axes can be extracted from the --axes
output format. This option also outputs a tab-delimited file, where each row corresponds to one axis of symmetry. In the case of simple cyclic or open symmetry, only one axis is output. For more complex cases (dihedral symmetry or axes that don't form a point group), a hierarchy of axes is constructed that align smaller and smaller portions of the structure. Axes with the same SymmLevel
are related to each other by combinations of lower-level axes. Note that the axes generated by CE-Symm (displayed in yellow in Jmol) are optimized independently, so for dihedral cases they may differ slightly from the blue point-group axes.
- Name Name of the structure
- SymmLevel Level of symmetry of the axis
- SymmType CLOSED or OPEN depending on whether applying the axis to the last repeat superimposes with the first repeat
- SymmOrder Number of repeats in the structure to which the axis applies. For CLOSED symmetry axes, this is equivalent to 360/RotationAngle.
- RotationAngle Angle of rotation at the principal axis (degrees). Closed symmetry may deviate from ideal values due to the superposition procedure.
- ScrewTranslation Translation parallel the principal axis (Å)
- Point1 One end of the axis
- Point2 Other end of the axis
- AlignedRepeats Defines which repeats are aligned. Each parenthetical group gives a set of aligned repeats. This is similar to the cycle notation used in group theory. For example,
(A;C;E)(B;D;F)
indicates a three-fold rotation superimposingA
andB
ontoC
andD
,C
andD
ontoE
andF
, and (if closed)E
andF
ontoA
andB
.
The XML output is primarily suitable for loading results into BioJava. Other formats are generally preferable for common uses. However, a few clarifications are in order for advanced users:
- The
<Structures>
element gives residue ranges for each repeat of the multiple alignment. This information is also available in the--simple
format. - The
<eqrX>
elements give the indices of aligned residues for each repeat. Use of the--tsv
output is strongly recommended rather than parsing these, since tsv uses standard PDB residue numbers. - The
<Matrix4d>
elements give the 4D transformation matrices required to superimpose each repeat onto the first repeat (thus the first entry is always the identity matrix). It is possible to express these matrices in terms of the rotation axes from the--axes
output, but doing so requires careful tracking of which repeats are aligned by each axis.
CE-Symm is licensed under LGPL 2.1 (see LICENSE, which should have come bundled with the executable).
The source code is available at https://github.com/rcsb/symmetry and as part of the BioJava library. A webserver is also provided at http://source.rcsb.org/jfatcatserver/symmetry.jsp
# Get help
runCESymm.sh -h
# Full interactive mode. Prompts for an input structure & runs CE-Symm,
# displaying the result using Jmol
runCESymm.sh
# Run CE-Symm on a single structure or a short list of structures
# and display the results using Jmol
runCESymm.sh 1HIV
runCESymm.sh 1GEN.A d1tl2a_ PDP:1RI6Aa
runCESymm.sh ./myprotein.pdb
# Print result summary to standard out, repressing Jmol display
runCESymm.sh --noshow3d 1GEN.A 1TL2.A 1RI6.A
# Print detailed alignment information
runCESymm.sh --stats=- 1GEN.A
# Output XML file with full alignments and detailed results
runCESymm.sh --xml=output.xml -J 1GEN.A 1TL2.A 1RI6.A
# Run in batch mode.
# The input file should list one structure per line.
# Accepts PDB IDs ("4hhb"), SCOP identifiers ("d4hhba_"), or ranges ("4hhb.A","4hhb.A_1-141")
# Lines beginning with '#' are ignored.
runCESymm.sh --input=queries.txt --xml=output.xml