Version: v1.1dev
The macromolecular transmission format (MMTF) is a binary encoding of biological structures. It includes the coordinates, the topology and associated data. Specifically, a large subset of the data in mmCIF or PDB files can be represented. Pronounced goals are a reduced file size for efficient transmission over the Internet or from hard disk to memory and fast decoding/parsing speed. Additionally, the format aims to be easily understood and implemented to facilitate its wide dissemination. For testing encoder and decoder implementations a test suite is available.
This specification describes a set of required and optional fields representing molecular structures and associated data. The fields are limited to six primitive types for efficient serialization and deserialization using the binary MessagePack format. The fields in MMTF are stored in a binary container format. The top-level of the container contains the field names as keys and field data as values. To describe the layout of data in MMTF we use the JSON notation throughout this document.
The first step of decoding MMTF is decoding the MessagePack-encoded container. Many of the resulting MMTF fields do not need to be decoded any further. However, to allow for custom compression some fields are given as binary data and must be decoded using the strategies described below. For maximal size savings the binary MMTF data can be compressed using general purpose algorithms like gzip or brotli.
The fields in the MMTF format group data of the same type together to create a flat data-structure, for instance, the coordinates of all atoms are stored together, instead of in atom objects with other atom-related data. This avoids imposing a deeply-nested hierarchical structure on consuming programs, while still allowing efficient traversal of models, chains, groups, and atoms.
In principle any serialization format that supports the types described below can be used to store the above fields. MMTF files (specifically files with the .mmtf
extension) use the binary MessagePack serialization format.
The MessagePack format (version 5) is used as the binary container format of MMTF. The MessagePack specification describes the data types and the data layout. Encoding and decoding libraries for MessagePack are available in many languages, see the MessagePack website.
The test suite will additionally provide files representing the MMTF fields as JSON to help validating implementations of this specification.
The following types are used for the fields in this specification.
String
An UTF-8 encoded string.Float
A 32-bit floating-point number.Integer
A 32-bit signed integer.Map
A data structure of key-value pairs where each key is unique. Also known as "dictionary", "hash".- In some fields we restrict the key-value types to specific types. Currently utilized restrictions are:
- <String, Array|Binary>: used by
atomProperties
,bondProperties
,groupProperties
,chainProperties
, andmodelProperties
- <String, String|Float|Integer|Map|Array|Binary>: used by
extraProperties
- <String, Array|Binary>: used by
- In some fields we restrict the key-value types to specific types. Currently utilized restrictions are:
Array
A sequence of elements that have the same type.Binary
An array of unsigned 8-bit integer numbers representing binary data.
The Binary
type is used here to store encoded data as described in the Codecs section. When the encoded data is to be interpreted as a multi-byte type (e.g. 32-bit integers) it must be represented in big-endian format.
Note that the MessagePack format limits the String
, Map
, Array
and Binary
type to (2^32)-1 entries per instance.
This section describes the binary layout of the header and the encoded data as well as the available en/decoding strategies.
- Bytes 0 to 3: 32-bit signed integer specifying the codec type
- Bytes 4 to 7: 32-bit signed integer specifying the length of the resulting array
- Bytes 8 to 11: 4 bytes containing codec-specific parameter data
- Bytes 12 to N: bytes containing the encoded array data
Type 1
Signature byte[] -> float32[]
Description Interpret bytes as array of 32-bit floating-point numbers.
Type 2
Signature byte[] -> int8[]
Description Interpret bytes as array of 8-bit signed integers.
Type 3
Signature byte[] -> int16[]
Description Interpret bytes as array of 16-bit signed integers.
Type 4
Signature byte[] -> int32[]
Description Interpret bytes as array of 32-bit signed integers.
Type 5
Parameter byte[4] -> int32
denoting the string length
Signature byte[] -> uint8[] -> string<length>[]
Description Interpret bytes as array of 8-bit unsigned integers, then iteratively consume length
many bytes to form a string array.
Type 6
Signature byte[] -> int32[] -> char[]
Description Interpret bytes as array of 32-bit signed integers, then run-length decode into array of characters.
Type 7
Signature byte[] -> int32[] -> int32[]
Description Interpret bytes as array of 32-bit signed integers, then run-length decode into array of 32-bit signed integers.
Type 8
Signature byte[] -> int32[] -> int32[] -> int32[]
Description Interpret bytes as array of 32-bit signed integers, then run-length decode into array of 32-bit signed integers, then delta decode into array of 32-bit signed integers.
Type 9
Parameter byte[4] -> int32
denoting the divisor
Signature byte[] -> int32[] -> int32[] -> float32[]
Description Interpret bytes as array of 32-bit signed integers, then run-length decode into array of 32-bit signed integers, then integer decode into array of 32-bit floating-point numbers using the divisor
parameter.
Type 10
Parameter byte[4] -> int32
denoting the divisor
Signature byte[] -> int16[] -> int32[] -> int32[] -> float32[]
Description Interpret bytes as array of 16-bit signed integers, then unpack into array of 32-bit integers, then delta decode into array of 32-bit integers, then integer decode into array of 32-bit floating-point numbers using the divisor
parameter.
Type 11
Parameter byte[4] -> int32
denoting the divisor
Signature byte[] -> int16[] -> float32[]
Description Interpret bytes as array of 16-bit signed integers, then integer decode into array of 32-bit floating-point numbers using the divisor
parameter.
Type 12
Parameter byte[4] -> int32
denoting the divisor
Signature byte[] -> int16[] -> int32[] -> float32[]
Description Interpret bytes as array of 16-bit signed integers, then unpack into array of 32-bit signed integers, then integer decode into array of 32-bit floating-point numbers using the divisor
parameter.
Note Useful for arrays where a small amount of values may be slightly larger than two bytes. However, note that with many values larger than that the packing becomes inefficient.
Type 13
Parameter byte[4] -> int32
denoting the divisor
Signature byte[] -> int8[] -> int32[] -> float32[]
Description Interpret array of bytes as array of 8-bit signed integers, then unpack into array of 32-bit signed integers, then integer decode into array of 32-bit floating-point numbers using the divisor
parameter.
Note Useful for arrays where a small amount of values may be slightly larger than one bytes. However, note that with many values larger than that the packing becomes inefficient.
Type 14
Signature byte[] -> int16[] -> int32[]
Description Interpret bytes as array of 16-bit signed integers, then unpack into array of 32-bit signed integers.
Note Useful for arrays where a small amount of values may be slightly larger than two bytes. However, note that with many values larger than that the packing becomes inefficient.
Type 15
Signature byte[] -> int8[] -> int32[]
Description Interpret bytes as array of 8-bit signed integers, then unpack into array of 32-bit signed integers.
Note Useful for arrays where a small amount of values may be slightly larger than one bytes. However, note that with many values larger than that the packing becomes inefficient.
Type 16
Signature byte[] -> int32[] -> int8[]
Description Interpret bytes as array of 32-bit signed integers, then run-length decode into array of 8-bit integers.
The following general encoding strategies are used to compress the data contained in MMTF files.
Run-length encoding can generally be used to compress arrays that contain stretches of equal values. Instead of storing each value itself, stretches of equal values are represented by the value itself and the occurrence count, that is a value/count pair.
Example:
Starting with the encoded array of value/count pairs. In the following example there are three pairs 1, 10
, 2, 1
and 1, 4
. The first entry in a pair is the value to be repeated and the second entry denotes how often the value must be repeated.
[ 1, 10, 2, 1, 1, 4 ]
Applying run-length decoding by repeating, for each pair, the value as often as denoted by the count entry.
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1 ]
Delta encoding is used to store an array of numbers. Instead of storing the numbers themselves, the differences (deltas) between the numbers are stored. When the values of the deltas are smaller than the numbers themselves they can be more efficiently packed to require less space.
Note that arrays in which the values change by an identical amount for a range of consecutive values lend themselves to subsequent run-length encoding.
Example:
Starting with the encoded array of delta values:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1 ]
Applying delta decoding. The first entry in the array is left as is, the second is calculated as the sum of the first and the second (not decoded) value, the third as the sum of the second (decoded) and third (not decoded) value and so forth.
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16 ]
Packing/Recursive indexing encodes values such that the encoded values lie within the open interval (MIN, MAX). This allows to create a more compact representation of a 32-bit signed integer array when the majority of values in the array fit into 16-bit (or 8-bit). To encode each value in the input array the method stores the value itself if it lies within the open interval (MIN, MAX), otherwise the MAX (or MIN if the number is negative) interval endpoint is stored and subtracted from the input value. This process of storing and subtracting is repeated recursively until the remainder lies within the interval.
Example:
Starting with the array of 8-bit integer values, so the open interval is (127, -128):
[ 127, 41, 34, 1, 0, -50, -128, 0, 7, 127, 0, 127, 127, 14 ]
Unpacking/Applying recursive indexing decoding. Values that lie within the interval are copied over to the output array. Values that are equal to an interval endpoint are added to the subsequent value while the subsequent value is equal to an interval endpoint, e.g. the sequence 127, 127, 14
becomes 268
:
[ 168, 34, 1, 0, -50, -128, 7, 127, 268 ]
In integer encoding, floating point numbers are converted to integer values by multiplying with a factor and discard everything after the decimal point. Depending on the multiplication factor this can change the precision but with a sufficiently large factor it is lossless. The integer values can then often be compressed with delta encoding which is the main motivation for it.
Example:
Starting with the array of integer values:
[ 100, 100, 100, 100, 50, 50 ]
Applying integer decoding with a divisor of 100
:
[ 1.00, 1.00, 1.00, 1.00, 0.50, 0.50 ]
For dictionary encoding an Array
is created to store values. Indices as references to the values can then be used instead of repeating the values over and over again. Arrays of indices can afterwards be compressed with delta and run-length encoding.
Example:
First create a Array
to hold values that are referable by indices. In the following example the are two indices, 0
and 1
with some values associated.
[
{
"groupName": "ASP",
"singleLetterCode": "D",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C", "O", "CB", "CG", "OD1", "OD2" ],
"elementList": [ "N", "C", "C", "O", "C", "C", "O", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4, 6, 5, 7, 5 ],
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ],
"bondResonanceList": [ 0, 0, 1, 0, 0, 1, 1 ]
},
{
"groupName": "SER",
"singleLetterCode": "S",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C", "O", "CB", "OG" ],
"elementList": [ "N", "C", "C", "O", "C", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4 ],
"bondOrderList": [ 1, 1, 2, 1, 1 ],
"bondResonanceList": [ 0, 0, 1, 0, 0 ]
}
{
"groupName": "PHE",
"singleLetterCode": "F",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "H", "HA", "1HB", "2HB", "HD1", "HD2", "HE1", "HE2", "HZ"],
"elementList": ["N", "C", "C", "O", "C", "C", "C", "C", "C", "C", "C", "H", "H", "H", "H", "H", "H", "H", "H", "H"],
"formalChargeList": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"bondAtomList": [0, 1, 1, 2, 2, 3, 1, 4, 4, 5, 5, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 10, 0, 11, 1, 12, 4, 13, 4, 14, 6, 15, 7, 16, 8, 17, 9, 18, 10, 19],
"bondOrderList": [1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"bondResonanceList": [0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
]
The indices can then be used to reference the values as often as needed:
[ 0, 1, 1, 0, 1 ]
The following table lists all top level fields, including their type and whether they are required or optional. The top-level fields themselves are stores as a Map
.
Required field
Type: String.
Description: The version number of the specification the file adheres to. The specification follows a semantic versioning scheme. In a version number MAJOR.MINOR
, the MAJOR
part is incremented when specification changes are incompatible with previous versions. The MINOR
part is changed for additions to the specification that are backwards compatible.
Examples:
The current, unreleased, in development specification:
"0.1"
A future version with additions backwards compatible to versions "1.0" and "1.1":
"1.2"
Required field
Type: String.
Description: The name and version of the software used to produce the file. For development versions it can be useful to also include the checksum of the commit. The main purpose of this field is to identify the software that has written a file, for instance because it has format errors.
Examples:
A software name and the checksum of a commit:
"RCSB PDB mmtf-java-encoder---version: 6b8635f8d319beea9cd7cc7f5dd2649578ac01a0"
Another software name and its version number:
"NGL mmtf exporter v1.2"
Optional field
Type: String.
Description: A short description of the structural data included in the file.
Example:
"CRAMBIN"
Optional field
Type: String.
Description: An ID for the structure, for example the PDB ID if applicable. If not in conflict with the format of the ID, it must be given in uppercase.
Example:
"1CRN"
Optional field
Type: String with the format YYYY-MM-DD
, where YYYY
stands for the year in the Gregorian calendar, MM
is the month of the year between 01 (January) and 12 (December), and DD
is the day of the month between 01 and 31.
Description: A date that relates to the deposition of the structure in a database, e.g. the wwPDB archive.
Example:
For example, the second day of October in the year 2005 is written as:
"2005-10-02"
Optional field
Type: String with the format YYYY-MM-DD
, where YYYY
stands for the year in the Gregorian calendar, MM
is the month of the year between 01 (January) and 12 (December), and DD
is the day of the month between 01 and 31.
Description: A date that relates to the release of the structure in a database, e.g. the wwPDB archive.
Example:
For example, the third day of December in the year 2013 is written as:
"2013-12-03"
Required field
Type: Integer.
Description: The overall number of bonds. This number must reflect both the bonds given in bondAtomList
and the bonds given in the groupType
entries in groupList
.
Example:
1142
Required field
Type: Integer.
Description: The overall number of atoms in the structure. This also includes atoms at alternate locations.
Example:
1023
Required field
Type: Integer.
Description: The overall number of groups in the structure. This also includes extra groups due to micro-heterogeneity.
Example:
302
Required field
Type: Integer.
Description: The overall number of chains in the structure.
Example:
4
Required field
Type: Integer.
Description: The overall number of models in the structure.
Example:
1
Optional field
Type: String.
Description: The Hermann-Mauguin space-group symbol.
Example:
"P 1 21 1"
Optional field
Type: Array of six Float values.
Description: Array of six values defining the unit cell. The first three entries are the length of the sides a
, b
, and c
in Å. The last three angles are the alpha
, beta
, and gamma
angles in degree.
Example:
[ 80.37, 96.12, 57.67, 90.00, 90.00, 90.00 ]
Optional field
Type: Array of Arrays of 16 Float values.
Description: Array of arrays representing 4x4 transformation matrices that are stored linearly in row major order. Thus, the translational component comprises the 4th, 8th, and 12th element. The transformation matrices describe noncrystallographic symmetry operations needed to create all molecules in the unit cell.
Example:
[
[
0.5, -0.809, -0.309, 128.875,
0.809, 0.309, 0.5, -208.524,
-0.309, -0.5, 0.809, 79.649,
0.0, 0.0, 0.0, 1.0
],
[
-0.5, 0.809, -0.309, 386.625,
0.809, 0.309, -0.5, -208.524,
-0.309, -0.5, -0.809, 79.649,
0.0, 0.0, 0.0, 1.0
]
]
Optional field
Type: Array
of assembly objects with the following fields:
Name | Type | Description |
---|---|---|
transformList | Array | Array of transform objects |
name | String | Name of the biological assembly |
Fields in a transform
object:
Name | Type | Description |
---|---|---|
chainIndexList | Array | Pointers into chain data fields, Integers |
matrix | Array | 4x4 transformation matrix, Floats |
The entries of chainIndexList
are indices into the chainIdList and chainNameList fields.
The elements of the 4x4 transformation matrix
are stored linearly in row major order. Thus, the translational component comprises the 4th, 8th, and 12th element.
Description: Array of instructions on how to transform coordinates for an array of chains to create (biological) assemblies. The translational component is given in Å.
Example:
The following example shows two transform objects from PDB ID 4OPJ. The transformation matrix of the first object performs no rotation and a translation of 42.387 Å in dimension x. The second one translates -42.387 Å in dimension x.
[
{
"transformList": [
{
"chainIndexList": [ 0, 4, 6 ],
"matrix": [
1.0, 0.0, 0.0, 42.387,
0.0, 1.0, 0.0, 0.000,
0.0, 0.0, 1.0, 0.000,
0.0, 0.0, 0.0, 1.000
]
}
]
},
{
"transformList": [
{
"chainIndexList": [ 0, 4, 6 ],
"matrix": [
1.0, 0.0, 0.0, -42.387,
0.0, 1.0, 0.0, 0.000,
0.0, 0.0, 1.0, 0.000,
0.0, 0.0, 0.0, 1.000
]
}
]
}
]
Optional field
Type: Array of entity objects with the following fields:
Name | Type | Description |
---|---|---|
chainIndexList | Array | Pointers into chain data fields, Integers |
description | String | Description of the entity |
type | String | Name of the entity type |
sequence | String | Sequence of the full construct in one-letter-code |
The entries of chainIndexList
are indices into the chainIdList and chainNameList fields.
The sequence
string contains the full construct, not just the resolved residues. Its characters are referenced by the entries of the sequenceIndexList field. Further, characters follow the IUPAC single letter code for protein or DNA/RNA residues, otherwise the character 'X'.
Description: Array of unique molecular entities within the structure. Each entry in chainIndexList
represents an instance of that entity in the structure.
Vocabulary: Known values for the entity field type
from the mmCIF dictionary are macrolide
, non-polymer
, polymer
, water
.
Example:
[
{
"description": "BROMODOMAIN ADJACENT TO ZINC FINGER DOMAIN PROTEIN 2B",
"type": "polymer",
"chainIndexList": [ 0 ],
"sequence": "SMSVKKPKRDDSKDLALCSMILTEMETHEDAWPFLLPVNLKLVPGYKKVIKKPMDFSTIREKLSSGQYPNLETFALDVRLVFDNCETFNEDDSDIGRAGHNMRKYFEKKWTDTFKVS"
},
{
"description": "4-FLUOROBENZAMIDOXIME",
"type": "non-polymer",
"chainIndexList": [ 1 ],
"sequence": ""
},
{
"description": "METHANOL",
"type": "non-polymer",
"chainIndexList": [ 2, 3, 4 ],
"sequence": ""
},
{
"description": "water",
"type": "water",
"chainIndexList": [ 5 ],
"sequence": ""
}
]
Optional field
Type: Float.
Description: The experimental resolution in Angstrom. If not applicable the field must be omitted.
Examples:
2.3
Optional field
Type: Float.
Description: The R-free value. If not applicable the field must be omitted.
Examples:
0.203
Optional field
Type: Float.
Description: The R-work value. If not applicable the field must be omitted.
Examples:
0.176
Optional field
Description: The array of experimental methods employed for structure determination.
Vocabulary: Known values from the mmCIF dictionary are ELECTRON CRYSTALLOGRAPHY
, ELECTRON MICROSCOPY
, EPR
, FIBER DIFFRACTION
, FLUORESCENCE TRANSFER
, INFRARED SPECTROSCOPY
, NEUTRON DIFFRACTION
, POWDER DIFFRACTION
, SOLID-STATE NMR
, SOLUTION NMR
, SOLUTION SCATTERING
, THEORETICAL MODEL
, X-RAY DIFFRACTION
.
Example:
[ "X-RAY DIFFRACTION" ]
Optional field
Type: Binary data that decodes into an array of 32-bit signed integers.
Description: Pairs of values represent indices of covalently bonded atoms. The indices point to the Atom data arrays. Only covalent bonds may be given. Only inter-groups bonds are expected here: intra-groups bonds are defined later in the groupList section. Note: This is an optional field in that if your mmtf file contains no bonds, the field is not required to exist (for decoding purposes). If bonds exist this must be defined.
Example:
Using the 'Pass-through: 32-bit signed integer array' encoding strategy (type 4).
In the following example there are three bonds, one between the atoms with the indices 0 and 61, one between the atoms with the indices 2 and 4, as well as one between the atoms with the indices 6 and 12.
[ 0, 61, 2, 4, 6, 12 ]
Optional field If it exists bondAtomList must also be present.
Type: Binary data that decodes into an array of 8-bit signed integers.
Description: Array of bond orders for bonds in bondAtomList
. Must be values -1, 1, 2, 3, or 4, defining unknown, single, double, triple, and quadruple bonds.
Note: This field is optional, it is not required that you know the order of the bonds when writing mmtf files. However if you have the information, we encourage
you to include it!
Example:
Using the 'Pass-through: 8-bit signed integer array' encoding strategy (type 2).
In the following example there are bond orders given for three bonds. The first and third bond have a bond order of 1 while the second bond has a bond order of 2.
[ 1, 2, 1 ]
Optional field If it exists bondAtomList and bondOrderList must also be present.
Type: Binary data that decodes into an array of 8-bit signed integers (type 16).
Description: Array of bond Resonances for bonds in bondAtomList
. Must be -1: resonance is unknown, 0: no resonance, or 1: resonance exists.
Note: This field is optional, it is not required that you know the resonance of bonds when writing mmtf files. However if you have the information, we encourage
you to include it!
Possible pairings between the bondResonanceList
and bondOrderList
are shown below.
If this field exists there should never be a non-resonating bond with an unknown bond order
Bond-order | Resonance | Explanation |
---|---|---|
-1 | 1 | kekulized form is unavailable, but resonance is known |
1(or 2,3,4) | 1 | kekulized form is known, and resonance is known and exists |
1(or 2,3,4) | 0 | kekulized form is known, but resonance is nonexistant |
1(or 2,3,4) | -1 | kekulized form is known, but resonance is not known |
Example
Using the Run-length encoded 8-bit array
encoding strategy (type 16).
In the following example there are bond Resonances given for three bonds. The first and thrid bonds are resonating bonds, while the second bond has no resonance.
[ 1, 0, 1]
An example for not knowing bond order would be a scenario where we don't have the kekulized form for resonating bonds.
{
"groupName": "PHE",
"singleLetterCode": "F",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "H", "HA", "1HB", "2HB", "HD1", "HD2", "HE1", "HE2", "HZ"],
"elementList": ["N", "C", "C", "O", "C", "C", "C", "C", "C", "C", "C", "H", "H", "H", "H", "H", "H", "H", "H", "H"],
"formalChargeList": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"bondAtomList": [0, 1, 1, 2, 2, 3, 1, 4, 4, 5, 5, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 10, 0, 11, 1, 12, 4, 13, 4, 14, 6, 15, 7, 16, 8, 17, 9, 18, 10, 19],
"bondOrderList": [1, 1, -1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"bondResonanceList": [0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
The number of models in a structure is equal to the length of the chainsPerModel field. The chainsPerModel
field also defines which chains belong to each model.
Required field
Type: Array of Integer numbers. The number of models is thus equal to the length of the chainsPerModel
field.
Description: Array of the number of chains in each model. The array allows looping over all models:
# initialize index counter
set modelIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
increment modelIndex by one
Examples:
In the following example there are 2 models. The first model has 5 chains and the second model has 8 chains. This also means that the chains with indices 0 to 4 belong to the first model and that the chains with indices 5 to 12 belong to the second model.
[ 5, 8 ]
For structures with homogeneous models the number of chains per model is identical for all models. In the following example there are five models, each with four chains.
[ 4, 4, 4, 4, 4 ]
The number of chains in a structure is equal to the length of the groupsPerChain field. The groupsPerChain
field also defines which groups belong to each chain.
Required field
Type: Array of Integer numbers.
Description: Array of the number of groups (aka residues) in each chain. The number of chains is thus equal to the length of the groupsPerChain
field. In conjunction with chainsPerModel
, the array allows looping over all chains:
# initialize index counters
set modelIndex to 0
set chainIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
# traverse chains
for 1 to modelChainCount
print chainIndex
set offset to chainIndex * 4
print chainIdList[ offset : offset + 4 ]
print chainNameList[ offset : offset + 4 ]
increment chainIndex by 1
increment modelIndex by 1
Example:
In the following example there are 3 chains. The first chain has 73 groups, the second 59 and the third 1. This also means that the groups with indices 0 to 72 belong to the first chain, groups with indices 73 to 131 to the second chain and the group with index 132 to the third chain.
[ 73, 59, 1 ]
Required field
Type: Binary data that decodes into an array of 4-character strings.
Description: Array of chain IDs. For storing data from mmCIF files the chainIdList
field should contain the value from the label_asym_id
mmCIF data item and the chainNameList
the auth_asym_id
mmCIF data item. In PDB files there is only a single name/identifier for chains that corresponds to the auth_asym_id
item. When there is only a single chain identifier available it must be stored in the chainIdList
field.
Note: The character strings must be left aligned and unused characters must be represented by 0 bytes.
Example:
Using the 'UTF8/ASCII fixed-length string array' encoding strategy (type 5).
Starting with the array of 8-bit unsigned integers:
[ 65, 0, 0, 0, 66, 0, 0, 0, 67, 0, 0, 0 ]
Decoding the ASCII characters:
[ "A", "", "", "", "B", "", "", "", "C", "", "", "" ]
Creating the array of chain IDs:
[ "A", "B", "C" ]
Optional field
Type: Binary data that decodes into an array of 4-character strings.
Description: Array of chain names. This field allows to specify an additional set of labels/names for chains. For example, it can be used to store both, the label_asym_id
(in chainIdList
) and the auth_asym_id
(in chainNameList
) from mmCIF files.
Example:
Using the 'UTF8/ASCII fixed-length string array' encoding strategy (type 5).
Starting with the array of 8-bit unsigned integers:
[ 65, 0, 0, 0, 68, 65, 0, 0 ]
Decoding the ASCII characters:
[ "A", "", "", "", "DA", "", "", "" ]
Creating the array of chain IDs:
[ "A", "DA" ]
The fields in the following sections hold group-related data.
The mmCIF format allows for so-called micro-heterogeneity on the group-level. For groups (residues) with micro-heterogeneity there are two or more entries given that have the same sequence index, group id (and insertion code) but are of a different group type. The defining property is their identical sequence index.
Note: There is no compression applied to elements in Group data
.
Required field
Type: Array of groupType
objects with the following fields:
Name | Type | Description | Required |
---|---|---|---|
formalChargeList | Array | Array of formal charges as Integers | Y |
atomNameList | Array | Array of atom names, 0 to 5 character Strings | Y |
elementList | Array | Array of elements, 0 to 3 character Strings | Y |
bondAtomList | Array | Array of bonded atom indices, Integers | |
bondOrderList | Array | Array of bond orders as Integers either -1, 1, 2, 3 or 4 | |
bondResonanceList | Array | Array of bond resonance as Integers either -1, 0, or 1 | |
groupName | String | The name of the group, 0 to 5 characters | Y |
singleLetterCode | String | The single letter code, 1 character | Y |
chemCompType | String | The chemical component type | Y |
The element name must follow the IUPAC standard where only the first character is capitalized and the remaining ones are lower case, for instance Cd
for Cadmium.
Two consecutive entries in bondAtomList
representing indices of covalently bound atoms. The indices point into the formalChargeList
, atomNameList
, and elementList
fields.
The singleLetterCode
is the IUPAC single letter code for protein or DNA/RNA residues, otherwise the character 'X' for polymer groups or '?' for non-polymer groups.
Description: Common group (residue) data that is referenced via the groupType
key by group entries.
Vocabulary: Known values for the groupType field chemCompType
from the mmCIF dictionary are D-beta-peptide, C-gamma linking
, D-gamma-peptide, C-delta linking
, D-peptide COOH carboxy terminus
, D-peptide NH3 amino terminus
, D-peptide linking
, D-saccharide
, D-saccharide 1,4 and 1,4 linking
, D-saccharide 1,4 and 1,6 linking
, DNA OH 3 prime terminus
, DNA OH 5 prime terminus
, DNA linking
, L-DNA linking
, L-RNA linking
, L-beta-peptide, C-gamma linking
, L-gamma-peptide, C-delta linking
, L-peptide COOH carboxy terminus
, L-peptide NH3 amino terminus
, L-peptide linking
, L-saccharide
, L-saccharide 1,4 and 1,4 linking
, L-saccharide 1,4 and 1,6 linking
, RNA OH 3 prime terminus
, RNA OH 5 prime terminus
, RNA linking
, non-polymer
, other
, peptide linking
, peptide-like
, saccharide
.
Example:
[
{
"groupName": "GLY",
"singleLetterCode": "G",
"chemCompType": "PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C", "O" ],
"elementList": [ "N", "C", "C", "O" ],
"formalChargeList": [ 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2 ],
"bondOrderList": [ 1, 1, 2 ],
"bondResonanceList": [ 0, 0, 1 ],
},
{
"groupName": "ASP",
"singleLetterCode": "D",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C", "O", "CB", "CG", "OD1", "OD2" ],
"elementList": [ "N", "C", "C", "O", "C", "C", "O", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4, 6, 5, 7, 5 ],
"bondOrderList": [ 1, 1, 2, 1, 1, 2, 1 ],
"bondResonanceList": [ 0, 0, 1, 0, 0, 1, 1 ]
},
{
"groupName": "SER",
"singleLetterCode": "S",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C", "O", "CB", "OG" ],
"elementList": [ "N", "C", "C", "O", "C", "O" ],
"formalChargeList": [ 0, 0, 0, 0, 0, 0 ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2, 4, 1, 5, 4 ],
"bondOrderList": [ 1, 1, 2, 1, 1 ],
"bondResonanceList": [ 0, 0, 1, 0, 0 ]
},
{
"groupName": "PHE",
"singleLetterCode": "F",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": ["N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "H", "HA", "1HB", "2HB", "HD1", "HD2", "HE1", "HE2", "HZ"],
"elementList": ["N", "C", "C", "O", "C", "C", "C", "C", "C", "C", "C", "H", "H", "H", "H", "H", "H", "H", "H", "H"],
"formalChargeList": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"bondAtomList": [0, 1, 1, 2, 2, 3, 1, 4, 4, 5, 5, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 10, 0, 11, 1, 12, 4, 13, 4, 14, 6, 15, 7, 16, 8, 17, 9, 18, 10, 19],
"bondOrderList": [1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"bondResonanceList": [0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
]
Required field
Type: Binary data that decodes into an array of 32-bit signed integers.
Description: Array of pointers to groupType
entries in groupList
by their keys. One entry for each residue, thus the number of residues is equal to the length of the groupTypeList
field.
Example:
Using the 'Pass-through: 32-bit signed integer array' encoding strategy (type 4).
In the following example there are 5 groups. The 1st, 4th and 5th reference the groupType
with index 2
, the 2nd references index 0
and the third references index 1
. So using the data from the groupList
example this describes the polymer SER-GLY-ASP-SER-SER
.
[ 2, 0, 1, 2, 2 ]
Required field
Type: Binary data that decodes into an array of 32-bit signed integers.
Description: Array of group (residue) numbers. One entry for each group/residue.
Example:
Using the 'Delta & run-length encoded 32-bit signed integer array' encoding strategy (type 8).
Starting with the array of 32-bit signed integers:
[ 1, 10, -10, 1, 1, 4 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -10, 1, 1, 1, 1 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5 ]
Optional field
Type: Binary data that decodes into an array of 8-bit signed integers.
Description: Array of secondary structure assignments coded according to the following table, which shows the eight different types of secondary structure the DSSP algorithm distinguishes. If the field is included there must be an entry for each group (residue) either in all models or only in the first model.
Code | Name |
---|---|
0 | pi helix |
1 | bend |
2 | alpha helix |
3 | extended |
4 | 3-10 helix |
5 | bridge |
6 | turn |
7 | coil |
-1 | undefined |
Example:
Using the 'Pass-through: 8-bit signed integer array' encoding strategy (type 2).
Starting with the array of 8-bit signed integers:
[ 7, 7, 2, 2, 2, 2, 2, 2, 2, 7 ]
Optional field
Type: Binary data that decodes into an array of characters.
Description: Array of insertion codes, one for each group (residue). The lack of an insertion code must be denoted by a 0 byte.
Example:
Using the 'Run-length encoded character array' encoding strategy (type 6).
Starting with the array of 32-bit signed integers:
[ 0, 5, 65, 3, 66, 2 ]
Applying run-length decoding:
[ 0, 0, 0, 0, 0, 65, 65, 65, 66, 66 ]
If needed the ASCII codes can be converted to an Array
of String
s with the zeros as zero-length String
s:
[ "", "", "", "", "", "A", "A", "A", "B", "B" ]
Optional field
Type: Binary data that decodes into an array of 32-bit signed integers.
Description: Array of indices that point into the sequence
property of an entity object in the entityList field that is associated with the chain the group belongs to (i.e. the index of the chain is included in the chainIndexList
of the entity). There is one entry for each group (residue). It must be set to -1
when a group entry has no associated entity (and thus no sequence), for example water molecules.
Example:
Using the 'Delta & run-length encoded 32-bit signed integer array' encoding strategy (type 8).
Starting with the array of 32-bit signed integers:
[ 1, 10, -10, 1, 1, 4 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -10, 1, 1, 1, 1 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4 ]
The fields in the following sections hold atom-related data.
The mmCIF format allows for alternate locations of atoms. Such atoms have multiple entries in the atom-level fields (including the fields in the groupList entries). They can be identified and distinguished by their distinct values in the altLocList field.
Optional field
Type: Binary data that decodes into an array of 32-bit signed integers.
Description: Array of atom serial numbers. One entry for each atom.
Example:
Using the 'Delta & run-length encoded 32-bit signed integer array' encoding strategy (type 8).
Starting with the array of 32-bit signed integers:
[ 1, 7, 2, 1 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 2 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 9 ]
Optional field
Type: Binary data that decodes into an array of characters.
Description: Array of alternate location labels, one for each atom. The lack of an alternate location label must be denoted by a 0 byte.
Example:
Using the 'Run-length encoded character array' encoding strategy (type 6).
Starting with the array of 32-bit signed integers:
[ 0, 5, 65, 3, 66, 2 ]
Applying run-length decoding:
[ 0, 0, 0, 0, 0, 65, 65, 65, 66, 66 ]
If needed the ASCII codes can be converted to an Array
of String
s with the zeros as zero-length String
s:
[ "", "", "", "", "", "A", "A", "A", "B", "B" ]
Optional fields
Type: Binary data that decodes into an array of 32-bit floating-point numbers.
Description: Array of atom B-factors in in Å^2. One entry for each atom.
Example:
Using the 'Integer & delta encoded & two-byte-packed 32-bit floating-point number array' encoding strategy (type 10) with a divisor of 100.
Starting with the packed array of 16-bit signed integers:
[ 18200, 0, 2, -1, 100, -3, 5 ]
Unpacking/applying recursive indexing decoding to create an array of 32-bit signed integers (note, only the array type changed as the values all fitted into 16-bit signed integers):
[ 18200, 0, 2, -1, 100, -3, 5 ]
Applying delta decoding to create an array of 32-bit signed integers:
[ 18200, 18200, 18202, 18201, 18301, 18298, 18303 ]
Applying integer decoding with a divisor of 100
to create an array of 32-bit floating-point numbers:
[ 182.00, 182.00, 182.02, 182.01, 183.01, 182.98, 183.03 ]
Required fields
Type: Binary data that decodes into an array of 32-bit floating-point numbers.
Description: Array of x, y, and z atom coordinates, respectively, in Å. One entry for each atom and coordinate.
Note: To clarify, the data for each coordinate is stored in a separate array.
Example:
Using the 'Integer & delta encoded & two-byte-packed 32-bit floating-point number array' encoding strategy (type 10) with a divisor of 1000.
Starting with the packed array of 16-bit signed integers:
[ 32767, 32767, 32767, 6899, 0, 2, -1, 100, -3, 5 ]
Unpacking/Applying recursive indexing decoding to create an array of 32-bit signed integers:
[ 105200, 0, 2, -1, 100, -3, 5 ]
Applying delta decoding to create an array of 32-bit signed integers:
[ 105200, 105200, 105202, 105201, 105301, 105298, 105303 ]
Applying integer decoding with a divisor of 1000
to create an array of 32-bit floating-point values:
[ 100.000, 105.200, 105.202, 105.201, 105.301, 105.298, 105.303 ]
Optional field
Description: Array of atom occupancies, one for each atom.
Type: Binary data that decodes into an array of 32-bit floating-point numbers.
Example:
Using the 'Integer & run-length encoded 32-bit floating-point number array' encoding strategy (type 9) with a divisor of 100.
Starting with the array of 32-bit signed integers:
[ 100, 4, 50, 2 ]
Applying run-length decoding:
[ 100, 100, 100, 100, 50, 50 ]
Applying integer decoding with a divisor of 100
to create an array of 32-bit floating-point values:
[ 1.00, 1.00, 1.00, 1.00, 0.50, 0.50 ]
The following are fields that are not supplied by the RCSB, and are provided for applications to transfer non-molecular data along with the molecular data.
Each field is a map, and there are 6 possible fields bondProperties
, atomProperties
, groupProperties
, chainProperties
,
modelProperties
, and extraProperties
. The first 5 fields are restricted to holding data that is relevant to the field's prefix, while
any other data can be stored in the extraProperties
field.
As mentioned in Types in some fields we restrict the key-value types to specific types. The current type format the sepc follows is:
- <String, Array|Binary>
atomProperties
bondProperties
groupProperties
chainProperties
modelProperties
- <String, String|Float|Integer|Map|Array|Binary>
extraProperties
For each of the fields above, we do reserve some key-value pair formats to help ensure cross-application-compatibility. If you feel a certain key-value pair should be reserved and made a part of the spec, please let us know!
We encourage you to apply our encoding techniques to your application data to reduce file sizes!
The restrictions on each of the lengths of the values in these fields are not enforced at decode-time but are reserved to help guide applications.
Name | length-restrictions |
---|---|
bondProperties | length of numBonds |
atomProperties | length of numAtoms |
groupProperties | length of numGroups |
chainProperties | length of numChains |
modelProperties | length of numModels |
extraProperties | None |
Example
data = {
"mmtfVersion": "1.1",
"numAtoms": 999,
"numModels": 2,
"numChains": 4,
...
"xCoordList": [1.2, 3.4, ...],
"yCoordList": [5.6, 7.8, ...],
"zCoordList": [9.0, 1.2, ...],
...
"modelProperties": {
# lists have len numModels=2
"foo_rmsdList": [0.5, 0.8],
"foo_scoreList": [1.2, 3.4],
},
"chainProperties": {
# lists have len numChains=4
"foo_uniprotIdList": ["HBB_HUMAN", "HBA_HUMAN", "HBB_HUMAN", "HBA_HUMAN"],
"foo_chainColorList": [0xFF0000, 0x00FF00, 0xFF0000, 0x00FF00],
},
"groupProperties": {
# lists have len numGroups
"stride_secStructList": [7, 7, 7, ...],
"sst_secStructList": [7, 7, 7, ...],
},
"atomProperties": {
# lists have len numAtoms=999
"pymol_colorList": [1, 2, 3, ...],
"pymol_repsList": [1, 1, 1, ...],
"apbs_chargeList": [0.1, -0.4, 0.7, ...],
"apbs_radiusList": [1.2, 1.8, 1.5, ...],
},
"bondProperties": {
# lists have len numBonds
"pymol_bondTypeList": [1, 1, 1, 4, 4, 4, 4, 4, 4, 1, ...],
},
"extraProperties": {
"pymol_bondTypes": {0: "metal", 1: "single", 2: "double", 3: "triple", 4: "aromatic"},
"foo_id": "ABC",
}
}
Optional field
Type: Map<String, Array|Binary>
Description: A field meant to store information on specific bonds. Must have the same length as numBonds.
Reserved key:value pairs
Key | Value-description |
---|---|
colorList | color hex code |
transparancyList | float (0-1) |
representationTypeList | (0: "lines", 1: "spheres", 2: "surface", 3: "ball and stick", 4: "cartoon") |
Optional field
Type: Map<String, Array|Binary>
Description: A field meant to store information on specific atoms. Must have the same length as numAtoms
Reserved key:value pairs
Key | Value-description |
---|---|
colorList | color hex code |
transparancyList | float (0-1) |
representationTypeList | (0: "lines", 1: "spheres", 2: "surface", 3: "ball and stick", 4: "cartoon") |
Optional field
Type: Map<String, Array|Binary>
Description: A field meant to store information on specific groups. Must have the same length as numGroups
Reserved key:value pairs
Key | Value-description |
---|---|
colorList | color hex code |
transparancyList | float (0-1) |
representationTypeList | (0: "lines", 1: "spheres", 2: "surface", 3: "ball and stick", 4: "cartoon") |
Optional field
Type: Map<String, Array|Binary>
Description: A field meant to store information on specific chains. Must have the same length as numChains
Reserved key:value pairs
Key | Value-description |
---|---|
colorList | color hex code |
transparancyList | float (0-1) |
representationTypeList | (0: "lines", 1: "spheres", 2: "surface", 3: "ball and stick", 4: "cartoon") |
Optional field
Type: Map<String, Array|Binary>
Description: A field meant to store information on specific models. Must have the same length as numModels
Reserved key:value pairs
Key | Value-description |
---|---|
colorList | color hex code |
transparancyList | float (0-1) |
representationTypeList | (0: "lines", 1: "spheres", 2: "surface", 3: "ball and stick", 4: "cartoon") |
rmsdList | list[floats] |
gdtList | list[floats] |
Optional field
Type: Map<String, String|Float|Integer|Map|Array|Binary>
Description: A field meant to store any information at all. There are no restrictions on second level data structures like the inner Map in Map<String, Map>.
Example:
# Ok!
data = {
"mmtfVersion": "1.1",
...
"extraProperties": {
"pymol_bondTypes": {0: "metal", 1: "single", 2: "double", 3: "triple", 4: "aromatic"},
}
}
# Bad!
data = {
"mmtfVersion": "1.1",
...
"extraProperties": {
1: {0: "false", 1: "true", 2: "true", 3: "false", 4: "false"},
}
}
Reserved key:value pairs
Key | Value-description |
---|---|
cameraPosition | cartesian (x,y,z) |
cameraDirectionVector | cartesian vector [(x,y,z), (x,y,z)] |
The following traversal pseudo code assumes that all fields have been decoded.
# initialize index counters
set modelIndex to 0
set chainIndex to 0
set groupIndex to 0
set atomIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
# traverse chains
for 1 to modelChainCount
print chainIndex
set offset to chainIndex * 4
print chainIdList[ offset : offset + 4 ]
print chainNameList[ offset : offset + 4 ]
set chainGroupCount to groupsPerChain[ chainIndex ]
# traverse groups
for 1 to chainGroupCount
print groupIndex
print groupIdList[ groupIndex ]
print insCodeList[ groupIndex ]
print secStructList[ groupIndex ]
print sequenceIndexList[ groupIndex ]
print groupTypeList[ groupIndex ]
set group to groupList[ groupTypeList[ groupIndex ] ]
print group.groupName
print group.singleLetterCode
print group.chemCompType
set atomOffset to atomIndex
set groupBondCount to group.bondAtomList.length / 2
for i in 1 to groupBondCount
print atomOffset + group.bondAtomList[ i * 2 ] # atomIndex1
print atomOffset + group.bondAtomList[ i * 2 + 1 ] # atomIndex2
print group.bondOrderList[ i ]
print group.bondResonanceList[ i ]
set groupAtomCount to group.atomNameList.length
# traverse atoms
for i in 1 to groupAtomCount
print atomIndex
print xCoordList[ atomIndex ]
print yCoordList[ atomIndex ]
print zCoordList[ atomIndex ]
print bFactorList[ atomIndex ]
print atomIdList[ atomIndex ]
print altLocList[ atomIndex ]
print occupancyList[ atomIndex ]
print group.formalChargeList[ i ]
print group.atomNameList[ i ]
print group.elementList[ i ]
increment atomIndex by 1
increment groupIndex by 1
increment chainIndex by 1
increment modelIndex by 1
# traverse inter-group bonds
for i in 1 to bondAtomList.length / 2
print bondAtomList[ i * 2 ] # atomIndex1
print bondAtomList[ i * 2 + 1 ] # atomIndex2
print bondOrderList[ i ]
print bondResonanceList[ i ]