-
Notifications
You must be signed in to change notification settings - Fork 10
NCC data format
The main NCC output format for contact data takes the form of whitespace-separated lines, where each line represents a different chromosomal contact that pairs two chromosomal regions. This format specifies more than just the chromosome contact map: it also records the original read locations within the (ligated) primary RE digest fragments, strand information, ambiguity information and information to relate the data back to the original FASTQ input files. Accordingly this format can be used for data filtering and validation.
The columns of NCC files correspond to:
- Name of chromosome A
- First base position of sequence read A
- Last base position of sequence read A
- 5' base position of primary RE fragment containing read A
- 3' base position of primary RE fragment containing read A
- The strand of sequence read A
- Name of chromosome B
- First base position of sequence read B
- Last base position of sequence read B
- 5' base position of primary RE fragment containing read B
- 3' base position of primary RE fragment containing read B
- The strand of sequence read B
- Ambiguity group (see below)
- The ID number of the read pair in the original FASTQ files
- Whether read pairs are swapped relative to original FASTQ files
i.e:
chr_a start_a end_a re1_a_start re1_a_end strand_a chr_b start_b end_b re1_b_start re1_b_end strand_b ambig_group pair_id swap_pair
For example two lines could be:
chr10 100002828 100002899 100002733 100003107 + chr10 100015771 100015700 100015676 100016001 - 1.1 1534464 0
chr10 100007729 100007658 100007551 100008354 - chr13 107185630 107185700 107184984 107185698 + 1.1 1032357 1
The NCC format has changed since the initial release of nuc_processing regarding how ambiguity is recorded. Here ambiguity group refers to the alternative mappings for the same DNA sequence read pair, for example where a read sequence maps to multiple genome positions.
Originally the ambiguity group was an integer identity number for the group; it was repeated for all mapped pairs within that group, but otherwise unique. Given that this information was effectively already present in the next (pair_id) NCC column the file format was changed so that NCC files could be concatenated. Also, the new format allowed for mapping possibilities to be activated or deactivated without loss of the original underlying information. In this manner the resolution of ambiguous contacts can be reversed or reconsidered, e.g. during a genome structure calculation as better 3D models are generated. One caveat to this however is that the lines of NCC files are now strictly ordered and must not be sorted.
The NCC ambiguity group is now specified with via a {group_size}.{is_active} sub-format. Here {group_size} is the number of alternative pairs, but is only present on the first line for the group, otherwise it is "0". The {is_active} part is either "1" or "0" depending if that line (contact pair) is active or not.
For example for the following two lines the first ambiguity code is "2.1": there are two pairs for this group and this line is active. The second code is "0.1": this line is in the same group as the first and is also active. The the third code is "1.0": this line is in a new group, with only one pair, and the line is inactive.
chr10.b 25126043 25125933 25125805 25126302 - chr10.b 25736466 25736315 25736194 25736671 - 2.1 20 1
chr10.a 25863031 25862880 25862753 25863236 - chr10.b 25126043 25125933 25125805 25126302 - 0.1 20 0
chr14.a 18522852 18522701 18522632 18523275 - chr15.a 18773961 18773893 18773149 18773895 - 1.0 70 0
The reason to have zeros instead of group sizes for the second and subsequent pairs is for human readability; this makes it visually obvious what the extent of the group is within the lines of the file.