This Python Jupyter notebook processes the PacBio circular consensus sequences (CCSs) to extract barcodes and call mutations in the gene. It then builds consensus sequences for barcoded variants from the mutations, to build a barcode-variant lookup table.
First, process the PacBio CCSs to extract barcodes and call mutations in the gene.
Import Python modules
Plotting is done with plotnine, which uses ggplot2-like syntax.
The analysis uses the Bloom lab's alignparse and dms_variants packages.
import collections
import math
import os
import re
import time
import warnings
import alignparse
import alignparse.ccs
from alignparse.constants import CBPALETTE
import alignparse.minimap2
import alignparse.targets
import alignparse.consensus
import dms_variants
import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.utils
from IPython.display import display, HTML
import numpy
import pandas as pd
from plotnine import *
import yaml
Set plotnine theme to the one defined in dms_variants:
theme_set(dms_variants.plotnine_themes.theme_graygrid())
Versions of key software:
print(f"Using alignparse version {alignparse.__version__}")
print(f"Using dms_variants version {dms_variants.__version__}")
Using alignparse version 0.2.4
Using dms_variants version 0.8.9
Ignore warnings that clutter output:
warnings.simplefilter('ignore')
Read the configuration file:
with open('config.yaml') as f:
config = yaml.safe_load(f)
Make output directory for figures:
os.makedirs(config['process_ccs_dir'], exist_ok=True)
Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:
pacbio_runs = (
pd.read_csv(config['pacbio_runs'], dtype=str)
.drop(columns=['subreads'])
.assign(name=lambda x: x['library'] + '_' + x['run'],
fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
)
)
display(HTML(pacbio_runs.to_html(index=False)))
library | run | name | fastq |
---|---|---|---|
lib1 | 210619 | lib1_210619 | results/ccs/lib1_210619_ccs.fastq.gz |
lib2 | 210619 | lib2_210619 | results/ccs/lib2_210619_ccs.fastq.gz |
Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:
print(f"Reading amplicons from {config['amplicon']}")
print(f"Reading feature parse specs from {config['feature_parse_specs']}")
targets = alignparse.targets.Targets(
seqsfile=config['amplicon'],
feature_parse_specs=config['feature_parse_specs'])
Reading amplicons from data/PacBio_amplicon.gb
Reading feature parse specs from data/feature_parse_specs.yaml
Draw the target amplicons:
fig = targets.plot(ax_width=7,
plots_indexing='biopython', # numbering starts at 0
ax_height=2, # height of each plot
hspace=1.2, # vertical space between plots
)
plotfile = os.path.join(config['process_ccs_dir'], 'amplicons.pdf')
print(f"Saving plot to {plotfile}")
fig.savefig(plotfile, bbox_inches='tight')
Saving plot to results/process_ccs/amplicons.pdf
Write out the specs used to parse the features (these are the same specs provided as feature_parse_specs
when initializing targets
, but with defaults filled in):
print(targets.feature_parse_specs('yaml'))
CGG_naive:
query_clip5: 4
query_clip3: 4
termini5:
filter:
clip5: 4
mutation_nt_count: 1
mutation_op_count: null
clip3: 0
return: []
gene:
filter:
mutation_nt_count: 90
mutation_op_count: null
clip5: 0
clip3: 0
return:
- mutations
- accuracy
spacer:
filter:
mutation_nt_count: 1
mutation_op_count: null
clip5: 0
clip3: 0
return: []
barcode:
filter:
mutation_nt_count: 0
mutation_op_count: null
clip5: 0
clip3: 0
return:
- sequence
- accuracy
termini3:
filter:
clip3: 4
mutation_nt_count: 1
mutation_op_count: null
clip5: 0
return: []
Read data frame with information on PacBio runs:
pacbio_runs = (
pd.read_csv(config['pacbio_runs'], dtype=str)
.drop(columns=['subreads'])
.assign(name=lambda x: x['library'] + '_' + x['run'],
fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
)
)
# we only have report files on the Hutch server, not for SRA download
if config['seqdata_source'] == 'HutchServer':
pacbio_runs = (
pacbio_runs
.assign(report=lambda x: config['ccs_dir'] + '/' + x['name'] + '_report.txt')
)
report_col = 'report'
elif config['seqdata_source'] == 'SRA':
report_col = None
else:
raise ValueError(f"invalid `seqdata_source` {config['seqdata_source']}")
display(HTML(pacbio_runs.to_html(index=False)))
library | run | name | fastq | report |
---|---|---|---|---|
lib1 | 210619 | lib1_210619 | results/ccs/lib1_210619_ccs.fastq.gz | results/ccs/lib1_210619_report.txt |
lib2 | 210619 | lib2_210619 | results/ccs/lib2_210619_ccs.fastq.gz | results/ccs/lib2_210619_report.txt |
Create an object that summarizes the ccs
runs:
ccs_summaries = alignparse.ccs.Summaries(pacbio_runs,
report_col=report_col,
ncpus=config['max_cpus'],
)
If available, plot statistics on the number of ZMWs for each run:
if ccs_summaries.has_zmw_stats():
p = ccs_summaries.plot_zmw_stats()
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
else:
print('No ZMW stats available.')
Plot statistics on generated CCSs: their length, number of subread passes, and accuracy (as reported by the ccs
program):
for variable in ['length', 'passes', 'accuracy']:
if ccs_summaries.has_stat(variable):
p = ccs_summaries.plot_ccs_stats(variable, maxcol=7, bins=25)
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
else:
print(f"No {variable} statistics available.")
We now align the CCSs to the amplicon and parse features from the resulting alignments using the specs above.
First, we initialize an alignparse.minimap2.Mapper
to align the reads to SAM files:
mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)
print(f"Using `minimap2` {mapper.version} with these options:\n" +
' '.join(mapper.options))
Using `minimap2` 2.18-r1015 with these options:
-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs
Next, we use Targets.align_and_parse
to create the alignments and parse them:
readstats, aligned, filtered = targets.align_and_parse(
df=pacbio_runs,
mapper=mapper,
outdir=config['process_ccs_dir'],
name_col='run',
group_cols=['name', 'library'],
queryfile_col='fastq',
overwrite=True,
ncpus=config['max_cpus'],
)
First, examine the read stats from the alignment / parsing, both extracting alignment target name and getting stats aggregated by target:
readstats = (
readstats
.assign(category_all_targets=lambda x: x['category'].str.split().str[0],
target=lambda x: x['category'].str.split(None, 1).str[1],
valid=lambda x: x['category_all_targets'] == 'aligned')
)
Now plot the read stats by run (combining all targets and libraries within a run):
ncol = 2
p = (
ggplot(readstats
.groupby(['name', 'category_all_targets', 'valid'])
.aggregate({'count': 'sum'})
.reset_index(),
aes('category_all_targets', 'count', fill='valid')) +
geom_bar(stat='identity') +
facet_wrap('~ name', ncol=ncol) +
theme(axis_text_x=element_text(angle=90),
figure_size=(1.85 * min(ncol, len(pacbio_runs)),
2 * math.ceil(len(pacbio_runs) / ncol)),
panel_grid_major_x=element_blank(),
legend_position='none',
) +
scale_fill_manual(values=CBPALETTE)
)
_ = p.draw()
And the read stats by library (combining all targets and runs within a library):
p = (
ggplot(readstats
.groupby(['library', 'category_all_targets', 'valid'])
.aggregate({'count': 'sum'})
.reset_index(),
aes('category_all_targets', 'count', fill='valid')) +
geom_bar(stat='identity') +
facet_wrap('~ library', nrow=1) +
theme(axis_text_x=element_text(angle=90),
figure_size=(1.5 * pacbio_runs['library'].nunique(), 2),
panel_grid_major_x=element_blank(),
legend_position='none',
) +
scale_fill_manual(values=CBPALETTE)
)
_ = p.draw()
And the number of reads by target (combining all libraries and runs for a target):
p = (
ggplot(readstats
.groupby(['target'])
.aggregate({'count': 'sum'})
.reset_index(),
aes('target', 'count')) +
geom_point(stat='identity', size=3) +
theme(axis_text_x=element_text(angle=90),
figure_size=(0.3 * readstats['target'].nunique(), 2),
panel_grid_major_x=element_blank(),
) +
scale_y_log10(name='number of reads')
)
_ = p.draw()
And read stats by target (combining all libraries and runs for a target):
p = (
ggplot(readstats
.groupby(['target', 'valid'])
.aggregate({'count': 'sum'})
.reset_index()
.assign(total=lambda x: x.groupby('target')['count'].transform('sum'),
frac=lambda x: x['count'] / x['total'],
),
aes('target', 'frac', fill='valid')) +
geom_bar(stat='identity') +
theme(axis_text_x=element_text(angle=90),
figure_size=(0.5 * readstats['target'].nunique(), 2),
panel_grid_major_x=element_blank(),
) +
scale_fill_manual(values=CBPALETTE)
)
_ = p.draw()
Now let's see why we filtered the reads.
First, we do some transformations on the filtered
dict returned by Targets.align_and_parse
.
Then we count up the number of CCSs filtered for each reason, and group together "unusual" reasons that represent less than some fraction of all filtering.
For now, we group together all targets to the stats represent all targets combined:
other_cutoff = 0.02 # group as "other" reasons with <= this frac
filtered_df = (
pd.concat(df.assign(target=target) for target, df in filtered.items())
.groupby(['library', 'name', 'run', 'filter_reason'])
.size()
.rename('count')
.reset_index()
.assign(tot_reason_frac=lambda x: (x.groupby('filter_reason')['count']
.transform('sum')) / x['count'].sum(),
filter_reason=lambda x: numpy.where(x['tot_reason_frac'] > other_cutoff,
x['filter_reason'],
'other')
)
)
Now plot the filtering reason for all runs:
ncol = 7
nreasons = filtered_df['filter_reason'].nunique()
p = (
ggplot(filtered_df, aes('filter_reason', 'count')) +
geom_bar(stat='identity') +
facet_wrap('~ name', ncol=ncol) +
theme(axis_text_x=element_text(angle=90),
figure_size=(0.25 * nreasons * min(ncol, len(pacbio_runs)),
2 * math.ceil(len(pacbio_runs) / ncol)),
panel_grid_major_x=element_blank(),
)
)
_ = p.draw()
Finally, we take the successfully parsed alignments and read them into a data frame, keeping track of the target that each CCS aligns to. We also drop the pieces of information we won't use going forward, and rename a few columns:
aligned_df = (
pd.concat(df.assign(target=target) for target, df in aligned.items())
.drop(columns=['query_clip5', 'query_clip3', 'run','name'])
.rename(columns={'barcode_sequence': 'barcode'})
)
print(f"First few lines of information on the parsed alignments:")
display(HTML(aligned_df.head().to_html(index=False)))
First few lines of information on the parsed alignments:
library | query_name | gene_mutations | gene_accuracy | barcode | barcode_accuracy | target |
---|---|---|---|---|---|---|
lib1 | m64272e_210802_074641/13/ccs | C40T C41T T42G ins166T | 0.999347 | TATCCACACGAAGCGT | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/14/ccs | G155A | 1.000000 | CAAACACACTCCTCAG | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/43/ccs | 1.000000 | TTCGACAGAATCGATG | 1.0 | CGG_naive | |
lib1 | m64272e_210802_074641/71/ccs | A539G T540G | 1.000000 | ATATGGTCACTTATTC | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/83/ccs | 0.999977 | GAATTCCAAAGAGTCA | 1.0 | CGG_naive |
Write the processed CCSs to a file:
aligned_df.to_csv(config['processed_ccs_file'], index=False)
print("Barcodes and mutations for valid processed CCSs "
f"have been written to {config['processed_ccs_file']}.")
Barcodes and mutations for valid processed CCSs have been written to results/process_ccs/processed_ccs.csv.
Next, we analyze these processed CCSs to build the variants.
Builds consensus sequences for barcoded variants from the mutations called in the processed PacBio CCSs. Uses these consensus sequences to build a codon variant table.
Make output directories if needed:
os.makedirs(config['variants_dir'], exist_ok=True)
os.makedirs(config['figs_dir'], exist_ok=True)
Read the CSV file with the processed CCSs into a data frame, display first few lines:
processed_ccs = pd.read_csv(config['processed_ccs_file'], na_filter=None)
nlibs = processed_ccs['library'].nunique() # number of unique libraries
ntargets = processed_ccs['target'].nunique() # number of unique targets
print(f"Read {len(processed_ccs)} CCSs from {nlibs} libraries and {ntargets} targets.")
display(HTML(processed_ccs.head().to_html(index=False)))
Read 1443261 CCSs from 2 libraries and 1 targets.
library | query_name | gene_mutations | gene_accuracy | barcode | barcode_accuracy | target |
---|---|---|---|---|---|---|
lib1 | m64272e_210802_074641/13/ccs | C40T C41T T42G ins166T | 0.999347 | TATCCACACGAAGCGT | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/14/ccs | G155A | 1.000000 | CAAACACACTCCTCAG | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/43/ccs | 1.000000 | TTCGACAGAATCGATG | 1.0 | CGG_naive | |
lib1 | m64272e_210802_074641/71/ccs | A539G T540G | 1.000000 | ATATGGTCACTTATTC | 1.0 | CGG_naive |
lib1 | m64272e_210802_074641/83/ccs | 0.999977 | GAATTCCAAAGAGTCA | 1.0 | CGG_naive |
Overall statistics on number of total CCSs and number of unique barcodes:
display(HTML(
processed_ccs
.groupby(['target', 'library'])
.aggregate(total_CCSs=('barcode', 'size'),
unique_barcodes=('barcode', 'nunique'))
.assign(avg_CCSs_per_barcode=lambda x: x['total_CCSs'] / x['unique_barcodes'])
.round(2)
.to_html()
))
total_CCSs | unique_barcodes | avg_CCSs_per_barcode | ||
---|---|---|---|---|
target | library | |||
CGG_naive | lib1 | 802675 | 108569 | 7.39 |
lib2 | 640586 | 118042 | 5.43 |
We have the PacBio ccs
program's estimated "accuracy" for both the barcode and the gene sequence for each processed CCS.
We will filter the CCSs to only keep ones of sufficiently high accuracy.
First, we want to plot the accuracies. It is actually visually easier to look at the error rate, which is one minus the accuracy. Because we want to plot on a log scale (which can't show error rates of zero), we define a error_rate_floor, and set all error rates less than this to that value:
error_rate_floor = 1e-7 # error rates < this set to this
if error_rate_floor >= config['max_error_rate']:
raise ValueError('error_rate_floor must be < max_error_rate')
processed_ccs = (
processed_ccs
.assign(barcode_error=lambda x: numpy.clip(1 - x['barcode_accuracy'],
error_rate_floor, None),
gene_error=lambda x: numpy.clip(1 - x['gene_accuracy'],
error_rate_floor, None)
)
)
Now plot the error rates, drawing a dashed vertical line at the threshold separating the CCSs we retain for consensus building versus those that we discard:
_ = (
ggplot(processed_ccs
.melt(value_vars=['barcode_error', 'gene_error'],
var_name='feature_type', value_name='error rate'),
aes('error rate')) +
geom_histogram(bins=25) +
geom_vline(xintercept=config['max_error_rate'],
linetype='dashed',
color=CBPALETTE[1]) +
facet_wrap('~ feature_type') +
theme(figure_size=(4.5, 2)) +
ylab('number of CCSs') +
scale_x_log10()
).draw()
Flag the CCSs to retain, and indicate how many we are retaining and purging due to the accuracy filter:
processed_ccs = (
processed_ccs
.assign(retained=lambda x: ((x['gene_error'] < config['max_error_rate']) &
(x['barcode_error'] < config['max_error_rate'])))
)
Here are number of retained CCSs:
_ = (
ggplot(processed_ccs.assign(xlabel=lambda x: x['target'] + ', ' + x['library'])
.groupby(['xlabel', 'retained'])
.size()
.rename('count')
.reset_index(),
aes('xlabel', 'count', color='retained', label='count')) +
geom_point(size=3) +
geom_text(va='bottom', size=7, ha='center',format_string='{:.3g}', nudge_y=0.2) +
theme(figure_size=(0.5 * nlibs * ntargets, 3),
panel_grid_major_x=element_blank(),
axis_text_x=element_text(angle=90),
) +
scale_y_log10(name='number of CCSs') +
xlab('') +
scale_color_manual(values=CBPALETTE[1:])
).draw()
How many times is each barcode sequenced? This is useful to know for thinking about building the barcode consensus.
Plot the distribution of the number of times each barcode is observed among the retained CCSs:
max_count = 8 # in plot, group all barcodes with >= this many counts
p = (
ggplot(
processed_ccs
.query('retained')
.groupby(['library', 'barcode'])
.size()
.rename('nseqs')
.reset_index()
.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_count)),
aes('nseqs')) +
geom_bar() +
facet_wrap('~ library', nrow=1) +
theme(figure_size=(1.75 * nlibs, 2),
panel_grid_major_x=element_blank(),
) +
ylab('number of barcodes') +
xlab('CCSs for barcode')
)
_ = p.draw()
We want to directly estimate the accuracy of the gene-barcode link rather than relying on the PacBio ccs
accuracy, which doesn't include inaccuracies due to things like strand exchange or the same barcode on different sequences.
One way to do this is to examine instances when we have multiple sequences for the same barcode.
We can calculate the empirical accuracy of the sequences by looking at all instances of multiple sequences of the same barcode and determining how often they are identical.
This calculation is performed by alignparse.consensus.empirical_accuracy
using the equations described in the docs for that function.
We will do this four for sets of sequences:
- All of the CCSs retained above.
- CCSs retained by applying a PacBio
ccs
accuracy filter 10-fold more stringent than the one above. The rationale is that if this improves the concordance (real accuracy) of the CCSs substantially then maybe we should make the accuracy filter more stringent. - Like (1) but excluding all CCSs with indels. the rationale is that we only really care about substitutions, and will exclude sequences with indels anyway.
- Like (2) but excluding all CCSs with indels.
First, we annotate the sequences with the number of indels and whether they have an indel to enable categorization into the aforementioned sets:
processed_ccs = processed_ccs.reset_index(drop=True)
processed_ccs = alignparse.consensus.add_mut_info_cols(processed_ccs,
mutation_col='gene_mutations',
n_indel_col='n_indels')
processed_ccs = processed_ccs.assign(has_indel=lambda x: x['n_indels'] > 0)
Plot how many sequences have indels:
_ = (
ggplot(processed_ccs,
aes('retained', fill='has_indel')) +
geom_bar(position='dodge') +
geom_text(aes(label='..count..'), stat='count', va='bottom', size=7,
position=position_dodge(width=0.9), format_string='{:.2g}') +
theme(figure_size=(2.5 * nlibs, 3),
panel_grid_major_x=element_blank(),
) +
ylab('number of CCSs') +
scale_fill_manual(values=CBPALETTE[1:]) +
facet_wrap('~ library', nrow=1)
).draw()
Now get the empirical accuracy for each of the CCS groups mentioned above:
high_acc = config['max_error_rate'] / 10
empirical_acc = []
for desc, query_str in [
('retained', 'retained'),
('retained, no indel', 'retained and not has_indel'),
('10X accuracy',
f"(gene_error < {high_acc}) and (barcode_error < {high_acc})"),
('10X accuracy, no indel',
f"(gene_error < {high_acc}) and (barcode_error < {high_acc}) and not has_indel")
]:
# get just CCSs in that category
df = processed_ccs.query(query_str)
# compute empirical accuracy
empirical_acc.append(
alignparse.consensus.empirical_accuracy(df,
mutation_col='gene_mutations')
.assign(description=desc)
.merge(df
.groupby('library')
.size()
.rename('number_CCSs')
.reset_index()
)
)
# make description categorical to preserve order, and annotate as "actual"
# the category ("retained, no indel") that we will use for building variants.
empirical_acc = (
pd.concat(empirical_acc, ignore_index=True, sort=False)
.assign(description=lambda x: pd.Categorical(x['description'],
x['description'].unique(),
ordered=True),
actual=lambda x: numpy.where(x['description'] == 'retained, no indel',
True, False),
)
)
Display table of the empirical accuracies:
display(HTML(empirical_acc.to_html(index=False)))
library | accuracy | description | number_CCSs | actual |
---|---|---|---|---|
lib1 | 0.956602 | retained | 724908 | False |
lib2 | 0.949559 | retained | 577257 | False |
lib1 | 0.987664 | retained, no indel | 696647 | True |
lib2 | 0.982219 | retained, no indel | 553582 | True |
lib1 | 0.967650 | 10X accuracy | 681243 | False |
lib2 | 0.961755 | 10X accuracy | 541970 | False |
lib1 | 0.989080 | 10X accuracy, no indel | 661332 | False |
lib2 | 0.984277 | 10X accuracy, no indel | 525460 | False |
Plot the empirical accuracies, using a different color to show the category that we will actually use:
p = (
ggplot(empirical_acc,
aes('description', 'accuracy', color='actual', label='accuracy')
) +
geom_point(size=3) +
geom_text(va='bottom', size=9, format_string='{:.3g}', nudge_y=0.003) +
facet_wrap('~ library', ncol=8) +
theme(figure_size=(1.75 * nlibs, 2.25),
axis_text_x=element_text(angle=90),
panel_grid_major_x=element_blank(),
) +
xlab('') +
scale_y_continuous(name='empirical accuracy', limits=(0.95, 1.005)) +
scale_color_manual(values=CBPALETTE, guide=False)
)
plotfile = os.path.join(config['figs_dir'], 'empirical_CCS_accuracy.pdf')
print(f"Saving plot to {plotfile}")
_ = p.draw()
Saving plot to results/figures/empirical_CCS_accuracy.pdf
The above analysis shows that if we exclude sequences with indels (which we plan to do among our consensus sequences), then the accuracy of each CCS is around 99%.
We do not get notably higher empirical accuracy by imposing a more stringent filter from the PacBio ccs
program, indicating that the major sources of error are due to processes that are not modeled in this program's accuracy filter (perhaps strand exchange or barcode sharing).
Note that this empirical accuracy is for a single CCS. When we build the consensus sequences for each barcode below, we will take the consensus of CCSs within a barcode. So for barcodes with multiple CCSs, the actual accuracy of the consensus sequences will be higher than the empirical accuracy above due to capturing information from multiple CCSs.
We call the consensus sequence for each barcode using the simple method implemented in alignparse.consensus.simple_mutconsensus. The documentation for that function explains the method in detail, but basically it works like this:
- When there is just one CCS per barcode, the consensus is just that sequence.
- When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or strand exchange.
First, call the consensus for each barcode including all retained sequences, even those with indels:
consensus, dropped = alignparse.consensus.simple_mutconsensus(
processed_ccs.query('retained'),
group_cols=('library', 'barcode', 'target'),
mutation_col='gene_mutations',
)
Here are the first few lines of the data frame of consensus sequences for each barcode. In addition to giving the library, barcode, target, and mutations, it also has a column indicating how many CCSs support the variant call:
display(HTML(consensus.head().to_html(index=False)))
library | barcode | target | gene_mutations | variant_call_support |
---|---|---|---|---|
lib1 | AAAAAAAAAACACCGG | CGG_naive | C357T T598A T599C A600T | 6 |
lib1 | AAAAAAAAAACATGAG | CGG_naive | C46T A47G | 1 |
lib1 | AAAAAAAAAAGCGACG | CGG_naive | G466C T467A G468T | 1 |
lib1 | AAAAAAAAAAGGAAAG | CGG_naive | T329G G330T | 6 |
lib1 | AAAAAAAAAATATAGA | CGG_naive | T139C A140C C141A | 1 |
Since we retain variants with substitutions but ignore those with indels, add information about substitution mutations and number of indels:
consensus = alignparse.consensus.add_mut_info_cols(
consensus,
mutation_col='gene_mutations',
sub_str_col='substitutions',
n_indel_col='number_of_indels',
overwrite_cols=True)
display(HTML(consensus.head().to_html(index=False)))
library | barcode | target | gene_mutations | variant_call_support | substitutions | number_of_indels |
---|---|---|---|---|---|---|
lib1 | AAAAAAAAAACACCGG | CGG_naive | C357T T598A T599C A600T | 6 | C357T T598A T599C A600T | 0 |
lib1 | AAAAAAAAAACATGAG | CGG_naive | C46T A47G | 1 | C46T A47G | 0 |
lib1 | AAAAAAAAAAGCGACG | CGG_naive | G466C T467A G468T | 1 | G466C T467A G468T | 0 |
lib1 | AAAAAAAAAAGGAAAG | CGG_naive | T329G G330T | 6 | T329G G330T | 0 |
lib1 | AAAAAAAAAATATAGA | CGG_naive | T139C A140C C141A | 1 | T139C A140C C141A | 0 |
Plot distribution of number of CCSs supporting each variant call (consensus), indicating whether or not there is an indel:
max_variant_call_support = 6 # group variants with >= this much support
_ = (
ggplot(consensus
.assign(variant_call_support=lambda x: numpy.clip(x['variant_call_support'],
None,
max_variant_call_support),
indel_state=lambda x: numpy.where(x['number_of_indels'] > 0,
'has indel', 'no indel')
),
aes('variant_call_support')) +
geom_bar() +
ylab('number of variants') +
facet_grid('indel_state ~ library') +
theme(figure_size=(1.75 * nlibs, 3.5),
panel_grid_major_x=element_blank(),
)
).draw()
We see that most variant consensus sequences do not have indels, especially if we limit to the more "accurate" ones that have multiple CCSs supporting them.
We will ignore all consensus sequences with indels in the variant-barcode lookup table. We do this for two reasons:
- When there is just one CCS supporting a consensus, it is less likely to be accurate as indels are the main mode of PacBio error.
- For the purposes of our studies, we are interested in point mutations rather than indels anyway.
Here are number of valid consensus sequence (no indels) for each library and target:
consensus = consensus.query('number_of_indels == 0')
lib_target_counts = (
consensus
.groupby(['library', 'target'])
.size()
.rename('consensus sequences')
.reset_index()
)
display(HTML(lib_target_counts.to_html(index=False)))
p = (ggplot(lib_target_counts.assign(xlabel=lambda x: x['target'] + ', ' + x['library']),
aes('xlabel', 'consensus sequences')) +
geom_point(size=3) +
theme(figure_size=(0.5 * nlibs * ntargets, 1.75),
axis_text_x=element_text(angle=90)) +
xlab('') +
scale_y_log10()
)
_ = p.draw()
library | target | consensus sequences |
---|---|---|
lib1 | CGG_naive | 93059 |
lib2 | CGG_naive | 99370 |
Below we write the retained consensus sequences to a CSV file that links the nucleotide mutations to the barcodes. (The next section analyzes this variant table in detail, and provides have more precise information on the number of variants and relevant statistics):
print(f"Writing nucleotide variants to {config['nt_variant_table_file']}")
(consensus
[['target', 'library', 'barcode', 'substitutions', 'variant_call_support']]
.to_csv(config['nt_variant_table_file'], index=False)
)
print('Here are the first few lines of this file:')
display(HTML(
pd.read_csv(config['nt_variant_table_file'], na_filter=None)
.head()
.to_html(index=False)
))
Writing nucleotide variants to results/variants/nucleotide_variant_table.csv
Here are the first few lines of this file:
target | library | barcode | substitutions | variant_call_support |
---|---|---|---|---|
CGG_naive | lib1 | AAAAAAAAAACACCGG | C357T T598A T599C A600T | 6 |
CGG_naive | lib1 | AAAAAAAAAACATGAG | C46T A47G | 1 |
CGG_naive | lib1 | AAAAAAAAAAGCGACG | G466C T467A G468T | 1 |
CGG_naive | lib1 | AAAAAAAAAAGGAAAG | T329G G330T | 6 |
CGG_naive | lib1 | AAAAAAAAAATATAGA | T139C A140C C141A | 1 |
What happened to the barcodes that we "dropped" because we could not construct a reliable consensus?
The dropped
data frame from alignparse.consensus.simple_mutconsensus has this information:
display(HTML(dropped.head().to_html(index=False)))
library | barcode | target | drop_reason | nseqs |
---|---|---|---|---|
lib1 | AAAAAAAAAAATAATA | CGG_naive | subs diff too large | 10 |
lib1 | AAAAAAAAAATCCGAG | CGG_naive | subs diff too large | 18 |
lib1 | AAAAAAAACAACGTAA | CGG_naive | minor subs too frequent | 6 |
lib1 | AAAAAAAATCTCAAAC | CGG_naive | subs diff too large | 73 |
lib1 | AAAAAAACACAACTCA | CGG_naive | subs diff too large | 7 |
Summarize the information in this data frame on dropped barcodes with the plot below. This plot shows several things. First, we see that the total number of barcodes dropped is modest (just a few thousand per library) relative to the total number of barcodes per library (seen above to be on the order of hundreds of thousands). Second, the main reason that barcodes are dropped is that there are CCSs within the same barcode with suspiciously large numbers of mutations relative to the consensus---which we use as a filter to discard the entire barcode as it could indicate strand exchange or some other issue. In any case, the modest number of dropped barcodes indicates that there probably isn't much of a need to worry:
max_nseqs = 8 # plot together all barcodes with >= this many sequences
_ = (
ggplot(
dropped.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_nseqs)),
aes('nseqs')) +
geom_bar() +
scale_x_continuous(limits=(1, None)) +
xlab('number of sequences for barcode') +
ylab('number of barcodes') +
facet_grid('library ~ drop_reason') +
theme(figure_size=(10, 1.5 * nlibs),
panel_grid_major_x=element_blank(),
)
).draw()
We now create a CodonVariantTable that stores and processes all the information about the variant consensus sequences. Below we initialize such a table, and then analyze information about its composition.
In order to initialize the codon variant table, we need two pieces of information:
- The wildtype gene sequence.
- The list of nucleotide mutations for each variant as determined in the consensus calling above.
Read "wildtype" gene sequence to which we made the alignments (in order to do this, initialize an alignparse.Targets
and get the gene sequence from it):
targets = alignparse.targets.Targets(seqsfile=config['amplicon'],
feature_parse_specs=config['feature_parse_specs'])
geneseq = targets.get_target(config['primary_target']).get_feature('gene').seq
print(f"Read gene of {len(geneseq)} nts for {config['primary_target']} from {config['amplicon']}")
Read gene of 705 nts for CGG_naive from data/PacBio_amplicon.gb
Now initialize the codon variant table using this wildtype sequence and our list of nucleotide mutations for each variant:
variants = dms_variants.codonvarianttable.CodonVariantTable(
barcode_variant_file=config['nt_variant_table_file'],
geneseq=geneseq,
primary_target=config['primary_target'],
)
We now will analyze the variants.
In this call and in the plots below, we set samples=None
as we aren't looking at variant counts in specific samples, but are simply looking at properties of the variants in the table.
Here are the number of variants for each target:
display(HTML(
variants
.n_variants_df(samples=None)
.pivot_table(index=['target'],
columns='library',
values='count')
.to_html()
))
library | lib1 | lib2 | all libraries |
---|---|---|---|
target | |||
CGG_naive | 93059 | 99370 | 192429 |
Plot the number of variants supported by each number of CCSs:
max_support = 10 # group variants with >= this much support
p = variants.plotVariantSupportHistogram(max_support=max_support,
widthscale=1.1,
heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
Plot the number of barcoded variants with each number of amino-acid and codon mutations. This is for the primary target only, and doesn't include the spiked-in secondary targets:
max_muts = 7 # group all variants with >= this many mutations
for mut_type in ['aa', 'codon']:
p = variants.plotNumMutsHistogram(mut_type, samples=None, max_muts=max_muts,
widthscale=1.1,
heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"n_{mut_type}_muts_per_variant.pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)
Saving plot to results/figures/n_aa_muts_per_variant.pdf
Saving plot to results/figures/n_codon_muts_per_variant.pdf
Plot the frequencies of different codon mutation types among all variants (any number of mutations), again only for primary target:
p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
ylabel='mutations per variant',
heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"avg_muts_per_variant.pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)
Saving plot to results/figures/avg_muts_per_variant.pdf
Variants supported by multiple PacBio CCSs should have fewer spurious mutations since sequencing errors are very unlikely to occur on two CCSs.
Below we plot the number of codon mutations per variant among variants with at least two CCSs supporting their call.
The difference in mutation rates here and in the plot above (that does not apply the min_support=2
filter) gives some estimate of the frequency of mutations in our variants our spurious.
In fact, we see the numbers are very similar, indicating that few of the mutations are spurious:
p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
ylabel='mutations per variant',
min_support=2, heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank()) # no vertical grid lines
_ = p.draw()
We examine how completely amino-acid mutations are sampled by the variants for the primary target, looking at single-mutant variants only and all variants. The plot below shows that virtually every mutation is found in a variant in each library, even if we just look among the single mutants. Things look especially good if we aggregate across libraries:
for variant_type in ['all', 'single']:
p = variants.plotCumulMutCoverage(variant_type, mut_type='aa', samples=None)
_ = p.draw()
plotfile = os.path.join(config['figs_dir'],
f"variant_cumul_{variant_type}_mut_coverage.pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)
Saving plot to results/figures/variant_cumul_all_mut_coverage.pdf
Saving plot to results/figures/variant_cumul_single_mut_coverage.pdf
To get more quantitative information like that plotted above, we determine how many mutations are found 0, 1, or >1 times both among single and all mutants for the primary target:
count_dfs = []
for variant_type in ['all', 'single']:
i_counts = (variants.mutCounts(variant_type, mut_type='aa', samples=None)
.assign(variant_type=variant_type)
)
count_dfs += [i_counts.assign(include_stops=True),
i_counts
.query('not mutation.str.contains("\*")', engine='python')
.assign(include_stops=False)
]
display(HTML(
pd.concat(count_dfs)
.assign(count=lambda x: (numpy.clip(x['count'], None, 2)
.map({0: '0', 1: '1', 2:'>1'}))
)
.groupby(['variant_type', 'include_stops', 'library', 'count'])
.aggregate(number_of_mutations=pd.NamedAgg(column='mutation', aggfunc='count'))
.to_html()
))
number_of_mutations | ||||
---|---|---|---|---|
variant_type | include_stops | library | count | |
all | False | lib1 | 0 | 240 |
1 | 25 | |||
>1 | 4200 | |||
lib2 | 0 | 233 | ||
1 | 23 | |||
>1 | 4209 | |||
all libraries | 0 | 225 | ||
1 | 8 | |||
>1 | 4232 | |||
True | lib1 | 0 | 414 | |
1 | 44 | |||
>1 | 4242 | |||
lib2 | 0 | 399 | ||
1 | 39 | |||
>1 | 4262 | |||
all libraries | 0 | 378 | ||
1 | 32 | |||
>1 | 4290 | |||
single | False | lib1 | 0 | 275 |
1 | 35 | |||
>1 | 4155 | |||
lib2 | 0 | 283 | ||
1 | 26 | |||
>1 | 4156 | |||
all libraries | 0 | 257 | ||
1 | 22 | |||
>1 | 4186 | |||
True | lib1 | 0 | 477 | |
1 | 57 | |||
>1 | 4166 | |||
lib2 | 0 | 477 | ||
1 | 57 | |||
>1 | 4166 | |||
all libraries | 0 | 434 | ||
1 | 60 | |||
>1 | 4206 |
We plot the frequencies of mutations along the gene among the variants for the primary target. Ideally, this would be uniform. We make the plot for both all variants and single-mutant / wildtype variants:
for variant_type in ['all', 'single']:
p = variants.plotMutFreqs(variant_type, mut_type='codon', samples=None)
p.draw()
We can also use heat maps to examine the extent to which specific amino-acid or codon mutations are over-represented. These heat maps are large, so we make them just for all variants and the merge of all libraries:
for mut_type in ['aa', 'codon']:
p = variants.plotMutHeatmap('all', mut_type, samples=None, #libraries='all_only',
widthscale=2)
p.draw()
We write the codon variant table to a CSV file. This table looks like this:
display(HTML(
variants.barcode_variant_df
.head()
.to_html(index=False)
))
target | library | barcode | variant_call_support | codon_substitutions | aa_substitutions | n_codon_substitutions | n_aa_substitutions |
---|---|---|---|---|---|---|---|
CGG_naive | lib1 | AAAAAAAAAACACCGG | 6 | GGC119GGT TTA200ACT | L200T | 2 | 1 |
CGG_naive | lib1 | AAAAAAAAAACATGAG | 1 | CAG16TGG | Q16W | 1 | 1 |
CGG_naive | lib1 | AAAAAAAAAAGCGACG | 1 | GTG156CAT | V156H | 1 | 1 |
CGG_naive | lib1 | AAAAAAAAAAGGAAAG | 6 | GTG110GGT | V110G | 1 | 1 |
CGG_naive | lib1 | AAAAAAAAAATATAGA | 1 | TAC47CCA | Y47P | 1 | 1 |
Note how this table differs from the nucleotide variant table we generated above and used to initialize the CodonVariantTable in that it gives codon substitutions and associated amino-acid substitutions.
Write it to CSV file:
print(f"Writing codon-variant table to {config['codon_variant_table_file']}")
variants.barcode_variant_df.to_csv(config['codon_variant_table_file'], index=False)
Writing codon-variant table to results/variants/codon_variant_table.csv