Process CCSs

First, process the PacBio CCSs to extract barcodes and call mutations in the gene.

Setup

Import Python modules

Plotting is done with plotnine, which uses ggplot2-like syntax.

The analysis uses the Bloom lab's alignparse and dms_variants packages.

import collections
import math
import os
import re
import time
import warnings

import alignparse
import alignparse.ccs
from alignparse.constants import CBPALETTE
import alignparse.minimap2
import alignparse.targets
import alignparse.consensus

import dms_variants
import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.utils

from IPython.display import display, HTML

import numpy

import pandas as pd

from plotnine import *

import yaml

Set plotnine theme to the one defined in dms_variants:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Versions of key software:

print(f"Using alignparse version {alignparse.__version__}")
print(f"Using dms_variants version {dms_variants.__version__}")

Using alignparse version 0.2.4
Using dms_variants version 0.8.9

Ignore warnings that clutter output:

warnings.simplefilter('ignore')

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Make output directory for figures:

os.makedirs(config['process_ccs_dir'], exist_ok=True)

PacBio reads

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['subreads'])
    .assign(name=lambda x: x['library'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )

display(HTML(pacbio_runs.to_html(index=False)))

library	run	name	fastq
lib1	210619	lib1_210619	results/ccs/lib1_210619_ccs.fastq.gz
lib2	210619	lib2_210619	results/ccs/lib2_210619_ccs.fastq.gz

PacBio amplicons

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

print(f"Reading amplicons from {config['amplicon']}")
print(f"Reading feature parse specs from {config['feature_parse_specs']}")

targets = alignparse.targets.Targets(
                seqsfile=config['amplicon'],
                feature_parse_specs=config['feature_parse_specs'])

Reading amplicons from data/PacBio_amplicon.gb
Reading feature parse specs from data/feature_parse_specs.yaml

Draw the target amplicons:

fig = targets.plot(ax_width=7,
                   plots_indexing='biopython',  # numbering starts at 0
                   ax_height=2,  # height of each plot
                   hspace=1.2,  # vertical space between plots
                   )

plotfile = os.path.join(config['process_ccs_dir'], 'amplicons.pdf')
print(f"Saving plot to {plotfile}")
fig.savefig(plotfile, bbox_inches='tight')

Saving plot to results/process_ccs/amplicons.pdf

Write out the specs used to parse the features (these are the same specs provided as feature_parse_specs when initializing targets, but with defaults filled in):

print(targets.feature_parse_specs('yaml'))

CGG_naive:
  query_clip5: 4
  query_clip3: 4
  termini5:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip3: 0
    return: []
  gene:
    filter:
      mutation_nt_count: 90
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - mutations
    - accuracy
  spacer:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return: []
  barcode:
    filter:
      mutation_nt_count: 0
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - sequence
    - accuracy
  termini3:
    filter:
      clip3: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
    return: []

CCS stats for PacBio runs

Read data frame with information on PacBio runs:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['subreads'])
    .assign(name=lambda x: x['library'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )

# we only have report files on the Hutch server, not for SRA download
if config['seqdata_source'] == 'HutchServer':
    pacbio_runs = (
        pacbio_runs
        .assign(report=lambda x: config['ccs_dir'] + '/' + x['name'] + '_report.txt')
        )
    report_col = 'report'
elif config['seqdata_source'] == 'SRA':
    report_col = None
else:
    raise ValueError(f"invalid `seqdata_source` {config['seqdata_source']}")

display(HTML(pacbio_runs.to_html(index=False)))

library	run	name	fastq	report
lib1	210619	lib1_210619	results/ccs/lib1_210619_ccs.fastq.gz	results/ccs/lib1_210619_report.txt
lib2	210619	lib2_210619	results/ccs/lib2_210619_ccs.fastq.gz	results/ccs/lib2_210619_report.txt

Create an object that summarizes the ccs runs:

ccs_summaries = alignparse.ccs.Summaries(pacbio_runs,
                                         report_col=report_col,
                                         ncpus=config['max_cpus'],
                                         )

If available, plot statistics on the number of ZMWs for each run:

if ccs_summaries.has_zmw_stats():
    p = ccs_summaries.plot_zmw_stats()
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
else:
    print('No ZMW stats available.')

Plot statistics on generated CCSs: their length, number of subread passes, and accuracy (as reported by the ccs program):

for variable in ['length', 'passes', 'accuracy']:
    if ccs_summaries.has_stat(variable):
        p = ccs_summaries.plot_ccs_stats(variable, maxcol=7, bins=25)
        p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
        _ = p.draw()
    else:
        print(f"No {variable} statistics available.")

Align CCSs to amplicons

We now align the CCSs to the amplicon and parse features from the resulting alignments using the specs above.

First, we initialize an alignparse.minimap2.Mapper to align the reads to SAM files:

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)

print(f"Using `minimap2` {mapper.version} with these options:\n" +
      ' '.join(mapper.options))

Using `minimap2` 2.18-r1015 with these options:
-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs

Next, we use Targets.align_and_parse to create the alignments and parse them:

readstats, aligned, filtered = targets.align_and_parse(
        df=pacbio_runs,
        mapper=mapper,
        outdir=config['process_ccs_dir'],
        name_col='run',
        group_cols=['name', 'library'],
        queryfile_col='fastq',
        overwrite=True,
        ncpus=config['max_cpus'],
        )

First, examine the read stats from the alignment / parsing, both extracting alignment target name and getting stats aggregated by target:

readstats = (
    readstats
    .assign(category_all_targets=lambda x: x['category'].str.split().str[0],
            target=lambda x: x['category'].str.split(None, 1).str[1],
            valid=lambda x: x['category_all_targets'] == 'aligned')
    )

Now plot the read stats by run (combining all targets and libraries within a run):

ncol = 2
p = (
    ggplot(readstats
           .groupby(['name', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(),
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.85 * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

And the read stats by library (combining all targets and runs within a library):

p = (
    ggplot(readstats
           .groupby(['library', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ library', nrow=1) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.5 * pacbio_runs['library'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

And the number of reads by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('target', 'count')) +
    geom_point(stat='identity', size=3) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.3 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_y_log10(name='number of reads')
    )
_ = p.draw()

And read stats by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index()
           .assign(total=lambda x: x.groupby('target')['count'].transform('sum'),
                   frac=lambda x: x['count'] / x['total'],
                   ), 
           aes('target', 'frac', fill='valid')) +
    geom_bar(stat='identity') +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.5 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

Now let's see why we filtered the reads. First, we do some transformations on the filtered dict returned by Targets.align_and_parse. Then we count up the number of CCSs filtered for each reason, and group together "unusual" reasons that represent less than some fraction of all filtering. For now, we group together all targets to the stats represent all targets combined:

other_cutoff = 0.02  # group as "other" reasons with <= this frac

filtered_df = (
    pd.concat(df.assign(target=target) for target, df in filtered.items())
    .groupby(['library', 'name', 'run', 'filter_reason'])
    .size()
    .rename('count')
    .reset_index()
    .assign(tot_reason_frac=lambda x: (x.groupby('filter_reason')['count']
                                       .transform('sum')) / x['count'].sum(),
            filter_reason=lambda x: numpy.where(x['tot_reason_frac'] > other_cutoff,
                                                x['filter_reason'],
                                                'other')
            )
    )

Now plot the filtering reason for all runs:

ncol = 7
nreasons = filtered_df['filter_reason'].nunique()

p = (
    ggplot(filtered_df, aes('filter_reason', 'count')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.25 * nreasons * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          )
    )
_ = p.draw()

Finally, we take the successfully parsed alignments and read them into a data frame, keeping track of the target that each CCS aligns to. We also drop the pieces of information we won't use going forward, and rename a few columns:

aligned_df = (
    pd.concat(df.assign(target=target) for target, df in aligned.items())
    .drop(columns=['query_clip5', 'query_clip3', 'run','name'])
    .rename(columns={'barcode_sequence': 'barcode'})
    )

print(f"First few lines of information on the parsed alignments:")
display(HTML(aligned_df.head().to_html(index=False)))

First few lines of information on the parsed alignments:

library	query_name	gene_mutations	gene_accuracy	barcode	barcode_accuracy	target
lib1	m64272e_210802_074641/13/ccs	C40T C41T T42G ins166T	0.999347	TATCCACACGAAGCGT	1.0	CGG_naive
lib1	m64272e_210802_074641/14/ccs	G155A	1.000000	CAAACACACTCCTCAG	1.0	CGG_naive
lib1	m64272e_210802_074641/43/ccs		1.000000	TTCGACAGAATCGATG	1.0	CGG_naive
lib1	m64272e_210802_074641/71/ccs	A539G T540G	1.000000	ATATGGTCACTTATTC	1.0	CGG_naive
lib1	m64272e_210802_074641/83/ccs		0.999977	GAATTCCAAAGAGTCA	1.0	CGG_naive

Write valid CCSs

Write the processed CCSs to a file:

aligned_df.to_csv(config['processed_ccs_file'], index=False)

print("Barcodes and mutations for valid processed CCSs "
      f"have been written to {config['processed_ccs_file']}.")

Barcodes and mutations for valid processed CCSs have been written to results/process_ccs/processed_ccs.csv.

Next, we analyze these processed CCSs to build the variants.

Build barcode variant table

Builds consensus sequences for barcoded variants from the mutations called in the processed PacBio CCSs. Uses these consensus sequences to build a codon variant table.

Make output directories if needed:

os.makedirs(config['variants_dir'], exist_ok=True)
os.makedirs(config['figs_dir'], exist_ok=True)

Read the CSV file with the processed CCSs into a data frame, display first few lines:

processed_ccs = pd.read_csv(config['processed_ccs_file'], na_filter=None)

nlibs = processed_ccs['library'].nunique()  # number of unique libraries

ntargets = processed_ccs['target'].nunique()  # number of unique targets

print(f"Read {len(processed_ccs)} CCSs from {nlibs} libraries and {ntargets} targets.")

display(HTML(processed_ccs.head().to_html(index=False)))

Read 1443261 CCSs from 2 libraries and 1 targets.

library	query_name	gene_mutations	gene_accuracy	barcode	barcode_accuracy	target
lib1	m64272e_210802_074641/13/ccs	C40T C41T T42G ins166T	0.999347	TATCCACACGAAGCGT	1.0	CGG_naive
lib1	m64272e_210802_074641/14/ccs	G155A	1.000000	CAAACACACTCCTCAG	1.0	CGG_naive
lib1	m64272e_210802_074641/43/ccs		1.000000	TTCGACAGAATCGATG	1.0	CGG_naive
lib1	m64272e_210802_074641/71/ccs	A539G T540G	1.000000	ATATGGTCACTTATTC	1.0	CGG_naive
lib1	m64272e_210802_074641/83/ccs		0.999977	GAATTCCAAAGAGTCA	1.0	CGG_naive

Overall statistics on number of total CCSs and number of unique barcodes:

display(HTML(
    processed_ccs
    .groupby(['target', 'library'])
    .aggregate(total_CCSs=('barcode', 'size'),
               unique_barcodes=('barcode', 'nunique'))
    .assign(avg_CCSs_per_barcode=lambda x: x['total_CCSs'] / x['unique_barcodes'])
    .round(2)
    .to_html()
    ))

		total_CCSs	unique_barcodes	avg_CCSs_per_barcode
target	library
CGG_naive	lib1	802675	108569	7.39
CGG_naive	lib2	640586	118042	5.43

Filter processed CCSs

We have the PacBio ccs program's estimated "accuracy" for both the barcode and the gene sequence for each processed CCS. We will filter the CCSs to only keep ones of sufficiently high accuracy.

First, we want to plot the accuracies. It is actually visually easier to look at the error rate, which is one minus the accuracy. Because we want to plot on a log scale (which can't show error rates of zero), we define a error_rate_floor, and set all error rates less than this to that value:

error_rate_floor = 1e-7  # error rates < this set to this
if error_rate_floor >= config['max_error_rate']:
    raise ValueError('error_rate_floor must be < max_error_rate')

processed_ccs = (
    processed_ccs
    .assign(barcode_error=lambda x: numpy.clip(1 - x['barcode_accuracy'],
                                               error_rate_floor, None),
            gene_error=lambda x: numpy.clip(1 - x['gene_accuracy'],
                                            error_rate_floor, None)
            )
    )

Now plot the error rates, drawing a dashed vertical line at the threshold separating the CCSs we retain for consensus building versus those that we discard:

_ = (
 ggplot(processed_ccs
        .melt(value_vars=['barcode_error', 'gene_error'],
              var_name='feature_type', value_name='error rate'),
        aes('error rate')) +
 geom_histogram(bins=25) +
 geom_vline(xintercept=config['max_error_rate'],
            linetype='dashed',
            color=CBPALETTE[1]) +
 facet_wrap('~ feature_type') +
 theme(figure_size=(4.5, 2)) +
 ylab('number of CCSs') +
 scale_x_log10()
 ).draw()

Flag the CCSs to retain, and indicate how many we are retaining and purging due to the accuracy filter:

processed_ccs = (
    processed_ccs
    .assign(retained=lambda x: ((x['gene_error'] < config['max_error_rate']) &
                                (x['barcode_error'] < config['max_error_rate'])))
    )

Here are number of retained CCSs:

_ = (
 ggplot(processed_ccs.assign(xlabel=lambda x: x['target'] + ', ' + x['library'])
                     .groupby(['xlabel', 'retained'])
                     .size()
                     .rename('count')
                     .reset_index(),
        aes('xlabel', 'count', color='retained', label='count')) +
 geom_point(size=3) +
 geom_text(va='bottom', size=7, ha='center',format_string='{:.3g}', nudge_y=0.2) +
 theme(figure_size=(0.5 * nlibs * ntargets, 3),
       panel_grid_major_x=element_blank(),
       axis_text_x=element_text(angle=90),
       ) +
 scale_y_log10(name='number of CCSs') +
 xlab('') +
 scale_color_manual(values=CBPALETTE[1:])
 ).draw()

Sequences per barcode

How many times is each barcode sequenced? This is useful to know for thinking about building the barcode consensus.

Plot the distribution of the number of times each barcode is observed among the retained CCSs:

max_count = 8 # in plot, group all barcodes with >= this many counts

p = (
 ggplot(
    processed_ccs
     .query('retained')
     .groupby(['library', 'barcode'])
     .size()
     .rename('nseqs')
     .reset_index()
     .assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_count)),
    aes('nseqs')) +
 geom_bar() +
 facet_wrap('~ library', nrow=1) +
 theme(figure_size=(1.75 * nlibs, 2),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of barcodes') +
 xlab('CCSs for barcode')
 )

_ = p.draw()

Empirical accuracy of CCSs

We want to directly estimate the accuracy of the gene-barcode link rather than relying on the PacBio ccs accuracy, which doesn't include inaccuracies due to things like strand exchange or the same barcode on different sequences.

One way to do this is to examine instances when we have multiple sequences for the same barcode. We can calculate the empirical accuracy of the sequences by looking at all instances of multiple sequences of the same barcode and determining how often they are identical. This calculation is performed by alignparse.consensus.empirical_accuracy using the equations described in the docs for that function.

We will do this four for sets of sequences:

All of the CCSs retained above.
CCSs retained by applying a PacBio ccs accuracy filter 10-fold more stringent than the one above. The rationale is that if this improves the concordance (real accuracy) of the CCSs substantially then maybe we should make the accuracy filter more stringent.
Like (1) but excluding all CCSs with indels. the rationale is that we only really care about substitutions, and will exclude sequences with indels anyway.
Like (2) but excluding all CCSs with indels.

First, we annotate the sequences with the number of indels and whether they have an indel to enable categorization into the aforementioned sets:

processed_ccs = processed_ccs.reset_index(drop=True)

processed_ccs = alignparse.consensus.add_mut_info_cols(processed_ccs,
                                                       mutation_col='gene_mutations',
                                                       n_indel_col='n_indels')

processed_ccs = processed_ccs.assign(has_indel=lambda x: x['n_indels'] > 0)

Plot how many sequences have indels:

_ = (
 ggplot(processed_ccs,
        aes('retained', fill='has_indel')) +
 geom_bar(position='dodge') +
 geom_text(aes(label='..count..'), stat='count', va='bottom', size=7,
           position=position_dodge(width=0.9), format_string='{:.2g}') +
 theme(figure_size=(2.5 * nlibs, 3),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of CCSs') +
 scale_fill_manual(values=CBPALETTE[1:]) +
 facet_wrap('~ library', nrow=1)
 ).draw()

Now get the empirical accuracy for each of the CCS groups mentioned above:

high_acc = config['max_error_rate'] / 10
empirical_acc = []

for desc, query_str in [
        ('retained', 'retained'),
        ('retained, no indel', 'retained and not has_indel'),
        ('10X accuracy',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc})"),
        ('10X accuracy, no indel',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc}) and not has_indel")
        ]:
    # get just CCSs in that category
    df = processed_ccs.query(query_str)
    
    # compute empirical accuracy
    empirical_acc.append(
        alignparse.consensus.empirical_accuracy(df,
                                                mutation_col='gene_mutations')
        .assign(description=desc)
        .merge(df
               .groupby('library')
               .size()
               .rename('number_CCSs')
               .reset_index()
               )
        )

# make description categorical to preserve order, and annotate as "actual"
# the category ("retained, no indel") that we will use for building variants.
empirical_acc = (
    pd.concat(empirical_acc, ignore_index=True, sort=False)
    .assign(description=lambda x: pd.Categorical(x['description'],
                                                 x['description'].unique(),
                                                 ordered=True),
            actual=lambda x: numpy.where(x['description'] == 'retained, no indel',
                                         True, False),
            )
    )

Display table of the empirical accuracies:

display(HTML(empirical_acc.to_html(index=False)))

library	accuracy	description	number_CCSs	actual
lib1	0.956602	retained	724908	False
lib2	0.949559	retained	577257	False
lib1	0.987664	retained, no indel	696647	True
lib2	0.982219	retained, no indel	553582	True
lib1	0.967650	10X accuracy	681243	False
lib2	0.961755	10X accuracy	541970	False
lib1	0.989080	10X accuracy, no indel	661332	False
lib2	0.984277	10X accuracy, no indel	525460	False

Plot the empirical accuracies, using a different color to show the category that we will actually use:

p = (
    ggplot(empirical_acc,
           aes('description', 'accuracy', color='actual', label='accuracy')
           ) +
    geom_point(size=3) +
    geom_text(va='bottom', size=9, format_string='{:.3g}', nudge_y=0.003) +
    facet_wrap('~ library', ncol=8) +
    theme(figure_size=(1.75 * nlibs, 2.25),
          axis_text_x=element_text(angle=90),
          panel_grid_major_x=element_blank(),
          ) +
    xlab('') +
    scale_y_continuous(name='empirical accuracy', limits=(0.95, 1.005)) +
    scale_color_manual(values=CBPALETTE, guide=False)
    )

plotfile = os.path.join(config['figs_dir'], 'empirical_CCS_accuracy.pdf')
print(f"Saving plot to {plotfile}")
_ = p.draw()

Saving plot to results/figures/empirical_CCS_accuracy.pdf

The above analysis shows that if we exclude sequences with indels (which we plan to do among our consensus sequences), then the accuracy of each CCS is around 99%. We do not get notably higher empirical accuracy by imposing a more stringent filter from the PacBio ccs program, indicating that the major sources of error are due to processes that are not modeled in this program's accuracy filter (perhaps strand exchange or barcode sharing).

Note that this empirical accuracy is for a single CCS. When we build the consensus sequences for each barcode below, we will take the consensus of CCSs within a barcode. So for barcodes with multiple CCSs, the actual accuracy of the consensus sequences will be higher than the empirical accuracy above due to capturing information from multiple CCSs.

Consensus sequences for barcodes

We call the consensus sequence for each barcode using the simple method implemented in alignparse.consensus.simple_mutconsensus. The documentation for that function explains the method in detail, but basically it works like this:

When there is just one CCS per barcode, the consensus is just that sequence.
When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or strand exchange.

First, call the consensus for each barcode including all retained sequences, even those with indels:

consensus, dropped = alignparse.consensus.simple_mutconsensus(
                        processed_ccs.query('retained'),
                        group_cols=('library', 'barcode', 'target'),
                        mutation_col='gene_mutations',
                        )

Here are the first few lines of the data frame of consensus sequences for each barcode. In addition to giving the library, barcode, target, and mutations, it also has a column indicating how many CCSs support the variant call:

display(HTML(consensus.head().to_html(index=False)))

library	barcode	target	gene_mutations	variant_call_support
lib1	AAAAAAAAAACACCGG	CGG_naive	C357T T598A T599C A600T	6
lib1	AAAAAAAAAACATGAG	CGG_naive	C46T A47G	1
lib1	AAAAAAAAAAGCGACG	CGG_naive	G466C T467A G468T	1
lib1	AAAAAAAAAAGGAAAG	CGG_naive	T329G G330T	6
lib1	AAAAAAAAAATATAGA	CGG_naive	T139C A140C C141A	1

Since we retain variants with substitutions but ignore those with indels, add information about substitution mutations and number of indels:

consensus = alignparse.consensus.add_mut_info_cols(
                    consensus,
                    mutation_col='gene_mutations',
                    sub_str_col='substitutions',
                    n_indel_col='number_of_indels',
                    overwrite_cols=True)

display(HTML(consensus.head().to_html(index=False)))

library	barcode	target	gene_mutations	variant_call_support	substitutions
lib1	AAAAAAAAAACACCGG	CGG_naive	C357T T598A T599C A600T	6	C357T T598A T599C A600T
lib1	AAAAAAAAAACATGAG	CGG_naive	C46T A47G	1	C46T A47G
lib1	AAAAAAAAAAGCGACG	CGG_naive	G466C T467A G468T	1	G466C T467A G468T
lib1	AAAAAAAAAAGGAAAG	CGG_naive	T329G G330T	6	T329G G330T
lib1	AAAAAAAAAATATAGA	CGG_naive	T139C A140C C141A	1	T139C A140C C141A

Plot distribution of number of CCSs supporting each variant call (consensus), indicating whether or not there is an indel:

max_variant_call_support = 6  # group variants with >= this much support

_ = (
 ggplot(consensus
        .assign(variant_call_support=lambda x: numpy.clip(x['variant_call_support'],
                                                          None,
                                                          max_variant_call_support),
                indel_state=lambda x: numpy.where(x['number_of_indels'] > 0,
                                                  'has indel', 'no indel')
                ),
        aes('variant_call_support')) +
 geom_bar() +
 ylab('number of variants') +
 facet_grid('indel_state ~ library') +
 theme(figure_size=(1.75 * nlibs, 3.5),
       panel_grid_major_x=element_blank(),
       ) 
 ).draw()

We see that most variant consensus sequences do not have indels, especially if we limit to the more "accurate" ones that have multiple CCSs supporting them.

We will ignore all consensus sequences with indels in the variant-barcode lookup table. We do this for two reasons:

When there is just one CCS supporting a consensus, it is less likely to be accurate as indels are the main mode of PacBio error.
For the purposes of our studies, we are interested in point mutations rather than indels anyway.

Here are number of valid consensus sequence (no indels) for each library and target:

consensus = consensus.query('number_of_indels == 0')

lib_target_counts = (
    consensus
    .groupby(['library', 'target'])
    .size()
    .rename('consensus sequences')
    .reset_index()
    )

display(HTML(lib_target_counts.to_html(index=False)))

p = (ggplot(lib_target_counts.assign(xlabel=lambda x: x['target'] + ', ' + x['library']),
            aes('xlabel', 'consensus sequences')) +
     geom_point(size=3) +
     theme(figure_size=(0.5 * nlibs * ntargets, 1.75),
           axis_text_x=element_text(angle=90)) +
     xlab('') +
     scale_y_log10()
     )

_ = p.draw()

library	target	consensus sequences
lib1	CGG_naive	93059
lib2	CGG_naive	99370

Below we write the retained consensus sequences to a CSV file that links the nucleotide mutations to the barcodes. (The next section analyzes this variant table in detail, and provides have more precise information on the number of variants and relevant statistics):

print(f"Writing nucleotide variants to {config['nt_variant_table_file']}")
      
(consensus
 [['target', 'library', 'barcode', 'substitutions', 'variant_call_support']]
 .to_csv(config['nt_variant_table_file'], index=False)
 )
      
print('Here are the first few lines of this file:')
display(HTML(
    pd.read_csv(config['nt_variant_table_file'], na_filter=None)
    .head()
    .to_html(index=False)
    ))

Writing nucleotide variants to results/variants/nucleotide_variant_table.csv
Here are the first few lines of this file:

target	library	barcode	substitutions	variant_call_support
CGG_naive	lib1	AAAAAAAAAACACCGG	C357T T598A T599C A600T	6
CGG_naive	lib1	AAAAAAAAAACATGAG	C46T A47G	1
CGG_naive	lib1	AAAAAAAAAAGCGACG	G466C T467A G468T	1
CGG_naive	lib1	AAAAAAAAAAGGAAAG	T329G G330T	6
CGG_naive	lib1	AAAAAAAAAATATAGA	T139C A140C C141A	1

What happened to the barcodes that we "dropped" because we could not construct a reliable consensus? The dropped data frame from alignparse.consensus.simple_mutconsensus has this information:

display(HTML(dropped.head().to_html(index=False)))

library	barcode	target	drop_reason	nseqs
lib1	AAAAAAAAAAATAATA	CGG_naive	subs diff too large	10
lib1	AAAAAAAAAATCCGAG	CGG_naive	subs diff too large	18
lib1	AAAAAAAACAACGTAA	CGG_naive	minor subs too frequent	6
lib1	AAAAAAAATCTCAAAC	CGG_naive	subs diff too large	73
lib1	AAAAAAACACAACTCA	CGG_naive	subs diff too large	7

Summarize the information in this data frame on dropped barcodes with the plot below. This plot shows several things. First, we see that the total number of barcodes dropped is modest (just a few thousand per library) relative to the total number of barcodes per library (seen above to be on the order of hundreds of thousands). Second, the main reason that barcodes are dropped is that there are CCSs within the same barcode with suspiciously large numbers of mutations relative to the consensus---which we use as a filter to discard the entire barcode as it could indicate strand exchange or some other issue. In any case, the modest number of dropped barcodes indicates that there probably isn't much of a need to worry:

max_nseqs = 8  # plot together all barcodes with >= this many sequences

_ = (
 ggplot(
    dropped.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_nseqs)),
    aes('nseqs')) + 
 geom_bar() + 
 scale_x_continuous(limits=(1, None)) +
 xlab('number of sequences for barcode') +
 ylab('number of barcodes') +
 facet_grid('library ~ drop_reason') +
 theme(figure_size=(10, 1.5 * nlibs),
       panel_grid_major_x=element_blank(),
       )
 ).draw()

Create barcode-variant table

We now create a CodonVariantTable that stores and processes all the information about the variant consensus sequences. Below we initialize such a table, and then analyze information about its composition.

Initialize codon variant table

In order to initialize the codon variant table, we need two pieces of information:

The wildtype gene sequence.
The list of nucleotide mutations for each variant as determined in the consensus calling above.

Read "wildtype" gene sequence to which we made the alignments (in order to do this, initialize an alignparse.Targets and get the gene sequence from it):

targets = alignparse.targets.Targets(seqsfile=config['amplicon'],
                                     feature_parse_specs=config['feature_parse_specs'])
geneseq = targets.get_target(config['primary_target']).get_feature('gene').seq

print(f"Read gene of {len(geneseq)} nts for {config['primary_target']} from {config['amplicon']}")

Read gene of 705 nts for CGG_naive from data/PacBio_amplicon.gb

Now initialize the codon variant table using this wildtype sequence and our list of nucleotide mutations for each variant:

variants = dms_variants.codonvarianttable.CodonVariantTable(
                barcode_variant_file=config['nt_variant_table_file'],
                geneseq=geneseq,
                primary_target=config['primary_target'],
                )

Basic stats on variants

We now will analyze the variants. In this call and in the plots below, we set samples=None as we aren't looking at variant counts in specific samples, but are simply looking at properties of the variants in the table.

Here are the number of variants for each target:

display(HTML(
    variants
    .n_variants_df(samples=None)
    .pivot_table(index=['target'],
                 columns='library',
                 values='count')
    .to_html()
    ))

library	lib1	lib2	all libraries
target
CGG_naive	93059	99370	192429

Plot the number of variants supported by each number of CCSs:

max_support = 10  # group variants with >= this much support

p = variants.plotVariantSupportHistogram(max_support=max_support,
                                         widthscale=1.1,
                                         heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

Mutations per variant

Plot the number of barcoded variants with each number of amino-acid and codon mutations. This is for the primary target only, and doesn't include the spiked-in secondary targets:

max_muts = 7  # group all variants with >= this many mutations

for mut_type in ['aa', 'codon']:
    p = variants.plotNumMutsHistogram(mut_type, samples=None, max_muts=max_muts,
                                      widthscale=1.1,
                                      heightscale=0.9)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'], f"n_{mut_type}_muts_per_variant.pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)

Saving plot to results/figures/n_aa_muts_per_variant.pdf
Saving plot to results/figures/n_codon_muts_per_variant.pdf

Plot the frequencies of different codon mutation types among all variants (any number of mutations), again only for primary target:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant',
                                    heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"avg_muts_per_variant.pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)

Saving plot to results/figures/avg_muts_per_variant.pdf

Variants supported by multiple PacBio CCSs should have fewer spurious mutations since sequencing errors are very unlikely to occur on two CCSs. Below we plot the number of codon mutations per variant among variants with at least two CCSs supporting their call. The difference in mutation rates here and in the plot above (that does not apply the min_support=2 filter) gives some estimate of the frequency of mutations in our variants our spurious. In fact, we see the numbers are very similar, indicating that few of the mutations are spurious:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant', 
                                    min_support=2, heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

Completeness of mutation sampling

We examine how completely amino-acid mutations are sampled by the variants for the primary target, looking at single-mutant variants only and all variants. The plot below shows that virtually every mutation is found in a variant in each library, even if we just look among the single mutants. Things look especially good if we aggregate across libraries:

for variant_type in ['all', 'single']:
    p = variants.plotCumulMutCoverage(variant_type, mut_type='aa', samples=None)
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'],
                            f"variant_cumul_{variant_type}_mut_coverage.pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)

Saving plot to results/figures/variant_cumul_all_mut_coverage.pdf
Saving plot to results/figures/variant_cumul_single_mut_coverage.pdf

To get more quantitative information like that plotted above, we determine how many mutations are found 0, 1, or >1 times both among single and all mutants for the primary target:

count_dfs = []
for variant_type in ['all', 'single']:
    i_counts = (variants.mutCounts(variant_type, mut_type='aa', samples=None)
                .assign(variant_type=variant_type)
                )
    count_dfs += [i_counts.assign(include_stops=True),
                  i_counts
                  .query('not mutation.str.contains("\*")', engine='python')
                  .assign(include_stops=False)
                  ]
    
display(HTML(
    pd.concat(count_dfs)
    .assign(count=lambda x: (numpy.clip(x['count'], None, 2)
                             .map({0: '0', 1: '1', 2:'>1'}))
            )
    .groupby(['variant_type', 'include_stops', 'library', 'count'])
    .aggregate(number_of_mutations=pd.NamedAgg(column='mutation', aggfunc='count'))
    .to_html()
    ))

				number_of_mutations
variant_type	include_stops	library	count
all	False	lib1	0	240
			1	25
			>1	4200
		lib2	0	233
			1	23
			>1	4209
		all libraries	0	225
			1	8
			>1	4232
	True	lib1	0	414
			1	44
			>1	4242
		lib2	0	399
			1	39
			>1	4262
		all libraries	0	378
			1	32
			>1	4290
single	False	lib1	0	275
			1	35
			>1	4155
		lib2	0	283
			1	26
			>1	4156
		all libraries	0	257
			1	22
			>1	4186
	True	lib1	0	477
			1	57
			>1	4166
		lib2	0	477
			1	57
			>1	4166
		all libraries	0	434
			1	60
			>1	4206

Mutation frequencies along gene

We plot the frequencies of mutations along the gene among the variants for the primary target. Ideally, this would be uniform. We make the plot for both all variants and single-mutant / wildtype variants:

for variant_type in ['all', 'single']:
    p = variants.plotMutFreqs(variant_type, mut_type='codon', samples=None)
    p.draw()

We can also use heat maps to examine the extent to which specific amino-acid or codon mutations are over-represented. These heat maps are large, so we make them just for all variants and the merge of all libraries:

for mut_type in ['aa', 'codon']:
    p = variants.plotMutHeatmap('all', mut_type, samples=None, #libraries='all_only',
                                widthscale=2)
    p.draw()

Write codon-variant table

We write the codon variant table to a CSV file. This table looks like this:

display(HTML(
    variants.barcode_variant_df
    .head()
    .to_html(index=False)
    ))

target	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
CGG_naive	lib1	AAAAAAAAAACACCGG	6	GGC119GGT TTA200ACT	L200T	2	1
CGG_naive	lib1	AAAAAAAAAACATGAG	1	CAG16TGG	Q16W	1	1
CGG_naive	lib1	AAAAAAAAAAGCGACG	1	GTG156CAT	V156H	1	1
CGG_naive	lib1	AAAAAAAAAAGGAAAG	6	GTG110GGT	V110G	1	1
CGG_naive	lib1	AAAAAAAAAATATAGA	1	TAC47CCA	Y47P	1	1

Note how this table differs from the nucleotide variant table we generated above and used to initialize the CodonVariantTable in that it gives codon substitutions and associated amino-acid substitutions.

Write it to CSV file:

print(f"Writing codon-variant table to {config['codon_variant_table_file']}")

variants.barcode_variant_df.to_csv(config['codon_variant_table_file'], index=False)

Writing codon-variant table to results/variants/codon_variant_table.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_ccs.md

process_ccs.md

Table of Contents

Process CCSs

Setup

PacBio reads

PacBio amplicons

CCS stats for PacBio runs

Align CCSs to amplicons

Write valid CCSs

Build barcode variant table

Filter processed CCSs

Sequences per barcode

Empirical accuracy of CCSs

Consensus sequences for barcodes

Create barcode-variant table

Initialize codon variant table

Basic stats on variants

Mutations per variant

Completeness of mutation sampling

Mutation frequencies along gene

Write codon-variant table

Files

process_ccs.md

Latest commit

History

process_ccs.md

File metadata and controls

Table of Contents

Process CCSs

Setup

PacBio reads

PacBio amplicons

CCS stats for PacBio runs

Align CCSs to amplicons

Write valid CCSs

Build barcode variant table

Filter processed CCSs

Sequences per barcode

Empirical accuracy of CCSs

Consensus sequences for barcodes

Create barcode-variant table

Initialize codon variant table

Basic stats on variants

Mutations per variant

Completeness of mutation sampling

Mutation frequencies along gene

Write codon-variant table