Skip to content

Latest commit

 

History

History
2112 lines (1649 loc) · 55 KB

process_ccs.md

File metadata and controls

2112 lines (1649 loc) · 55 KB

Table of Contents

This Python Jupyter notebook processes the PacBio circular consensus sequences (CCSs) to extract barcodes and call mutations in the gene. It then builds consensus sequences for barcoded variants from the mutations, to build a barcode-variant lookup table.

Process CCSs

First, process the PacBio CCSs to extract barcodes and call mutations in the gene.

Setup

Import Python modules

Plotting is done with plotnine, which uses ggplot2-like syntax.

The analysis uses the Bloom lab's alignparse and dms_variants packages.

import collections
import math
import os
import re
import time
import warnings

import alignparse
import alignparse.ccs
from alignparse.constants import CBPALETTE
import alignparse.minimap2
import alignparse.targets
import alignparse.consensus

import dms_variants
import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.utils

from IPython.display import display, HTML

import numpy

import pandas as pd

from plotnine import *

import yaml

Set plotnine theme to the one defined in dms_variants:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Versions of key software:

print(f"Using alignparse version {alignparse.__version__}")
print(f"Using dms_variants version {dms_variants.__version__}")
Using alignparse version 0.2.4
Using dms_variants version 0.8.9

Ignore warnings that clutter output:

warnings.simplefilter('ignore')

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Make output directory for figures:

os.makedirs(config['process_ccs_dir'], exist_ok=True)

PacBio reads

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['subreads'])
    .assign(name=lambda x: x['library'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )

display(HTML(pacbio_runs.to_html(index=False)))
library run name fastq
lib1 210619 lib1_210619 results/ccs/lib1_210619_ccs.fastq.gz
lib2 210619 lib2_210619 results/ccs/lib2_210619_ccs.fastq.gz

PacBio amplicons

Get the amplicons sequenced by PacBio as the alignment target along with the specs on how to parse the features:

print(f"Reading amplicons from {config['amplicon']}")
print(f"Reading feature parse specs from {config['feature_parse_specs']}")

targets = alignparse.targets.Targets(
                seqsfile=config['amplicon'],
                feature_parse_specs=config['feature_parse_specs'])
Reading amplicons from data/PacBio_amplicon.gb
Reading feature parse specs from data/feature_parse_specs.yaml

Draw the target amplicons:

fig = targets.plot(ax_width=7,
                   plots_indexing='biopython',  # numbering starts at 0
                   ax_height=2,  # height of each plot
                   hspace=1.2,  # vertical space between plots
                   )

plotfile = os.path.join(config['process_ccs_dir'], 'amplicons.pdf')
print(f"Saving plot to {plotfile}")
fig.savefig(plotfile, bbox_inches='tight')
Saving plot to results/process_ccs/amplicons.pdf

png

Write out the specs used to parse the features (these are the same specs provided as feature_parse_specs when initializing targets, but with defaults filled in):

print(targets.feature_parse_specs('yaml'))
CGG_naive:
  query_clip5: 4
  query_clip3: 4
  termini5:
    filter:
      clip5: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip3: 0
    return: []
  gene:
    filter:
      mutation_nt_count: 90
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - mutations
    - accuracy
  spacer:
    filter:
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return: []
  barcode:
    filter:
      mutation_nt_count: 0
      mutation_op_count: null
      clip5: 0
      clip3: 0
    return:
    - sequence
    - accuracy
  termini3:
    filter:
      clip3: 4
      mutation_nt_count: 1
      mutation_op_count: null
      clip5: 0
    return: []

CCS stats for PacBio runs

Read data frame with information on PacBio runs:

pacbio_runs = (
    pd.read_csv(config['pacbio_runs'], dtype=str)
    .drop(columns=['subreads'])
    .assign(name=lambda x: x['library'] + '_' + x['run'],
            fastq=lambda x: config['ccs_dir'] + '/' + x['name'] + '_ccs.fastq.gz'
            )
    )

# we only have report files on the Hutch server, not for SRA download
if config['seqdata_source'] == 'HutchServer':
    pacbio_runs = (
        pacbio_runs
        .assign(report=lambda x: config['ccs_dir'] + '/' + x['name'] + '_report.txt')
        )
    report_col = 'report'
elif config['seqdata_source'] == 'SRA':
    report_col = None
else:
    raise ValueError(f"invalid `seqdata_source` {config['seqdata_source']}")

display(HTML(pacbio_runs.to_html(index=False)))
library run name fastq report
lib1 210619 lib1_210619 results/ccs/lib1_210619_ccs.fastq.gz results/ccs/lib1_210619_report.txt
lib2 210619 lib2_210619 results/ccs/lib2_210619_ccs.fastq.gz results/ccs/lib2_210619_report.txt

Create an object that summarizes the ccs runs:

ccs_summaries = alignparse.ccs.Summaries(pacbio_runs,
                                         report_col=report_col,
                                         ncpus=config['max_cpus'],
                                         )

If available, plot statistics on the number of ZMWs for each run:

if ccs_summaries.has_zmw_stats():
    p = ccs_summaries.plot_zmw_stats()
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
else:
    print('No ZMW stats available.')

png

Plot statistics on generated CCSs: their length, number of subread passes, and accuracy (as reported by the ccs program):

for variable in ['length', 'passes', 'accuracy']:
    if ccs_summaries.has_stat(variable):
        p = ccs_summaries.plot_ccs_stats(variable, maxcol=7, bins=25)
        p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
        _ = p.draw()
    else:
        print(f"No {variable} statistics available.")

png

png

png

Align CCSs to amplicons

We now align the CCSs to the amplicon and parse features from the resulting alignments using the specs above.

First, we initialize an alignparse.minimap2.Mapper to align the reads to SAM files:

mapper = alignparse.minimap2.Mapper(alignparse.minimap2.OPTIONS_CODON_DMS)

print(f"Using `minimap2` {mapper.version} with these options:\n" +
      ' '.join(mapper.options))
Using `minimap2` 2.18-r1015 with these options:
-A2 -B4 -O12 -E2 --end-bonus=13 --secondary=no --cs

Next, we use Targets.align_and_parse to create the alignments and parse them:

readstats, aligned, filtered = targets.align_and_parse(
        df=pacbio_runs,
        mapper=mapper,
        outdir=config['process_ccs_dir'],
        name_col='run',
        group_cols=['name', 'library'],
        queryfile_col='fastq',
        overwrite=True,
        ncpus=config['max_cpus'],
        )

First, examine the read stats from the alignment / parsing, both extracting alignment target name and getting stats aggregated by target:

readstats = (
    readstats
    .assign(category_all_targets=lambda x: x['category'].str.split().str[0],
            target=lambda x: x['category'].str.split(None, 1).str[1],
            valid=lambda x: x['category_all_targets'] == 'aligned')
    )

Now plot the read stats by run (combining all targets and libraries within a run):

ncol = 2
p = (
    ggplot(readstats
           .groupby(['name', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(),
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.85 * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

And the read stats by library (combining all targets and runs within a library):

p = (
    ggplot(readstats
           .groupby(['library', 'category_all_targets', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('category_all_targets', 'count', fill='valid')) +
    geom_bar(stat='identity') +
    facet_wrap('~ library', nrow=1) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(1.5 * pacbio_runs['library'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          legend_position='none',
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

And the number of reads by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target'])
           .aggregate({'count': 'sum'})
           .reset_index(), 
           aes('target', 'count')) +
    geom_point(stat='identity', size=3) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.3 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_y_log10(name='number of reads')
    )
_ = p.draw()

png

And read stats by target (combining all libraries and runs for a target):

p = (
    ggplot(readstats
           .groupby(['target', 'valid'])
           .aggregate({'count': 'sum'})
           .reset_index()
           .assign(total=lambda x: x.groupby('target')['count'].transform('sum'),
                   frac=lambda x: x['count'] / x['total'],
                   ), 
           aes('target', 'frac', fill='valid')) +
    geom_bar(stat='identity') +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.5 * readstats['target'].nunique(), 2),
          panel_grid_major_x=element_blank(),
          ) +
    scale_fill_manual(values=CBPALETTE)
    )
_ = p.draw()

png

Now let's see why we filtered the reads. First, we do some transformations on the filtered dict returned by Targets.align_and_parse. Then we count up the number of CCSs filtered for each reason, and group together "unusual" reasons that represent less than some fraction of all filtering. For now, we group together all targets to the stats represent all targets combined:

other_cutoff = 0.02  # group as "other" reasons with <= this frac

filtered_df = (
    pd.concat(df.assign(target=target) for target, df in filtered.items())
    .groupby(['library', 'name', 'run', 'filter_reason'])
    .size()
    .rename('count')
    .reset_index()
    .assign(tot_reason_frac=lambda x: (x.groupby('filter_reason')['count']
                                       .transform('sum')) / x['count'].sum(),
            filter_reason=lambda x: numpy.where(x['tot_reason_frac'] > other_cutoff,
                                                x['filter_reason'],
                                                'other')
            )
    )

Now plot the filtering reason for all runs:

ncol = 7
nreasons = filtered_df['filter_reason'].nunique()

p = (
    ggplot(filtered_df, aes('filter_reason', 'count')) +
    geom_bar(stat='identity') +
    facet_wrap('~ name', ncol=ncol) +
    theme(axis_text_x=element_text(angle=90),
          figure_size=(0.25 * nreasons * min(ncol, len(pacbio_runs)),
                       2 * math.ceil(len(pacbio_runs) / ncol)),
          panel_grid_major_x=element_blank(),
          )
    )
_ = p.draw()

png

Finally, we take the successfully parsed alignments and read them into a data frame, keeping track of the target that each CCS aligns to. We also drop the pieces of information we won't use going forward, and rename a few columns:

aligned_df = (
    pd.concat(df.assign(target=target) for target, df in aligned.items())
    .drop(columns=['query_clip5', 'query_clip3', 'run','name'])
    .rename(columns={'barcode_sequence': 'barcode'})
    )

print(f"First few lines of information on the parsed alignments:")
display(HTML(aligned_df.head().to_html(index=False)))
First few lines of information on the parsed alignments:
library query_name gene_mutations gene_accuracy barcode barcode_accuracy target
lib1 m64272e_210802_074641/13/ccs C40T C41T T42G ins166T 0.999347 TATCCACACGAAGCGT 1.0 CGG_naive
lib1 m64272e_210802_074641/14/ccs G155A 1.000000 CAAACACACTCCTCAG 1.0 CGG_naive
lib1 m64272e_210802_074641/43/ccs 1.000000 TTCGACAGAATCGATG 1.0 CGG_naive
lib1 m64272e_210802_074641/71/ccs A539G T540G 1.000000 ATATGGTCACTTATTC 1.0 CGG_naive
lib1 m64272e_210802_074641/83/ccs 0.999977 GAATTCCAAAGAGTCA 1.0 CGG_naive

Write valid CCSs

Write the processed CCSs to a file:

aligned_df.to_csv(config['processed_ccs_file'], index=False)

print("Barcodes and mutations for valid processed CCSs "
      f"have been written to {config['processed_ccs_file']}.")
Barcodes and mutations for valid processed CCSs have been written to results/process_ccs/processed_ccs.csv.

Next, we analyze these processed CCSs to build the variants.

Build barcode variant table

Builds consensus sequences for barcoded variants from the mutations called in the processed PacBio CCSs. Uses these consensus sequences to build a codon variant table.

Make output directories if needed:

os.makedirs(config['variants_dir'], exist_ok=True)
os.makedirs(config['figs_dir'], exist_ok=True)

Read the CSV file with the processed CCSs into a data frame, display first few lines:

processed_ccs = pd.read_csv(config['processed_ccs_file'], na_filter=None)

nlibs = processed_ccs['library'].nunique()  # number of unique libraries

ntargets = processed_ccs['target'].nunique()  # number of unique targets

print(f"Read {len(processed_ccs)} CCSs from {nlibs} libraries and {ntargets} targets.")

display(HTML(processed_ccs.head().to_html(index=False)))
Read 1443261 CCSs from 2 libraries and 1 targets.
library query_name gene_mutations gene_accuracy barcode barcode_accuracy target
lib1 m64272e_210802_074641/13/ccs C40T C41T T42G ins166T 0.999347 TATCCACACGAAGCGT 1.0 CGG_naive
lib1 m64272e_210802_074641/14/ccs G155A 1.000000 CAAACACACTCCTCAG 1.0 CGG_naive
lib1 m64272e_210802_074641/43/ccs 1.000000 TTCGACAGAATCGATG 1.0 CGG_naive
lib1 m64272e_210802_074641/71/ccs A539G T540G 1.000000 ATATGGTCACTTATTC 1.0 CGG_naive
lib1 m64272e_210802_074641/83/ccs 0.999977 GAATTCCAAAGAGTCA 1.0 CGG_naive

Overall statistics on number of total CCSs and number of unique barcodes:

display(HTML(
    processed_ccs
    .groupby(['target', 'library'])
    .aggregate(total_CCSs=('barcode', 'size'),
               unique_barcodes=('barcode', 'nunique'))
    .assign(avg_CCSs_per_barcode=lambda x: x['total_CCSs'] / x['unique_barcodes'])
    .round(2)
    .to_html()
    ))
total_CCSs unique_barcodes avg_CCSs_per_barcode
target library
CGG_naive lib1 802675 108569 7.39
lib2 640586 118042 5.43

Filter processed CCSs

We have the PacBio ccs program's estimated "accuracy" for both the barcode and the gene sequence for each processed CCS. We will filter the CCSs to only keep ones of sufficiently high accuracy.

First, we want to plot the accuracies. It is actually visually easier to look at the error rate, which is one minus the accuracy. Because we want to plot on a log scale (which can't show error rates of zero), we define a error_rate_floor, and set all error rates less than this to that value:

error_rate_floor = 1e-7  # error rates < this set to this
if error_rate_floor >= config['max_error_rate']:
    raise ValueError('error_rate_floor must be < max_error_rate')

processed_ccs = (
    processed_ccs
    .assign(barcode_error=lambda x: numpy.clip(1 - x['barcode_accuracy'],
                                               error_rate_floor, None),
            gene_error=lambda x: numpy.clip(1 - x['gene_accuracy'],
                                            error_rate_floor, None)
            )
    )

Now plot the error rates, drawing a dashed vertical line at the threshold separating the CCSs we retain for consensus building versus those that we discard:

_ = (
 ggplot(processed_ccs
        .melt(value_vars=['barcode_error', 'gene_error'],
              var_name='feature_type', value_name='error rate'),
        aes('error rate')) +
 geom_histogram(bins=25) +
 geom_vline(xintercept=config['max_error_rate'],
            linetype='dashed',
            color=CBPALETTE[1]) +
 facet_wrap('~ feature_type') +
 theme(figure_size=(4.5, 2)) +
 ylab('number of CCSs') +
 scale_x_log10()
 ).draw()

png

Flag the CCSs to retain, and indicate how many we are retaining and purging due to the accuracy filter:

processed_ccs = (
    processed_ccs
    .assign(retained=lambda x: ((x['gene_error'] < config['max_error_rate']) &
                                (x['barcode_error'] < config['max_error_rate'])))
    )

Here are number of retained CCSs:

_ = (
 ggplot(processed_ccs.assign(xlabel=lambda x: x['target'] + ', ' + x['library'])
                     .groupby(['xlabel', 'retained'])
                     .size()
                     .rename('count')
                     .reset_index(),
        aes('xlabel', 'count', color='retained', label='count')) +
 geom_point(size=3) +
 geom_text(va='bottom', size=7, ha='center',format_string='{:.3g}', nudge_y=0.2) +
 theme(figure_size=(0.5 * nlibs * ntargets, 3),
       panel_grid_major_x=element_blank(),
       axis_text_x=element_text(angle=90),
       ) +
 scale_y_log10(name='number of CCSs') +
 xlab('') +
 scale_color_manual(values=CBPALETTE[1:])
 ).draw()

png

Sequences per barcode

How many times is each barcode sequenced? This is useful to know for thinking about building the barcode consensus.

Plot the distribution of the number of times each barcode is observed among the retained CCSs:

max_count = 8 # in plot, group all barcodes with >= this many counts

p = (
 ggplot(
    processed_ccs
     .query('retained')
     .groupby(['library', 'barcode'])
     .size()
     .rename('nseqs')
     .reset_index()
     .assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_count)),
    aes('nseqs')) +
 geom_bar() +
 facet_wrap('~ library', nrow=1) +
 theme(figure_size=(1.75 * nlibs, 2),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of barcodes') +
 xlab('CCSs for barcode')
 )

_ = p.draw()

png

Empirical accuracy of CCSs

We want to directly estimate the accuracy of the gene-barcode link rather than relying on the PacBio ccs accuracy, which doesn't include inaccuracies due to things like strand exchange or the same barcode on different sequences.

One way to do this is to examine instances when we have multiple sequences for the same barcode. We can calculate the empirical accuracy of the sequences by looking at all instances of multiple sequences of the same barcode and determining how often they are identical. This calculation is performed by alignparse.consensus.empirical_accuracy using the equations described in the docs for that function.

We will do this four for sets of sequences:

  1. All of the CCSs retained above.
  2. CCSs retained by applying a PacBio ccs accuracy filter 10-fold more stringent than the one above. The rationale is that if this improves the concordance (real accuracy) of the CCSs substantially then maybe we should make the accuracy filter more stringent.
  3. Like (1) but excluding all CCSs with indels. the rationale is that we only really care about substitutions, and will exclude sequences with indels anyway.
  4. Like (2) but excluding all CCSs with indels.

First, we annotate the sequences with the number of indels and whether they have an indel to enable categorization into the aforementioned sets:

processed_ccs = processed_ccs.reset_index(drop=True)

processed_ccs = alignparse.consensus.add_mut_info_cols(processed_ccs,
                                                       mutation_col='gene_mutations',
                                                       n_indel_col='n_indels')

processed_ccs = processed_ccs.assign(has_indel=lambda x: x['n_indels'] > 0)

Plot how many sequences have indels:

_ = (
 ggplot(processed_ccs,
        aes('retained', fill='has_indel')) +
 geom_bar(position='dodge') +
 geom_text(aes(label='..count..'), stat='count', va='bottom', size=7,
           position=position_dodge(width=0.9), format_string='{:.2g}') +
 theme(figure_size=(2.5 * nlibs, 3),
       panel_grid_major_x=element_blank(),
       ) +
 ylab('number of CCSs') +
 scale_fill_manual(values=CBPALETTE[1:]) +
 facet_wrap('~ library', nrow=1)
 ).draw()

png

Now get the empirical accuracy for each of the CCS groups mentioned above:

high_acc = config['max_error_rate'] / 10
empirical_acc = []

for desc, query_str in [
        ('retained', 'retained'),
        ('retained, no indel', 'retained and not has_indel'),
        ('10X accuracy',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc})"),
        ('10X accuracy, no indel',
         f"(gene_error < {high_acc}) and (barcode_error < {high_acc}) and not has_indel")
        ]:
    # get just CCSs in that category
    df = processed_ccs.query(query_str)
    
    # compute empirical accuracy
    empirical_acc.append(
        alignparse.consensus.empirical_accuracy(df,
                                                mutation_col='gene_mutations')
        .assign(description=desc)
        .merge(df
               .groupby('library')
               .size()
               .rename('number_CCSs')
               .reset_index()
               )
        )

# make description categorical to preserve order, and annotate as "actual"
# the category ("retained, no indel") that we will use for building variants.
empirical_acc = (
    pd.concat(empirical_acc, ignore_index=True, sort=False)
    .assign(description=lambda x: pd.Categorical(x['description'],
                                                 x['description'].unique(),
                                                 ordered=True),
            actual=lambda x: numpy.where(x['description'] == 'retained, no indel',
                                         True, False),
            )
    )

Display table of the empirical accuracies:

display(HTML(empirical_acc.to_html(index=False)))
library accuracy description number_CCSs actual
lib1 0.956602 retained 724908 False
lib2 0.949559 retained 577257 False
lib1 0.987664 retained, no indel 696647 True
lib2 0.982219 retained, no indel 553582 True
lib1 0.967650 10X accuracy 681243 False
lib2 0.961755 10X accuracy 541970 False
lib1 0.989080 10X accuracy, no indel 661332 False
lib2 0.984277 10X accuracy, no indel 525460 False

Plot the empirical accuracies, using a different color to show the category that we will actually use:

p = (
    ggplot(empirical_acc,
           aes('description', 'accuracy', color='actual', label='accuracy')
           ) +
    geom_point(size=3) +
    geom_text(va='bottom', size=9, format_string='{:.3g}', nudge_y=0.003) +
    facet_wrap('~ library', ncol=8) +
    theme(figure_size=(1.75 * nlibs, 2.25),
          axis_text_x=element_text(angle=90),
          panel_grid_major_x=element_blank(),
          ) +
    xlab('') +
    scale_y_continuous(name='empirical accuracy', limits=(0.95, 1.005)) +
    scale_color_manual(values=CBPALETTE, guide=False)
    )

plotfile = os.path.join(config['figs_dir'], 'empirical_CCS_accuracy.pdf')
print(f"Saving plot to {plotfile}")
_ = p.draw()
Saving plot to results/figures/empirical_CCS_accuracy.pdf

png

The above analysis shows that if we exclude sequences with indels (which we plan to do among our consensus sequences), then the accuracy of each CCS is around 99%. We do not get notably higher empirical accuracy by imposing a more stringent filter from the PacBio ccs program, indicating that the major sources of error are due to processes that are not modeled in this program's accuracy filter (perhaps strand exchange or barcode sharing).

Note that this empirical accuracy is for a single CCS. When we build the consensus sequences for each barcode below, we will take the consensus of CCSs within a barcode. So for barcodes with multiple CCSs, the actual accuracy of the consensus sequences will be higher than the empirical accuracy above due to capturing information from multiple CCSs.

Consensus sequences for barcodes

We call the consensus sequence for each barcode using the simple method implemented in alignparse.consensus.simple_mutconsensus. The documentation for that function explains the method in detail, but basically it works like this:

  1. When there is just one CCS per barcode, the consensus is just that sequence.
  2. When there are multiple CCSs per barcode, they are used to build a consensus--however, the entire barcode is discarded if there are many differences between CCSs with the barcode, or high-frequency non-consensus mutations. The reason that barcodes are discarded in such cases as many differences between CCSs or high-frequency non-consensus mutations suggest errors such as barcode collisions or strand exchange.

First, call the consensus for each barcode including all retained sequences, even those with indels:

consensus, dropped = alignparse.consensus.simple_mutconsensus(
                        processed_ccs.query('retained'),
                        group_cols=('library', 'barcode', 'target'),
                        mutation_col='gene_mutations',
                        )

Here are the first few lines of the data frame of consensus sequences for each barcode. In addition to giving the library, barcode, target, and mutations, it also has a column indicating how many CCSs support the variant call:

display(HTML(consensus.head().to_html(index=False)))
library barcode target gene_mutations variant_call_support
lib1 AAAAAAAAAACACCGG CGG_naive C357T T598A T599C A600T 6
lib1 AAAAAAAAAACATGAG CGG_naive C46T A47G 1
lib1 AAAAAAAAAAGCGACG CGG_naive G466C T467A G468T 1
lib1 AAAAAAAAAAGGAAAG CGG_naive T329G G330T 6
lib1 AAAAAAAAAATATAGA CGG_naive T139C A140C C141A 1

Since we retain variants with substitutions but ignore those with indels, add information about substitution mutations and number of indels:

consensus = alignparse.consensus.add_mut_info_cols(
                    consensus,
                    mutation_col='gene_mutations',
                    sub_str_col='substitutions',
                    n_indel_col='number_of_indels',
                    overwrite_cols=True)

display(HTML(consensus.head().to_html(index=False)))
library barcode target gene_mutations variant_call_support substitutions number_of_indels
lib1 AAAAAAAAAACACCGG CGG_naive C357T T598A T599C A600T 6 C357T T598A T599C A600T 0
lib1 AAAAAAAAAACATGAG CGG_naive C46T A47G 1 C46T A47G 0
lib1 AAAAAAAAAAGCGACG CGG_naive G466C T467A G468T 1 G466C T467A G468T 0
lib1 AAAAAAAAAAGGAAAG CGG_naive T329G G330T 6 T329G G330T 0
lib1 AAAAAAAAAATATAGA CGG_naive T139C A140C C141A 1 T139C A140C C141A 0

Plot distribution of number of CCSs supporting each variant call (consensus), indicating whether or not there is an indel:

max_variant_call_support = 6  # group variants with >= this much support

_ = (
 ggplot(consensus
        .assign(variant_call_support=lambda x: numpy.clip(x['variant_call_support'],
                                                          None,
                                                          max_variant_call_support),
                indel_state=lambda x: numpy.where(x['number_of_indels'] > 0,
                                                  'has indel', 'no indel')
                ),
        aes('variant_call_support')) +
 geom_bar() +
 ylab('number of variants') +
 facet_grid('indel_state ~ library') +
 theme(figure_size=(1.75 * nlibs, 3.5),
       panel_grid_major_x=element_blank(),
       ) 
 ).draw()

png

We see that most variant consensus sequences do not have indels, especially if we limit to the more "accurate" ones that have multiple CCSs supporting them.

We will ignore all consensus sequences with indels in the variant-barcode lookup table. We do this for two reasons:

  1. When there is just one CCS supporting a consensus, it is less likely to be accurate as indels are the main mode of PacBio error.
  2. For the purposes of our studies, we are interested in point mutations rather than indels anyway.

Here are number of valid consensus sequence (no indels) for each library and target:

consensus = consensus.query('number_of_indels == 0')

lib_target_counts = (
    consensus
    .groupby(['library', 'target'])
    .size()
    .rename('consensus sequences')
    .reset_index()
    )

display(HTML(lib_target_counts.to_html(index=False)))

p = (ggplot(lib_target_counts.assign(xlabel=lambda x: x['target'] + ', ' + x['library']),
            aes('xlabel', 'consensus sequences')) +
     geom_point(size=3) +
     theme(figure_size=(0.5 * nlibs * ntargets, 1.75),
           axis_text_x=element_text(angle=90)) +
     xlab('') +
     scale_y_log10()
     )

_ = p.draw()
library target consensus sequences
lib1 CGG_naive 93059
lib2 CGG_naive 99370

png

Below we write the retained consensus sequences to a CSV file that links the nucleotide mutations to the barcodes. (The next section analyzes this variant table in detail, and provides have more precise information on the number of variants and relevant statistics):

print(f"Writing nucleotide variants to {config['nt_variant_table_file']}")
      
(consensus
 [['target', 'library', 'barcode', 'substitutions', 'variant_call_support']]
 .to_csv(config['nt_variant_table_file'], index=False)
 )
      
print('Here are the first few lines of this file:')
display(HTML(
    pd.read_csv(config['nt_variant_table_file'], na_filter=None)
    .head()
    .to_html(index=False)
    ))
Writing nucleotide variants to results/variants/nucleotide_variant_table.csv
Here are the first few lines of this file:
target library barcode substitutions variant_call_support
CGG_naive lib1 AAAAAAAAAACACCGG C357T T598A T599C A600T 6
CGG_naive lib1 AAAAAAAAAACATGAG C46T A47G 1
CGG_naive lib1 AAAAAAAAAAGCGACG G466C T467A G468T 1
CGG_naive lib1 AAAAAAAAAAGGAAAG T329G G330T 6
CGG_naive lib1 AAAAAAAAAATATAGA T139C A140C C141A 1

What happened to the barcodes that we "dropped" because we could not construct a reliable consensus? The dropped data frame from alignparse.consensus.simple_mutconsensus has this information:

display(HTML(dropped.head().to_html(index=False)))
library barcode target drop_reason nseqs
lib1 AAAAAAAAAAATAATA CGG_naive subs diff too large 10
lib1 AAAAAAAAAATCCGAG CGG_naive subs diff too large 18
lib1 AAAAAAAACAACGTAA CGG_naive minor subs too frequent 6
lib1 AAAAAAAATCTCAAAC CGG_naive subs diff too large 73
lib1 AAAAAAACACAACTCA CGG_naive subs diff too large 7

Summarize the information in this data frame on dropped barcodes with the plot below. This plot shows several things. First, we see that the total number of barcodes dropped is modest (just a few thousand per library) relative to the total number of barcodes per library (seen above to be on the order of hundreds of thousands). Second, the main reason that barcodes are dropped is that there are CCSs within the same barcode with suspiciously large numbers of mutations relative to the consensus---which we use as a filter to discard the entire barcode as it could indicate strand exchange or some other issue. In any case, the modest number of dropped barcodes indicates that there probably isn't much of a need to worry:

max_nseqs = 8  # plot together all barcodes with >= this many sequences

_ = (
 ggplot(
    dropped.assign(nseqs=lambda x: numpy.clip(x['nseqs'], None, max_nseqs)),
    aes('nseqs')) + 
 geom_bar() + 
 scale_x_continuous(limits=(1, None)) +
 xlab('number of sequences for barcode') +
 ylab('number of barcodes') +
 facet_grid('library ~ drop_reason') +
 theme(figure_size=(10, 1.5 * nlibs),
       panel_grid_major_x=element_blank(),
       )
 ).draw()

png

Create barcode-variant table

We now create a CodonVariantTable that stores and processes all the information about the variant consensus sequences. Below we initialize such a table, and then analyze information about its composition.

Initialize codon variant table

In order to initialize the codon variant table, we need two pieces of information:

  1. The wildtype gene sequence.
  2. The list of nucleotide mutations for each variant as determined in the consensus calling above.

Read "wildtype" gene sequence to which we made the alignments (in order to do this, initialize an alignparse.Targets and get the gene sequence from it):

targets = alignparse.targets.Targets(seqsfile=config['amplicon'],
                                     feature_parse_specs=config['feature_parse_specs'])
geneseq = targets.get_target(config['primary_target']).get_feature('gene').seq

print(f"Read gene of {len(geneseq)} nts for {config['primary_target']} from {config['amplicon']}")
Read gene of 705 nts for CGG_naive from data/PacBio_amplicon.gb

Now initialize the codon variant table using this wildtype sequence and our list of nucleotide mutations for each variant:

variants = dms_variants.codonvarianttable.CodonVariantTable(
                barcode_variant_file=config['nt_variant_table_file'],
                geneseq=geneseq,
                primary_target=config['primary_target'],
                )

Basic stats on variants

We now will analyze the variants. In this call and in the plots below, we set samples=None as we aren't looking at variant counts in specific samples, but are simply looking at properties of the variants in the table.

Here are the number of variants for each target:

display(HTML(
    variants
    .n_variants_df(samples=None)
    .pivot_table(index=['target'],
                 columns='library',
                 values='count')
    .to_html()
    ))
library lib1 lib2 all libraries
target
CGG_naive 93059 99370 192429

Plot the number of variants supported by each number of CCSs:

max_support = 10  # group variants with >= this much support

p = variants.plotVariantSupportHistogram(max_support=max_support,
                                         widthscale=1.1,
                                         heightscale=0.9)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

png

Mutations per variant

Plot the number of barcoded variants with each number of amino-acid and codon mutations. This is for the primary target only, and doesn't include the spiked-in secondary targets:

max_muts = 7  # group all variants with >= this many mutations

for mut_type in ['aa', 'codon']:
    p = variants.plotNumMutsHistogram(mut_type, samples=None, max_muts=max_muts,
                                      widthscale=1.1,
                                      heightscale=0.9)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'], f"n_{mut_type}_muts_per_variant.pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)
Saving plot to results/figures/n_aa_muts_per_variant.pdf
Saving plot to results/figures/n_codon_muts_per_variant.pdf

png

png

Plot the frequencies of different codon mutation types among all variants (any number of mutations), again only for primary target:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant',
                                    heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()
plotfile = os.path.join(config['figs_dir'], f"avg_muts_per_variant.pdf")
print(f"Saving plot to {plotfile}")
p.save(plotfile)
Saving plot to results/figures/avg_muts_per_variant.pdf

png

Variants supported by multiple PacBio CCSs should have fewer spurious mutations since sequencing errors are very unlikely to occur on two CCSs. Below we plot the number of codon mutations per variant among variants with at least two CCSs supporting their call. The difference in mutation rates here and in the plot above (that does not apply the min_support=2 filter) gives some estimate of the frequency of mutations in our variants our spurious. In fact, we see the numbers are very similar, indicating that few of the mutations are spurious:

p = variants.plotNumCodonMutsByType(variant_type='all', samples=None,
                                    ylabel='mutations per variant', 
                                    min_support=2, heightscale=0.8)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw()

png

Completeness of mutation sampling

We examine how completely amino-acid mutations are sampled by the variants for the primary target, looking at single-mutant variants only and all variants. The plot below shows that virtually every mutation is found in a variant in each library, even if we just look among the single mutants. Things look especially good if we aggregate across libraries:

for variant_type in ['all', 'single']:
    p = variants.plotCumulMutCoverage(variant_type, mut_type='aa', samples=None)
    _ = p.draw()
    plotfile = os.path.join(config['figs_dir'],
                            f"variant_cumul_{variant_type}_mut_coverage.pdf")
    print(f"Saving plot to {plotfile}")
    p.save(plotfile)
Saving plot to results/figures/variant_cumul_all_mut_coverage.pdf
Saving plot to results/figures/variant_cumul_single_mut_coverage.pdf

png

png

To get more quantitative information like that plotted above, we determine how many mutations are found 0, 1, or >1 times both among single and all mutants for the primary target:

count_dfs = []
for variant_type in ['all', 'single']:
    i_counts = (variants.mutCounts(variant_type, mut_type='aa', samples=None)
                .assign(variant_type=variant_type)
                )
    count_dfs += [i_counts.assign(include_stops=True),
                  i_counts
                  .query('not mutation.str.contains("\*")', engine='python')
                  .assign(include_stops=False)
                  ]
    
display(HTML(
    pd.concat(count_dfs)
    .assign(count=lambda x: (numpy.clip(x['count'], None, 2)
                             .map({0: '0', 1: '1', 2:'>1'}))
            )
    .groupby(['variant_type', 'include_stops', 'library', 'count'])
    .aggregate(number_of_mutations=pd.NamedAgg(column='mutation', aggfunc='count'))
    .to_html()
    ))
number_of_mutations
variant_type include_stops library count
all False lib1 0 240
1 25
>1 4200
lib2 0 233
1 23
>1 4209
all libraries 0 225
1 8
>1 4232
True lib1 0 414
1 44
>1 4242
lib2 0 399
1 39
>1 4262
all libraries 0 378
1 32
>1 4290
single False lib1 0 275
1 35
>1 4155
lib2 0 283
1 26
>1 4156
all libraries 0 257
1 22
>1 4186
True lib1 0 477
1 57
>1 4166
lib2 0 477
1 57
>1 4166
all libraries 0 434
1 60
>1 4206

Mutation frequencies along gene

We plot the frequencies of mutations along the gene among the variants for the primary target. Ideally, this would be uniform. We make the plot for both all variants and single-mutant / wildtype variants:

for variant_type in ['all', 'single']:
    p = variants.plotMutFreqs(variant_type, mut_type='codon', samples=None)
    p.draw()

png

png

We can also use heat maps to examine the extent to which specific amino-acid or codon mutations are over-represented. These heat maps are large, so we make them just for all variants and the merge of all libraries:

for mut_type in ['aa', 'codon']:
    p = variants.plotMutHeatmap('all', mut_type, samples=None, #libraries='all_only',
                                widthscale=2)
    p.draw()

png

png

Write codon-variant table

We write the codon variant table to a CSV file. This table looks like this:

display(HTML(
    variants.barcode_variant_df
    .head()
    .to_html(index=False)
    ))
target library barcode variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions
CGG_naive lib1 AAAAAAAAAACACCGG 6 GGC119GGT TTA200ACT L200T 2 1
CGG_naive lib1 AAAAAAAAAACATGAG 1 CAG16TGG Q16W 1 1
CGG_naive lib1 AAAAAAAAAAGCGACG 1 GTG156CAT V156H 1 1
CGG_naive lib1 AAAAAAAAAAGGAAAG 6 GTG110GGT V110G 1 1
CGG_naive lib1 AAAAAAAAAATATAGA 1 TAC47CCA Y47P 1 1

Note how this table differs from the nucleotide variant table we generated above and used to initialize the CodonVariantTable in that it gives codon substitutions and associated amino-acid substitutions.

Write it to CSV file:

print(f"Writing codon-variant table to {config['codon_variant_table_file']}")

variants.barcode_variant_df.to_csv(config['codon_variant_table_file'], index=False)
Writing codon-variant table to results/variants/codon_variant_table.csv