Skip to content

Releases: fedarko/strainFlye

v0.2.0

02 Nov 23:15
Compare
Choose a tag to compare

This release makes strainFlye essentially "feature-complete"—it adds in support for the final few features we had planned.

As with the previous release: the code is mostly well-tested, although there are some uncovered sections of the align, fdr, and matrix modules in particular. There are also a few open issues for improving certain commands' performance.

New features

  • Implemented commands for creating link graphs: strainFlye link nt and strainFlye link graph.
  • Implemented commands for creating mutation matrices: strainFlye matrix count and strainFlye matrix fill.
  • Implemented a command for comparing coverage and skew: strainFlye dynam covskew.

Documentation

  • Added information to the tutorial describing how to use the aforementioned new commands and visualize their outputs.
  • Various other improvements to the tutorial, and to the AnalyzingDiversityIndices tutorial notebook.
  • Various improvements to the README (e.g. added a "Quick descriptions of each strainFlye command" section).
  • Removed the PlottingFDRCurves tutorial notebook, in favor of just linking to the up-to-date notebook used in the fedarko/sheepgut repository.
  • Various improvements to the CLI.

Bug fixes

  • Fixed a bug in the overlapping-supplementary-alignment (OSA) filter used in strainFlye align. The problem was fixed in commit fd2c5bd.

    • The problem: strainFlye's get_coords() function, which returns the coordinates on a contig which a given linear alignment spans, returned coordinate ranges that were off by 2 nucleotides. Rather than returning [start (0-indexed), end (0-indexed)], it would return [start (0-indexed), end (0-indexed) + 2]. The problem was caused by using + 1 rather than - 1. Sorry for the trouble!
    • This problem caused overlap detection to be slightly overzealous, making the OSA filter slightly too strict about which reads it would filter out. From testing, this impacted a nonzero but relatively small amount of reads in the SheepGut and ChickenGut datasets used in the paper. (So we'll either rerun things for the final paper version, or add a note about this bug to the paper.)
    • More details, from an email I sent about this:

      To understand the impacts of this bug on the paper, I analyzed the SheepGut and ChickenGut alignments before doing any filtering in order to see which reads would have been erroneously filtered out. I found that the percentage of reads that were incorrectly filtered out is very small (in SheepGut: 27,211 reads, or 0.12% of the total number of reads in the dataset; in ChickenGut: 1,596 reads, or 0.08%). These impacted reads seem to mostly be strange cases where minimap2 would create two distinct linear alignments of a read directly next to each other (but not overlapping) on the same contig: in one example, part of a read is aligned to the inclusive interval of positions [504,493 bp, 515,426 bp] on a contig, and another part of this same read is aligned to the inclusive interval of positions [515,427 bp, 516,076 bp] on this same contig. So, for some reason what would have been a single linear alignment becomes two.

      The three selected MAGs (CAMP, BACT1, BACT2) are barely affected: there were 16, 9, and 3 reads erroneously filtered for these MAGs, respectively. So, the bug should have a minimal impact on the paper's conclusions.

  • Gracefully handle the case where no gaps in a contig were long enough (aa2871d).

Miscellaneous

  • Renamed the P_Value column in the strainFlye spot cold-gaps output to LongestGap_P_Value, to avoid confusion.
  • In the strainFlye spot hot-features output, the values in the PercentMutatedPositions column now no longer end with % signs. This makes it easier to load these TSV files with Pandas.
  • Moved matplotlib to an optional dependency—as of writing, it's only used in the tutorial (so if a user doesn't want to do any plotting, or would prefer to use another package like ggplot in R to do plotting, then there's no need to install matplotlib).
  • Added SciPy as a setup requirement. (We could maybe have it as just an "install requirement" instead, but I remember that having NumPy as an install requirement can fail in some cases, so I figure this is safer.)
  • Explicitly set a maximum Python version (< 3.8, at the moment).

🦠 🛩️

Full Changelog: v0.1.0...v0.2.0

v0.1.0: Initial release

20 Sep 21:41
Compare
Choose a tag to compare

Includes implementations of the following commands:

  • align
  • call (p-mutation, r-mutation)
  • fdr (estimate, fix)
  • spot (hot-features, cold-gaps)
  • smooth (create, assemble)
  • utils (gfa-to-fasta)

These are mostly derived from the ad hoc analysis code in the sheepgut repository. These commands are covered by a reasonably large set of tests, although some commands should ideally be better covered (there are a decent amount of uncovered lines in the align, fdr estimate, and fdr fix commands).

The planned commands to implement next are link (for constructing link graphs), matrix (for computing codon / amino acid mutation matrices), and covskew (for plotting coverage vs. skew, and computing peak-to-trough ratios).

This also includes some work-in-progress Jupyter notebooks demonstrating how to use strainFlye.

There isn't anything super special about this particular commit; I just wanted to tag a release as a starting point.