Releases: fedarko/strainFlye
v0.2.0
This release makes strainFlye essentially "feature-complete"—it adds in support for the final few features we had planned.
As with the previous release: the code is mostly well-tested, although there are some uncovered sections of the align
, fdr
, and matrix
modules in particular. There are also a few open issues for improving certain commands' performance.
New features
- Implemented commands for creating link graphs:
strainFlye link nt
andstrainFlye link graph
. - Implemented commands for creating mutation matrices:
strainFlye matrix count
andstrainFlye matrix fill
. - Implemented a command for comparing coverage and skew:
strainFlye dynam covskew
.
Documentation
- Added information to the tutorial describing how to use the aforementioned new commands and visualize their outputs.
- Various other improvements to the tutorial, and to the
AnalyzingDiversityIndices
tutorial notebook. - Various improvements to the README (e.g. added a "Quick descriptions of each strainFlye command" section).
- Removed the
PlottingFDRCurves
tutorial notebook, in favor of just linking to the up-to-date notebook used in the fedarko/sheepgut repository. - Various improvements to the CLI.
Bug fixes
-
Fixed a bug in the overlapping-supplementary-alignment (OSA) filter used in
strainFlye align
. The problem was fixed in commit fd2c5bd.- The problem: strainFlye's
get_coords()
function, which returns the coordinates on a contig which a given linear alignment spans, returned coordinate ranges that were off by 2 nucleotides. Rather than returning[start (0-indexed), end (0-indexed)]
, it would return[start (0-indexed), end (0-indexed) + 2]
. The problem was caused by using+ 1
rather than- 1
. Sorry for the trouble! - This problem caused overlap detection to be slightly overzealous, making the OSA filter slightly too strict about which reads it would filter out. From testing, this impacted a nonzero but relatively small amount of reads in the SheepGut and ChickenGut datasets used in the paper. (So we'll either rerun things for the final paper version, or add a note about this bug to the paper.)
- More details, from an email I sent about this:
To understand the impacts of this bug on the paper, I analyzed the SheepGut and ChickenGut alignments before doing any filtering in order to see which reads would have been erroneously filtered out. I found that the percentage of reads that were incorrectly filtered out is very small (in SheepGut: 27,211 reads, or 0.12% of the total number of reads in the dataset; in ChickenGut: 1,596 reads, or 0.08%). These impacted reads seem to mostly be strange cases where minimap2 would create two distinct linear alignments of a read directly next to each other (but not overlapping) on the same contig: in one example, part of a read is aligned to the inclusive interval of positions [504,493 bp, 515,426 bp] on a contig, and another part of this same read is aligned to the inclusive interval of positions [515,427 bp, 516,076 bp] on this same contig. So, for some reason what would have been a single linear alignment becomes two.
The three selected MAGs (CAMP, BACT1, BACT2) are barely affected: there were 16, 9, and 3 reads erroneously filtered for these MAGs, respectively. So, the bug should have a minimal impact on the paper's conclusions.
- The problem: strainFlye's
-
Gracefully handle the case where no gaps in a contig were long enough (aa2871d).
Miscellaneous
- Renamed the
P_Value
column in thestrainFlye spot cold-gaps
output toLongestGap_P_Value
, to avoid confusion. - In the
strainFlye spot hot-features
output, the values in thePercentMutatedPositions
column now no longer end with%
signs. This makes it easier to load these TSV files with Pandas. - Moved matplotlib to an optional dependency—as of writing, it's only used in the tutorial (so if a user doesn't want to do any plotting, or would prefer to use another package like ggplot in R to do plotting, then there's no need to install matplotlib).
- Added SciPy as a setup requirement. (We could maybe have it as just an "install requirement" instead, but I remember that having NumPy as an install requirement can fail in some cases, so I figure this is safer.)
- Explicitly set a maximum Python version (
< 3.8
, at the moment).
🦠 🛩️
Full Changelog: v0.1.0...v0.2.0
v0.1.0: Initial release
Includes implementations of the following commands:
align
call
(p-mutation
,r-mutation
)fdr
(estimate
,fix
)spot
(hot-features
,cold-gaps
)smooth
(create
,assemble
)utils
(gfa-to-fasta
)
These are mostly derived from the ad hoc analysis code in the sheepgut
repository. These commands are covered by a reasonably large set of tests, although some commands should ideally be better covered (there are a decent amount of uncovered lines in the align
, fdr estimate
, and fdr fix
commands).
The planned commands to implement next are link
(for constructing link graphs), matrix
(for computing codon / amino acid mutation matrices), and covskew
(for plotting coverage vs. skew, and computing peak-to-trough ratios).
This also includes some work-in-progress Jupyter notebooks demonstrating how to use strainFlye.
There isn't anything super special about this particular commit; I just wanted to tag a release as a starting point.