Skip to content

Commit

Permalink
Io to bio (#1)
Browse files Browse the repository at this point in the history
* Moved io to bio

* fixed io imports

* Add more generic definitions to bio

* Update bio/fastq/fastq.go

Co-authored-by: Willow Carretero Chavez <[email protected]>

* update fasta

* add fasta updates and parser

* made readability improvements

* changed ParseWithHeader

* removed int64 in reads

* add more example tests

* gotta update this for this tests!

* integrate slow5

* have examples covering most of changes

* removed interfaces

* updated with NewXXXParser

* added 3 parsers

* added pileup

* add concurrent functions plus better documentation

* moved svb to ioToBio

* Improve tests

* make better docs for header

* Update bio/fasta/fasta_test.go

Co-authored-by: Willow Carretero Chavez <[email protected]>

* changed name of LowLevelParser to parserInterface

* zw -> zipWriter

* remove a identifier from pileup

* genbank parser now compatible

* writeTo interface now fulfilled

* make linter happy :)

* convert all types to io.WriterTo

* fixed linter issues

* handle EOF better

* fixed tutorial

* fix genbank read error

* remove io.WriterTo checks

* fix with cmp.Equal

* genbank tests merged

* sample merge

* make linter happy

* Added generic collections module

* Switched Feature.Attributes to use multimap

* Fixed tests

* Ran linter

* Added copy methods

* Adds new functional test that addresses case where there is a partial on an implicit range, standardizes parse error format for clarity

* Ran linter

* Add capability to compute sequence features and marshal en masse

* Add methods to convert polyjson -> genbank

* Removed generic collections library in favor of hand-rolled multimap, with the added benefit of better cmp interop

* Propogate handrolled multimap to test files

* Responded to more comments

* Reduced new example genbank file

* Resolved lint errors, added test StoreFeatureSequences and fixed uncovered bug

* Added multimap.go file doc

* Responded to more comments

* Fixed deref issue

* Fixed tests, moved genbank files

* Fixed fasta docs

* Added changelog

* added to changelog

* changed

* fix linter issues

---------

Co-authored-by: Willow Carretero Chavez <[email protected]>
Co-authored-by: Alex <[email protected]>
  • Loading branch information
3 people authored Dec 7, 2023
1 parent f76bf05 commit 1c7c3bc
Show file tree
Hide file tree
Showing 114 changed files with 3,030 additions and 2,190 deletions.
24 changes: 23 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- Generic parser is now implemented across all parsers for consistent interactions. [(#339)](https://github.com/TimothyStiles/poly/issues/339)
- Alternative start codons can now be used in the `synthesis/codon` DNA -> protein translation package [(#305)](https://github.com/TimothyStiles/poly/issues/305)
- Added a parser and writer for the `pileup` sequence alignment format [(#329)](https://github.com/TimothyStiles/poly/issues/329)
- Created copy methods for Feature and Location to address concerns raised by [(#342)](https://github.com/TimothyStiles/poly/issues/342)
- Created new methods to convert polyjson -> genbank.
- Created new `Feature.StoreSequence` method to enable [(#388)](https://github.com/TimothyStiles/poly/issues/388)

### Changed
- **Breaking**: Genbank parser uses new custom multimap for `Feature.Attributes`, which allows for duplicate keys. This changes the type of Features.Attributes from `map[string]string` to `MultiMap[string, string]`, an alias for `map[string]string` defined in `multimap.go`. [(#383)](https://github.com/TimothyStiles/poly/issues/383)
- Improves error reporting for genbank parse errors via a new `ParseError` struct.

### Fixed
- Fixed bug that produced wrong overhang in linear, non-directional, single cut reactions. #408
- `fastq` parser no longer becomes de-aligned when reading [(#325)](https://github.com/TimothyStiles/poly/issues/325)
- `fastq` now handles optionals correctly [(#323)](https://github.com/TimothyStiles/poly/issues/323)
- Adds functional test and fix for [(#313)](https://github.com/TimothyStiles/poly/issues/313).
- In addition to expanding the set of genbank files which can be validly parsed, the parser is more vocal when it encounters unusual syntax in the "feature" section. This "fail fast" approach is better as there were cases where inputs triggered a codepath which would neither return a valid Genbank object nor an error, and should help with debugging.
- Fixed bug that produced wrong overhang in linear, non-directional, single cut reactions. #408

## [0.26.0] - 2023-07-22
Oops, we weren't keeping a changelog before this tag!

[unreleased]: https://github.com/TimothyStiles/poly/compare/v0.26.0...main
[0.26.0]: https://github.com/TimothyStiles/poly/releases/tag/v0.26.0
12 changes: 12 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,18 @@ In order to simplify the development experience, and environment setup, the poly

Whether you're a beginner with Go or you're an experienced developer, You should see the suggestions popup automatically when you goto the *Plugins* tab in VSCode. Using these plugins can help accelerate the development experience and also allow you to work more collaboratively with other poly developers.

## Local Checks

Poly runs numerous CI/CD checks via Github Actions before a PR can be merged. In order to make your PR mergeable, your PR must pass all of these checks.

A quick way to check your PR will pass is to run:

```sh
gofmt -s -w . && go test ./...
```

Additionally, you may want to [install](https://golangci-lint.run/usage/install/#local-installation) and run the linter.

# How to report a bug

### Security disclosures
Expand Down
29 changes: 9 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,39 @@
# (Poly)merase <img align="right" src="https://cdn.discordapp.com/attachments/766785755305213953/777596834734145546/ProfileFrameArtboard_1.png" width="100">

[![PkgGoDev](https://pkg.go.dev/badge/github.com/TimothyStiles/poly)](https://pkg.go.dev/github.com/TimothyStiles/poly)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/TimothyStiles/poly/blob/main/LICENSE)
![Tests](https://github.com/TimothyStiles/poly/workflows/Test/badge.svg)
![Test Coverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/TimothyStiles/e58f265655ac0acacdd1a38376ccd32a/raw/coverage.json)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/koeng101/poly/blob/main/LICENSE)
![Tests](https://github.com/koeng101/poly/workflows/Test/badge.svg)
![Test Coverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/koeng101/e58f265655ac0acacdd1a38376ccd32a/raw/coverage.json)

Poly is a Go package for engineering organisms.
Poly is a Go package for engineering organisms. This is a fork of the main poly project incorporating more features and bug fixes.

* **Fast:** Poly is fast and scalable.

* **Modern:** Poly tackles issues that other libraries and utilities just don't. From general codon optimization and primer design to circular sequence hashing. All written in a language that was designed to be fast, scalable, and easy to develop in and maintain. Did we say it was fast?

* **Reproducible:** Poly is well tested and designed to be used in industrial, academic, and hobbyist settings. No more copy and pasting strings into random websites to process the data you need.

* **Ambitious:** Poly's goal is to be the most complete, open, and well used collection of computational synthetic biology tools ever assembled. If you like our dream and want to support us please star this repo, request a feature, open a pull request, or [sponsor the project](https://github.com/sponsors/TimothyStiles).
* **Ambitious:** Poly's goal is to be the most complete, open, and well used collection of computational synthetic biology tools ever assembled. If you like our dream and want to support us please star this repo, request a feature, or open a pull request.


## Install

`go get github.com/TimothyStiles/poly@latest`
`go get github.com/koeng101/poly@latest`

## Documentation


* **[Library](https://pkg.go.dev/github.com/TimothyStiles/poly#pkg-examples)**
* **[Library](https://pkg.go.dev/github.com/koeng101/poly#pkg-examples)**

* **[Tutorials](https://github.com/TimothyStiles/poly/tree/main/tutorials)**

* **[Learning Synbio](https://github.com/TimothyStiles/how-to-synbio)**

## Community

* **[Discord](https://discord.gg/Hc8Ncwt):** Chat about Poly and join us for game nights on our discord server!
* **[Tutorials](https://github.com/koeng101/poly/tree/main/tutorials)**

## Contributing

* **[Code of conduct](CODE_OF_CONDUCT.md):** Please read the full text so you can understand what we're all about and remember to be excellent to each other!

* **[Contributor's guide](CONTRIBUTING.md):** Please read through it before you start hacking away and pushing contributions to this fine codebase.

## Sponsor

* **[Sponsor](https://github.com/sponsors/TimothyStiles):** 🤘 Thanks for your support 🤘

## License

* [MIT](LICENSE)

* Copyright (c) 2023 Timothy Stiles
* Copyright (c) 2023 Keoni Gandall, Timothy Stiles
275 changes: 275 additions & 0 deletions bio/bio.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
/*
Package bio provides utilities for reading and writing sequence data.
There are many different biological file formats for different applications.
The poly bio package provides a consistent way to work with each of these file
formats. The underlying data returned by each parser is as raw as we can return
while still being easy to use for downstream applications, and should be
immediately recognizable as the original format.
*/
package bio

import (
"bufio"
"context"
"errors"
"io"
"math"

"github.com/TimothyStiles/poly/bio/fasta"
"github.com/TimothyStiles/poly/bio/fastq"
"github.com/TimothyStiles/poly/bio/genbank"
"github.com/TimothyStiles/poly/bio/pileup"
"github.com/TimothyStiles/poly/bio/slow5"
"golang.org/x/sync/errgroup"
)

// Format is a enum of different parser formats.
type Format int

const (
Fasta Format = iota
Fastq
Genbank
Slow5
Pileup
)

// DefaultMaxLineLength variables are defined for performance reasons. While
// parsing, reading byte-by-byte takes far, far longer than reading many bytes
// into a buffer. In golang, this buffer in bufio is usually 64kb. However,
// many files, especially slow5 files, can contain single lines that are much
// longer. We use the default maxLineLength from bufio unless there is a
// particular reason to use a different number.
const defaultMaxLineLength int = bufio.MaxScanTokenSize // 64kB is a magic number often used by the Go stdlib for parsing.
var DefaultMaxLengths = map[Format]int{
Fasta: defaultMaxLineLength,
Fastq: 8 * 1024 * 1024, // The longest single nanopore sequencing read so far is 4Mb. A 8mb buffer should be large enough for any sequencing.
Genbank: defaultMaxLineLength,
Slow5: 128 * 1024 * 1024, // 128mb is used because slow5 lines can be massive, since a single read can be many millions of base pairs.
Pileup: defaultMaxLineLength,
}

/******************************************************************************
Aug 30, 2023
Lower level interfaces
******************************************************************************/

// parserInterface is a generic interface that all parsers must support. It is
// very simple, only requiring two functions, Header() and Next(). Header()
// returns the header of the file if there is one: most files, like fasta,
// fastq, and pileup do not contain headers, while others like sam and slow5 do
// have headers. Next() returns a record/read/line from the file format, and
// terminates on an io.EOF error.
//
// Next() may terminate with an io.EOF error with a nil Data or with a
// full Data, depending on where the EOF is in the actual file. A check
// for this is needed at the last Next(), when it returns an io.EOF error. A
// pointer is used to represent the difference between a null Data and an
// empty Data.
type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
Header() (Header, error)
Next() (Data, error)
}

/******************************************************************************
Higher level parse
******************************************************************************/

// Parser is generic bioinformatics file parser. It contains a LowerLevelParser
// and implements useful functions on top of it: such as Parse(), ParseToChannel(), and
// ParseWithHeader().
type Parser[Data io.WriterTo, Header io.WriterTo] struct {
parserInterface parserInterface[Data, Header]
}

// NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}

// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}

// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}

// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
}

// NewGenbankParser initiates a new Genbank parser form an io.Reader.
func NewGenbankParser(r io.Reader) (*Parser[*genbank.Genbank, *genbank.Header], error) {
return NewGenbankParserWithMaxLineLength(r, DefaultMaxLengths[Genbank])
}

// NewGenbankParserWithMaxLineLength initiates a new Genbank parser from an
// io.Reader and a user-given maxLineLength.
func NewGenbankParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*genbank.Genbank, *genbank.Header], error) {
return &Parser[*genbank.Genbank, *genbank.Header]{parserInterface: genbank.NewParser(r, maxLineLength)}, nil
}

// NewSlow5Parser initiates a new SLOW5 parser from an io.Reader.
func NewSlow5Parser(r io.Reader) (*Parser[*slow5.Read, *slow5.Header], error) {
return NewSlow5ParserWithMaxLineLength(r, DefaultMaxLengths[Slow5])
}

// NewSlow5ParserWithMaxLineLength initiates a new SLOW5 parser from an
// io.Reader and a user-given maxLineLength.
func NewSlow5ParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*slow5.Read, *slow5.Header], error) {
parser, err := slow5.NewParser(r, maxLineLength)
return &Parser[*slow5.Read, *slow5.Header]{parserInterface: parser}, err
}

// NewPileupParser initiates a new Pileup parser from an io.Reader.
func NewPileupParser(r io.Reader) (*Parser[*pileup.Line, *pileup.Header], error) {
return NewPileupParserWithMaxLineLength(r, DefaultMaxLengths[Pileup])
}

// NewPileupParserWithMaxLineLength initiates a new Pileup parser from an
// io.Reader and a user-given maxLineLength.
func NewPileupParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*pileup.Line, *pileup.Header], error) {
return &Parser[*pileup.Line, *pileup.Header]{parserInterface: pileup.NewParser(r, maxLineLength)}, nil
}

/******************************************************************************
Parser higher-level functions
******************************************************************************/

// Next is a parsing primitive that should be used when low-level control is
// needed. It returns the next record/read/line from the parser. On EOF, it
// returns an io.EOF error, though the returned FASTA/FASTQ/SLOW5/Pileup may or
// may not be nil, depending on where the io.EOF is. This should be checked by
// downstream software. Next can only be called as many times as there are
// records in a file, as the parser reads the underlying io.Reader in a
// straight line.
func (p *Parser[Data, Header]) Next() (Data, error) {
return p.parserInterface.Next()
}

// Header is a parsing primitive that should be used when low-level control is
// needed. It returns the header of the parser, which is usually parsed prior
// to the parser being returned by the "NewXXXParser" functions. Unlike the
// Next() function, Header() can be called as many times as needed. Sometimes
// files have useful headers, while other times they do not.
//
// The following file formats do not have a useful header:
//
// FASTA
// FASTQ
// Pileup
//
// The following file formats do have a useful header:
//
// SLOW5
func (p *Parser[Data, Header]) Header() (Header, error) {
return p.parserInterface.Header()
}

// ParseN returns a countN number of records/reads/lines from the parser.
func (p *Parser[Data, Header]) ParseN(countN int) ([]Data, error) {
var records []Data
for counter := 0; counter < countN; counter++ {
record, err := p.Next()
if err != nil {
if errors.Is(err, io.EOF) {
err = nil // EOF not treated as parsing error.
}
return records, err
}
records = append(records, record)
}
return records, nil
}

// Parse returns all records/reads/lines from the parser, but does not include
// the header. It can only be called once on a given parser because it will
// read all the input from the underlying io.Reader before exiting.
func (p *Parser[Data, Header]) Parse() ([]Data, error) {
return p.ParseN(math.MaxInt)
}

// ParseWithHeader returns all records/reads/lines, plus the header, from the
// parser. It can only be called once on a given parser because it will read
// all the input from the underlying io.Reader before exiting.
func (p *Parser[Data, Header]) ParseWithHeader() ([]Data, Header, error) {
header, headerErr := p.Header()
data, err := p.Parse()
if headerErr != nil {
return data, header, err
}
if err != nil {
return data, header, err
}
return data, header, nil
}

/******************************************************************************
Concurrent higher-level functions
******************************************************************************/

// ParseToChannel pipes all records/reads/lines from a parser into a channel,
// then optionally closes that channel. If parsing a single file,
// "keepChannelOpen" should be set to false, which will close the channel once
// parsing is complete. If many files are being parsed to a single channel,
// keepChannelOpen should be set to true, so that an external function will
// close channel once all are done parsing.
//
// Context can be used to close the parser in the middle of parsing - for
// example, if an error is found in another parser elsewhere and all files
// need to close.
func (p *Parser[Data, Header]) ParseToChannel(ctx context.Context, channel chan<- Data, keepChannelOpen bool) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
record, err := p.Next()
if err != nil {
if errors.Is(err, io.EOF) {
err = nil // EOF not treated as parsing error.
}
if !keepChannelOpen {
close(channel)
}
return err
}
channel <- record
}
}
}

// ManyToChannel is a generic function that implements the ManyXXXToChannel
// functions. It properly does concurrent parsing of many parsers to a
// single channel, then closes that channel. If any of the files fail to
// parse, the entire pipeline exits and returns.
func ManyToChannel[Data io.WriterTo, Header io.WriterTo](ctx context.Context, channel chan<- Data, parsers ...*Parser[Data, Header]) error {
errorGroup, ctx := errgroup.WithContext(ctx)
// For each parser, start a new goroutine to parse data to the channel
for _, p := range parsers {
parser := p // Copy to local variable to avoid loop variable scope issues
errorGroup.Go(func() error {
// Use the parser's ParseToChannel function, but keep the channel open
return parser.ParseToChannel(ctx, channel, true)
})
}
// Wait for all parsers to complete
err := errorGroup.Wait()
close(channel)
return err
}
Loading

0 comments on commit 1c7c3bc

Please sign in to comment.