Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io->bio #339

Closed
wants to merge 67 commits into from
Closed
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
b87947b
Moved io to bio
Koeng101 Aug 24, 2023
5c887a3
fixed io imports
Koeng101 Aug 24, 2023
d8f4b38
Add more generic definitions to bio
Koeng101 Aug 31, 2023
4fb41ff
Update bio/fastq/fastq.go
Koeng101 Sep 1, 2023
2452282
update fasta
Koeng101 Sep 1, 2023
6dda2b9
Merge branch 'ioToBio' of github.com:TimothyStiles/poly into ioToBio
Koeng101 Sep 1, 2023
16fbcbd
add fasta updates and parser
Koeng101 Sep 1, 2023
382a014
made readability improvements
Koeng101 Sep 2, 2023
0bbd05e
changed ParseWithHeader
Koeng101 Sep 2, 2023
eb68f81
removed int64 in reads
Koeng101 Sep 2, 2023
344220c
add more example tests
Koeng101 Sep 2, 2023
03f8b68
gotta update this for this tests!
Koeng101 Sep 2, 2023
6199c43
integrate slow5
Koeng101 Sep 5, 2023
65f0539
have examples covering most of changes
Koeng101 Sep 5, 2023
8ff6da4
removed interfaces
Koeng101 Sep 5, 2023
00732a4
updated with NewXXXParser
Koeng101 Sep 7, 2023
3ce8109
added 3 parsers
Koeng101 Sep 7, 2023
630bd88
added pileup
Koeng101 Sep 7, 2023
df98fe3
add concurrent functions plus better documentation
Koeng101 Sep 9, 2023
fa4d29a
moved svb to ioToBio
Koeng101 Sep 11, 2023
f80b317
Improve tests
Koeng101 Sep 11, 2023
37859a8
make better docs for header
Koeng101 Sep 11, 2023
e24801b
Update bio/fasta/fasta_test.go
Koeng101 Sep 12, 2023
584b73e
changed name of LowLevelParser to parserInterface
Koeng101 Sep 12, 2023
da7118a
Merge branch 'main' into ioToBio
Koeng101 Sep 12, 2023
5e6204f
zw -> zipWriter
Koeng101 Sep 12, 2023
90316d3
remove a identifier from pileup
Koeng101 Sep 12, 2023
7b2cd52
genbank parser now compatible
Koeng101 Sep 13, 2023
9b55fda
writeTo interface now fulfilled
Koeng101 Sep 13, 2023
6655565
make linter happy :)
Koeng101 Sep 13, 2023
11972ae
convert all types to io.WriterTo
Koeng101 Sep 14, 2023
12a4b48
fixed linter issues
Koeng101 Sep 14, 2023
4b50625
handle EOF better
Koeng101 Sep 14, 2023
f44721c
fixed tutorial
Koeng101 Sep 14, 2023
b192fda
fix genbank read error
Koeng101 Sep 14, 2023
3eab1f9
remove io.WriterTo checks
Koeng101 Sep 14, 2023
0edfd1c
fix with cmp.Equal
Koeng101 Sep 16, 2023
34de749
Merge pull request #341 from TimothyStiles/slow5StreamVByte2
Koeng101 Sep 16, 2023
6abe0cd
Merge branch 'main' into ioToBio
Koeng101 Oct 28, 2023
4c61c22
genbank tests merged
Koeng101 Oct 28, 2023
1d23668
sample merge
Koeng101 Oct 28, 2023
56772bb
Merge branch 'main' of github.com:TimothyStiles/poly into ioToBio
Koeng101 Oct 28, 2023
956d26e
make linter happy
Koeng101 Oct 28, 2023
158fcf1
Added generic collections module
abondrn Oct 30, 2023
8862a6c
Switched Feature.Attributes to use multimap
abondrn Oct 30, 2023
ef07e94
Fixed tests
abondrn Oct 30, 2023
cac1e55
Ran linter
abondrn Oct 30, 2023
8025bc2
Added copy methods
abondrn Oct 30, 2023
1f49f9d
Adds new functional test that addresses case where there is a partial…
abondrn Oct 30, 2023
8112866
Ran linter
abondrn Oct 30, 2023
fec8796
Add capability to compute sequence features and marshal en masse
abondrn Oct 30, 2023
9ce9f4f
Add methods to convert polyjson -> genbank
abondrn Oct 30, 2023
89a2ba4
Removed generic collections library in favor of hand-rolled multimap,…
abondrn Oct 30, 2023
b88d7b8
Propogate handrolled multimap to test files
abondrn Oct 30, 2023
b4c3a37
Responded to more comments
abondrn Oct 30, 2023
8b82d7b
Reduced new example genbank file
abondrn Oct 31, 2023
f523651
Resolved lint errors, added test StoreFeatureSequences and fixed unco…
abondrn Oct 31, 2023
1270ec8
Added multimap.go file doc
abondrn Oct 31, 2023
9c322f6
Responded to more comments
abondrn Oct 31, 2023
f124fae
First merge attempt
abondrn Nov 4, 2023
98b6984
Fixed deref issue
abondrn Nov 4, 2023
fc2ca75
Merged updated branch
abondrn Nov 4, 2023
25e0f61
Fixed tests, moved genbank files
abondrn Nov 4, 2023
60abf6d
Fixed fasta docs
abondrn Nov 4, 2023
7e3c812
Added changelog
abondrn Nov 5, 2023
35a5492
Merge pull request #394 from abondrn/ioToBio-genbank
Koeng101 Nov 5, 2023
433df00
added to changelog
Koeng101 Nov 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions bio/bio.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
/*
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
Package bio provides utilities for reading and writing sequence data.

There are many different biological file formats for different applications.
The poly bio package provides a consistent way to work with each of these file
formats. The underlying data returned by each parser is as raw as we can return
while still being easy to use for downstream applications, and should be
immediately recognizable as the original format.
*/
package bio

import (
"bufio"
"context"
"errors"
"io"
"math"

"github.com/TimothyStiles/poly/bio/fasta"
"github.com/TimothyStiles/poly/bio/fastq"
"github.com/TimothyStiles/poly/bio/genbank"
"github.com/TimothyStiles/poly/bio/pileup"
"github.com/TimothyStiles/poly/bio/slow5"
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
"golang.org/x/sync/errgroup"
)

// Format is a enum of different parser formats.
type Format int

const (
Fasta Format = iota
Fastq
Genbank
Slow5
Pileup
)

// DefaultMaxLineLength variables are defined for performance reasons. While
// parsing, reading byte-by-byte takes far, far longer than reading many bytes
// into a buffer. In golang, this buffer in bufio is usually 64kb. However,
// many files, especially slow5 files, can contain single lines that are much
// longer. We use the default maxLineLength from bufio unless there is a
// particular reason to use a different number.
const defaultMaxLineLength int = bufio.MaxScanTokenSize // 64kB is a magic number often used by the Go stdlib for parsing.
var DefaultMaxLengths = map[Format]int{
Fasta: defaultMaxLineLength,
Fastq: 8 * 1024 * 1024, // The longest single nanopore sequencing read so far is 4Mb. A 8mb buffer should be large enough for any sequencing.
Genbank: defaultMaxLineLength,
Slow5: 128 * 1024 * 1024, // 128mb is used because slow5 lines can be massive, since a single read can be many millions of base pairs.
Pileup: defaultMaxLineLength,
}

/******************************************************************************
Aug 30, 2023

Lower level interfaces

******************************************************************************/

// parserInterface is a generic interface that all parsers must support. It is
// very simple, only requiring two functions, Header() and Next(). Header()
// returns the header of the file if there is one: most files, like fasta,
// fastq, and pileup do not contain headers, while others like sam and slow5 do
// have headers. Next() returns a record/read/line from the file format, and
// terminates on an io.EOF error.
//
// Next() may terminate with an io.EOF error with a nil Data or with a
// full Data, depending on where the EOF is in the actual file. A check
// for this is needed at the last Next(), when it returns an io.EOF error. A
// pointer is used to represent the difference between a null Data and an
// empty Data.
type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
Header() (Header, error)
Next() (Data, error)
}

/******************************************************************************

Higher level parse

******************************************************************************/
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved

// Parser is generic bioinformatics file parser. It contains a LowerLevelParser
// and implements useful functions on top of it: such as Parse(), ParseToChannel(), and
// ParseWithHeader().
type Parser[Data io.WriterTo, Header io.WriterTo] struct {
parserInterface parserInterface[Data, Header]
}

// NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}

// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}

// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}

// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
}

// NewGenbankParser initiates a new Genbank parser form an io.Reader.
func NewGenbankParser(r io.Reader) (*Parser[*genbank.Genbank, *genbank.Header], error) {
return NewGenbankParserWithMaxLineLength(r, DefaultMaxLengths[Genbank])
}

// NewGenbankParserWithMaxLineLength initiates a new Genbank parser from an
// io.Reader and a user-given maxLineLength.
func NewGenbankParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*genbank.Genbank, *genbank.Header], error) {
return &Parser[*genbank.Genbank, *genbank.Header]{parserInterface: genbank.NewParser(r, maxLineLength)}, nil
}

// NewSlow5Parser initiates a new SLOW5 parser from an io.Reader.
func NewSlow5Parser(r io.Reader) (*Parser[*slow5.Read, *slow5.Header], error) {
return NewSlow5ParserWithMaxLineLength(r, DefaultMaxLengths[Slow5])
}

// NewSlow5ParserWithMaxLineLength initiates a new SLOW5 parser from an
// io.Reader and a user-given maxLineLength.
func NewSlow5ParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*slow5.Read, *slow5.Header], error) {
parser, err := slow5.NewParser(r, maxLineLength)
return &Parser[*slow5.Read, *slow5.Header]{parserInterface: parser}, err
}

// NewPileupParser initiates a new Pileup parser from an io.Reader.
func NewPileupParser(r io.Reader) (*Parser[*pileup.Line, *pileup.Header], error) {
return NewPileupParserWithMaxLineLength(r, DefaultMaxLengths[Pileup])
}

// NewPileupParserWithMaxLineLength initiates a new Pileup parser from an
// io.Reader and a user-given maxLineLength.
func NewPileupParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*pileup.Line, *pileup.Header], error) {
return &Parser[*pileup.Line, *pileup.Header]{parserInterface: pileup.NewParser(r, maxLineLength)}, nil
}

/******************************************************************************

Parser higher-level functions

******************************************************************************/

// Next is a parsing primitive that should be used when low-level control is
// needed. It returns the next record/read/line from the parser. On EOF, it
// returns an io.EOF error, though the returned FASTA/FASTQ/SLOW5/Pileup may or
// may not be nil, depending on where the io.EOF is. This should be checked by
// downstream software. Next can only be called as many times as there are
// records in a file, as the parser reads the underlying io.Reader in a
// straight line.
func (p *Parser[Data, Header]) Next() (Data, error) {
return p.parserInterface.Next()
}

// Header is a parsing primitive that should be used when low-level control is
// needed. It returns the header of the parser, which is usually parsed prior
// to the parser being returned by the "NewXXXParser" functions. Unlike the
// Next() function, Header() can be called as many times as needed. Sometimes
// files have useful headers, while other times they do not.
//
// The following file formats do not have a useful header:
//
// FASTA
// FASTQ
// Pileup
//
// The following file formats do have a useful header:
//
// SLOW5
func (p *Parser[Data, Header]) Header() (Header, error) {
return p.parserInterface.Header()
}

// ParseN returns a countN number of records/reads/lines from the parser.
func (p *Parser[Data, Header]) ParseN(countN int) ([]Data, error) {
var records []Data
for counter := 0; counter < countN; counter++ {
record, err := p.Next()
if err != nil {
if errors.Is(err, io.EOF) {
err = nil // EOF not treated as parsing error.
}
return records, err
}
records = append(records, record)
}
return records, nil
}

// Parse returns all records/reads/lines from the parser, but does not include
// the header. It can only be called once on a given parser because it will
// read all the input from the underlying io.Reader before exiting.
func (p *Parser[Data, Header]) Parse() ([]Data, error) {
return p.ParseN(math.MaxInt)
}

// ParseWithHeader returns all records/reads/lines, plus the header, from the
// parser. It can only be called once on a given parser because it will read
// all the input from the underlying io.Reader before exiting.
func (p *Parser[Data, Header]) ParseWithHeader() ([]Data, Header, error) {
header, headerErr := p.Header()
data, err := p.Parse()
if headerErr != nil {
return data, header, err
}
if err != nil {
return data, header, err
}
return data, header, nil
}

/******************************************************************************
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved

Concurrent higher-level functions

******************************************************************************/

// ParseToChannel pipes all records/reads/lines from a parser into a channel,
// then optionally closes that channel. If parsing a single file,
// "keepChannelOpen" should be set to false, which will close the channel once
// parsing is complete. If many files are being parsed to a single channel,
// keepChannelOpen should be set to true, so that an external function will
// close channel once all are done parsing.
//
// Context can be used to close the parser in the middle of parsing - for
// example, if an error is found in another parser elsewhere and all files
// need to close.
func (p *Parser[Data, Header]) ParseToChannel(ctx context.Context, channel chan<- Data, keepChannelOpen bool) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
record, err := p.Next()
if err != nil {
if errors.Is(err, io.EOF) {
err = nil // EOF not treated as parsing error.
}
if !keepChannelOpen {
close(channel)
}
return err
}
channel <- record
}
}
}

// ManyToChannel is a generic function that implements the ManyXXXToChannel
// functions. It properly does concurrent parsing of many parsers to a
// single channel, then closes that channel. If any of the files fail to
// parse, the entire pipeline exits and returns.
func ManyToChannel[Data io.WriterTo, Header io.WriterTo](ctx context.Context, channel chan<- Data, parsers ...*Parser[Data, Header]) error {
errorGroup, ctx := errgroup.WithContext(ctx)
// For each parser, start a new goroutine to parse data to the channel
for _, p := range parsers {
parser := p // Copy to local variable to avoid loop variable scope issues
errorGroup.Go(func() error {
// Use the parser's ParseToChannel function, but keep the channel open
return parser.ParseToChannel(ctx, channel, true)
})
}
// Wait for all parsers to complete
err := errorGroup.Wait()
close(channel)
return err
}
Loading
Loading