Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io->bio #339

Closed
wants to merge 67 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
b87947b
Moved io to bio
Koeng101 Aug 24, 2023
5c887a3
fixed io imports
Koeng101 Aug 24, 2023
d8f4b38
Add more generic definitions to bio
Koeng101 Aug 31, 2023
4fb41ff
Update bio/fastq/fastq.go
Koeng101 Sep 1, 2023
2452282
update fasta
Koeng101 Sep 1, 2023
6dda2b9
Merge branch 'ioToBio' of github.com:TimothyStiles/poly into ioToBio
Koeng101 Sep 1, 2023
16fbcbd
add fasta updates and parser
Koeng101 Sep 1, 2023
382a014
made readability improvements
Koeng101 Sep 2, 2023
0bbd05e
changed ParseWithHeader
Koeng101 Sep 2, 2023
eb68f81
removed int64 in reads
Koeng101 Sep 2, 2023
344220c
add more example tests
Koeng101 Sep 2, 2023
03f8b68
gotta update this for this tests!
Koeng101 Sep 2, 2023
6199c43
integrate slow5
Koeng101 Sep 5, 2023
65f0539
have examples covering most of changes
Koeng101 Sep 5, 2023
8ff6da4
removed interfaces
Koeng101 Sep 5, 2023
00732a4
updated with NewXXXParser
Koeng101 Sep 7, 2023
3ce8109
added 3 parsers
Koeng101 Sep 7, 2023
630bd88
added pileup
Koeng101 Sep 7, 2023
df98fe3
add concurrent functions plus better documentation
Koeng101 Sep 9, 2023
fa4d29a
moved svb to ioToBio
Koeng101 Sep 11, 2023
f80b317
Improve tests
Koeng101 Sep 11, 2023
37859a8
make better docs for header
Koeng101 Sep 11, 2023
e24801b
Update bio/fasta/fasta_test.go
Koeng101 Sep 12, 2023
584b73e
changed name of LowLevelParser to parserInterface
Koeng101 Sep 12, 2023
da7118a
Merge branch 'main' into ioToBio
Koeng101 Sep 12, 2023
5e6204f
zw -> zipWriter
Koeng101 Sep 12, 2023
90316d3
remove a identifier from pileup
Koeng101 Sep 12, 2023
7b2cd52
genbank parser now compatible
Koeng101 Sep 13, 2023
9b55fda
writeTo interface now fulfilled
Koeng101 Sep 13, 2023
6655565
make linter happy :)
Koeng101 Sep 13, 2023
11972ae
convert all types to io.WriterTo
Koeng101 Sep 14, 2023
12a4b48
fixed linter issues
Koeng101 Sep 14, 2023
4b50625
handle EOF better
Koeng101 Sep 14, 2023
f44721c
fixed tutorial
Koeng101 Sep 14, 2023
b192fda
fix genbank read error
Koeng101 Sep 14, 2023
3eab1f9
remove io.WriterTo checks
Koeng101 Sep 14, 2023
0edfd1c
fix with cmp.Equal
Koeng101 Sep 16, 2023
34de749
Merge pull request #341 from TimothyStiles/slow5StreamVByte2
Koeng101 Sep 16, 2023
6abe0cd
Merge branch 'main' into ioToBio
Koeng101 Oct 28, 2023
4c61c22
genbank tests merged
Koeng101 Oct 28, 2023
1d23668
sample merge
Koeng101 Oct 28, 2023
56772bb
Merge branch 'main' of github.com:TimothyStiles/poly into ioToBio
Koeng101 Oct 28, 2023
956d26e
make linter happy
Koeng101 Oct 28, 2023
158fcf1
Added generic collections module
abondrn Oct 30, 2023
8862a6c
Switched Feature.Attributes to use multimap
abondrn Oct 30, 2023
ef07e94
Fixed tests
abondrn Oct 30, 2023
cac1e55
Ran linter
abondrn Oct 30, 2023
8025bc2
Added copy methods
abondrn Oct 30, 2023
1f49f9d
Adds new functional test that addresses case where there is a partial…
abondrn Oct 30, 2023
8112866
Ran linter
abondrn Oct 30, 2023
fec8796
Add capability to compute sequence features and marshal en masse
abondrn Oct 30, 2023
9ce9f4f
Add methods to convert polyjson -> genbank
abondrn Oct 30, 2023
89a2ba4
Removed generic collections library in favor of hand-rolled multimap,…
abondrn Oct 30, 2023
b88d7b8
Propogate handrolled multimap to test files
abondrn Oct 30, 2023
b4c3a37
Responded to more comments
abondrn Oct 30, 2023
8b82d7b
Reduced new example genbank file
abondrn Oct 31, 2023
f523651
Resolved lint errors, added test StoreFeatureSequences and fixed unco…
abondrn Oct 31, 2023
1270ec8
Added multimap.go file doc
abondrn Oct 31, 2023
9c322f6
Responded to more comments
abondrn Oct 31, 2023
f124fae
First merge attempt
abondrn Nov 4, 2023
98b6984
Fixed deref issue
abondrn Nov 4, 2023
fc2ca75
Merged updated branch
abondrn Nov 4, 2023
25e0f61
Fixed tests, moved genbank files
abondrn Nov 4, 2023
60abf6d
Fixed fasta docs
abondrn Nov 4, 2023
7e3c812
Added changelog
abondrn Nov 5, 2023
35a5492
Merge pull request #394 from abondrn/ioToBio-genbank
Koeng101 Nov 5, 2023
433df00
added to changelog
Koeng101 Nov 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 162 additions & 9 deletions bio/bio.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,173 @@ Package bio provides utilities for reading and writing sequence data.
package bio

import (
"bufio"
"io"
"os"

"github.com/TimothyStiles/poly/bio/fasta"
)

/*
This package is supposed to be empty and only exists to provide a doc string.
Otherwise its namespace would collide with Go's native IO package.
*/
type Format int

const (
Fasta Format = iota
Fastq
Gff
Genbank
Slow5
Pileup
Uniprot
Rebase
)

// DefaultMaxLineLength variables are defined for performance reasons. While
// parsing, reading byte-by-byte takes far, far longer than reading many bytes
// into a buffer. In golang, this buffer in bufio is usually 64kb. However,
// many files, especially slow5 files, can contain single lines that are much
// longer. We use the default maxLineLength from bufio unless there is a
// particular reason to use a different number.
const defaultMaxLineLength int = bufio.MaxScanTokenSize // 64kB is a magic number often used by the Go stdlib for parsing.
var (
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
Fasta_DefaultMaxLineLength = defaultMaxLineLength
Fastq_DefaultMaxLineLength = 8 * 1024 * 1024 // The longest single nanopore sequencing read so far is 4Mb. A 8mb buffer should be large enough for any sequencing.
Gff_DefaultMaxLineLength = defaultMaxLineLength
Genbank_DefaultMaxLineLength = defaultMaxLineLength
Slow5_DefaultMaxLineLength = 128 * 1024 * 1024 // 128mb is used because slow5 lines can be massive, since a single read can be many millions of base pairs.
Pileup_DefaultMaxLineLength = defaultMaxLineLength
Uniprot_DefaultMaxLineLength = defaultMaxLineLength
Rebase_DefaultMaxLineLength = defaultMaxLineLength
)

type Parser[T any, TH any] interface {
Header() (TH, error)
Next() (T, error)
MaxLineCount() int64
/******************************************************************************
Aug 30, 2023

Lower level interfaces

******************************************************************************/

type LowLevelParser[DataType fasta.Fasta, DataTypeHeader fasta.Header] interface {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
Header() (DataTypeHeader, int64, error)
Next() (DataType, int64, error)
}

type Writer interface {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
Write(io.Writer) error
Write(io.Writer) (int, error)
}

/******************************************************************************

Higher level parse

******************************************************************************/
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved

type Parser[DataType fasta.Fasta, DataTypeHeader fasta.Header] struct {
LowLevelParser LowLevelParser[DataType, DataTypeHeader]
}

var emptyParser Parser[fasta.Fasta, fasta.Header] = Parser[fasta.Fasta, fasta.Header]{}

func NewParser(format Format, r io.Reader) (*Parser[fasta.Fasta, fasta.Header], error) {
var maxLineLength int
switch format {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
case Fasta:
maxLineLength = Fasta_DefaultMaxLineLength
}
return NewParserWithMaxLine(format, r, maxLineLength)
}

func NewParserWithMaxLine(format Format, r io.Reader, maxLineLength int) (*Parser[fasta.Fasta, fasta.Header], error) {
switch format {
case Fasta:
return &Parser[fasta.Fasta, fasta.Header]{LowLevelParser: fasta.NewParser(r, maxLineLength)}, nil
}
return nil, nil
}

/******************************************************************************

Parser read functions

******************************************************************************/

func ReadWithMaxLine(format Format, path string, maxLineLength int) (*Parser[fasta.Fasta, fasta.Header], error) {
file, err := os.Open(path)
if err != nil {
return &emptyParser, err
}
return NewParserWithMaxLine(format, file, maxLineLength)
}

func Read(format Format, path string) (*Parser[fasta.Fasta, fasta.Header], error) {
file, err := os.Open(path)
if err != nil {
return &emptyParser, err
}
return NewParser(format, file)
}

/******************************************************************************

Parser higher-level functions

******************************************************************************/

func (p *Parser[DataType, DataTypeHeader]) Next() (DataType, int64, error) {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
return p.LowLevelParser.Next()
}

func (p *Parser[DataType, DataTypeHeader]) Header() (DataTypeHeader, int64, error) {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
return p.LowLevelParser.Header()
}

func (p *Parser[DataType, DataTypeHeader]) ParseN(countN int) ([]DataType, error) {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
return nil, nil
}

func (p *Parser[DataType, DataTypeHeader]) Parse() ([]DataType, error) {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
return nil, nil
}

func (p *Parser[DataType, DataTypeHeader]) ParseAll() ([]DataType, DataTypeHeader, error) {
header, _, err := p.Header()
if err != nil {
return []DataType{}, DataTypeHeader{}, err
}
data, err := p.Parse()
if err != nil {
return []DataType{}, DataTypeHeader{}, err
}
return data, header, nil
}

func (p *Parser[DataType, DataTypeHeader]) ParseConcurrent(channel chan<- DataType) error {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
return nil
}

/******************************************************************************
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved

Writer functions
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved

******************************************************************************/

func WriteAll(data []Writer, header Writer, w io.Writer) error {
_, err := header.Write(w)
if err != nil {
return err
}
for _, datum := range data {
_, err = datum.Write(w)
if err != nil {
return err
}
}
return nil
}

func WriteFile(data []Writer, header Writer, path string) error {
file, err := os.Open(path)
if err != nil {
return err
}
return WriteAll(data, header, file)
}
8 changes: 4 additions & 4 deletions bio/bio_test.go
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
package bio
package bio_test

import (
"testing"

"github.com/TimothyStiles/poly/bio"
"github.com/TimothyStiles/poly/bio/fasta"
"github.com/TimothyStiles/poly/bio/fastq"
)

func TestWriter(t *testing.T) {
var _ Writer = &fastq.Fastq{}
var _ Writer = &fasta.Fasta{}
var _ bio.LowLevelParser[fasta.Fasta, fasta.Header] = &fasta.Parser{}
var _ bio.Writer = &fasta.Fasta{}
}
5 changes: 2 additions & 3 deletions bio/example_test.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package bio_test

import (
"github.com/TimothyStiles/poly/bio/fasta"
"github.com/TimothyStiles/poly/bio/genbank"
"github.com/TimothyStiles/poly/bio/gff"
"github.com/TimothyStiles/poly/bio/polyjson"
Expand All @@ -16,13 +15,13 @@ func Example() {

gffInput, _ := gff.Read("../data/ecoli-mg1655-short.gff")
gbkInput, _ := genbank.Read("../data/puc19.gbk")
fastaInput, _ := fasta.Read("fasta/data/base.fasta")
//fastaInput, _ := fasta.Read("fasta/data/base.fasta")
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
jsonInput, _ := polyjson.Read("../data/cat.json")

// Poly can also output these file formats. Every file format has a corresponding Write function.
_ = gff.Write(gffInput, "test.gff")
_ = genbank.Write(gbkInput, "test.gbk")
_ = fasta.WriteFile(fastaInput, "test.fasta")
//_ = fasta.WriteFile(fastaInput, "test.fasta")
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
_ = polyjson.Write(jsonInput, "test.json")

// Extra tips:
Expand Down
Loading
Loading