Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design #1

Open
Avsecz opened this issue Oct 1, 2018 · 17 comments
Open

Design #1

Avsecz opened this issue Oct 1, 2018 · 17 comments
Assignees

Comments

@Avsecz
Copy link
Contributor

Avsecz commented Oct 1, 2018

Which datalaoders to use:

  • DNA sequence based
  • bigiwig-based (useful for DNA + accessibility)
  • Efficient bed-file reader

API

  • use torch.data.Dataset API

Nice to have

  • transforms
    • reverse-complementation

Integration with Kipoi

  • see how to expand default_dataloader
@Avsecz Avsecz self-assigned this Oct 1, 2018
@lzamparo
Copy link

lzamparo commented Oct 1, 2018

Two suggestions:

  • from our previous thread (#330 in kipoi/kipoi) re: DeepBind dataloaders, I would +1 pyfaidx for dealing with fasta indexing. It has fewer dependencies than GenomeLake (sp?).
  • the actual encoding from concise was good, but maybe could be factored out of the package? Installing the whole concise package just to encode DNA was a bit overkill

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 1, 2018

Agree. We should also add proper handling of variable widths etc.

I will indeed factor out the one-hot-encoder from concise for the beginning to make things simple. It will not be as fast as genomelake which uses cython to compile the one-hot-encode function.

Do you think one should install all the dependencies for all the dataloaders in the dataloaders package or should these be installed on the fly?

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 1, 2018

@lzamparo what was your issue once again with genomelake - pysam?

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 1, 2018

Migrating from pysam to pyfaidx: kundajelab/genomelake#19

@lzamparo
Copy link

lzamparo commented Oct 2, 2018

@Avsecz my issue was that genomelake - pysam could not handle a fasta file that contained a collection of intervals with the corresponding subsequences. It expected an entire chromosome, and each subsequence where the subsequences could be extracted from it.

@lzamparo
Copy link

lzamparo commented Oct 2, 2018

I will indeed factor out the one-hot-encoder from concise for the beginning to make things simple. It will not be as fast as genomelake which uses cython to compile the one-hot-encode function.

Did you do profiling of cythonized versus pure python one-hot encoding? How much of a speedup did it provide?

Do you think one should install all the dependencies for all the dataloaders in the dataloaders package or should these be installed on the fly?

I'm not sure. I think for users, it would be better to have dependencies installed on an as-needed basis. However, this might be more work for you, so I understand if you'd want to bundle all dependencies together whe then install a kipoi/dataloaders package (via conda I imagine?)

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 2, 2018

Did you do profiling of cythonized versus pure python one-hot encoding? How much of a speedup did it provide?

Yup. I did the benchmarks. For 1kb it's 300us vs 50us and for 10kb is 3ms vs 0.5 ms. Hence 6x speedup. I really want to make these dataloaders as fast as possible so that they become the default to people training models as well.

@Avsecz my issue was that genomelake - pysam could not handle a fasta file that contained a collection of intervals with the corresponding subsequences. It expected an entire chromosome, and each subsequence where the subsequences could be extracted from it.

I migrated genomelake to pyfaidx. Need to fix the unit tests and then it should work. Could you please send me a slashed down version of your fasta file (containing a single sequence of length ~100) and the expected result so that I can add it to the unit-tests. Migrating to pyfaidx will also solve the installation issues - if conda channels were in the wrong order, then (old) pysam from anaconda was chosen.

I'm not sure. I think for users, it would be better to have dependencies installed on an as-needed basis. However, this might be more work for you, so I understand if you'd want to bundle all dependencies together whe then install a kipoi/dataloaders package (via conda I imagine?)

Ok. I think that's a good idea especially if the set of data-loaders and their dependencies starts growing. It's actually not more work. We anyway need to annotate each dataloader as in dataloader.yaml.

@krrome
Copy link
Contributor

krrome commented Oct 2, 2018

thanks for the discussion - I am currently implementing a first draft of the dataloaders repo and I think my suggestion agrees with many of your points. I have opted to:

  • write my own FastaExtractor class and called it FastaReader to avoid confusion. This is necessary as some models (e.g: lsgkm-SVM) need string inputs, others one-hot encoded input, etc. It is therefore a wrapper around pyfaidx so that it operates with pybedtools.Interval objects.
  • the string sequence output can be converted by transformer methods, one of which is the onehot_trafo, another will be char_list_trafo:
    • onehot_trafo is at the moment a three-liner which can at any point be replaced with the onehot method in genomelake. I didn't do that to avoid genomelake dependency only because of this function
    • char_list_trafo converts the sequence string returned by FastaReader into list of characters, as required by some models.
  • For one-hot encoded sequences I define a Reshaper class that can add "dummy" dimensions and swap the order of the sequence and alphabet axes as required by the model.

using these definitions, I get a neat and compact defintion of dataloader classes capable to cater for one-hot encoded models with all sorts of different input shapes and axis-order, string-sequece models, etc. see: https://github.com/kipoi/dataloaders/blob/first_draft/kipoi_dataloaders/sequence_based.py. please feel free to take a look at the branch https://github.com/kipoi/dataloaders/tree/first_draft and give feedback :)

@krrome
Copy link
Contributor

krrome commented Oct 3, 2018

@lzamparo btw. I am more than happy to use a fasta reading package that has the functionality you requested. pyfaidx does not have the functionality that you have requested - e.g.: requesting the sequence from genomic interval chr15:98503515-98503517 given the fasta example file in your gist in kipoi/kipoi#330. pyfaidx inteprets the entire string after the ">" as a sequence/chromosome name. If you can point me to the fasta reader that you have in mind which is capable to handle this the way you specified and if the that fasta reader complies with the fasta standards then I am more than happy to replace pyfaidx with that one.

This is what happens when I use pyfaidx:

import pyfaidx
ff = pyfaidx.Fasta("tests/data/subseq_sample.fasta")
ff.get_seq("chr15", 98503515, 98503517)

Traceback (most recent call last):
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 628, in from_file
    i = self.index[rname]
KeyError: 'chr15'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
    seq = self.from_file(name, start, end)
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 631, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr15 does not exist! Please check your FASTA file.

on this fasta file:

>chr15:98503514-98503814
TGGCTCCTGCAAAGTCCTTACGCTGCTAGAAACCCCGCCCCCGTTCGTAACTATGGGAACGGCTGACACGCTGAGAGCCG

@lzamparo
Copy link

lzamparo commented Oct 5, 2018

@Avsecz Here's a sample .fasta file in a gist, along with the test interval and expected output

@lzamparo
Copy link

lzamparo commented Oct 5, 2018

@krrome I've not used get_seq before, but I used the dict interface of Pyfaidx.Fasta to do slicing on Fasta records from a file organized like the one in my original gist you reference.

There is one snag, which is you need to convert to relative indexing of the bases from the absolute indexing of the interval, but that's pretty fast. Have a look at the TruncSeqDataset dataloader I made, you can see how I'm taking subsequences in __getitem__

@krrome
Copy link
Contributor

krrome commented Oct 5, 2018

@lzamparo thanks for the link. I could add your way of querying in case the default get_seq throws a FetchError.. but the problem I can see there is that it will only work if the bed query is matched by a fasta file entry...
in any case if your suggestion solves your problem we can add it to the default dataloader, if @Avsecz agrees.

@lzamparo
Copy link

lzamparo commented Oct 5, 2018

.. but the problem I can see there is that it will only work if the bed query is matched by a fasta file entry...

True, I ensured there would be correspondance by construction. Though I think it can handle the .other common use case, where you have one fasta file per chromosome, and the intervals can be taken as a subset without any conversion to relative sequences. Up to you to decide how to handle the case where an interval has no match in the fasta provided.

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 5, 2018

Great! Thanks @lzamparo . I'll have a closer look at your implementation tomorrow and add the provided files to unit-tests.

Update

  • I'm almost done writing the transforms module that contains all the useful pre-processing functions like one-hot encoding, tokenizing or sequence trimming. I re-factored one-hot-encoding functions from concise and also imported the fast one-hot-encoding from genomelake (for the DNA case)
  • @krrome @lzamparo do you frequently use any other functions in addition to those in https://github.com/kipoi/kipoiseq/blob/master/kipoiseq/transforms/functional.py?
    • I was thinking of also adding the following two transforms:
      • k-mer or gapped k-mer counter
      • simple reverse complementation function

@lzamparo
Copy link

lzamparo commented Oct 9, 2018

@Avsecz +1 for both transforms you suggest below, I can think of some handy use-cases for both. Also useful might be a random substring transform? I can imagine a scenario when you are working with longer sequences, but the property you're trying to predict is valid over the length of the sequence, and you want to use this for data augmentation.

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 9, 2018

@lzamparo what's the motivation of storing chromosome subsets in the fasta file instead of using intervals to query the sequence directly?

>chr15:98503514-98503614
TGGCTCCTGCAAAGTCCTTACGCTGCTAGAAACCCCGCCCCCGTTCGTAACTATGGGAACGGCTGACACGCTGAGAGCCG
GAGATAACCCTGTTCCGCCC

@Avsecz
Copy link
Contributor Author

Avsecz commented Oct 9, 2018

I've started a new issue to keep track of the additional dataloaders: #8. The package is ready and on pypi pip install kipoiseq. I also created a small colab notebook: link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants