Error with AnchoredGTFDl #102

PelFritz · 2021-08-31T13:04:02Z

Hi,
I am using AnchoredGTFDl to extract promoter sequences, however when I call the load_all() I get the following error
"ValueError: all input arrays must have the same shape".

My assumption is that this arises from the fact that some gene coordinates are too close to the end of the chromosome and hence we do not get the appropriate extraction length. My code is below

`import numpy as np
from kipoiseq.dataloaders import AnchoredGTFDl

fasta_path = 'Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna.toplevel.fa'
gtf_path = 'Zea_mays.Zm-B73-REFERENCE-NAM-5.0.51.gtf'

dl = AnchoredGTFDl(gtf_path, fasta_path, num_upstream=1000, num_downstream=500,
gtf_filter='gene_biotype == "protein_coding"')

data = dl.load_all()`

As a work around I used the code below but I don't know if this is okay or there is some function to check extracted sequence length automatically.

`sequence = []
gene_id = []
for seq in dl:
if len(seq['inputs']) == 1500:
gene_id.append(seq['metadata']['gene_id'])
sequence.append(seq['inputs'])

sequence = np.array(sequence)
print(sequence.shape)`

Is there some way to assert sequence length to be the same?

Hoeze · 2021-09-02T13:44:27Z

Hi @PelFritz, thanks for posting this issue.

I think there are two valid solutions for this issue:

Pad sequences with N's
Filter too short sequences

This should be done here I guess:

kipoiseq/kipoiseq/dataloaders/sequence.py

Line 512 in b5c9b5f

sequence = self._fa.extract(interval)

@Karollus What's your opinion?

Karollus · 2021-10-13T09:32:44Z

Sorry for the very late reply, I somehow missed this.

I think the most feasible solution (without changing the behaviour of the extract function- which would impact a lot of dataloaders) is to have a padding or filtering afterwards. This is a bit ugly - it runs a check for every extracted sequence, but I see no other real solution. I think N-padding is probably better than filtering (filtering can be done by done relatively easily by the user by excluding genes too close to the chromosome end, padding is harder to achieve for the user as it requires editing the fasta - so it seems that providing padding would be more useful). @Hoeze, does that sound reasonable?

I can try to prototype it in the next few days

Hoeze · 2021-10-14T21:04:58Z

@Karollus maybe we can have an additional flag in the dataloader that chooses the behavior:
a) raise an error
b) log a warning and just remove the sequence
c) silently pad the sequence

haimasree assigned Hoeze and Karollus Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with AnchoredGTFDl #102

Error with AnchoredGTFDl #102

PelFritz commented Aug 31, 2021

Hoeze commented Sep 2, 2021 •

edited

Loading

Karollus commented Oct 13, 2021

Hoeze commented Oct 14, 2021

Error with AnchoredGTFDl #102

Error with AnchoredGTFDl #102

Comments

PelFritz commented Aug 31, 2021

Hoeze commented Sep 2, 2021 • edited Loading

Karollus commented Oct 13, 2021

Hoeze commented Oct 14, 2021

Hoeze commented Sep 2, 2021 •

edited

Loading