Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot get accurate protein sequences from the gff file #37

Open
ATPs opened this issue Oct 24, 2021 · 3 comments
Open

cannot get accurate protein sequences from the gff file #37

ATPs opened this issue Oct 24, 2021 · 3 comments

Comments

@ATPs
Copy link

ATPs commented Oct 24, 2021

I tried to extracted the cds sequences from the gff file.

gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3

however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?

@mhaukness-ucsc
Copy link

mhaukness-ucsc commented Dec 7, 2021

Hi @ATPs ,

I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta

@mhaukness-ucsc
Copy link

These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ATPs @mhaukness-ucsc and others