cannot get accurate protein sequences from the gff file #37

ATPs · 2021-10-24T08:45:21Z

I tried to extracted the cds sequences from the gff file.

gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3

however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?

mhaukness-ucsc · 2021-12-07T05:34:46Z

Hi @ATPs ,

I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta

mhaukness-ucsc · 2021-12-07T21:31:56Z

These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.

skoren mentioned this issue Apr 8, 2022

Problems in gff #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot get accurate protein sequences from the gff file #37

cannot get accurate protein sequences from the gff file #37

ATPs commented Oct 24, 2021

mhaukness-ucsc commented Dec 7, 2021 •

edited

Loading

mhaukness-ucsc commented Dec 7, 2021

cannot get accurate protein sequences from the gff file #37

cannot get accurate protein sequences from the gff file #37

Comments

ATPs commented Oct 24, 2021

mhaukness-ucsc commented Dec 7, 2021 • edited Loading

mhaukness-ucsc commented Dec 7, 2021

mhaukness-ucsc commented Dec 7, 2021 •

edited

Loading