Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No HTT repeat in adotto_TRregions_v1.2 ? #6

Open
davidlougheed opened this issue Jun 17, 2024 · 2 comments
Open

No HTT repeat in adotto_TRregions_v1.2 ? #6

davidlougheed opened this issue Jun 17, 2024 · 2 comments

Comments

@davidlougheed
Copy link

davidlougheed commented Jun 17, 2024

Hello,
I'm working on benchmarking some tandem repeat genotypers and I was looking for the well-known pathogenic CAG HTT repeat in the adotto_TRregions_v1.2.bed BED file.
However, there doesn't seem to be an entry for the CAG repeat in the BED file. Is this just because it doesn't vary from HG38 in HG002?

@ACEnglish
Copy link
Owner

Hello,

The region we have documented for HTT from our Patho.csv is at

chr4 3074876 3074966 HTT CAG

This intersects in the adotto_TRregions_v1.2:

chr4 3074938 3075085

Which is included as a tier1 region in the GIAB TR v1.0

chr4 3074938 3075085 Tier1 TN_TN_TN 14 0.7933 -4 0 28

So there perhaps is a discrepancy in the boundaries between the catalog and the patho.csv by 62bp which may be messing up your query region's overlap to the catalog?

@davidlougheed
Copy link
Author

Thanks for the quick response! I do see that region in adotto_TRregions_v1.2, as the following line:

chr4	3074938	3075085	5	0	25	10	3	2	2	93	98	.	HTT	.	25	protein_coding	"[{""chrom"": ""chr4"", ""start"": 3074938, ""end"": 3075048, ""period"": 3.0, ""copies"": 37.0, ""score"": 143, ""entropy"": 1.5, ""ovl_flag"": 4, ""motif"": ""GCC"", ""purity"": 91}, {""chrom"": ""chr4"", ""start"": 3075050, ""end"": 3075060, ""period"": 5.0, ""copies"": 2.2, ""score"": 33, ""entropy"": 0.95, ""ovl_flag"": 1, ""motif"": ""CCCGG"", ""purity"": 95}]"

However, I was expecting to see the CAG repeat as an entry in the anno column of the BED file, since it's the best-known expansion location and motif for HTT repeat variation, with the GCC coming after.

My goal is to produce a BED file with motifs to feed to various STR genotypers, so extracting this chr4:3074938-3075048 region from hg38 gives a more heterogeneous repeat then what I would expect (and is missing a repeat I'd like to have), which would be two STR entries: CAG in chr4:3074877-3074940, and GCC in ~chr4:3074941-3075052.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants