Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User request CHM13 libs #50

Closed
simoncchu opened this issue Jun 22, 2022 · 4 comments
Closed

User request CHM13 libs #50

simoncchu opened this issue Jun 22, 2022 · 4 comments

Comments

@simoncchu
Copy link
Collaborator

simoncchu commented Jun 22, 2022

Question from #20

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the hg38_FL_L1_flanks.fa file I got is 53MB (using -e 100), while the size of hg38_FL_L1_flanks_3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.

#########
grep "LINE1" hg38.fa.out > hg38.fa_L1.out
cat hg38.fa_L1.out | while read line
do
eval{line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}')
if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14))
else
i_length=$(($var_13 - $var_12))
fi
if [ $i_length -gt 5900 ];then
echo "$line"
fi
done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100
#########

And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!

@anderswe
Copy link

Also interested in CHM13 in rep_lib_annotation! Thank you!

@zhuxf-lab
Copy link

Any chance the CHM13 lib will be out soon? We got stuck in the lib preparation. Thank you!

@mikecuoco
Copy link

Hi @simoncchu have you had any luck with generating the libraries for the T2T-CHM13v2.0 reference? I tried to follow your instructions, but it looks like you have additional custom files for each human TE type, so I'm worried the custom implementation will be suboptimal.

UCSC recently published the build and RepeatMasker output on the genome browser FTP server here. Let me know if I can do anything to help!

@simoncchu
Copy link
Collaborator Author

Added the CHM13v2 support. Please have a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants