Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory effeciency of split_cpg_groups.py #20

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

shangshanzhizhe
Copy link

Hi,

Awesome tool. However, when I use the pipeline to detect methylation with nanopolish, I found the memory consumption of split_cpg_groups.py is really huge. I believe it functions as a line-to-line reformat of the input file. I did a few slight modifications, changed it to instant output. Hope METEORE can be better~

Shangzhe Zhang

Copy link

@hrrsjeong hrrsjeong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. this code did not write the first CpG if there's CpG > 1. you should add another out.write in line 45 of your code.

@shangshanzhizhe
Copy link
Author

done.

@Toinax
Copy link

Toinax commented Jul 11, 2024

Hey,

Thanks for your script, it seems to be effectively less memory consuming.

However, i found some errors, so here is a corrected version to be used :

#!/usr/bin/env python3

import sys
import pandas as pd
import re

"""
	split_cpg_groups.py
	Author: Zaka Yuen, JCSMR, ANU
	Created on Feb 18 2020

        Modified by Shangzhe Zhang Mars 2 2022
        Modified by Antoine Quoniam_Barre on July 11 2024
    
	The output of the nanopolish calling procedure is a log-likelihood ratio, 
	where a positive log-likelihood ratio indicates evidence for methylation.
	Nanopolish groups nearby CpG sites together and calls the group jointly,
	assigning the same scores to each site in the group.

	This script is to:
	-use after running nanopolish call-methylation
	-allow per-site comparison to other datasets
	-split up the CpG group into its constituent CpG sites and assign the same log-likelihood ratio
	-keeping strandedness and readID information
"""


target ="CG"
out = open(snakemake.output[0],'w')
out.write("\t".join(['Chr', 'Pos','Strand', 'Log.like.ratio', 'Read_ID']) + "\n")
with open(snakemake.input[0],'r') as fh:
    next(fh)
    for line in fh:
        fields = line.rstrip().split()
        chrom = str(fields[0])
        strand = str(fields[1])
        start = int(fields[2]) + 1 # fix the position
        read_name = str(fields[4]).split('_', 1)[0]
        logRatio = float(fields[5])
        cpg_num = int(fields[9])
        seq = str(fields[10])

        if cpg_num == 1:
            out.write(chrom + "\t" + str(start) + "\t" + strand + "\t" + str(logRatio) + "\t" + read_name + "\n")

        elif cpg_num > 1:
            index = []
            for match in  re.finditer(target, seq):
                out.write(chrom + "\t" + str(start) + "\t" + strand + "\t" + str(logRatio) + "\t" + read_name + "\n")
            for match in re.finditer(target, seq):
                index.append(match.start())
                length = len(index)
            for i in range(1, length):
                new_start = start + (index[i] - index[0])
                out.write(chrom + "\t" + str(new_start) + "\t" + strand + "\t" + str(logRatio) + "\t" + read_name + "\n")
                

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants