Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory #19

Open
number-25 opened this issue Feb 5, 2020 · 4 comments
Open

No such file or directory #19

number-25 opened this issue Feb 5, 2020 · 4 comments

Comments

@number-25
Copy link

Hi Dana

I am having an issue with one particular file in my data set when I attempt to run it through TC.

Below are commands

python $TC --threads 6 --sam $SSAM --genome $REF --spliceJns $SPLICE --deleteTmp --outprefix EXT2_TC

Reading genome ..............................
Reading genome ..............................
Reading genome ..............................
cat: 'TC_tmp//.sam': No such file or directory
cat: 'TC_tmp//
.fa': No such file or directory
cat: 'TC_tmp//.log': No such file or directory
cat: 'TC_tmp//
.TElog': No such file or directory
Took 0:00:00 to combine all outputs.

I have attempted to clear /tmp/ directory before trying this again (I notice pybedtools creates many files here) but it didn't help. After restarting my PC I got further (thinking it may clear temporary files causing issues), yet the output files don't appear to be correct. This replicate has the largest file size out of all, yet it was processed via TC with an end file size substantially smaller than all the rest, it was also processed very quickly, where as the rest took 3+ hours. To be sure, I remapped the original file in minimap2 again and tried once more. Have also tried without the --deleteTmp option.

Cheers
Dean

@dewyman
Copy link
Member

dewyman commented Feb 5, 2020

Hi Dean,
I'll try the trivial solution first- have you double-checked that the path you are providing in the $SSAM bash variable is correct?
If that doesn't help, feel free to send me a sample of the SAM file in question at [email protected] and I'd be happy to take a look.
Cheers,
Dana

@number-25
Copy link
Author

Hi Dana,

The SAM file path is all good. I tried tweaking a few parameters, in that perhaps the program exits if the computational load is too high for the computer. I reduced the threads to 4 from 6, and the program appears to have ran to completion - now my main confusion here is that the end file sizes are quite different (counter-intuitively). The starting file size of one replicate was 5gb, and the output SAM from TC was 3.4gb, my second replicate which had a starting file size of 23gb, ended up with a output SAM that was only 2.5gb - smaller than even the first replicate? I am confused as to what may be happening here? Is TC culling upwards ~90% of the data? Most replicates have mapping stats of ~85% also

Perhaps indel size in my data? number of mismatches?

Cheers
Dean

@dewyman
Copy link
Member

dewyman commented Feb 7, 2020

Hi Dean,
One of the drawbacks to the multithreading is that it does result in higher memory usage, which as you remarked can lead to a crash on large inputs. One way to mitigate this might be to pre-filter your SAM file and run TC on only the primary alignments (ie keep reads where the second column is 0 or 16).
TranscriptClean doesn't correct the unmapped reads or the non-primary alignments, so it is possible that the file size difference you are seeing is related to that. In particular, the --canonOnly and --primaryOnly command-line options would be expected to decrease the size of the final output that you get. For instance, if your mapping rate is 85% and you ran TC with one of these options enabled, then at least 15% of the reads in your input file would not be found in the output. But of course I would not typically expect a 23G to 2.5G drop unless the mapping rate/multimapping rate was really bad- that is a big reduction. Do you still have the tmp files from that run?
Best,
Dana

@number-25
Copy link
Author

Hi Dana,

Just some more info - the large replicate has ~26% non-primary alignments. The first smaller one has ~28%.

.TE.log file indicates that (Large replicate)
61907421 Corrected
658870 Uncorrected

I've got the tmp files, see the contents of the directory below /TC_tmp/split/un_corr_sams

Also, I sent you the .sam via email

uncorrected

Greatly appreciate the help!
Dean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants