-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split results #19
Comments
Ok so I just dug for the first time into the code, and as far as I understand, the program generates a subset of the big '${TAXID}_Diamond_results.bout' for each query. I am thinking of a way to do it with 2 Go of RAM : Would that be compatible with the general functioning of genEra ? |
Dear Paul, Thanks again for your useful comments! The first enhancement you propose seems quite interesting, but I suspect it would not be compatible when applying additional options to the pipeline (e.g., adding proteins for organisms that are absent from the NR database). I may be wrong, though! Best, |
About the first way, I thought the additional genomes were blasted separately and then merged to the big results '.bout' Any way if the second suggestion works, the first one is not necessary, but for now I can just try the first one without having to change the code. ( still in progress but it took 1h36min to parse a 262Go ${TAXID}_Diamond_results.bout, and it should be faster or nearly the same with the second suggestion). Also if the latter suggestion is fine, we could try to deal with parallel writing conflicts and use multiple cpus to faster it. |
Dear @josuebarrera, the proposition of @Proginski is very interesting with Mixing the split first all per gene results approach of FASTSTEP3R together with this command should avoid the large RAM use performed by data.table for importing ${TAXID}_Diamond_results.bout into R. So far i have been doing some test (with ~ 63 Go and ~ 170 Go ${TAXID}_Diamond_results.bout) and FASTSTEP3R it is still faster (vs single-fashion awk commands; parallel-fashion not checked), but the the runtime difference between FASTSTEP3R and the awk command is in the range of some hours (maximum difference did not reach 24 hours for ${TAXID}_Diamond_results.bout of these sizes). I think that this difference in the runtime, at least for outputs of these sizes, could be assumed for most of the users and, therefore, make the pipeline widely available for users without that level of resources. Additionally, i have compared the difference in runtimes between parallel-fashion (Erassignment-like) awk command using splitted ${TAXID}_Diamond_results.bout as in FASTSTEP3R and single-fashion awk command. Single-fashion awk command
Parallel-fashion awk command (Erassignment-like) | ${NTHREADS} = 10
I have checked that both approaches seems to produce tmp_${GENE}.bout files containing same results, so parallel writting conflicts may not suppose a big thing. The parallel-fashion awk command (Erassignment-like) is ~ 1.4 times faster than single-fashion awk command for ~ 63 Go ${TAXID}_Diamond_results.bout. Of course, these are very raw tests with great range of improvement and all needs to be further checked, but it seems very promising! If all works fine -F argument could be turned false as default due to the huge amount of resources used and set the default run with the parallel-fashion awk commands. If you finally decide to implement it and need some help i will be completely available to code so let me know! 👨💻 Cheers, Víctor |
Hi, In order to deal with potential writing conflicts, we could : With results=${TMP_PATH}/${NCBITAX}_Diamond_prefiltered_results.bout, and n=${NTHREADS} :
}`
cat firsts/* | awk -F"\t" '{ print $0 >> "tmps/tmp_"$1".bout" }' Now I need to have a look at how to integrate that into the current program. If it is easy for you to integrate that code and if it works as expected, just let me know, otherwise, I'll try myself ;) (help ! It does not seem easy ) Paul |
Dear @Proginski , |
Just forget about my last proposition : |
Hi,
Is your feature request related to a problem? Please describe.
As the last release of the human genome, with its ~145k CDS produces a 630Go results, and as the help of v1.4.0 says that one needs around 200Go RAm for 180Go of results, it seems one needs ~700Go of RAM to complete the analysis with the -F option.
Describe the solution you'd like
Once step 1 (+/- 2) is completed, is it possible to manually split the input fasta and Diamond results to better each chunk's performance? (I'm not saying it will not require a lot of RAM also ;) )
Describe alternatives you've considered
I just tried something like
The chunk has 87 CDS and of course, it went turbo-fast.
The ages assigned to the CDS were the same as when the entire original fasta was used.
So is it possible to do so, and could it be of any interest?
Paul
The text was updated successfully, but these errors were encountered: