Question about the Parallelization #358

jgSi1113 · 2024-11-06T14:21:16Z

jgSi1113
Nov 6, 2024

Dear all,

When I perform a NPT calculation via SSCHA+QE, I met some troubles about the parallelization.

Firstly, I followed the turtorials for automatic submission with a cluster, but it is failed due to the pretection from the HPC cluster. So I tried to apply more nodes directly in the slurm script, and use the mpirun directly. See the following script for details:

...
#SBATCH -N 4
#SBATCH --ntasks-per-node=36
...

conda activate sscha
mpirun -np 4 python npt_relax.py > log_npt_relax

where the npt_relax.py contains the following command:

command = '/public/.../bin/mpirun -np 36 pw.x -npools 9 < PREFIX.pwi > PREFIX.pwo'

After submit the job, there are 4 nodes applied successfully, and there are 4 input and output files created by SSCHA in the calculation folder. But unfortunately, there are only one output file updated timely, the others keeps 0 bits. When I ssh into the other nodes and check the process， I found that there is no process in the other 3 nodes. Why this happend ?

By the way, I checked such operation in the login nodes, it work well by following code:

>> mpirun -np 4 python npt_relax.py > log_npt_relax
where command in the npt_relax.py is changed into
command = '/public/.../bin/mpirun -np 9 pw.x -npools 3 < PREFIX.pwi > PREFIX.pwo'
By do this, there are 4 input and output files created by SSCHA in the calculation folder as well. And all output files updated timely and there are 36 pw.x processes running correctly.

Besides, I also checked submit 4 python job on one node, there are 4 input and output files created, but only one output file update. So this seems to indicate that there is a problem with the slurm scheduling system? How to fix it?

In addtion, I found that the sscha can create input files correctly by the mpirun -np NPROC python *py , can I perform the scf calculation with these input files separately? And then collect them together for sscha calculations and create the input files for the next ITERATIONS? If it is possible, how to set for the sscha calculations? I have did some minimisations by similar way, but I am curious whether this operation can also be applied to NPT calculations.

Any suggestions and discussions will be greatly appreciated.

Thanks and Regards!
Jianguo Si

Answered by mesonepigreco

Nov 6, 2024

Dear Jianguo Si,

We recently implemented a workaround to the protection from the cluster.
(This is available only from the repository's master; it has not yet been released to PyPy, so you must clone the last development version from github. We will release it in version 1.5 in the next few months.)
Instead of using the Cluster class, you can replace it with LocalCluster.

import sscha.LocalCluster

my_cluster = sscha.LocalCluster.LocalCluster(...)

You initialize my_cluster exactly as done in the standard Cluster calculation.

You can find here an example of how it works.

You need to readapt the script a bit (changing the modules to load quantum espresso and the keywords for submitting job…

View full answer

mesonepigreco · 2024-11-06T17:13:14Z

mesonepigreco
Nov 6, 2024
Maintainer

Dear Jianguo Si,

We recently implemented a workaround to the protection from the cluster.
(This is available only from the repository's master; it has not yet been released to PyPy, so you must clone the last development version from github. We will release it in version 1.5 in the next few months.)
Instead of using the Cluster class, you can replace it with LocalCluster.

import sscha.LocalCluster

my_cluster = sscha.LocalCluster.LocalCluster(...)

You initialize my_cluster exactly as done in the standard Cluster calculation.

You can find here an example of how it works.

You need to readapt the script a bit (changing the modules to load quantum espresso and the keywords for submitting jobs...) .
This has been designed explicitly for the clusters requiring 2FA, which made the automatic submission from remote extremely painful.

In this way you can submit your SSCHA calculation as a long serial job, and it will automatically submit subjobs within the cluster for higher parallelization. This correctly exploits MPI and all parallelization.

6 replies

mesonepigreco Nov 7, 2024
Maintainer

Hi, are you submitting the core sscha job with MPI? The core sscha job must be executed in serial, as it needs to handle the jobs submitted via slurm. This already occurs concurrently with multithreads and will collide with MPI processes, leading to multiple submissions and uncontrolled errors.

jgSi1113 Nov 7, 2024
Author

Thanks for your reply!

I didn't use any mpi related operations. The job just submitted by nohup python rlx_cluster.py & at another login node.

jgSi1113 Nov 7, 2024
Author

The MPI related setting only appears at the part of definition of my_hpc and it created the slurm scripts correctlly.

mesonepigreco Nov 12, 2024
Maintainer

It is really wierd, can you redirect the standard output and standard error of the python script and attach to this discussion?
It may be something like that some connections are failing and the code thinks that the job needs to be resubmitted.

Are you using the LocalCluster or the Cluster module as in your example? (the LocalCluster avoids the connections altogether) Also are you submitting from inside the cluster itself?

jgSi1113 Nov 19, 2024
Author

Dear Dr. Lorenzo Monacelli,

I have uploaded the script and log files before, please check it kindly when you have time. Any suggestions or help will be appreciated greatly！

Thanks & Regards!
Jianguo Si

jgSi1113 · 2024-11-12T13:23:27Z

jgSi1113
Nov 12, 2024
Author

Re-post!

0 replies

jgSi1113 · 2024-11-13T07:15:00Z

jgSi1113
Nov 13, 2024
Author

Dear Dr. Lorenzo Monacelli,

I have upload the python script and the log files as attachments (named 1113.zip), please check it.

Besides, I used the Cluster module to define the HPC related information. For me, there are two HPC devices available. I submit the python script at the login node in HPC-1 by nohup python rlx.py > log_rlx &，which connect HPC-2 and submit the pw.x jobs one the HPC-2. (I originally wanted to use my office computer to connect to HPC-2 and do this task, but this computer can not keep power and internet well for a very long time. So I do this on HPC-1.) I thinks it should not be the case of inside the cluster itself.

Thanks & Regards!
Jianguo Si

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSCHA

Question about the Parallelization #358

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

SSCHA

Question about the Parallelization #358

jgSi1113 Nov 6, 2024

Replies: 3 comments · 6 replies

mesonepigreco Nov 6, 2024 Maintainer

mesonepigreco Nov 7, 2024 Maintainer

jgSi1113 Nov 7, 2024 Author

jgSi1113 Nov 7, 2024 Author

mesonepigreco Nov 12, 2024 Maintainer

jgSi1113 Nov 19, 2024 Author

jgSi1113 Nov 12, 2024 Author

jgSi1113 Nov 13, 2024 Author

jgSi1113
Nov 6, 2024

Replies: 3 comments 6 replies

mesonepigreco
Nov 6, 2024
Maintainer

mesonepigreco Nov 7, 2024
Maintainer

jgSi1113 Nov 7, 2024
Author

jgSi1113 Nov 7, 2024
Author

mesonepigreco Nov 12, 2024
Maintainer

jgSi1113 Nov 19, 2024
Author

jgSi1113
Nov 12, 2024
Author

jgSi1113
Nov 13, 2024
Author