Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong stats in multi-node local executor #297

Open
jordane95 opened this issue Oct 15, 2024 · 1 comment
Open

Wrong stats in multi-node local executor #297

jordane95 opened this issue Oct 15, 2024 · 1 comment

Comments

@jordane95
Copy link
Contributor

Due to lack of node-level communication, the stats at the end of each pipeline step can only aggregate results from the current node, and what's being written to the disk is the status of the last finished worker, rather than the global info

@hynky1999
Copy link
Contributor

hynky1999 commented Oct 17, 2024

Hi, this is resolved for slurm executor by running a stats merger after all substasks are finished. I don't think there is a way to accomplish same behavior, as the global orchestration in local executor multi-node is not done by datatrove. Thus the responsibility of launching the merge script can't handled by datatrove.

If you log all stats into one folder you can use this script https://github.com/huggingface/datatrove/blob/main/src/datatrove/tools/merge_stats.py, which is exactly the script the slurm that slurm executor runs after all tasks have finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants