-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datatrove fails to handle tasks >1k with slurm job arrays #238
Comments
I think this is due to the number of available dataset shards/files (possibly lower than 7113, the rank in the logs). Can you share how you instantiated HuggingFaceReader? Was it something like this |
I did:
I suspect the prompt comes from me trying to use 10k tasks, and when this gets submitted to slurm it generates 10 job arrays with 1k jobs each. And then is it possible that So with 10k tasks what we have here is SLURM with job ids as following:
so if it just looks at I could be wrong of course and the cause of the issue is something else |
Is there a way to tell |
It should assign the correct rank, you can check the first few lines of the log files, they should say "rank=number" and it should be >1000 for the other job arrays. |
Not currently, as you can have multiple readers stacked one after the other with different file sizes each, or some pipelines without readers at all. Your approach with 10k should work (extra tasks should just not do anything). For now until we fix this bug I recommend using the ParquetReader to stream and process FineWeb data, there's an example snippet for this on the FineWeb dataset page |
This should fix the bug/crash: e01bd0a |
Thank you for working with me on this, @guipenedo - I'm currently testing your fix The other weird thing about this splitting into 1k job arrays, is that it won't start the next job array till it finishes the first one. I guess this is normal because of job dependency. Just a peculiarity I noticed - if even one job is lagging behind, it won't expand to num So if one or more of the jobs crashes - how does it recover? |
The dependencies for these arrays are "afterany:" and not "afterok:", so if one crashes the next job array should still launch. If you want them to run concurrently you can set |
I'm not sure what you mean by only 250 files available. I'm passing:
so should take a lot more tasks than 250. Unless I don't understand the purpose of I thought limit helps the user to set how many samples to feed per slurm job. Otherwise, if you can't control that, how would you deal with time-limited slurm jobs? Say your job can't be longer than 24h and it'd take more than 24h to process one file - it'll than get cancelled by slurm and it'd be impossible to finish data processing. |
oh! that's very cool! Thank you! Shouldn't that be the default setting? It makes sense that the limitation imposed by slurm to have only 1k jobs per job array, ideally should be transparent to the user and if resources are available it should utilize all of them at all times? |
Yes it drops the rest, we use it mostly for testing/debugging (a small limit just to check everything runs fine before scaling up). For the HuggingFaceReader in particular you can actually get in file parallelism to work if you don't use streaming (but then obviously it would take up some storage), as you can scale the total number of tasks as much as you want and it will split everything, but for JsonlReader, ParquetReader, etc it's still just a planned feature |
That's a good point, I might change the default |
To examples of datatrove+fineweb - it'd be very useful to have more of those Incidentally any idea why datatrove wasn't used for fineweb-edu and used HF Trainer instead? https://github.com/huggingface/cosmopedia/tree/main/classification It appears that they somehow managed to run the whole thing in 24h https://github.com/huggingface/cosmopedia/blob/main/classification/train_edu_bert.slurm#L11 |
I think the slurm file you linked is to just train the classifier, which only takes a relatively small number of annotated samples. Actually classifying all of FineWeb took quite a lot longer. Datatrove could have been used but this was originally the work of a separate team who wasn't too familiar with it |
oh, and the other thing I noticed it creates one job array too many that never gets satisfied
so with 2k items it created To validate I have just repeated with 1k items and got again one job array too many:
and later it'll become |
This one task will take all the individual job stats and merge them into a global stats.json file, showing how many documents were processed and dropped on each step. This one actually has an afterok: dependency so it only runs if all the actual jobs succeed. I often use it as a quick way to check if my job is fully finished or not: if there's a DependencyNeverSatisfied job, then I have to rerun or fix something |
oops, linked to the wrong file, the inference is here: https://github.com/huggingface/cosmopedia/blob/main/classification/run_edu_bert.slurm#L13C1-L14C26
so it was running on a slurm env w/o time effective limits So I still would like to know - how does datatrove handles failed jobs? Do you need to restart the same process a few times and it'll recheck if any of the jobs didn't complete and it'll then re-do them from scratch? W/o taking into an account any partial completion? Is there a way to have some finalizer script that can report to the operator if all jobs were successful or whether it needs to re-run some? And of course it'd be silly to launch 1k jobs if only one failed and needs repeating, albeit most will exit immediately... so probably ok. edit: I see you already replied partially to my questions in the previous comment of yours. |
ok, so I launched again with no
That is how do I know the actual progress - it doesn't tell me how many iterations will be run and whether the input file will be smaller than the capacity I have to train in 24h, since that's the longest job I can run and I don't want to lose data. I'm using Is this because of streaming? Should I not use streaming - it felt it was much slower w/o streaming when I tried it out first. |
Also, this might be of interest to you:
It says it can't find
|
Yes exactly, each individual task was processing 1/128th (job array with 128 jobs, and passing the position in the array to
Yes, it will only relaunch the incomplete tasks. We don't really have any "checkpointing" as it would depend a lot on when the output files were last flushed (could often even just corrupt the files if the final data before a task crashed was incomplete), and just to in general simplify some other logic in general.
We have two commands that you can use to track job status:
We do not really check total number of documents as some formats don't have it (jsonl, etc), even though in theory for a dataset on the hub we could find a way to fetch it. When you're processing multiple files |
re:
I tried all the tricks discussed at #180 but none of them work. Tried all 3: |
may I suggest to add a prefix to these utils? as in:
or otherwise the names are too generic and may collide with other pre-existing scripts/aliases. |
@guipenedo, so if I submit 2k tasks and I set 128 workers, it'd run 256 workers! |
And another thing I can't figure out. The 250 files job was fine, but then I run the same on the And so I first did 1k items and noticed none of them finished early and they all had data to process, so I raised it to 2k and launched it again and it did another 1k item - hitting rank 1999 - I said OK, then there is even more than 2k to do so I raised to 3k and launched again... Should there be as many tasks as shards or files?
so 2264 tasks? clearly it is not 350 shards |
it just keeps on going:
how do I know how many tasks I need? I wonder if it's because I had to switch from streaming to cached and then it had more shards? But even then it's more than what it said when loading the dataset - it's past 3k tasks now
|
If I have more tasks than 1k, datatrove splits it into multiple job arrays 1k-each.
the first job array of 1k runs fine, the subsequent ones all fail
This failing behavior is consistent
This is with
datatrove@main
- I can't use the official release as it doesn't support datasets streaming.This is just doing a slightly modified FineWeb filter from the example.
The text was updated successfully, but these errors were encountered: