Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring of Spark emissions via Spark plugin #600

Open
tvial opened this issue Jul 1, 2024 · 2 comments
Open

Monitoring of Spark emissions via Spark plugin #600

tvial opened this issue Jul 1, 2024 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tvial
Copy link

tvial commented Jul 1, 2024

Hi,

I am working on a prototype of a Spark plugin to report the energy consumption of executors. The logic behind is similar to CodeCarbon's, although the computation method differs slightly: the executor process scheduling is sampled regularly, converted to Wh with the TDP (provided or inferred), and aggregated by the driver. The total energy is published as a Spark metric, accessible via the REST API.

I wanted to know if you'd be interested in integrating it with CodeCarbon, for example with Spark cluster as a new type of resource alongside CPU, GPU, or RAM. It would let CC factor in the energy mix and cloud provider data, which could be cumbersome to access from a private Spark cluster (it's better not to assume internet connectivity). And it would benefit from CC's ease of use, which is a strong factor of adoption.

In any case, it's a prototype, it needs more testing and validation, and only handles CPU for now (but many data engineering pipelines don't use GPUs anyway). Here it is: https://github.com/tvial/ccspark (Apache 2.0 license). Note that it embeds your CPU database for the TDPs, I'm open to remove it if you think it's a bad idea :)

Let me know if it can be of any help
Thanks

@SaboniAmine
Copy link
Collaborator

Hello Thomas, that's a great idea!
Thanks for this proposal, that would be appreciated by a lot of potential users.
I'll have a deeper look on the implementation, but here are some initial questions.

  • are the binaries executed from the executor, to sample energy consumption, launched as root ? Otherwise, the RAPL interface exposing this data might not be accessible.
  • It seems that your implementation is already working, what could be a good test setup ? Do you think it could be reproduced in the CI, with some TestContainers, for instance ?
    I'm happy to get news from you, especially from this channel !!

@tvial
Copy link
Author

tvial commented Jul 1, 2024

Hi Amine, it's been a while :) Glad to hear from you as well!

Thanks for the encouraging feedback.

It's working as it is, I tested it locally and on a small dedicated Databricks cluster on Azure, both with very simple jobs (no real usage as of yet). I see no challenges in making it work within a CI or other environment, as it has no dependencies by itself.

Regarding the measurements, it does not use RAPL for the reason you mention. I think I read somewhere that some Databricks configs would let you run executors as root, but I would not make this a requirement, maybe an alternative method? The one here is to read scheduled jiffies from proc/stat and proc/$pid/stat, to take the difference between two samplings, and compute the ratio as the load attributed to the process over the sampling period. It should be reviewed by someone more expert in Linux and Spark's execution model.

@SaboniAmine SaboniAmine added the enhancement New feature or request label Jul 2, 2024
@benoit-cty benoit-cty added the help wanted Extra attention is needed label Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants