Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WebGPU Support #319

Open
jrissman opened this issue Apr 29, 2023 · 9 comments
Open

Add WebGPU Support #319

jrissman opened this issue Apr 29, 2023 · 9 comments
Labels

Comments

@jrissman
Copy link

WebGPU is a new API that lets web applications leverage the power of modern hardware for general purpose computation. After six years of development, it was just released in beta on Chrome for Windows, Mac, and ChromeOS. All major browsers are currently working on adding support (it can already be enabled in Safari, Firefox, Edge, etc. with a flag), and Android and iOS support are coming as well. For more, see the beta release announcement here: https://developer.chrome.com/blog/webgpu-release/

WebGPU might enable much faster model run speeds on modern devices. This might allow models to perform more calculations (such as full 8760-hour electricity dispatch) while still running the model quickly enough to feel interactive. It would also be useful for optimization and genetic algorithms, where a model may need to be run many times, with the computer varying the settings each time, to home in on an optimal solution defined by a goal function.

A modern GPU can have over 1,000 cores, vastly more than a CPU. Assuming WebGPU abstracts the process of distributing work among GPU cores, runspeed might increase by orders of magnitude.

WebGPU can be invoked by WebAssembly code generated by Emscripten. Here's a repo I found where someone built tools to do this: https://github.com/juj/wasm_webgpu . Therefore, it appears that WebGPU is not a replacement for WebAssembly but rather could be added on for further speed gains.

It would be useful to investigate how easy or difficult it would be to use the WebGPU API and see what sort of runspeed improvement it provides. If the runspeed improvement is dramatic, it might be worthwhile to plan to migrate to WebGPU as the standard method of running models once the WebGPU API is broadly supported across browsers and operating systems, as it likely will be in a couple years.

@ToddFincannonEI
Copy link
Collaborator

This is something that could have a big payoff for large models. Developing a demo with Emscripten could be a quick project. I think it would need to be optional for the code generator until it matures. Jeff's proposal contains two use cases — interactive speedup on a personal computer, which would have a modest number of GPU cores; and a big speedup when running models on a cloud server with ~1000 GPU cores. I don't think anyone has used SDEverywhere the second way, but it is certainly a possibility for either C or WebAssembly code.

@jrissman
Copy link
Author

jrissman commented May 1, 2023

My comment on 1000+ cores was not for cloud servers. A modern graphics card from Nvidia or AMD has over 1000 cores. This chart of Nvidia graphics cards, for example, says that the Nvidia RTX-4090 has 16,384 cores, and the last-generation 3000-series has between 2,560 cores and 10,572 cores depending on the graphics card model.

I'm not sure how this translates to integrated graphics rather than a distinct GPU card, but even integrated graphics probably offers a vast speed boost over relying only on a CPU.

I was still thinking of the model running locally on people's devices, but taking advantage of the highly optimized graphics hardware in essentially all modern devices.

@jrissman
Copy link
Author

jrissman commented May 1, 2023

Another way to look at it is in floating point operations per second, where GPUs have an order-of-magnitude edge over CPUs. Here's a chart from Cornell University:

PeakFlopsCPUGPU

That chart only goes up through 2016 or so. The current-generation GeForce 4000 series has between 23 and 83 teraflops of computing power depending on the graphics card model (single precision), according to this Wikipedia chart.

Cornell says: "CPUs are typically designed for multitasking and fast serial processing, while GPUs are designed to produce high computational throughput using their massively parallel architectures... One can ask: given a device that delivers such impressive floating-point performance, is it possible to use it for something besides graphics?"

@jrissman
Copy link
Author

jrissman commented May 1, 2023

This chart from SETI (Berkeley) gives Intel CPU GFLOPs as generally being between 20-80 for most computers, which is about 1000 times smaller than the Nvidia 4000-series GPUs, at 20-80 TFLOPs. But the Cornell graph above suggests something closer to a 10x difference, not a 1000x difference. I'm not sure how to reconcile the difference between these sources.

We'd probably need to run a test like Todd suggested with the idea of an Emscripten demo to see what sort of performance we see. Much might depend on how much overhead there is, or if there are bottlenecks we don't encounter today but which would become relevant if the model computation is 10-1000x faster.

@chrispcampbell
Copy link
Contributor

Thanks for filing. I've been interested for some time in making a small prototype of SDE compilation using WebGPU on the backend to see where it can be beneficial, if at all.

I'm hesitant because the highly sequential nature of typical SD model calculations doesn't align well with what GPUs are best at (parallel operations without interdependencies). It's possible that there are some kinds of SD models that can benefit, for example, a model that has equations operating on dimensions with a large number of subscripts (where vectorization can help).

In a past life my day job was in accelerating imaging and graphics operations on the GPU, and a lot of the development time goes into the tedious parts of putting data into optimal layouts in buffers and keeping overhead low when shuttling those buffers to/from the GPU.

One such task for SDE (that's likely a prerequisite for a WebGPU backend, but also might be helpful for C or JS backends for other reasons) is to allow code gen to use a single buffer for input/intermediate/output values (accessed by named indices) instead of global variables. I have a few other recommendations for changes in SDE that would be beneficial to implement before embarking on WebGPU support.

@jrissman
Copy link
Author

jrissman commented May 1, 2023

That's good to know, Chris. Thanks.

I do not have the background to get into technical detail about WebGPU or SDE, but I can at least give a sense of where/how the EPS and similar SD models might take advantage of this capability:

The EPS has many non-time subscripts and does a lot of operations in parallel for many subscript elements that don't depend on each other. For instance, everything that happens in the transportation sector happens in parallel for 12 vehicle types, so if all 12 could be done simultaneously rather than in sequence, that could help. Similarly, in the Industry sector, there are around 25 industry categories, each of which is calculated independently of the others. The electricity sector is a more extreme example, where the model may make separate calculations for up to 8,760 hours per annual timestep. No knowledge of any past hour is used in calculating the next hour. We moved away from simulating all 8,760 hours because the runspeed impacts were too severe, but with a parallel architecture, maybe it could be fast enough. But as it is, we still are doing a few hundred hours (per timestep) sequentially, and this accounts for the majority of the total time required to run the EPS. And each economic sector runs independently of the others, so we could be doing the transportation sector calculations and the electricity sector calculations simultaneously. Even if we can only accelerate operations by parallelizing calculations for non-interacting subscript elements or non-interacting variables, that would be enough to allow the EPS to run many times faster.

Note that there is usually a SUM function at the end, such as adding up the emissions or energy use from all 12 vehicle types or from all 25 industry categories. So although the subscripts are calculated independently and could be done in parallel, there are operations at the end that need the results of all the subscripts to be available, usually summing things to make totals that will be displayed on the graph.

There are also cases where the elements of a subscript only interact at one or two points per timestep during the calculation flow. For example, there is a moment when new electric generating capacity to be constructed is allocated across 20 power plant types, and another moment when electricity demand is met by dispatching a mix of those 20 plant types. In these two allocation operations, each plant type influences the results for the other plant types. But apart from these two allocation operations, a lot of computation happens independently for each of the 20 plant types. So if it's possible to accelerate the portions of the code in between the allocation steps using a GPU, rather than requiring that the subscript elements never interact with each other, that would expand the relevance of WebGPU to cover essentially all non-time subscripts in almost any SD model.

Anyway, I opened this issue just as a suggestion for consideration of a powerful new capability on a roadmap. It's not something I expected us to work on in the short term. It sounds like you want to do some other improvements to SDE that would facilitate eventual WebGPU support. If you do end up making that "small prototype of SDE compilation using WebGPU on the backend," I'd be very curious to learn what you discover.

@chrispcampbell
Copy link
Contributor

@jrissman: Thanks for that detailed context, Jeff. It will be interesting to see what improvement WebGPU can provide for that use case. If you happen to have a standalone model available (or one that replicates the use case) that would be helpful, otherwise I'll get in touch with you and Todd later down the road (maybe for private access to that EPS variant) if I find time to look into this.

@jrissman
Copy link
Author

@chrispcampbell: Yes, there is a standalone model available that reflects this use case. It's the Energy Policy Simulator. There is no need to give you private access because the model is open-source. I recommend testing it with the U.S. national model, using the latest version that is available when you work on WebGPU support. The latest version will always be available at this link:

https://docs.energypolicy.solutions/models/us#model-download

It should be compatible with SDE because Todd helps add to SDE any features that are needed to support the Energy Policy Simulator, so it should be ready to use to test WebGPU capabilities whenever you have time to work on WebGPU.

@jrissman
Copy link
Author

Or if you want a version of the EPS that runs too slowly without WebGPU, you can use this old commit, which is from when we were trying to simulate all 8,760 hours per year, before we moved to representative timeslices as a speed optimization:

EnergyInnovation/eps-us@2e3acf8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants