Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Might get significantly better performance using scanx instead of image/vector #334

Closed
rcoreilly opened this issue Jan 31, 2025 · 7 comments

Comments

@rcoreilly
Copy link

I did some benchmarks of a modified version of your rasterizer and found that, for our GUI-based workload with lots of rounded rectangles for buttons, that the https://pkg.go.dev/golang.org/x/image/vector that you use was 500 to 1000 times(!) slower than https://github.com/srwiley/scanx I replicated this using srwiley's ScanGV as well. I don't know if there is something fundamentally wrong in how we're using that rasterizer but it really was surprising how bad the performance is.

Full details are here: cogentcore/core#1453 (comment)

FWIW my impl of the scanx rasterizer is here: https://github.com/cogentcore/core/blob/87b8f776975cbc126dcbbd2922eb82083c0cebc1/paint/renderers/_canvasrast/rasterizer.go

Meanwhile, thank you for all the amazing work you've done on this package and Go graphics more generally! We are refactoring our rendering framework for the Cogent Core GUI to try to get better performance on the web and mobile devices, and somehow only recently came across this package. We have translated your extensive and impressive path library into float32 and our math32 library link, and plan on adapting the other back ends and some of the text formatting code as well. Our need for math32 and other differences make it impossible to directly use your packages but we give you full credit in the relevant code! One thing I found useful was just making the Path type a []float32 directly, instead of having it be a struct wrapper around that.

@tdewolff
Copy link
Owner

tdewolff commented Feb 1, 2025

Thank you for your extensive work and benchmarks, and thanks for raising the issue here. I'm always looking for improving the performance of canvas. My previous tests did show that the image/vector implementation was the fastest (depending on the hardware for the ASM) and was implementing a relatively new paper on rasterization performance. I'd be very surprised if an implementation exists that is significantly faster, let alone 1000 times. But it's worth checking again the performance of scanx, though I suspect something else is at play here..

My first instinct would be to check if you reuse the rasterizer object to reuse its memory.

Secondly, stroking involves "fixing" the resulting path using Settle. This happens when the stroke starts overlapping itself (inner bend of corners or with other parts of the path), which is particularly the case for closed paths (since the middle will be a hole) and more urgent with the EvenOdd fill-rule. This relies on the path intersections code which was a nightmare to implement and I'm pretty sure is somewhat novel (there are very few robust implementations in public, maybe 3?). That code is O(n log n) and can cause a significant slowdown for long paths, even though the performance is hitting a practical limit of what is possible under the constraints. I'm confident this does not happen if you use scanx directly, but might give bad results for problematic paths / large stroke widths.

Thirdly, maybe you should compare lines only since all other path commands are flattened to that, unless your benchmark notices slow performance for those operations?

Fourthly, maybe the tolerance/precision for flattening is tuned for more accuracy in one of the benchmarks, which creates significantly more line segments. It would be nice to see if we're comparing apples to apples by checking if the input paths are the exact same number of commands. You might checkout SimplifyVisvalingamWhyatt to remove excess details, or increase the tolerance to begin with for path flattening to let's say 1/10th of a pixel. Assuming that the path is entirely inside the image, otherwise use FastClip to clip away great part of the paths before even doing anything else. Using a combination of those techniques allowed me to draw vector tiles of the earth in about 20 min (looks like a record), but granted that's without the rasterizer.

I'm sure you've already ruled out various problems as the benchmarks looked sophisticated, but would be nice to get your perspective before diving in.


Regarding the float32 point, I agree that removing the struct would be better. I was looking at improving the way that paths are stored and have an idea that would save about ~50% of memory for storage. A second idea could be using generics to set the underlying type to float64 or float32 and save another ~50% of memory for low-precision applications. I'm not sure what the implications are with the many algorithms, some of which are tricky regarding floating point accuracy, but worth looking into.

And regarding text use, I believe the current implementation is quite complete. The only thing lacking is handling fall-back fonts for missing glyphs. Is that what you're looking for, or do you believe something else is missing?

Thanks again for the great work!

EDIT: perhaps we should create a FastStroke global variable that skips the Settle operation. I'm pretty sure great part of the 1204 ms is dedicated to that. Did you check pprof benchmarks?

@rcoreilly
Copy link
Author

My first instinct would be to check if you reuse the rasterizer object to reuse its memory.

yep we reuse.

Secondly, stroking involves "fixing" the resulting path using Settle...

adding a FastStroke would probably be good. In terms of the overall strategy here: it is probably faster to simply paint a wider path around a central line than to figure out how to make a well-formed closed path out of it as you are doing. Your algorithm would be awesome for a vector drawing program for turning a path into a shape, but I think it is overkill for just rendering.

Thirdly, maybe you should compare lines only since all other path commands are flattened to that, unless your benchmark notices slow performance for those operations?

Our goal was to test "real world" rendering for our GUI, so this is that. It really is just a ton of calls to the RoundedRect which is just lines and arcs for the corners. We implement shadows using alpha blended versions of these rects. There may be something entirely perverse from a rasterizing perspective for how that works.

Fourthly, maybe the tolerance/precision for flattening is tuned for more accuracy in one of the benchmarks..

I had those at default levels: .1 for tolerance. I did play around with those and didn't see much diff.

A second idea could be using generics to set the underlying type to float64 or float32 and save another ~50% of memory for low-precision applications. I'm not sure what the implications are with the many algorithms, some of which are tricky regarding floating point accuracy, but worth looking into.

The tricky thing with generics would be requiring a fully generic math library, and presumably then forgoing the ASM optimizations that are in place for those, which I assume are specific to 32 vs 64.

All your existing tests pass OK with float32, but I did have to lower the testing tolerances a bit. But if Skia can get away with float32, then it seems like it might be reasonable enough overall.

Regarding the basic issues with image/vector: the key point is that we got the same terrible performance using the ScanGV backend for rasterx as with your rasterizer, so it really seems to be something about that and not about the upstream inputs. I've attached the cpu.prof profile for each case.

We set the clip boundaries for each render path to contain the thing being rendered, but all of the time seems to be in the rasterizeDstRGBASrcUniformOpOver function that goes over the entire rect area. Anyway, one could presumably optimize whatever is going wrong there, but for our purposes, scanx is working well enough and we've detected no issues with its output on a range of svg outputs.

profile_canvas_scanx.pdf
profile_canvas_vector.pdf
profile_rasterx_scangv.pdf

tdewolff added a commit that referenced this issue Feb 2, 2025
tdewolff added a commit that referenced this issue Feb 2, 2025
@tdewolff
Copy link
Owner

tdewolff commented Feb 2, 2025

You're right, I'm seeing a significant speedup for a test case while generating the exact same result (see commits above). In fact, running various resulting image sizes it looks that image.Vector is linear in execution time with the number of pixels, while srwiley.Scanx is faster than linear, looks log n. For 44 million pixels, image.Vector takes 493 ms while srwiley.Scanx only 52 ms, while for 0.1 million pixels it's only 1.5 ms vs 0.8 ms respectively.

This is really surprising, given that scanx doesn't even use ASM. Does it use a smarter algorithm? It would be nice to merge that code and provide fast ASM paths for RGBA and NRGBA images.

EDIT: it would've been nice if there was a paper it was based on or some other reference...

tdewolff added a commit that referenced this issue Feb 3, 2025
@tdewolff
Copy link
Owner

tdewolff commented Feb 3, 2025

I've added the FastStroke option, but you may need to test it.

@tdewolff
Copy link
Owner

tdewolff commented Feb 6, 2025

I've switched to the scanx rasterizer. In the future we should improve its implementation and integrate better with canvas, perhaps adding SIMD versions for common architectures and image types (both for RGBA and NRGBA) which would enhance performance even more. Thanks again for the investigation.

@rcoreilly
Copy link
Author

Great! SIMD versions would be amazing! BTW, at some point I will also try to implement rasterization for WebGPU, which is our GPU backend in Cogent Core, perhaps going so far as to implement this framework in Go: https://github.com/linebender/vello I'll let you know when we get to it, so you can perhaps port it back to canvas as another backend.

@tdewolff
Copy link
Owner

tdewolff commented Feb 9, 2025

Great, vello in Go would be an amazing addition!

I had done some preliminary work on this some years ago. What is needed is a tessellator to divide up the filled regions in triangles. Should not be too hard to do? I also wanted to merge signed-field shaders for quadratic/cubic beziers to make fonts look great at any size, this may be somewhat more difficult I imagine.

Happy to know any progress you have!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants