Might get significantly better performance using scanx instead of image/vector #334

rcoreilly · 2025-01-31T19:11:11Z

I did some benchmarks of a modified version of your rasterizer and found that, for our GUI-based workload with lots of rounded rectangles for buttons, that the https://pkg.go.dev/golang.org/x/image/vector that you use was 500 to 1000 times(!) slower than https://github.com/srwiley/scanx I replicated this using srwiley's ScanGV as well. I don't know if there is something fundamentally wrong in how we're using that rasterizer but it really was surprising how bad the performance is.

Full details are here: cogentcore/core#1453 (comment)

FWIW my impl of the scanx rasterizer is here: https://github.com/cogentcore/core/blob/87b8f776975cbc126dcbbd2922eb82083c0cebc1/paint/renderers/_canvasrast/rasterizer.go

Meanwhile, thank you for all the amazing work you've done on this package and Go graphics more generally! We are refactoring our rendering framework for the Cogent Core GUI to try to get better performance on the web and mobile devices, and somehow only recently came across this package. We have translated your extensive and impressive path library into float32 and our math32 library link, and plan on adapting the other back ends and some of the text formatting code as well. Our need for math32 and other differences make it impossible to directly use your packages but we give you full credit in the relevant code! One thing I found useful was just making the Path type a []float32 directly, instead of having it be a struct wrapper around that.

The text was updated successfully, but these errors were encountered:

tdewolff · 2025-02-01T12:36:16Z

Thank you for your extensive work and benchmarks, and thanks for raising the issue here. I'm always looking for improving the performance of canvas. My previous tests did show that the image/vector implementation was the fastest (depending on the hardware for the ASM) and was implementing a relatively new paper on rasterization performance. I'd be very surprised if an implementation exists that is significantly faster, let alone 1000 times. But it's worth checking again the performance of scanx, though I suspect something else is at play here..

My first instinct would be to check if you reuse the rasterizer object to reuse its memory.

Secondly, stroking involves "fixing" the resulting path using Settle. This happens when the stroke starts overlapping itself (inner bend of corners or with other parts of the path), which is particularly the case for closed paths (since the middle will be a hole) and more urgent with the EvenOdd fill-rule. This relies on the path intersections code which was a nightmare to implement and I'm pretty sure is somewhat novel (there are very few robust implementations in public, maybe 3?). That code is O(n log n) and can cause a significant slowdown for long paths, even though the performance is hitting a practical limit of what is possible under the constraints. I'm confident this does not happen if you use scanx directly, but might give bad results for problematic paths / large stroke widths.

Thirdly, maybe you should compare lines only since all other path commands are flattened to that, unless your benchmark notices slow performance for those operations?

Fourthly, maybe the tolerance/precision for flattening is tuned for more accuracy in one of the benchmarks, which creates significantly more line segments. It would be nice to see if we're comparing apples to apples by checking if the input paths are the exact same number of commands. You might checkout SimplifyVisvalingamWhyatt to remove excess details, or increase the tolerance to begin with for path flattening to let's say 1/10th of a pixel. Assuming that the path is entirely inside the image, otherwise use FastClip to clip away great part of the paths before even doing anything else. Using a combination of those techniques allowed me to draw vector tiles of the earth in about 20 min (looks like a record), but granted that's without the rasterizer.

I'm sure you've already ruled out various problems as the benchmarks looked sophisticated, but would be nice to get your perspective before diving in.

Regarding the float32 point, I agree that removing the struct would be better. I was looking at improving the way that paths are stored and have an idea that would save about ~50% of memory for storage. A second idea could be using generics to set the underlying type to float64 or float32 and save another ~50% of memory for low-precision applications. I'm not sure what the implications are with the many algorithms, some of which are tricky regarding floating point accuracy, but worth looking into.

And regarding text use, I believe the current implementation is quite complete. The only thing lacking is handling fall-back fonts for missing glyphs. Is that what you're looking for, or do you believe something else is missing?

Thanks again for the great work!

EDIT: perhaps we should create a FastStroke global variable that skips the Settle operation. I'm pretty sure great part of the 1204 ms is dedicated to that. Did you check pprof benchmarks?

rcoreilly · 2025-02-01T19:59:15Z

My first instinct would be to check if you reuse the rasterizer object to reuse its memory.

yep we reuse.

Secondly, stroking involves "fixing" the resulting path using Settle...

adding a FastStroke would probably be good. In terms of the overall strategy here: it is probably faster to simply paint a wider path around a central line than to figure out how to make a well-formed closed path out of it as you are doing. Your algorithm would be awesome for a vector drawing program for turning a path into a shape, but I think it is overkill for just rendering.

Thirdly, maybe you should compare lines only since all other path commands are flattened to that, unless your benchmark notices slow performance for those operations?

Our goal was to test "real world" rendering for our GUI, so this is that. It really is just a ton of calls to the RoundedRect which is just lines and arcs for the corners. We implement shadows using alpha blended versions of these rects. There may be something entirely perverse from a rasterizing perspective for how that works.

Fourthly, maybe the tolerance/precision for flattening is tuned for more accuracy in one of the benchmarks..

I had those at default levels: .1 for tolerance. I did play around with those and didn't see much diff.

A second idea could be using generics to set the underlying type to float64 or float32 and save another ~50% of memory for low-precision applications. I'm not sure what the implications are with the many algorithms, some of which are tricky regarding floating point accuracy, but worth looking into.

The tricky thing with generics would be requiring a fully generic math library, and presumably then forgoing the ASM optimizations that are in place for those, which I assume are specific to 32 vs 64.

All your existing tests pass OK with float32, but I did have to lower the testing tolerances a bit. But if Skia can get away with float32, then it seems like it might be reasonable enough overall.

Regarding the basic issues with image/vector: the key point is that we got the same terrible performance using the ScanGV backend for rasterx as with your rasterizer, so it really seems to be something about that and not about the upstream inputs. I've attached the cpu.prof profile for each case.

We set the clip boundaries for each render path to contain the thing being rendered, but all of the time seems to be in the rasterizeDstRGBASrcUniformOpOver function that goes over the entire rect area. Anyway, one could presumably optimize whatever is going wrong there, but for our purposes, scanx is working well enough and we've detected no issues with its output on a range of svg outputs.

profile_canvas_scanx.pdf
profile_canvas_vector.pdf
profile_rasterx_scangv.pdf

tdewolff · 2025-02-02T13:16:25Z

You're right, I'm seeing a significant speedup for a test case while generating the exact same result (see commits above). In fact, running various resulting image sizes it looks that image.Vector is linear in execution time with the number of pixels, while srwiley.Scanx is faster than linear, looks log n. For 44 million pixels, image.Vector takes 493 ms while srwiley.Scanx only 52 ms, while for 0.1 million pixels it's only 1.5 ms vs 0.8 ms respectively.

This is really surprising, given that scanx doesn't even use ASM. Does it use a smarter algorithm? It would be nice to merge that code and provide fast ASM paths for RGBA and NRGBA images.

EDIT: it would've been nice if there was a paper it was based on or some other reference...

…see #334

tdewolff · 2025-02-03T20:13:09Z

I've added the FastStroke option, but you may need to test it.

tdewolff · 2025-02-06T13:57:35Z

I've switched to the scanx rasterizer. In the future we should improve its implementation and integrate better with canvas, perhaps adding SIMD versions for common architectures and image types (both for RGBA and NRGBA) which would enhance performance even more. Thanks again for the investigation.

rcoreilly · 2025-02-07T19:47:20Z

Great! SIMD versions would be amazing! BTW, at some point I will also try to implement rasterization for WebGPU, which is our GPU backend in Cogent Core, perhaps going so far as to implement this framework in Go: https://github.com/linebender/vello I'll let you know when we get to it, so you can perhaps port it back to canvas as another backend.

tdewolff · 2025-02-09T14:08:26Z

Great, vello in Go would be an amazing addition!

I had done some preliminary work on this some years ago. What is needed is a tessellator to divide up the filled regions in triangles. Should not be too hard to do? I also wanted to merge signed-field shaders for quadratic/cubic beziers to make fonts look great at any size, this may be somewhat more difficult I imagine.

Happy to know any progress you have!

tdewolff added a commit that referenced this issue Feb 2, 2025

Add scanx benchmarks, see #334

Loading
Loading status checks…

24ed619

tdewolff added a commit that referenced this issue Feb 2, 2025

Fix benchmark, see #334

Loading
Loading status checks…

740de25

tdewolff added a commit that referenced this issue Feb 3, 2025

Add FastStroke option to skip Settle on Path.Offset and Path.Stroke, …

Loading
Loading status checks…

59be125

…see #334

tdewolff closed this as completed in e419d36 Feb 6, 2025

This was referenced Feb 7, 2025

canvas.pathIntersections index out of range #280

Closed

Rasterization performance #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Might get significantly better performance using scanx instead of image/vector #334

Might get significantly better performance using scanx instead of image/vector #334

rcoreilly commented Jan 31, 2025

tdewolff commented Feb 1, 2025 •

edited

Loading

rcoreilly commented Feb 1, 2025

tdewolff commented Feb 2, 2025 •

edited

Loading

tdewolff commented Feb 3, 2025

tdewolff commented Feb 6, 2025

rcoreilly commented Feb 7, 2025

tdewolff commented Feb 9, 2025

Might get significantly better performance using scanx instead of image/vector #334

Might get significantly better performance using scanx instead of image/vector #334

Comments

rcoreilly commented Jan 31, 2025

tdewolff commented Feb 1, 2025 • edited Loading

rcoreilly commented Feb 1, 2025

tdewolff commented Feb 2, 2025 • edited Loading

tdewolff commented Feb 3, 2025

tdewolff commented Feb 6, 2025

rcoreilly commented Feb 7, 2025

tdewolff commented Feb 9, 2025

tdewolff commented Feb 1, 2025 •

edited

Loading

tdewolff commented Feb 2, 2025 •

edited

Loading