Fast Function Approximations lowering. #8566

mcourteaux · 2025-02-08T15:26:37Z

The big transcendental lower update! Replaces #8388.

TODO

I still have to do:

Validate all is fine with build bots.

Overview:

Fast transcendentals implemented for: sin, cos, tan, atan, exp, log, tanh.
Simple API to specify precision requirements. Default-initialised precision (AUTO without contraints) means "don't care about precision, as long as it's reasonable and fast", which gives you the highest chance of selecting a high-performance implementation based on hardware instructions. Optimization objectives MULPE (max ULP error), and MAE (max absolute error) are available. Compared to previous PR, I removed MULPE_MAE as I didn't see a good purpose for it.
Tabular info on intrinsics and native functions their precision and speed, to select an appropriate choice for lowering to something that is definitely not slower, while satisfying the precision requirements.
- OpenCL: lower to native_cos, native_exp, etc...
- Metal: lower to fast::cos, fast::exp, etc...
- CUDA: lower to dedicated PTX instructions
- When no fast hardware versions are available: polynomial approximations.
Performance tests validating that:
- the AUTO-versions are at least always faster.
- all known-to-be faster functions are faster.
Accuracy tests validating that:
- the AUTO-versions are at least somewhat reasonable precise (at least 1e-4).
- all polynomials satisfy the precision they advertise on their non-range-reduced interval.
Drive-by fix for adding libOpenCL.so.1 to the list of tested sonames for the OpenCL runtime.

Review guide

I pass ApproximationPrecision parameters as a Call::make_struct node with 4 parameters (see API below). This approximation precision Call node survives until lowering pass where the transcendentals are lowered. In this pass, they are extracted again from this Call node's arguments. I conceptually like that this way, they are bundled and clearly not at the same level as the actual mathematical arguments. Is this a good approach? In order for this to work, I had to stop CSE from extracting those precision arguments, and StrictfyFloat from recursing down into that struct and litter strict_float on those numbers. I have seen the Call::bundle intrinsic. Perhaps this one is better for that purpose? @abadams
I tried to design the API such that it would also be compatible with Float(16) and Float(64), but those are not yet implemented/tested. The polynomial approximations should work correctly (although untested) for these other data-types.
The intrinsics table and their behavior (MULPE/MAE-precision) is measured on devices I have available (and build bots). On some backends (such as OpenCL, and Vulkan) these intrinsics have implementation-defined behavior. This probably means it's AMD or NVIDIA that gets to implement them and determine the precision. I do not have any AMD GPU available to test the OpenCL and Vulkan backends on to see how these functions behave. I have realized that for example Vulkan's native_tan() compiles to the same three instructions as I implemented on CUDA: sin.approx.f32, cos.approx.f32, div.approx.f32. I haven't investigated AMD's documentation on available hardware instructions.

API

struct ApproximationPrecision {
    enum OptimizationObjective {
        AUTO,   //< No preference, but favor speed.
        MAE,    //< Optimized for Max Absolute Error.
        MULPE,  //< Optimized for Max ULP Error. ULP is "Units in Last Place", when represented in IEEE 32-bit floats.
    } optimized_for{AUTO};

    /**
     * Most function approximations have a range where the approximation works
     * natively (typically close to zero), without any range reduction tricks
     * (e.g., exploiting symmetries, repetitions). You may specify a maximal
     * absolute error or maximal units in last place error, which will be
     * interpreted as the maximal absolute error within this native range of the
     * approximation. This will be used as a hint as to which implementation to
     * use.
     */
    // @{
    int constraint_max_ulp_error{0};
    float constraint_max_absolute_error{0.0f};
    // @}

    /**
     * For most functions, Halide has a built-in table of polynomial
     * approximations. However, some targets have specialized instructions or
     * intrinsics available that allow to produce an even faster approximation.
     * Setting this integer to a non-zero value will force Halide to use the
     * polynomial with at least this many terms, instead of specialized
     * device-specific code. This means this is still combinable with the
     * other constraints.
     * This is mostly useful for testing and benchmarking.
     */
    int force_halide_polynomial{0};
};

Expr fast_sin(const Expr &x, ApproximationPrecision precision = {});
Expr fast_cos(const Expr &x, ApproximationPrecision precision = {});
Expr fast_tan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan2(const Expr &y, const Expr &x, ApproximationPrecision = {});
Expr fast_log(const Expr &x, ApproximationPrecision precision = {});
Expr fast_exp(const Expr &x, ApproximationPrecision precision = {});
Expr fast_pow(Expr x, Expr y, ApproximationPrecision precision = {});
Expr fast_tanh(const Expr &x, ApproximationPrecision precision = {});

src/CMakeLists.txt

src/FastMathFunctions.cpp

alexreinking · 2025-02-10T20:59:18Z

src/runtime/opencl.cpp

@@ -38,6 +38,7 @@ extern "C" WEAK void *halide_opencl_get_symbol(void *user_context, const char *n
        "opencl.dll",
 #else
        "libOpenCL.so",
+        "libOpenCL.so.1",


I wonder if it's the case that libOpenCL.so.1 should rather replace libOpenCL.so? The latter is a namelink that's only present when -dev packages are installed. It should always point to the former.

I was thinking the same. Can fix it later, but I needed this on my local machine, so without being too destructive without consensus, I did this.

I removed it. We'll see what the build bots do.

…nge (-1, 1) to test (-4, 4). Cleanup code/comments. Test performance for all approximations.

…optimization. Greatly improved accuracy testing framework.

…ance test.

…n test by precomputing arguments buffer.

… support for fast_tanh on all backends.

…1 as well.

…st not touching input: prevents constant folding.

mcourteaux requested a review from halidebuildbots February 8, 2025 15:26

mcourteaux marked this pull request as draft February 8, 2025 21:36

alexreinking reviewed Feb 9, 2025

View reviewed changes

src/CMakeLists.txt Show resolved Hide resolved

alexreinking reviewed Feb 9, 2025

View reviewed changes

src/FastMathFunctions.cpp Outdated Show resolved Hide resolved

mcourteaux requested a review from abadams February 10, 2025 18:11

mcourteaux marked this pull request as ready for review February 10, 2025 18:12

alexreinking reviewed Feb 10, 2025

View reviewed changes

mcourteaux added enhancement New user-visible features or improvements to existing features. performance gpu release_notes For changes that may warrant a note in README for official releases. labels Feb 10, 2025

mcourteaux force-pushed the fast-math-lowering branch from efa7ddc to bea8612 Compare February 11, 2025 09:40

mcourteaux added 18 commits February 11, 2025 12:34

Fast vectorizable atan and atan2 functions.

94f57e2

Default to not using fast atan versions if on CUDA.

de2c334

Finished fast atan/atan2 functions and tests.

54aad39

Correct attribution.

61f17bb

Clang-format

b5aa8b9

Weird WebAssembly limits...

59bda4a

Small improvements to the optimization script.

81a4a47

Polynomial optimization for log, exp, sin, cos with correct ranges.

fc872f8

Improve fast atan performance tests for GPU.

4cdfb9e

Bugfix fast_atan approximation. Fix correctness test to exceed the ra…

c35e64f

…nge (-1, 1) to test (-4, 4). Cleanup code/comments. Test performance for all approximations.

Cleanup

2a5e88a

Enum class instead of enum for ApproximationPrecision.

e17b0af

Weird Metal limits. There should be a better way...

28de29b

Skip test for WebGPU.

cc434f6

Fast atan/atan2 polynomials reoptimized. New optimization strategy: ULP.

77d162b

Feedback Steven.

bb9ddca

More comments and test mantissa error.

342babe

Do not error when testing arctan performance on Metal / WebGPU.

8a100fc

mcourteaux added 23 commits February 11, 2025 12:34

Move Polynomial Optimizer Python script to tools/ directory.

a1dedbc

Enable performance test for fast_atan and fast_atan2.

569bf69

LLVM upper-limit 99 (CMake needs an upper limit).

37f48fa

Add LLVM IR for PTX sin.approx, cos.approx, tanh.approx

33d518b

Implemented tan. Improved polynomial optimizer performance for MULPE …

0ef2c9c

…optimization. Greatly improved accuracy testing framework.

Implemented tanh, tan. Many improvements to accuracy test and perform…

08f8bbd

…ance test.

Clang-format.

d6b3947

WIP: Fiddle with strict_float behavior in CSE. Fix fast math precisio…

dbe316e

…n test by precomputing arguments buffer.

Nuke MAE_MULPE. Separate optimized MULPE-corrected sin and cos.

b1b23b5

Clang-format

3c30732

Some cleanup.

4e0d2c2

Fix sine.

5d1dcc0

Fix clang-tidy. Mark OpenCL exp() as fast.

3232680

Clang format is annoying me.

abe25ab

Remove my experimental CSE step.

73e6e7b

OpenCL performance of fast_exp forced poly is expected to be worse.

44b80f1

OpenCL fast functions selected for fast transcendentals.

c29a24d

Lower fast intrinsics on metal to the fast:: namespace versions.

7dd1f40

Split tables for sin and cos, as metal has odd precision for sin. Add…

69b9990

… support for fast_tanh on all backends.

Move range_reduce_log to a header. Drive-by fix listing libOpenCL.so.…

41d072c

…1 as well.

Fix API documentation. Improve measuring accuracy. Fix vector_math te…

f0357dc

…st not touching input: prevents constant folding.

Also vectorize on GPU to make sure we test that.

7004161

Remove libOpenCL.so from search list in favor of libOpenCL.so.1

0de4dbc

mcourteaux force-pushed the fast-math-lowering branch from bea8612 to 0de4dbc Compare February 11, 2025 11:36

mcourteaux added 2 commits February 11, 2025 14:31

Add FastMathFunctions.cpp to Makefile

a637f8e

Add support for derivatives for the fast_ intrinsics.

267ae49

mcourteaux added the skip_buildbots Synonym for buildbot_test_nothing label Feb 11, 2025

mcourteaux added 3 commits February 11, 2025 15:29

Remove unused helper function.

53a2263

Add in a gracefactor for precision when the system does not support FMA.

70c6d8d

Clang Format.

5adec40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Function Approximations lowering. #8566

Fast Function Approximations lowering. #8566

mcourteaux commented Feb 8, 2025 •

edited

Loading

alexreinking Feb 10, 2025

mcourteaux Feb 10, 2025 •

edited

Loading

mcourteaux Feb 11, 2025

Fast Function Approximations lowering. #8566

Are you sure you want to change the base?

Fast Function Approximations lowering. #8566

Conversation

mcourteaux commented Feb 8, 2025 • edited Loading

TODO

Overview:

Review guide

API

alexreinking Feb 10, 2025

Choose a reason for hiding this comment

mcourteaux Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

mcourteaux Feb 11, 2025

Choose a reason for hiding this comment

mcourteaux commented Feb 8, 2025 •

edited

Loading

mcourteaux Feb 10, 2025 •

edited

Loading