Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Function Approximations lowering. #8566

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

mcourteaux
Copy link
Contributor

@mcourteaux mcourteaux commented Feb 8, 2025

The big transcendental lower update! Replaces #8388.

TODO

I still have to do:

  • Validate all is fine with build bots.

Overview:

  • Fast transcendentals implemented for: sin, cos, tan, atan, exp, log, tanh.

  • Simple API to specify precision requirements. Default-initialised precision (AUTO without contraints) means "don't care about precision, as long as it's reasonable and fast", which gives you the highest chance of selecting a high-performance implementation based on hardware instructions. Optimization objectives MULPE (max ULP error), and MAE (max absolute error) are available. Compared to previous PR, I removed MULPE_MAE as I didn't see a good purpose for it.

  • Tabular info on intrinsics and native functions their precision and speed, to select an appropriate choice for lowering to something that is definitely not slower, while satisfying the precision requirements.

    • OpenCL: lower to native_cos, native_exp, etc...
    • Metal: lower to fast::cos, fast::exp, etc...
    • CUDA: lower to dedicated PTX instructions
    • When no fast hardware versions are available: polynomial approximations.
  • Performance tests validating that:

    • the AUTO-versions are at least always faster.
    • all known-to-be faster functions are faster.
  • Accuracy tests validating that:

    • the AUTO-versions are at least somewhat reasonable precise (at least 1e-4).
    • all polynomials satisfy the precision they advertise on their non-range-reduced interval.
  • Drive-by fix for adding libOpenCL.so.1 to the list of tested sonames for the OpenCL runtime.

Review guide

  • I pass ApproximationPrecision parameters as a Call::make_struct node with 4 parameters (see API below). This approximation precision Call node survives until lowering pass where the transcendentals are lowered. In this pass, they are extracted again from this Call node's arguments. I conceptually like that this way, they are bundled and clearly not at the same level as the actual mathematical arguments. Is this a good approach? In order for this to work, I had to stop CSE from extracting those precision arguments, and StrictfyFloat from recursing down into that struct and litter strict_float on those numbers. I have seen the Call::bundle intrinsic. Perhaps this one is better for that purpose? @abadams
  • I tried to design the API such that it would also be compatible with Float(16) and Float(64), but those are not yet implemented/tested. The polynomial approximations should work correctly (although untested) for these other data-types.
  • The intrinsics table and their behavior (MULPE/MAE-precision) is measured on devices I have available (and build bots). On some backends (such as OpenCL, and Vulkan) these intrinsics have implementation-defined behavior. This probably means it's AMD or NVIDIA that gets to implement them and determine the precision. I do not have any AMD GPU available to test the OpenCL and Vulkan backends on to see how these functions behave. I have realized that for example Vulkan's native_tan() compiles to the same three instructions as I implemented on CUDA: sin.approx.f32, cos.approx.f32, div.approx.f32. I haven't investigated AMD's documentation on available hardware instructions.

API

struct ApproximationPrecision {
    enum OptimizationObjective {
        AUTO,   //< No preference, but favor speed.
        MAE,    //< Optimized for Max Absolute Error.
        MULPE,  //< Optimized for Max ULP Error. ULP is "Units in Last Place", when represented in IEEE 32-bit floats.
    } optimized_for{AUTO};

    /**
     * Most function approximations have a range where the approximation works
     * natively (typically close to zero), without any range reduction tricks
     * (e.g., exploiting symmetries, repetitions). You may specify a maximal
     * absolute error or maximal units in last place error, which will be
     * interpreted as the maximal absolute error within this native range of the
     * approximation. This will be used as a hint as to which implementation to
     * use.
     */
    // @{
    int constraint_max_ulp_error{0};
    float constraint_max_absolute_error{0.0f};
    // @}

    /**
     * For most functions, Halide has a built-in table of polynomial
     * approximations. However, some targets have specialized instructions or
     * intrinsics available that allow to produce an even faster approximation.
     * Setting this integer to a non-zero value will force Halide to use the
     * polynomial with at least this many terms, instead of specialized
     * device-specific code. This means this is still combinable with the
     * other constraints.
     * This is mostly useful for testing and benchmarking.
     */
    int force_halide_polynomial{0};
};

Expr fast_sin(const Expr &x, ApproximationPrecision precision = {});
Expr fast_cos(const Expr &x, ApproximationPrecision precision = {});
Expr fast_tan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan(const Expr &x, ApproximationPrecision precision = {});
Expr fast_atan2(const Expr &y, const Expr &x, ApproximationPrecision = {});
Expr fast_log(const Expr &x, ApproximationPrecision precision = {});
Expr fast_exp(const Expr &x, ApproximationPrecision precision = {});
Expr fast_pow(Expr x, Expr y, ApproximationPrecision precision = {});
Expr fast_tanh(const Expr &x, ApproximationPrecision precision = {});

@mcourteaux mcourteaux marked this pull request as draft February 8, 2025 21:36
@mcourteaux mcourteaux requested a review from abadams February 10, 2025 18:11
@mcourteaux mcourteaux marked this pull request as ready for review February 10, 2025 18:12
@@ -38,6 +38,7 @@ extern "C" WEAK void *halide_opencl_get_symbol(void *user_context, const char *n
"opencl.dll",
#else
"libOpenCL.so",
"libOpenCL.so.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's the case that libOpenCL.so.1 should rather replace libOpenCL.so? The latter is a namelink that's only present when -dev packages are installed. It should always point to the former.

Copy link
Contributor Author

@mcourteaux mcourteaux Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same. Can fix it later, but I needed this on my local machine, so without being too destructive without consensus, I did this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. We'll see what the build bots do.

@mcourteaux mcourteaux added enhancement New user-visible features or improvements to existing features. performance gpu release_notes For changes that may warrant a note in README for official releases. labels Feb 10, 2025
…optimization. Greatly improved accuracy testing framework.
…st not touching input: prevents constant folding.
@mcourteaux mcourteaux added the skip_buildbots Synonym for buildbot_test_nothing label Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New user-visible features or improvements to existing features. gpu performance release_notes For changes that may warrant a note in README for official releases. skip_buildbots Synonym for buildbot_test_nothing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants