JuliaDiff · gdalle · Apr 29, 2024 · Apr 29, 2024
@@ -72,15 +72,15 @@ julia> Pkg.add(
 
 ```julia
 using DifferentiationInterface
-import ForwardDiff, Enzyme, Zygote          # import automatic differentiation backends you want to use 
+import ForwardDiff, Enzyme, Zygote  # AD backends you want to use 
 
 f(x) = sum(abs2, x)
 
-x = [1.0, 2.0, 3.0]
+x = [1.0, 2.0]
 
-value_and_gradient(f, AutoForwardDiff(), x) # returns (14.0, [2.0, 4.0, 6.0]) using ForwardDiff.jl
-value_and_gradient(f, AutoEnzyme(),      x) # returns (14.0, [2.0, 4.0, 6.0]) using Enzyme.jl
-value_and_gradient(f, AutoZygote(),      x) # returns (14.0, [2.0, 4.0, 6.0]) using Zygote.jl
+value_and_gradient(f, AutoForwardDiff(), x) # returns (5.0, [2.0, 4.0]) with ForwardDiff.jl
+value_and_gradient(f, AutoEnzyme(),      x) # returns (5.0, [2.0, 4.0]) with Enzyme.jl
+value_and_gradient(f, AutoZygote(),      x) # returns (5.0, [2.0, 4.0]) with Zygote.jl
 ```
 
 For more performance, take a look at the [DifferentiationInterface tutorial](https://gdalle.github.io/DifferentiationInterface.jl/DifferentiationInterface/stable/tutorial/).

@@ -47,34 +47,36 @@ backend_table = Markdown.parse(String(take!(io)))
 
 ## Types
 
-We support all dense backend choices from [ADTypes.jl](https://github.com/SciML/ADTypes.jl), as well as their sparse wrapper `AutoSparse`.
+We support all dense backend choices from [ADTypes.jl](https://github.com/SciML/ADTypes.jl), as well as their sparse wrapper [`AutoSparse`](@ref).
 
 For sparse backends, only the Jacobian and Hessian operators are implemented differently, the other operators behave the same as for the corresponding dense backend.
 
 ```@example backends
 backend_table #hide
 ```
 
-## Availability
+## Checks
+
+### Availability
 
 You can use [`check_available`](@ref) to verify whether a given backend is loaded.
 
-## Support for two-argument functions
+### Support for two-argument functions
 
 All backends are compatible with one-argument functions `f(x) = y`.
 Only some are compatible with two-argument functions `f!(y, x) = nothing`.
 You can check this compatibility using [`check_twoarg`](@ref).
 
-## Hessian support
+### Support for Hessian
 
 Only some backends are able to compute Hessians.
-You can use [`check_hessian`](@ref) to check this feature.
+You can use [`check_hessian`](@ref) to check this feature (beware that it will try to compute a small Hessian, so it is not instantaneous).
 
 ## API reference
 
 !!! warning
-    The following documentation has been re-exported from [ADTypes.jl](https://github.com/SciML/ADTypes.jl).
-    Refer to the ADTypes documentation for more information.
+    The following documentation has been borrowed from ADTypes.jl.
+    Refer to the [ADTypes documentation](https://sciml.github.io/ADTypes.jl/stable/) for more information.
 
 ```@docs
 ADTypes

@@ -11,16 +11,16 @@ We provide the following high-level operators:
 | [`derivative`](@ref)        | 1     | `Number`        | `Number` or `AbstractArray` | same as `y`      | `size(y)`                |
 | [`second_derivative`](@ref) | 2     | `Number`        | `Number` or `AbstractArray` | same as `y`      | `size(y)`                |
 | [`gradient`](@ref)          | 1     | `AbstractArray` | `Number`                    | same as `x`      | `size(x)`                |
-| [`hvp`](@ref)               | 2     | `AbstractArray` | `Number`                    | same as `x`      | `size(x)`                |
 | [`hessian`](@ref)           | 2     | `AbstractArray` | `Number`                    | `AbstractMatrix` | `(length(x), length(x))` |
 | [`jacobian`](@ref)          | 1     | `AbstractArray` | `AbstractArray`             | `AbstractMatrix` | `(length(y), length(x))` |
 
-They can all be derived from two low-level operators:
+They can be derived from lower-level operators:
 
-| operator                       | order | input  `x` | output   `y` | result type | result shape |
-| :----------------------------- | :---- | :--------- | :----------- | :---------- | :----------- |
-| [`pushforward`](@ref) (or JVP) | 1     | `Any`      | `Any`        | same as `y` | `size(y)`    |
-| [`pullback`](@ref) (or VJP)    | 1     | `Any`      | `Any`        | same as `x` | `size(x)`    |
+| operator                       | order | input  `x`      | output   `y` | seed `v` | result type | result shape |
+| :----------------------------- | :---- | :-------------- | :----------- | :------- | :---------- | :----------- |
+| [`pushforward`](@ref) (or JVP) | 1     | `Any`           | `Any`        | `dx`     | same as `y` | `size(y)`    |
+| [`pullback`](@ref) (or VJP)    | 1     | `Any`           | `Any`        | `dy`     | same as `x` | `size(x)`    |
+| [`hvp`](@ref)                  | 2     | `AbstractArray` | `Number`     | `dx`     | same as `x` | `size(x)`    |
 
 Luckily, most backends have custom implementations, which we reuse if possible instead of relying on fallbacks.
 
@@ -36,26 +36,25 @@ Several variants of each operator are defined:
 | [`derivative`](@ref)        | [`derivative!`](@ref)        | [`value_and_derivative`](@ref)  | [`value_and_derivative!`](@ref)  |
 | [`second_derivative`](@ref) | [`second_derivative!`](@ref) | NA                              | NA                               |
 | [`gradient`](@ref)          | [`gradient!`](@ref)          | [`value_and_gradient`](@ref)    | [`value_and_gradient!`](@ref)    |
-| [`hvp`](@ref)               | [`hvp!`](@ref)               | NA                              | NA                               |
 | [`hessian`](@ref)           | [`hessian!`](@ref)           | NA                              | NA                               |
 | [`jacobian`](@ref)          | [`jacobian!`](@ref)          | [`value_and_jacobian`](@ref)    | [`value_and_jacobian!`](@ref)    |
 | [`pushforward`](@ref)       | [`pushforward!`](@ref)       | [`value_and_pushforward`](@ref) | [`value_and_pushforward!`](@ref) |
 | [`pullback`](@ref)          | [`pullback!`](@ref)          | [`value_and_pullback`](@ref)    | [`value_and_pullback!`](@ref)    |
+| [`hvp`](@ref)               | [`hvp!`](@ref)               | NA                              | NA                               |
 
 ## Mutation and signatures
 
 In order to ensure symmetry between one-argument functions `f(x) = y` and two-argument functions `f!(y, x) = nothing`, we define the same operators for both cases.
 However they have different signatures:
 
-| signature  | out-of-place                       | in-place                                    |
-| :--------- | :--------------------------------- | :------------------------------------------ |
-| `f(x)`     | `operator(f,     backend, x, ...)` | `operator!(f,     result, backend, x, ...)` |
-| `f!(y, x)` | `operator(f!, y, backend, x, ...)` | `operator!(f!, y, result, backend, x, ...)` |
+| signature  | out-of-place                                 | in-place                                              |
+| :--------- | :------------------------------------------- | :---------------------------------------------------- |
+| `f(x)`     | `operator(f,     backend, x, [v], [extras])` | `operator!(f,     result, backend, x, [v], [extras])` |
+| `f!(y, x)` | `operator(f!, y, backend, x, [v], [extras])` | `operator!(f!, y, result, backend, x, [v], [extras])` |
 
 !!! warning
     Our mutation convention is that all positional arguments between `f`/`f!` and `backend` are mutated (the `extras` as well, see below).
     This convention holds regardless of the bang `!` in the operator name, because we assume that a user passing a two-argument function `f!(y, x)` anticipates mutation anyway.
-
     Still, better be careful with two-argument functions, because every variant of the operator will mutate `y`... even if it does not have a `!` in its name (see the bottom left cell in the table).
 
 ## Preparation
@@ -78,8 +77,8 @@ Unsurprisingly, preparation syntax depends on the number of arguments:
 
 | signature  | preparation signature                      |
 | :--------- | :----------------------------------------- |
-| `f(x)`     | `prepare_operator(f,     backend, x, ...)` |
-| `f!(y, x)` | `prepare_operator(f!, y, backend, x, ...)` |
+| `f(x)`     | `prepare_operator(f,     backend, x, [v])` |
+| `f!(y, x)` | `prepare_operator(f!, y, backend, x, [v])` |
 
 The preparation `prepare_operator(f, backend, x)` will create an object called `extras` containing the necessary information to speed up `operator` and its variants.
 This information is specific to `backend` and `f`, as well as the _type and size_ of the input `x` and the _control flow_ within the function, but it should work with different _values_ of `x`.
@@ -102,6 +101,9 @@ We offer two ways to perform second-order differentiation (for [`second_derivati
     At the moment, trial and error is your best friend.
     Usually, the most efficient approach for Hessians is forward-over-reverse, i.e. a forward-mode outer backend and a reverse-mode inner backend.
 
+!!! warning
+    Preparation does not yet work for the inner differentiation step of a `SecondOrder`, only the outer differentiation is prepared.
+
 ## Experimental
 
 !!! danger
@@ -125,9 +127,10 @@ We make this available for all backends with the following operators:
 
 ### Translation
 
-The wrapper [`DifferentiateWith`](@ref) allows you to take a function and specify that it should be differentiated with the backend of your choice.
-In other words, when you try to differentiate `dw = DifferentiateWith(f, backend1)` with `backend2`, then `backend1` steps in and `backend2` does nothing.
-At the moment it only works when `backend2` supports [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl).
+The wrapper [`DifferentiateWith`](@ref) allows you to translate between AD backends.
+It takes a function `f` and specifies that `f` should be differentiated with the backend of your choice, instead of whatever other backend the code is trying to use.
+In other words, when someone tries to differentiate `dw = DifferentiateWith(f, backend1)` with `backend2`, then `backend1` steps in and `backend2` does nothing.
+At the moment, `DifferentiateWith` only works when `backend2` supports [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl).
 
 ## Going further
 

@@ -8,15 +8,8 @@ We present a typical workflow with DifferentiationInterface.jl and showcase its
 
 ```@example tuto
 using DifferentiationInterface
-
-import ForwardDiff, Enzyme  # ⚠️ import the backends you want to use ⚠️
 ```
 
-!!! tip
-    Importing backends with `import` instead of `using` avoids name conflicts and makes sure you are using operators from DifferentiationInterface.jl.
-    This is useful since most backends also export operators like `gradient` and `jacobian`.
-
-
 ## Computing a gradient
 
 A common use case of automatic differentiation (AD) is optimizing real-valued functions with first- or second-order methods.
@@ -25,21 +18,26 @@ Let's define a simple objective and a random input vector
 ```@example tuto
 f(x) = sum(abs2, x)
 
-x = [1.0, 2.0, 3.0]
-nothing # hide
+x = collect(1.0:5.0)
 ```
 
-To compute its gradient, we need to choose a "backend", i.e. an AD package that DifferentiationInterface.jl will call under the hood.
+To compute its gradient, we need to choose a "backend", i.e. an AD package to call under the hood.
 Most backend types are defined by [ADTypes.jl](https://github.com/SciML/ADTypes.jl) and re-exported by DifferentiationInterface.jl.
 
 [ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl) is very generic and efficient for low-dimensional inputs, so it's a good starting point:
 
 ```@example tuto
+import ForwardDiff
+
 backend = AutoForwardDiff()
 nothing # hide
 ```
 
-Now you can use DifferentiationInterface.jl to get the gradient:
+!!! tip
+    To avoid name conflicts, load AD packages with `import` instead of `using`.
+    Indeed, most AD packages also export operators like `gradient` and `jacobian`, but you only want to use the ones from DifferentiationInterface.jl.
+
+Now you can use the following syntax to compute the gradient:
 
 ```@example tuto
 gradient(f, backend, x)
@@ -48,15 +46,10 @@ gradient(f, backend, x)
 Was that fast?
 [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) helps you answer that question.
 
-```@repl tuto
+```@example tuto
 using BenchmarkTools
-@btime gradient($f, $backend, $x);
-```
-
-More or less what you would get if you just used the API from ForwardDiff.jl:
 
-```@repl tuto
-@btime ForwardDiff.gradient($f, $x);
+@benchmark gradient($f, $backend, $x)
 ```
 
 Not bad, but you can do better.
@@ -69,19 +62,18 @@ Some backends get a speed boost from this trick.
 ```@example tuto
 grad = similar(x)
 gradient!(f, grad, backend, x)
-
 grad  # has been mutated
 ```
 
 The bang indicates that one of the arguments of `gradient!` might be mutated.
 More precisely, our convention is that _every positional argument between the function and the backend is mutated (and the `extras` too, see below)_.
 
-```@repl tuto
-@btime gradient!($f, _grad, $backend, $x) evals=1 setup=(_grad=similar($x));
+```@example tuto
+@benchmark gradient!($f, _grad, $backend, $x) evals=1 setup=(_grad=similar($x))
 ```
 
 For some reason the in-place version is not much better than your first attempt.
-However, it has one less allocation, which corresponds to the gradient vector you provided.
+However, it makes fewer allocations, thanks to the gradient vector you provided.
 Don't worry, you can get even more performance.
 
 ## Preparing for multiple gradients
@@ -100,31 +92,31 @@ You don't need to know what this object is, you just need to pass it to the grad
 ```@example tuto
 grad = similar(x)
 gradient!(f, grad, backend, x, extras)
-
 grad  # has been mutated
 ```
 
 Preparation makes the gradient computation much faster, and (in this case) allocation-free.
 
-```@repl tuto
-@btime gradient!($f, _grad, $backend, $x, _extras) evals=1 setup=(
+```@example tuto
+@benchmark gradient!($f, _grad, $backend, $x, _extras) evals=1 setup=(
     _grad=similar($x);
     _extras=prepare_gradient($f, $backend, $x)
-);
+)
 ```
 
 Beware that the `extras` object is nearly always mutated by differentiation operators, even though it is given as the last positional argument.
 
 ## Switching backends
 
 The whole point of DifferentiationInterface.jl is that you can easily experiment with different AD solutions.
-Typically, for gradients, reverse mode AD might be a better fit.
-So let's try the state-of-the-art [Enzyme.jl](https://github.com/EnzymeAD/Enzyme.jl)!
+Typically, for gradients, reverse mode AD might be a better fit, so let's try [ReverseDiff.jl](https://github.com/JuliaDiff/ReverseDiff.jl)!
 
-For this one, the backend definition is slightly more involved, because you need to feed the "mode" to the object from ADTypes.jl:
+For this one, the backend definition is slightly more involved, because you can specify whether the tape needs to be compiled:
 
 ```@example tuto
-backend2 = AutoEnzyme(; mode=Enzyme.Reverse)
+import ReverseDiff
+
+backend2 = AutoReverseDiff(; compile=true)
 nothing # hide
 ```
 
@@ -134,16 +126,15 @@ But once it is done, things run smoothly with exactly the same syntax:
 gradient(f, backend2, x)
 ```
 
-And you can run the same benchmarks:
+And you can run the same benchmarks to see what you gained (although such a small input may not be realistic):
 
-```@repl tuto
-@btime gradient!($f, _grad, $backend2, $x, _extras) evals=1 setup=(
+```@example tuto
+@benchmark gradient!($f, _grad, $backend2, $x, _extras) evals=1 setup=(
     _grad=similar($x);
     _extras=prepare_gradient($f, $backend2, $x)
-);
+)
 ```
 
-Not only is it blazingly fast, you achieved this speedup without looking at the docs of either ForwardDiff.jl or Enzyme.jl!
 In short, DifferentiationInterface.jl allows for easy testing and comparison of AD backends.
 If you want to go further, check out the [DifferentiationInterfaceTest.jl tutorial](https://gdalle.github.io/DifferentiationInterface.jl/DifferentiationInterfaceTest/dev/tutorial/).
 It provides benchmarking utilities to compare backends and help you select the one that is best suited for your problem.