Unification of submodels and distributions

@mhauru @willtebbutt and I discussed submodels this evening (10 Feb). The present issue is that our support for submodels is currently halfway done - we are able to extract the _return value_ of a submodel, but not its latent variables.

(Note that this has always been true, even with the old `@submodel` macro; https://github.com/TuringLang/DynamicPPL.jl/pull/696 merely changed the syntax we used to achieve this.)

## (1) Overview

After a fair bit of back and forth, the summary of the interface we would like is something along these lines:

```julia
using DynamicPPL, Distributions

@model function inner()
    x ~ Normal()
    y ~ Normal()
    return "my string"
end

@model function outer()
    a ~ Normal()
    b ~ inner()
    @show b      # Should be a NamedTuple{:x, :y}
    @show b.x    # Should be a float
    c ~ inner() {OP} retval
    @show c      # Should also be a NamedTuple{:x, :y}
    @show retval # Should be "my string"
end

# Conditioning on submodel variables should work
outer() | (@varname(c.x) => 1.0)
# This should ideally work too
outer() | (c = (x = 1.0,),)
```

for some infix operator `{OP}` (see section 3.2 below for some possible options).

Note that there are several changes with respect to the current behaviour (as of 10/2/2025):

1. No need to wrap in `to_submodel` if possible (I am not totally sure if this is doable)
2. Manual prefixing should not be needed and may be disallowed
3. Prefixing should occur not by prepending directly to the symbol (as is currently done), but rather by making the submodel's variables be a field of the parent model's variable. Thus, we can write `@show c.x` instead of `@show var"c.x"`.
4. The lhs of a tilde should capture the submodel's random variables instead of its return value.
5. The return value, if desired, can be extracted by placing a further operator on the _right-hand side_ of the submodel.

Although we are collectively in favour of this interface, this is not meant to be set in stone yet, and there are several further points of discussion, which are detailed below.

## (2) Motivation

Turing models in general have two types of 'useful information' that one might want to extract:

1. The values of the random variables inside. This is best represented by the model trace, i.e., VarInfo that is used during execution.
2. Since `@model function ... end` itself expands into a function definition (the so-called 'model evaluation function'), this function will itself also have a return value.

This return value may be constructed from the random variables' values, and in many of the DynamicPPL/Turing docs, this is indeed the case; however, this is not mandatory and in general the return value can contain arbitrary information.

With models, these two pieces of information are obtained respectively using `rand()` and function calls:

```julia
julia> using Distributions

julia> using DynamicPPL, Distributions

julia> @model function f()
           x ~ Normal()
           return "hello, world"
       end
f (generic function with 2 methods)

julia> model = f()
Model{typeof(f), (), (), (), Tuple{}, Tuple{}, DefaultContext}(f, NamedTuple(), NamedTuple(), DefaultContext())

julia> rand(model)
(x = 0.12314369056401028,)

julia> model()
"hello, world"
```

Currently, `x ~ to_submodel(inner())` does not assign the random variables in `inner()` to `x`, but rather the return value. This means that there are several inconsistencies between the behaviour of submodels and distributions:

1. The obvious difference is that with a distribution on the rhs, the value of `x` is sampled by calling `rand(dist)`. With a submodel on the rhs, the value of `x` is obtained by calling `inner()()`.
2. It is not possible to calculate the logpdf of a submodel `inner()` evaluated at `x`. This is because the return value `x`, in general, has no relationship to the random variables contained inside `inner()`, and indeed there is no guarantee that a well-defined 'distribution' of return values exists.
3. In `x ~ to_submodel(inner())`, although the variables of `inner()` are added to the VarInfo and the resulting chains from sampling, `x` itself is not.

This proposal therefore seeks to unify the behaviour of submodels and distributions in a way that is internally consistent and thus easier for users to intuit. In particular, it is proposed that:

1. The syntax `lhs ~ rhs` is reserved for the results of sampling from a submodel or distribution using `rand()`. The result of sampling from a model should be some kind of data structure (a NamedTuple, struct, or dictionary) which allows for indexing. The variable `lhs` (or its subvariables) should always be part of the VarInfo and it should be possible to condition on them.

2. We adopt new syntax, in the form of `lhs ~ submodel {OP} retval` where `{OP}` is an infix operator, to extract the return value of a submodel (if so desired). Because distributions do not have return values, this syntax would only be accepted when used with submodels in the middle. The `{OP} retval` section may be omitted, in which case the return value is simply discarded.

3. Running a submodel without extracting its random values (i.e. just writing `submodel {OP} retval`) should be forbidden, because in such a case, users should refactor their code to use a plain Julia function instead of a submodel.

## (3) Concrete steps

1. Decide if the general idea makes sense.

2. Decide on the infix operator `{OP}`. We would probably like the operator to (1) be ASCII-compatible; (2) resemble a rightwards arrow.
   - I originally proposed `~>`, but this is not allowed by the [Julia parser](https://github.com/JuliaLang/julia/blob/master/src/julia-parser.scm).
   - The best boring option I see is `-->`
   - `>>=` is also possible, and I have a Haskell bias towards it, but it technically conflicts with right-bit-shift-and-assign.
   - The simpler `->` and `=>` are probably best avoided because they are already used for anonymous functions and Pair respectively.

3. Figure out the data structure that should be obtained when sampling from a submodel. Right now, `rand(model)` returns a NamedTuple. To me, this feels like the most natural interface to use; it 'makes sense' that if `t` is a random variable in a submodel, `c ~ submodel` should allow us to access `c.t`. It is possible that we may want to use a different type of data structure that retains more information (i.e. is closer to a varinfo) but still has an interface that allows field access.

4. Figure out how to obtain this data structure when sampling from a submodel. My original proposal was to evaluate submodels with a special wrapper context, say `SubmodelContext`, which would collect sampled variables and their values in a NamedTuple as each `assume` statement was hit. (Note, the behaviour of this would be [very similar to `ValuesAsInModelContext`](https://github.com/TuringLang/DynamicPPL.jl/blob/bca344cdcb3977080b6600234c72924c0b858759/src/values_as_in_model.jl).) However, it seems quite plausible that this could be obtained simply by subsetting the global varinfo.

5. Implement this in the DynamicPPL compiler. Note that this may require special attention to e.g. operator precedence / associativity which may in turn place more restrictions on the possible operators used. Some extra abstract type machinery will likely also be needed if we plan to not wrap submodels in a new type; my suspicion is that this might actually be the hardest part of it.

6. Iron out the odd bits of conditioning submodels. I actually suspect that all the infrastructure necessary for this is already in place, and it's mostly a matter of making sure that writing a comprehensive set of tests to make sure that everything behaves 'as expected'.

7. Iron out the expected behaviour when varnames conflict, e.g. if we have `c ~ submodel()` then we should probably not allow the identifier `c` to be reused on the lhs of another tilde.

8. Write tests. And more tests. And more tests. Even with as elegant an implementation as we can come up with, my gut feeling is that there are bound to be many awkward edge cases!

9. Turn the contents of this issue into documentation. (I wrote it up, so the hard bit's already done 😉)

## (4) Alternatives considered.

The main alternative considered was to use two different operators for extracting the random variables and the return value, plus one for extracting both, so something like:

```julia
@model function inner()
    x ~ Normal()
    y ~ Normal()
    return "my string"
end

@model function outer()
    a ~ Normal()
    b ~ inner()
    @show b       # Should be a NamedTuple{:x, :y}
    retval {OP1} inner()
    @show retval  # Should be "my string"
    c, retval2 {OP2} inner()
    @show c       # Should be a NamedTuple{:x, :y}
    @show retval2 # Should be "my string"
end
```

for some infix operators `{OP1}`and `{OP2}`.

We also considered having a single statement `b ~ submodel` return some data structure from which the random variables could be accessed using `b.vars` and the return value with `b.retval`.

However, we all agreed that the main proposal here is better, because its syntax is more elegant and it also does not introduce any extra layers of indirection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unification of submodels and distributions #2485

(1) Overview

(2) Motivation

(3) Concrete steps

(4) Alternatives considered.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unification of submodels and distributions #2485

Description

(1) Overview

(2) Motivation

(3) Concrete steps

(4) Alternatives considered.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions