-
Notifications
You must be signed in to change notification settings - Fork 36
Unboxed types proposal #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
It would be nice to also allow the syntax `x.bar.#y <- y'`, although | ||
this is a little trickier to understand and implement, since it is | ||
`x.bar` rather than `x.bar.y` which is the mutable field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gasche says
I don't think this is a good idea, due to the awkard confusion with the meaning of x.bar.y <- y' in the non-boxed case.
@stedolan says
I agree, I'm not terribly happy with this either. It would be nice to have a more compact syntax than x.bar <- { x.bar with foo = 1 } for updating a single field of an inlined record, though.
|
||
defines `M.t` but not `M.#t`. (It may be useful to allow exporting | ||
`M.#t` despite `t` being abstract, perhaps with syntax like `type t : | ||
value with #t : int32 * int32`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gasche says
If this exposes the fact that box(#t) = t, then is t still abstract in any reasonable sense? This seems a bit fragile.
@stedolan says
Right, I'm not sure how useful this is. The motivation is this: perhaps I'd like to expose a type in both boxed and unboxed form (t and #t), so that a client of the library can freely box and unbox the type (perhaps storing it inlined inside another data structure), but cannot otherwise introduce or eliminate t except by using the library's interface.
To let the client box and unbox t, the client must know that #t : int32 * int32. This lets the client compile code that passes around #t (knowing that it takes two integer registers, 8 bytes of memory, and should not be registered with the GC), but the client cannot do anything else with it.
@gasche says
That makes perfect sense -- you already explain how having incompatible types of layout int32, this is just an abstract type at layout int32. Thanks for the clarification.
@lpw25 says
Note that in the current proposal, type t with #t : l does not actually expose that box(#t) = t in any meaningful sense because there is no box type constructor and no builtin operations that allow you to make a t from a #t. The syntax merely allows the type #t to be exposed abstractly in the interface.
@gasche says
Wouldn't be useful to think of immediate values as boxed in the same sense (so type t [@@Immediate] actually means type t : value with #t : immediate rather than type t : immediate)? Can we get rid of subtyping that way?
More precisely, I think that #t in that context would naturally represent the untagged form of the immediate value, rather than just the immediate value itself.
@stedolan says
Reusing this syntax to allow explicit tagging/untagging for immediate types is a nice idea. I don't see how this replaces the immediate <= value subtyping, though: that's not about adding and removing tagging but about the inclusion of tagged integers into values.
@gasche says
The idea would be to never ask "are the value of this type immediate values?", but instead "is this type known to be the tagging of an untagged type?", when making optimization decisions. (Similarly to the fact that exporting type t with #t : int32 should let consumers deduce that the values of t are pointeres.) If that is enough to make the same optimization decisions, then we can drop the current concept of "immediate type" -- which could remove the need for having layout subtyping.
@stedolan says
Right, that makes sense now. The idea is that rather than allowing kindings t : int, you can instead express t : value = tag(#t) and #t : untagged_int. I think this would work, but it seems more awkward than the subkinding approach for a couple of reasons:
It's a bit more awkward to quantify over immediate types this way. This is useful when writing records with mutable fields that guarantee no uses of caml_modify:
type ('a : immediate) r = { mutable x : 'a }
becomes
type ('a : untagged_int) r = { mutable x : tag('a) }
It's no longer possible to instantiate this definition as int r: instead, the user must know that the builtin int type is defined as tag(#int) and write #int r (even though there are no uses of #int in the program). Similarly, the user must know about the untagged versions of variant types that have only constant constructors.
This trick stops working once we have more than one property of interest. For instance, if we implement efficient nullable types (as discussed below), then we also want to track whether a type may include a null value. We cannot define int as both tag(#int) and not_null(#int') unless we do some quite hairy type system work to commute tag and not_null.
I was hoping to make all compilation decisions about how a type should be stored (how many registers it takes, whether it needs to be registered with the GC, etc.) by examining the type's layout. However, this approach would mean a mixture of examining the type's layout (to see if it's value) and then matching on the type itself (to see if it's tag(#foo)) in order to make these decisions.
It sounds like the motivation for this instead of an immediate kind is the perceived "fishiness" of the algorithm for unification with layouts and subkinding. My informal description of the algorithm below is full of mutation, because that's how unification is implemented in OCaml. I'll write up a syntactic presentation of the unification algorithm, and see if you find that less fishy.
@gasche says
What about
type 'a r = { mutable x : 'a } constraint 'a = tag('i)
? Just kidding. I think your second point is interesting and the third is convincing.
I was critical of the algorithm description (and I do think it could be improved), but that was not necessarily the point for my consideration above; I just think that subtyping tend to introduce complexity, and that it's interesting to seriously check that we want it in the design. (I realize you are an interesting choice of person to complain to about subtyping.)
In the big picture I'm fairly confident that you can propose a good design even in presence of subtyping, because the problem domain is constrained: we don't necessarily need to do a lot of inference, and for now there is no interaction between subtyping and polymorphism -- given that there is no kind polymorphism, for now.
@stedolan says
Hah! I didn't think of using constraints here, nice.
Here's a syntactic version of the unification algorithm with subkinding, hopefully it seems less fishy than before: https://gist.github.com/stedolan/349b146b4744549fdb25088c4501de73. (Just the basic algorithm for now, I'll write up the details of handling type aliases later. Again, the story is the same as for standard unification: in principle, you can expand aliases beforehand, but in practice you're better off expanding them only when needed and caching the results)
@gasche says
Thanks! The description is very clear.
In the Instr rule, you check Ahat : kappa, using the kind of the alpha variable. Don't you rather want to lower the kind of alpha if Ahat can be given a smaller kind? That's what I understood from the informal description.
@gasche says
(And: I agree that having the rules for parametrized type aliases would be nice.)
@gasche says
Thinking aloud more: if you implement kind lowering during unification based on kind checking (and not just the intersection of variable's declared kinds as you do now), you probably want to have a more algorithmic presentation of kind checking Gamma |- A => kappa, which computes a kind that is minimal with respect to the declarative check Gamma |- A : kappa. (On a finite kind domain, we can always compute this by taking the lub of all admissible kinds, but this is less informative than an algorithmic judgment that is close to the implementation.)
@stedolan says
In the Instr rule, you check Ahat : kappa, using the kind of the alpha variable. Don't you rather want to lower the kind of alpha if Ahat can be given a smaller kind?
OK, there's some misunderstanding here. (Showing that It was useful to poke me to write a more precise description!) We don't lower the kind of alpha if Ahat can be given a smaller kind, because alpha is removed (substituted away). Only uninstantiated variables have kinds, because only uninstantiated variables range over anything. Once a variable is instantiated, we have neither a variable nor a kind any more.
When I talk about mutating the kind in the informal description, I'm describing the rule Eq, where variables alpha and beta are removed and replaced with a new variable gamma, ranging over the intersection of the layouts. (In an actual implementation, it is convenient to choose the name gamma = alpha, so all you need to do is mutate alpha's kind and delete beta).
But you're right that the rules should specify the kinds of the variables that are uninstantiated. I've currently written the output of unification as a list of variable assignments, but it should be a pair of a list of variable assignments and a list of unassigned variables (with kinds for the latter). That is, for each variable in Gamma to begin with, the output either gives a definition of that variable (if it was instantiated), or a new kind for that variable (if it was not).
of `L`, then the check passes. However, if the check fails and `r` is | ||
an alias, we must expand `r` and try again. We only report a | ||
unification failure once the check fails on a type which is not an | ||
alias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gasche says
This designs sounds fairly fishy to me:
- the any_layout kind doesn't really make sense to me
- the way kinds/layout are mutated during inference is fishy
For example, consider the two definitions
type ('a : value) id = 'a
type ('a : immediate) immediate = 'a
And then the following sequences of introduction of new variables and unifications:
some 'b in (* new: 'b: any_layout *)
some 'c in (* new: 'c: any_layout *)
'b = 'c id; (* changed: 'b: value, 'c : value *)
'c = immediate int; (* changed: 'c: immediate *)
(* final state: 'b: value, 'c: immediate *)
some 'b in (* new: 'b: any_layout *)
some 'c in (* new: 'c: any_layout *)
'c = immediate int; (* changed: 'c: immediate *)
'b = 'c id; (* changed: 'b: immediate *)
(* final state: 'b: immediate, 'c: immediate *)
The fact that the final state of those two sequences feels wrong to me. I may have misread your algorithm (it's not clear what "the check passes" in term of kind update to the solution), but I think that there are non-principality issues here. The lowering of 'c from any_value to immediate may require the recomputation of any kind decision based on observations of 'c.
Is there a less-convenient but safe proposal that can be used as a first step, to be refined later? Some ideas:
could we consider that only value is ever inferred, and all others must be put explicitly?
what is the design like if we drop immediate, so that there isn't any subtyping of layouts anymore?
@gasche says
If we solve all inference problems in the same way that we are currently doing, I think that we can read back from the solution a minimal classification of type variables into layouts -- or fail with an error if no such assignment exists. Is that right?
The other natural approach, I think, would be to be fully explicit about layout variables and their resolution during inference -- much like the explicit handling of universe levels in dependently-typed programming language. At the level of types, we use unification/equality as the base mechanism, but you are proposing to do inference with subtyping at the level of layouts. I would like to see a more explicit description of how that works (or another approach that avoids this sophistication) than the fairly imprecise description in the current document.
@lpw25 says
the any_layout kind doesn't really make sense to me
I believe an equivalent proposal is as follows:
Have a notion of "sort" that classifies the layouts by compatibility -- one sort for value and immediate, one sort for int64, etc.
Add family of layouts Top of sort which are the top of the subtyping order within each sort. For most of the layouts this will be the only layout in the lattice (e.g. Top of the int64 sort would be equal to the int64 layout).
Allow sort unification variables which cannot be generalized
Then you can replace all uses of any_layout with Top 's (where 's is a sort unification variable)
When generalizing a type only allow it to be generalized if it's layout does not contain any sort variables
So maybe that helps to clarify things. Stephen and I somewhat disagree over this part, I think that using sort variables and then removing the last restriction -- which makes layout inference be principal -- is a cleaner system. However, since my version is backwards compatible with his, I think his version is a reasonable one for an initial prototype.
@stedolan says
@gasche I think you did misread the algorithm! I'll see if I can find some time to write it out more clearly and precisely. The algorithm is really a small change to classical unification!
Type variables have layouts, describing what they range over. Once a type variable is instantiated (= substituted away), we no longer have either a variable or its layout, just the result of substituting. In your example:
some 'b in (* new: 'b: any_layout *)
some 'c in (* new: 'c: any_layout *)
'c id (* to form the type 'c id, 'c is restricted to value *)
'b = 'c id (* substitute 'c id for 'b *)
(* state is: 'c : value; 'b = 'c id *)
'c = int (* substitute int for 'c *)
(* final state is 'b = 'c id; 'c = int *)
or the other way around:
some 'b in (* new: 'b: any_layout *)
some 'c in (* new: 'c: any_layout *)
'c = int (* substitute int for 'c *)
(* state is: 'b: any_layout; 'c = int *)
'b = 'c id
(* final state is 'c = int; 'b = 'c id *)
The design is indeed somewhat simpler if we drop immediate: we can solve layouts by standard unification, with a restriction like the any_layout restriction here on when they can be generalised.
I think we need to consider immediate now, though: there's clearly a demand for such features ([@@Immediate] is part of the language today), and grafting subtyping onto an inference system afterwards does not tend to be straightforward.
@Drup says
I much prefer @lpw25 's version, which I suspect is a strip down version of what I did for linear types. I feel like having sort variables will serve us far better in the larger context.
"Uncompilable" types like `diag : ('a : any_value) . 'a -> 'a * 'a` | ||
can actually be compiled, by compiling a separate version of `diag` | ||
for each layout. In fact, we need fewer layouts than that: we can | ||
share the same version for `value` and `immediate`. Generally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gasche says
Note: another possible interpretation of any_value (than specialization) would be similar to F#'s "statically resolved type parameters", where the declaration does not result in any code being generated, but it has to be inlined at each callsite (it's an error if it's used in a non-inlinable way somewhere).
That said, I'm not personally fond of either proposals, as I think that both specialization and static-resolution are somewhat orthogonal to sorts/layout; we may want to specialize a definition according to the specialization abilities of polymorphic comparison operators (int, string, float...), and to statically resolve something to statically determine a modular-implicit parameter, rather than a type.
@lpw25 says
I think that F#'s thing is pretty much a less expressive version of the "Macros" section below. It's probably not clear from the description, but what that section has in mind is the macros work described in the "modular macros" talk and here. With that you can allow sort polymorphic macros producing quotations. Splicing these in at a particular sort is basically the same as calling a function with "statically resolved type parameters" in F#.
Of course, as the below section says, this stuff should really be left for a later project. I think Stephen only included it here because if he didn't people would only ask about it anyway.
|
||
Type aliases do not automatically bind an unboxed variant, because | ||
the scoping of type aliases should not depend on whether they expand | ||
to records or not. So, if you write: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Drup says
I do not understand this. # is pretty much an operator on types that represent their unboxed representation, if available. Why shouldn't it be allowed to follow type aliases ?
@lpw25 says
The key here is that # is not actually an operator on types -- #foo is just the name that we give to an automatically created type, as far as the type system is concerned it has no actual relationship to foo. So #foo is a fully fledged member of its containing module -- a separate entity from foo.
Whether or not a type aliases to a record is not stable under substitution -- it can change if you e.g. apply a functor -- whereas you really want the items in a signature to be stable under substitution.
@Drup says
That doesn't really explain why # is not an operator, though.
If we use the proposed syntax type t : value with #t = ..., then there is no reason to make #t a name, we can deduce everything we need from the environment.
@lpw25 says
Well a proper type operator needs to make sense on all types (of the appropriate kind). Whereas this # is only well defined on some datatypes.
Now you could make a box operator that would work on anything with an appropriate kind, and indeed I think that would be good to add later. But you probably still end up wanting to provide access to the unboxed version of all existing record definitions -- so you'd probably still want the # pseudo operator such that:
type foo = { bar : int }
would secretly be something like:
type foo = #foo box = { bar : int }
@lpw25 says
It might be worth noting that this # is actually pretty well aligned with how # types behave for classes.
@gasche says (I like the choice of reflecting the runtime limitations in the type theory as proposed, instead of starting a bike-shedding discussion on using tag bits for this or that in the runtime to get a nicer type theory. It's good to separate the two.) @gasche says (somewhat later) |
(I've copy-pasted across some comments that predate the public RFCs repo) |
I was originally confused by the avoidance of the terminology "kinds" and wondered about the relationship with Eisenberg and Peyton Jones' "Levity Polymorphism" paper, so @stedolan provided a helpful clarification:
I am happy that @Drup finds some similarity of one proposal with his work on integrating linear types in ML. Since @gasche says that the last part could be left out, and there is an obvious overlap with kinds for linearity, I think I understand (commenting as an interested outsider) that one important task will be finding a design for a kinds & constraints system that will prove forwards-compatible with other kind-based language extension proposals (linearity, resources, arities). From that point of view, I disagree that the "Variants" part of the "Beyond records" section should be left out: there is going to be (AFAIU) the constraint that components of variants must have the same layout, probably treated using implicit padding (another topic of that section), and I would like to be sure to understand a forward-compatible type system design to deal with variants later on. (This bears much in common with objections @lpw25 shared privately about my proposal for static allocation, and I suspect that there is going to be the same solution to both; also one should be careful that this is where experience with linear types is not enough since them alone are unlikely to display this issue.) I was also a bit confused about the proposal to unbox
I have stumbled on this idea before, not knowing what to do with it, and I am happy to see it considered seriously. The context in which it arose for me is mutable cells holding linear values, which can only be accessed by swapping. (This appears naturally in any language extension with linearity/resources, but could appear before then as a programming pattern for safe concurrency in multicore.) One common idiom in Rust (the "take" idiom) is to downgrade such unique mutable cells into "at most one" mutable cells where "taking" the contents means swapping it with a dummy value. Extending a type with a dummy value is in general costless thanks to layout optimisations similar to what @stedolan proposes. (More generally, studying language designs for safe resources/concurrency has encouraged me to be open-mided about the use-cases for nullable values not being entirely covered by the option type from functional programming.) The proposal also quickly alludes to return-value unboxing at the end of the Generalisation section, noticing that type variables could be generalised in positive position. I suspect this could be leveraged more automatically as an optimisation similar to C++-style RVO, likely related also to tail-recursion modulo cons / destination-passing style. |
It would be nice to consider a layout
Applications might include erasure for equality types or static capabilities. |
I really like the idea of the unboxed types presented. One thing that puzzels me, is how the GC distinguishes between something of kind My understanding is that the GC traverses the whole stack and distinguishes Are you adding bit masks to stack frames? (I think GHC does this if I'm not mistaken.) Does OCaml use stack frames at all? |
OCaml actually uses precise stack frames for the whole stack already, at least in native-code mode. (There's no type information in these frames beyond 'is it a value', and scanning of heap blocks does use the last bit to distinguish immediates). In bytecode mode, the whole stack is scanned as you describe, but the plan for bytecode is to implement unboxed locals with implicit boxing, since performance isn't so important in that context. |
Thanks for the clarification @stedolan! This, however, raises another question. Why not do the same thing in heap blocks, using a bitmap in the header to save information which words are pointers and which are unboxed values, and ditch the pointer tagging altogether? It seems to me as both approaches have the same goal, but bitmaps are more robust because one can store 32/64-bit integers and floats instead of only 31/63-bit immediates. |
This has been superseded by an updated proposal, #34, and this older PR can be closed, I think. |
Rendered