Change the way that we carry data in the AST #19

sgpthomas · 2023-07-28T17:44:53Z

State is carreid in the AST as mutable fields. This introduces a class of bugs where state is not properly initialized in places where it is used.

I propose that we give each AST node a unique id at creation time. Then, we can carry state in a separate Map data structure that maps id -> state. This has several benefits.

We can require that every node has some piece of state by using the type id -> state instead of id -> state | None.
We can enforce an ordering over passes at a type level. Suppose we have some pass Infer that annotates every node in the AST with a type. This pass will produce Map[Id, Type]. If some later pass P requires this information, we can require that in order to construct a visitor implementing pass P, we have to pass in a map of Map[Id, Type].

We will have to have a mechanism for mutating / creating new nodes. What ids should these new nodes use? Sometimes they probably want to inherit the id of the node that they are replacing. If we are adding a new node, we may need to mint a new Id.

sgpthomas · 2023-06-29T23:22:01Z

Thinking about the ordering of passes. Not all of these passes exist yet; I'll make the ones that don't. Putting this here, because I think it will be useful to have this picture in mind when thinking about how exactly to pass information around between passes.

Passes

Parsing (exists)

We start with the raw text of a python program (either from a function annotator using inspect) or an entire file from using the solvent cli. We then use the python ast library to parse this text into a python ast. Finally, we translate the python AST into our internal AST (syntax.py).

During this process, each node is given a unique ID.

A-Normalization (exists)

This pass converts our program into A-normal form. A-normal form is SSA-lite; every sub computation is named, but it doesn't do any renaming of existing variables. So if there are two writes to the same variable in a program, we don't change those.

Forward type annotation (doesn't exist)

Mark all the types that we can glean from doing a forward pass.

Concretely, we can mark the types of literals, variables that refer to objects or imports. Maybe there are other things in this list?

There's a question of what we should mark things that we don't know the types of. Maybe it should be a fresh type variable.

Make type calls explicit (exists)

We should know enough about the programs to insert explicit type applications.

If we have this type,

pd.Series: forall T . List[Refine[T]] -> pd.Series[T]{..}

and this program:

def fut():
  s = pd.Series([1, 2, 3])
  return s / -1

we can make the type applications explicit:

def fut():
  s = pd.Series[int]([1, 2, 3])
  return s / -1

If we can't figure out int precisely at this phase, we should replace it with a TypeVar.fresh(). This instantiates the abstracted variable with something that will be filled in during HM.

Hindley-Milner (backward phase of type inference)

For the types that we can't figure out during the forward pass, generate constraints from how a variable is used, then use unification to solve these constraints.

Liquid Template

Generate fresh liquid variables for every type in the program.

Liquid Constraint Generation

Generate liquid constraints by walking the program.

Liquid Predicate Abstraction

Solve for the liquid type constraints by using predicate abstraction.

…ork)

…actor

sgpthomas · 2023-07-17T22:15:59Z

How should we handle self parameters.

Consider this following example:

# in context
s: pd.Series[int] = pd.Series([1, 2, 3])

val = s.max()

The max method on s has type max(self) -> T, but when we call it, the self parameter is implicit.

Option 1: Rewrite the program

The first option, is that we can rewrite the program to make the self parameter explicit. For the above example, we would do:

val = s.max()

# becomes
val = pd.Series.max(s)

This is nice because then we don't have to have special logic to handle the self case whenever we need to handle function calls.

However, I think that we would have to implement this during forward type annotation. This pass it starting to feel like it's doing too many things. Maybe that's ok for now.

Option 2: Special Handling

Another option is just to check if the first parameter is named self and has a SelfType and deal with that case separately whenever we handle a Call.

Conclusion (for now)

I think that for now, I'm going to implement option 1. It seems like it will be simpler.

sgpthomas · 2023-07-17T22:58:37Z

Ended up going with Option 2. Option 1 was hard because I couldn't implement the transformation in end_GetAttr. Because I needed to produce a Call which then complicates the visiting order. I probably could get this to work with enough fiddling, but I decided to just handle this case specially when I'm type checking the arguments of functions. I don't think that it was too messy

annotation is probably the wrong place to do it)

sgpthomas · 2023-07-18T16:00:21Z

I wanted to put a derivation explaining how el works somewhere and here, already being a place for notes about this rewrite, seemed like a fine place.

pd.Series supports multiplication by a scalar. We want the returned series to have a type the reflects that every element in the series is multiplied by b. Something like this type:

def __mul__(self: pd.Series[int], other: int) -> pd.Series[{int | V == self * other}]:
  ...

However, the return type here is not well typed in our predicate language. We don't have a notion of multiplication between a series and an int. We could hard-code these semantics, but this is just hard-coding the logic into the SMT translation.

Instead, I think that it is simpler to have a "canonical" element function.

def el(self: pd.Series[{int | K}]) -> {int | K}:
  ...

The implementation of this function doesn't matter for our purposes. It would be reasonable to just hard-code this function I think.

Here's a derivation of how using this function would work during liquid inference.

# we are trying to find some K that generalized V == el(a) * b
pd.Series[{int | V == el(a) * b}] <: pd.Series[{int | K}]

# we have the following things in our context
a: pd.Series[{int | V > 0}]
b: {int | V < 0}
|- {int | V == el(a) * b} <: {int | K}

# the use of el(a) effectively introduces a new element into the context
# we have the following things in our context
a: pd.Series[{int | V > 0}]
el(a): {int | V > 0}
b: {int | V < 0}
|- {int | V == el(a) * b} <: {int | K}

# from the normal liquid inference algo, we subst to get concrete assumptions
a: pd.Series[{int | V > 0}]
{int | el(a) > 0}
{int | b < 0}
|- {int | V == el(a) * b} <: {int | K}

# now we can perform predicate abstraction
el(a) > 0 && b < 0 && V == el(a) * b ==> V < 0
...

# K := V < 0 is a valid solution

sgpthomas · 2023-07-18T20:42:42Z

Thinking about how we want to templatize pd.Series.

For this running series example, we generate this program after the type annotation + inference phase.

def fut() -> pd.Series[int]:
  tmp0: List[int] = [1, 2, 3]
  s: pd.Series[int] = pd.Series[int](tmp0)
  tmp1: int = -1
  return s * tmp1

We want to generate this program:

def fut() -> pd.Series[Refine[int, "K2"]]:
    tmp0: py.List[Refine[int, "K0"]] = [1, 2, 3]
    s: pd.Series[Refine[int, "K1"]] = pd.Series[Refine[int, "K1"]](tmp0)
    tmp1: Refine[int, V == -1] = -(1)
    return s * tmp1

with the following set of constraints:

V == 1 <: K0
V == 2 <: K0
V == 3 <: K0
K0 <: K1
pd.Series[int, V == el(s) * tmp1] <: pd.Series[int, K2]

This seems straightforward enough. Let's think about using pd.Series.max function. I think that this has a more interesting templating requirement.

def fut() -> pd.Series[int]:
  tmp0: List[int] = [1, 2, 3]
  s: pd.Series[int] = pd.Series[int](tmp0)
  tmp1: int = s.max()
  tmp2: int = -1
  return tmp1 * tmp2

We should templatize that program into:

def fut() -> pd.Series[Refine[int, K2]]:
  tmp0: List[Refine[int, K0]] = [1, 2, 3]
  s: pd.Series[Refine[int, K1]] = pd.Series[Refine[int, K1]](tmp0)
  tmp1: Refine[int, K1] = s.max()
  tmp2: Refine[int, V == -1] = -1
  return tmp1 * tmp2

Constraints:

V == 1 <: K0
V == 2 <: K0
V == 3 <: K0
K0 <: K1
V == tmp1 * tmp2 <: K2

The interesting bit of that program is that the refinement for tmp1 should use exactly the refinement that s uses. I'm not quite sure how to implement that, but I think that it will become clear once I get there.

sgpthomas · 2023-07-24T17:06:51Z

Thinking through how to templatize functions / object constructors.

We have the type forall T, List[T] -> pd.Series[T] which is a constructor for pd.Series. During type annotation, we solve find some concrete type for T. In the following example T = int

s = pd.Series([1, 2, 3])

because we are passing in a List[int]. The question at hand is how we should insert template variables. The first naive implementation is to just templatize the args and the ret separately. But then we end up with this type:

List[{int | K0}] -> pd.Series[{int | K1}]

This could be correct if we also generate the constraint K1 <: K0. But, this in general is not true. For example, the following implementation of the pd.Series constructor violates this.

def __init__(self, l: List[int]):
  self.data = [1, 2]

The l parameter is simply ignored and so their refinements bare no relation.

My initial instinct was that we should always templatize List[T] -> Object[T] as List[{T | K}] -> Object[{T | K}]. However, I'm now seeing that because of the above example, this is not always true. I think that this implies that we need to know upfront in the type that the refinement variables are the same. That is the type of this constructor has to be pd.Series[{T | K}] -> List[{T | K}] up-front (rather than templatizing things later).

dijkstracula · 2023-07-24T18:49:15Z

Can I ask what is probably a stupid question about the first example you gave? In the case where l is ignored in the constructor, the value of l is ignored but the subtyping relationship still has to be proven. So, if the input list is List[int | el() > 0], then assigning the field [1,2] would be satisfied (since of course the base type is the same, too). Is it ignored because there's no path constraint tying the parameter to the field?

If so, would something like a where K1 < k0 or -> pd.Series[{int | ? super K0}] let us express this even in the case where no path constraint is generated?

sgpthomas · 2023-07-24T23:54:01Z

If I understand your question correctly, I think the answer is yes. Though I'm not really sure what you are saying about path constraints.

Suppose you know ahead of time (because you annotated or something), that the type of the constructor is forall T, List[{T | K}] -> pd.Series[{T | K}].

Then the question is does the following type check:

def __init__(self, l: List[{int | K}]):
  self.data = [1, 2]

Python constructors aren't allowed to return anything. So we need some way of tying the types of input parameters to object fields.

Maybe something like this:

class pd.Series(Generic[K]):
  data: List[{int | K}]
  
  def __init__(self, l: List[{int | K}]):
    self.data = l

Here, the class is the thing abstracting over refinement variables. So when you instantiate it with some list like so: pd.Series([1, 2, 3]), we constrain the refinement of the class to be the same as the refinement of the passed in list. And we would check / generate this constraint when we are checking the body of __init__.

Ok, then back to the example where we ignore the input list l. Whether this is good depends on whether we are living in a closed world or an open world. In a closed world, this would type check if we only ever instantiate pd.Series with lists of type List[{int | V > 0}]. In an open world, this wouldn't type check it, because we have to prove this for every possible list; some of which have negative elements.

And finally, about your thought of having some kind of where clause. I think that this would be useful. It certainly increases the expressiveness of our type system. Something like this would be cool (making up syntax of course):

class pd.Series(Generic[K0, K1]) where K0 <: K1:
  data: List[{int | K1}]

  def __init__(self, l: List[{int | K0}]):
    self.data = l

This was rather rambly, but I hope this maybe answered your question?

sgpthomas · 2023-07-28T17:49:13Z

Finally have this in a state that I think is mergable. Tests are working, and there are minimal hacks. By mainlining this now, I think it'll be more possible to work in parallel and in smaller chunks.

Also, this is the most poorly named issue+PR+branch in existence. This started out as a simple move away from mutating state in our internal AST and ended up as close to a complete reworking of most of the core code.

sgpthomas added this to the Get `norm` working milestone Jun 29, 2023

sgpthomas added enhancement New feature or request low priority labels Jun 29, 2023

wip(broken)

154e92e

sgpthomas added 6 commits July 10, 2023 15:14

[wip] made good progress implementing forward type checking pass

0ca0cad

[wip] started rewriting hm inference

115e423

[wip] things are broken (playing around with how class types should w…

6a9afe4

…ork)

[wip] break things more bc why not

46c42da

[wip] got name resolution working. making steady progress in this ref…

c2c9d89

…actor

fixed subsitution

8a6be9e

sgpthomas added 3 commits July 17, 2023 18:00

[wip] fixed fn argument type checking + added second forward pass

8b3c670

[wip] operator resolution is now sorta working! with complex types!

b8e5a86

[wip] add substs to returned type for binops (during forward

f285de1

annotation is probably the wrong place to do it)

[wip] start writing the templatizer

15a54b0

[wip] fixed up templating a little bit

86bfd24

sgpthomas added 4 commits July 24, 2023 15:56

[wip] revamped type subst + added option for eager eval

567ebe2

[wip] with ample hacks, got a simple example working!

2d56bb6

[wip] cleanup so that things pass checks

196c31f

[wip] syntax.Any -> syntax.AnyType

5d95bfc

sgpthomas added 3 commits July 25, 2023 13:47

the fruits of my labor are finally realized!! (in a hacky way but still)

e26921e

[wip] working on making tests work again

53e460b

got list tests passing again

c959c58

sgpthomas added 2 commits July 28, 2023 11:43

finally got tests working again!!

70be99a

fix python version in CI

687baaf

sgpthomas merged commit 73cf84d into main Jul 28, 2023
2 checks passed

sgpthomas deleted the nathan/ast_ids branch July 28, 2023 17:54

sgpthomas mentioned this pull request Jul 28, 2023

Much ado Predicate Variables during Hindley-Milner. #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the way that we carry data in the AST #19

Change the way that we carry data in the AST #19

sgpthomas commented Jul 28, 2023

sgpthomas commented Jun 29, 2023 •

edited

Loading

sgpthomas commented Jul 17, 2023

sgpthomas commented Jul 17, 2023

sgpthomas commented Jul 18, 2023 •

edited

Loading

sgpthomas commented Jul 18, 2023

sgpthomas commented Jul 24, 2023

dijkstracula commented Jul 24, 2023

sgpthomas commented Jul 24, 2023

sgpthomas commented Jul 28, 2023

Change the way that we carry data in the AST #19

Change the way that we carry data in the AST #19

Conversation

sgpthomas commented Jul 28, 2023

sgpthomas commented Jun 29, 2023 • edited Loading

Passes

Parsing (exists)

A-Normalization (exists)

Forward type annotation (doesn't exist)

Make type calls explicit (exists)

Hindley-Milner (backward phase of type inference)

Liquid Template

Liquid Constraint Generation

Liquid Predicate Abstraction

sgpthomas commented Jul 17, 2023

How should we handle self parameters.

Option 1: Rewrite the program

Option 2: Special Handling

Conclusion (for now)

sgpthomas commented Jul 17, 2023

sgpthomas commented Jul 18, 2023 • edited Loading

sgpthomas commented Jul 18, 2023

sgpthomas commented Jul 24, 2023

Thinking through how to templatize functions / object constructors.

dijkstracula commented Jul 24, 2023

sgpthomas commented Jul 24, 2023

sgpthomas commented Jul 28, 2023

sgpthomas commented Jun 29, 2023 •

edited

Loading

sgpthomas commented Jul 18, 2023 •

edited

Loading