Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the way that we carry data in the AST #19

Merged
merged 21 commits into from
Jul 28, 2023
Merged

Conversation

sgpthomas
Copy link
Collaborator

State is carreid in the AST as mutable fields. This introduces a class of bugs where state is not properly initialized in places where it is used.

I propose that we give each AST node a unique id at creation time. Then, we can carry state in a separate Map data structure that maps id -> state. This has several benefits.

  1. We can require that every node has some piece of state by using the type id -> state instead of id -> state | None.
  2. We can enforce an ordering over passes at a type level. Suppose we have some pass Infer that annotates every node in the AST with a type. This pass will produce Map[Id, Type]. If some later pass P requires this information, we can require that in order to construct a visitor implementing pass P, we have to pass in a map of Map[Id, Type].

We will have to have a mechanism for mutating / creating new nodes. What ids should these new nodes use? Sometimes they probably want to inherit the id of the node that they are replacing. If we are adding a new node, we may need to mint a new Id.

@sgpthomas sgpthomas added this to the Get `norm` working milestone Jun 29, 2023
@sgpthomas sgpthomas added enhancement New feature or request low priority labels Jun 29, 2023
@sgpthomas
Copy link
Collaborator Author

sgpthomas commented Jun 29, 2023

Thinking about the ordering of passes. Not all of these passes exist yet; I'll make the ones that don't. Putting this here, because I think it will be useful to have this picture in mind when thinking about how exactly to pass information around between passes.

Passes

Parsing (exists)

We start with the raw text of a python program (either from a function annotator using inspect) or an entire file from using the solvent cli. We then use the python ast library to parse this text into a python ast. Finally, we translate the python AST into our internal AST (syntax.py).

During this process, each node is given a unique ID.

A-Normalization (exists)

This pass converts our program into A-normal form. A-normal form is SSA-lite; every sub computation is named, but it doesn't do any renaming of existing variables. So if there are two writes to the same variable in a program, we don't change those.

Forward type annotation (doesn't exist)

Mark all the types that we can glean from doing a forward pass.

Concretely, we can mark the types of literals, variables that refer to objects or imports. Maybe there are other things in this list?

There's a question of what we should mark things that we don't know the types of. Maybe it should be a fresh type variable.

Make type calls explicit (exists)

We should know enough about the programs to insert explicit type applications.

If we have this type,

pd.Series: forall T . List[Refine[T]] -> pd.Series[T]{..}

and this program:

def fut():
  s = pd.Series([1, 2, 3])
  return s / -1

we can make the type applications explicit:

def fut():
  s = pd.Series[int]([1, 2, 3])
  return s / -1

If we can't figure out int precisely at this phase, we should replace it with a TypeVar.fresh(). This instantiates the abstracted variable with something that will be filled in during HM.

Hindley-Milner (backward phase of type inference)

For the types that we can't figure out during the forward pass, generate constraints from how a variable is used, then use unification to solve these constraints.

Liquid Template

Generate fresh liquid variables for every type in the program.

Liquid Constraint Generation

Generate liquid constraints by walking the program.

Liquid Predicate Abstraction

Solve for the liquid type constraints by using predicate abstraction.

@sgpthomas
Copy link
Collaborator Author

How should we handle self parameters.

Consider this following example:

# in context
s: pd.Series[int] = pd.Series([1, 2, 3])

val = s.max()

The max method on s has type max(self) -> T, but when we call it, the self parameter is implicit.

Option 1: Rewrite the program

The first option, is that we can rewrite the program to make the self parameter explicit. For the above example, we would do:

val = s.max()

# becomes
val = pd.Series.max(s)

This is nice because then we don't have to have special logic to handle the self case whenever we need to handle function calls.

However, I think that we would have to implement this during forward type annotation. This pass it starting to feel like it's doing too many things. Maybe that's ok for now.

Option 2: Special Handling

Another option is just to check if the first parameter is named self and has a SelfType and deal with that case separately whenever we handle a Call.

Conclusion (for now)

I think that for now, I'm going to implement option 1. It seems like it will be simpler.

@sgpthomas
Copy link
Collaborator Author

Ended up going with Option 2. Option 1 was hard because I couldn't implement the transformation in end_GetAttr. Because I needed to produce a Call which then complicates the visiting order. I probably could get this to work with enough fiddling, but I decided to just handle this case specially when I'm type checking the arguments of functions. I don't think that it was too messy

@sgpthomas
Copy link
Collaborator Author

sgpthomas commented Jul 18, 2023

I wanted to put a derivation explaining how el works somewhere and here, already being a place for notes about this rewrite, seemed like a fine place.

pd.Series supports multiplication by a scalar. We want the returned series to have a type the reflects that every element in the series is multiplied by b. Something like this type:

def __mul__(self: pd.Series[int], other: int) -> pd.Series[{int | V == self * other}]:
  ...

However, the return type here is not well typed in our predicate language. We don't have a notion of multiplication between a series and an int. We could hard-code these semantics, but this is just hard-coding the logic into the SMT translation.

Instead, I think that it is simpler to have a "canonical" element function.

def el(self: pd.Series[{int | K}]) -> {int | K}:
  ...

The implementation of this function doesn't matter for our purposes. It would be reasonable to just hard-code this function I think.

Here's a derivation of how using this function would work during liquid inference.

# we are trying to find some K that generalized V == el(a) * b
pd.Series[{int | V == el(a) * b}] <: pd.Series[{int | K}]

# we have the following things in our context
a: pd.Series[{int | V > 0}]
b: {int | V < 0}
|- {int | V == el(a) * b} <: {int | K}

# the use of el(a) effectively introduces a new element into the context
# we have the following things in our context
a: pd.Series[{int | V > 0}]
el(a): {int | V > 0}
b: {int | V < 0}
|- {int | V == el(a) * b} <: {int | K}

# from the normal liquid inference algo, we subst to get concrete assumptions
a: pd.Series[{int | V > 0}]
{int | el(a) > 0}
{int | b < 0}
|- {int | V == el(a) * b} <: {int | K}

# now we can perform predicate abstraction
el(a) > 0 && b < 0 && V == el(a) * b ==> V < 0
...

# K := V < 0 is a valid solution

@sgpthomas
Copy link
Collaborator Author

Thinking about how we want to templatize pd.Series.

For this running series example, we generate this program after the type annotation + inference phase.

def fut() -> pd.Series[int]:
  tmp0: List[int] = [1, 2, 3]
  s: pd.Series[int] = pd.Series[int](tmp0)
  tmp1: int = -1
  return s * tmp1

We want to generate this program:

def fut() -> pd.Series[Refine[int, "K2"]]:
    tmp0: py.List[Refine[int, "K0"]] = [1, 2, 3]
    s: pd.Series[Refine[int, "K1"]] = pd.Series[Refine[int, "K1"]](tmp0)
    tmp1: Refine[int, V == -1] = -(1)
    return s * tmp1

with the following set of constraints:

V == 1 <: K0
V == 2 <: K0
V == 3 <: K0
K0 <: K1
pd.Series[int, V == el(s) * tmp1] <: pd.Series[int, K2]

This seems straightforward enough. Let's think about using pd.Series.max function. I think that this has a more interesting templating requirement.

def fut() -> pd.Series[int]:
  tmp0: List[int] = [1, 2, 3]
  s: pd.Series[int] = pd.Series[int](tmp0)
  tmp1: int = s.max()
  tmp2: int = -1
  return tmp1 * tmp2

We should templatize that program into:

def fut() -> pd.Series[Refine[int, K2]]:
  tmp0: List[Refine[int, K0]] = [1, 2, 3]
  s: pd.Series[Refine[int, K1]] = pd.Series[Refine[int, K1]](tmp0)
  tmp1: Refine[int, K1] = s.max()
  tmp2: Refine[int, V == -1] = -1
  return tmp1 * tmp2

Constraints:

V == 1 <: K0
V == 2 <: K0
V == 3 <: K0
K0 <: K1
V == tmp1 * tmp2 <: K2

The interesting bit of that program is that the refinement for tmp1 should use exactly the refinement that s uses. I'm not quite sure how to implement that, but I think that it will become clear once I get there.

@sgpthomas
Copy link
Collaborator Author

Thinking through how to templatize functions / object constructors.

We have the type forall T, List[T] -> pd.Series[T] which is a constructor for pd.Series. During type annotation, we solve find some concrete type for T. In the following example T = int

s = pd.Series([1, 2, 3])

because we are passing in a List[int]. The question at hand is how we should insert template variables. The first naive implementation is to just templatize the args and the ret separately. But then we end up with this type:

List[{int | K0}] -> pd.Series[{int | K1}]

This could be correct if we also generate the constraint K1 <: K0. But, this in general is not true. For example, the following implementation of the pd.Series constructor violates this.

def __init__(self, l: List[int]):
  self.data = [1, 2]

The l parameter is simply ignored and so their refinements bare no relation.

My initial instinct was that we should always templatize List[T] -> Object[T] as List[{T | K}] -> Object[{T | K}]. However, I'm now seeing that because of the above example, this is not always true. I think that this implies that we need to know upfront in the type that the refinement variables are the same. That is the type of this constructor has to be pd.Series[{T | K}] -> List[{T | K}] up-front (rather than templatizing things later).

@dijkstracula
Copy link
Owner

Can I ask what is probably a stupid question about the first example you gave? In the case where l is ignored in the constructor, the value of l is ignored but the subtyping relationship still has to be proven. So, if the input list is List[int | el() > 0], then assigning the field [1,2] would be satisfied (since of course the base type is the same, too). Is it ignored because there's no path constraint tying the parameter to the field?

If so, would something like a where K1 < k0 or -> pd.Series[{int | ? super K0}] let us express this even in the case where no path constraint is generated?

@sgpthomas
Copy link
Collaborator Author

If I understand your question correctly, I think the answer is yes. Though I'm not really sure what you are saying about path constraints.

Suppose you know ahead of time (because you annotated or something), that the type of the constructor is forall T, List[{T | K}] -> pd.Series[{T | K}].

Then the question is does the following type check:

def __init__(self, l: List[{int | K}]):
  self.data = [1, 2]

Python constructors aren't allowed to return anything. So we need some way of tying the types of input parameters to object fields.

Maybe something like this:

class pd.Series(Generic[K]):
  data: List[{int | K}]
  
  def __init__(self, l: List[{int | K}]):
    self.data = l

Here, the class is the thing abstracting over refinement variables. So when you instantiate it with some list like so: pd.Series([1, 2, 3]), we constrain the refinement of the class to be the same as the refinement of the passed in list. And we would check / generate this constraint when we are checking the body of __init__.

Ok, then back to the example where we ignore the input list l. Whether this is good depends on whether we are living in a closed world or an open world. In a closed world, this would type check if we only ever instantiate pd.Series with lists of type List[{int | V > 0}]. In an open world, this wouldn't type check it, because we have to prove this for every possible list; some of which have negative elements.

And finally, about your thought of having some kind of where clause. I think that this would be useful. It certainly increases the expressiveness of our type system. Something like this would be cool (making up syntax of course):

class pd.Series(Generic[K0, K1]) where K0 <: K1:
  data: List[{int | K1}]

  def __init__(self, l: List[{int | K0}]):
    self.data = l

This was rather rambly, but I hope this maybe answered your question?

@sgpthomas
Copy link
Collaborator Author

Finally have this in a state that I think is mergable. Tests are working, and there are minimal hacks. By mainlining this now, I think it'll be more possible to work in parallel and in smaller chunks.

Also, this is the most poorly named issue+PR+branch in existence. This started out as a simple move away from mutating state in our internal AST and ended up as close to a complete reworking of most of the core code.

@sgpthomas sgpthomas merged commit 73cf84d into main Jul 28, 2023
2 checks passed
@sgpthomas sgpthomas deleted the nathan/ast_ids branch July 28, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants