Calyx-Opt Lab Notebook #2202

calebmkim · 2024-07-10T17:03:12Z

calebmkim
Jul 10, 2024
Collaborator

@parthsarkar17 and I will post our updates here.

calebmkim · 2024-07-10T21:42:17Z

calebmkim
Jul 10, 2024
Collaborator Author

I'll start this notebook entry by repeating the same update I gave at today's Calyx-Opt meeting.

Recap of Problem

Suppose we have the following control structure:

static seq {A; B; static repeat 10 { // body with latency N } C; D; }

FSMs are currently implemented like this:

parent FSM: 0->1->2->3->4 // continue to count   -> 2 + (10*N) + 1 -> etc. 
child FSM:        0->1->.....->N // repeat 10 times

As you can see, because the parent does not stop counting when it offloads to the child, it can drastically increase the size of the parent register.

I implemented FSMs so that the parent pauses when it offloads to the child:

parent FSM: 0->1->2  // parent pauses instead of counting  -> 3 -> 4 etc. 
child FSM:        0->1->.....->N // child will repeat 10 times

Synthetic Results

I wrote a synthetic benchmark, i.e., a benchmark specifically written to showcase the advantage of this new approach.

Since it's handwritten it's a pretty small design but here are the numbers:

	Old	New
LUT	105	67
Register	24	28
Worst Slack	4.221	5.347

I view these results as a fairly promising upper bound on what type of results are achievable through the new technique.

Real Results

Here are some results on real benchmarks (blue is new, orange is old. Also I will work on making the graphs look nicer).

LUT usage:
offload-LUT.pdf

Register usage:
offload-register.pdf

Worst slack:
offload-worst-slack.pdf

The tl;dr is that the trends follow the synthetic results (LUT and worst slack are better, registers are worse), just not as strongly.
I think the main reason for this is how I currently handle static par blocks (in the old technique, we could have a single reference static FSM that all the threads could refer to if they wanted to synchronize: this new technique doesn't allow for this, so it's a bit more tricky to reduce the overhead of static par blocks).
I'm optimistic that if I continue to work at it we can improve the results so that static pars are not as expensive using the new method. I will keep this thread posted with updates as I get more results.

8 replies

rachitnigam Jul 13, 2024
Maintainer

In the two graphs, which one is the new technique ("binary" vs "smarter")?

calebmkim Jul 13, 2024
Collaborator Author

The "smarter" technique is my new technique

calebmkim Jul 13, 2024
Collaborator Author

Well, both are the "new" technique (in the sense they both stop counting when offloading to static repeat).

The "smarter" one is the one which is smarter about handling static par.
So: red line is existing, blue line was the new technique that naively handled static pars, orange line is the new technique that more inteligently handles static pars.

rachitnigam Jul 13, 2024
Maintainer

Got it, thanks for the clarification!

calebmkim Jul 13, 2024
Collaborator Author

Issue is here: #2207. It's ended up being super long and detailed so no pressure to read all of it. (Even if nobody reads it, it was helpful for me because it helped me think of some better ways to improve compilation).

parthsarkar17 · 2024-07-15T15:12:44Z

parthsarkar17
Jul 15, 2024
Collaborator

Here are some graphs on varying the dynamic fsm duplication parameter, with some one-hot encoding. In the following graphs, dup:x,ohe:y means: # states >= x implies the schedule reads from two identical registers, and # states <= y implies the schedule uses one-hot encoding.

LUT Usage:

Register Usage:

CLB Register Usage:

Max. Frequency Metrics:

2 replies

calebmkim Jul 15, 2024
Collaborator Author

Thanks for getting these graphs!

I'm curious why register usage doesn't seem to change for any of the benchmarks... seems like duplication should increase register usage. Have you tried looking at the compiled Calyx code to see what's going on? (Also, if you want, you can push your results to perf-dashboard).

Also, most of these changes seem pretty small (which is honestly expected tbh). Sorry if you've already tried this, but out of curiosity have you tried the "synthetic benchmark" idea, so we can see what the upper bound on resource improvements might be?

(Some ideas for the synthetic benchmarks)
For OHE, maybe you can try an FSM that will have exactly 10 states or something, and test binary vs. OHE. If it does (and I can't think of how that wouldn't improve LUT usage...) then maybe try to keep on increasing the size of the FSM until you see diminishing returns.

For duplicate, maybe you could try a really long, complicated control flow, with lots of if and while loop statements: it might also be worth making each group have lots of assignments in it to make sure the fanout for the FSM is large.

parthsarkar17 Jul 15, 2024
Collaborator

Yeah, agreed on the small changes part. I think, a lot of these being linalg computations, dynamic fsm changes won't matter too much / will be harmful to frequency. On top of that, it seems like duplication is only effective if paired with OHE (which begs the question, is OHE the thing that's making all the gains?). It would be helpful to only benchmark OHE representations and see how we do there.

For the synthetic design, the design I modified to make purely sequential ended up having a poorer worst slack (I think it had too many sequential groups, so duplicating registers made that even worse?), so I'll find / make a better one and post results here. We know duplication can have some good improvements, since we saw ntt-32.futil jumping from WS of 0.722 at base compiler specs to 1.09.

Lastly, that's a good idea to find a threshold for OHE. Will do!

parthsarkar17 · 2024-07-23T13:11:40Z

parthsarkar17
Jul 23, 2024
Collaborator

DESCRIPTION

Here are the graphs for the "smart-seq" split technique. Here's the description, copied from the PR (#2217):

Similar to duplication, the goal with splitting a seq block is ultimately to reduce fanout from the one register that normally controls a sequential schedule. Initially, we tried to transform:

seq { A; B; C; D; E; F; }

into:

seq { 
    @new_fsm seq {A; B; C;}
    @new_fsm seq {D; E; F;}
}

The children schedules generated control registers like so:

@generated fsm0 = std_reg(2);
@generated fsm1 = std_reg(2);

which was the goal, except, with this, the following groups and assignments would be generated:

group tdcc0 {
    A[go] = ...
    ...
    fsm0.in = fsm0.out == 2'd0 & A[done] ? 2'd1; // line 1
    ...
}
group tdcc1 {
    D[go] = ...
    ...
    D[go] = fsm1.in = fsm0.out == 2'd0 & D[done] ? 2'd1; // line 2
    ...
}

Lines 1 and 2 are pretty similar; they each check the current state of the FSM register and ensure some other group has finished, and then they each update their register's value with a new value (the same value for each register!). Once we got synthesis results from this "new-fsm-insertion" method, we saw that WS decreased and LUT usage increased; we suspected it had something to do with the fact that we were duplicating the logic to transition FSM states, since both fsm0 and fsm1 have identical transition conditions (up to the names of the groups they are controlling) and new values.

So, we decided to open up an option in TDCC that lets a seq block be controlled by a parent register, a child register, and duplicated versions of each register (that agree with their respective original at each cycle). We can hopefully, therefore, get the benefits of reducing fan-out, while also ensuring that we reuse logic wherever we can. Here's what the tdcc group will look like for the above control block:

group tdcc {
    A[go] = !A[done] & parent0 == 1'd0 & fsm0 == 2'd0;

    ...  // notice how the enable queries are split among the two sets of registers. that's for fan-out reduction

    F[g] = !F[done] & parent1 = 1'd1 & fsm1 = 2'd2;
    
    ... // notice how transition logic is shared and not duplicated

    fsm0.in = A[done] & parent0 == 1'd0 & fsm0 == 2'd0 ? 2'd1;
    fsm1.in = A[done] & parent0 == 1'd0 & fsm0 == 2'd0 ? 2'd1;
   
   ... // transition logic for parents isn't shown
}

In short, the idea is duplication, but with an emphasis on making sure the registers reuse logic to update themselves. Benchmarking + synthesis results are in progress.

GRAPHS

LUT Comparison

Register Comparison

CLB Register Comparison

Max. Frequency Estimate

3 replies

calebmkim Jul 23, 2024
Collaborator Author

Thanks for the graphs!

One idea: if you're only testing tdcc it might be helpful to -d static-promotion so that the linear algebra kernels stay dynamic so you can see a bigger difference in the results.

calebmkim Jul 23, 2024
Collaborator Author

Also can you explain what exactly is going on? Maybe a visual would be helpful. For example, the @new_fsm pass does this:

If you have

seq {
  A; B; ...// a bunch of control 
  @new_fsm seq {...}
  @new_fsm seq {...} 
   // some more control afterwards
}

The FSM would look like the image on the left (the image on the right corresponds to normal FSM generation if you had just deleted the @new_fsm attribute everywhere).

(Incidentally, I think this is one of the reasons why @new_fsm performed poorly: it's essentially adding another register that you need to check: the red represents an example query).

What's going on with your new technique exactly? In particular, how is it different from duplication?

parthsarkar17 Jul 23, 2024
Collaborator

Yeah, good idea, hopefully this helps. I tried to illustrate the things that are shared with my implementation, versus the stuff that is separated with the original split idea:

calebmkim · 2024-07-25T13:22:18Z

calebmkim
Jul 25, 2024
Collaborator Author

Using the technique described in #2207, I ran results with one-hot encoding.

Blue line is OHE (with new technique), Orange line is binary (with new technique), red line is what is on the main branch rn.

mvt and gemver are creating 130 bit and 194 bit registers when OHE for a reason described here: #2207 (comment). Currently getting synthesis results for a smarter strategy.

For the others, I'll have to poke around to see why resource usage is getting worse... I suspect it possibly may have something to do with sharing FSMs, but I'm not sure.

Good news is that worst slack gets better pretty much universally for OHE.

1 reply

calebmkim Jul 25, 2024
Collaborator Author

Results using a slightly smarter OHE strategy^ for LUTs. Red is default, orange is new technique + binary, blue is new technique + OHE.

OHE still worse--- going to get some synthetic results to see why.

calebmkim · 2024-08-01T18:41:48Z

calebmkim
Aug 1, 2024
Collaborator Author

Summer Wrap-Up Summary

Going to record the conclusion for our end-of-summer meeting right here. I'm mainly going to talk about static FSMs, @parthsarkar can give an update on dynamic FSMs.

The main thing I did this summer was (a) create a way to pause the parent FSM when offloading computations of static repeat compilation (PR here) and (b) coming up with this "tree" abstraction to be able to do this more easily (issue here).
The short term goal for finishing up is finally getting the PR merged.

@rachitnigam mentioned an analogy to parallel runtime systems (e.g., Cilk, NESL, MPL).

Also, we identified 2 broad types of optimizations we could perform: (1) determining what a good schedule is and (2) given a schedule, what the best way to implement it is.
(1) would be more similar to Calyx source code transformations (e.g., schedule compaction), while (2) is what we spent most time on this past summer.
There is lots of interesting work to be done in (1); in this issue, I'm mainly going to outline the possible work we could do for (2) (and specifically regarding static FSMs).

Given a schedule, how do we optimize its implementation?

There are two different paths we talked about.

How to represent the integer

We have implemented OHE and Binary, but are wondering if there could be a middle ground in between the two that could give the best of both worlds (in particular, for OHE you only have to check if you're in a given state, and only need a shifter to count up, but the register you use costs more bits).

In particular, perhaps you could represent integers in the following way: choose a base b, and then represent each digit in that number by using a one-hot-encoded b bit register. For example, if you choose base 5, then you would have two 5-bit registers: one could count up to 5 and then reset and the other would count how many times the first register has counted to 5 (if you needed to count above 25 then obviously you would need another register).

Tree/Control-flow level optimizations

The second avenue for optimizations takes a step back and thinks about how we can better use register(s) to implement the entire control flow of a given program.
This includes splitting and duplicating, which we have already talked about.

Currently, we greedily share FSMs (in particular, since each node represents an FSM, we perform a greedy coloring after inserting conflicts between nodes of each tree).
It is possible there are more intelligent ways to share FSMs than just greedily coloring (in particular if you share FSMs too much then it could possibly lead to an increase in LUT usage).

One point @sampsyo mentioned in the meeting is that there is some redundancy in checking FSM states.
For example, to check an FSM state, we might make a guard that looks something like grandparent_fsm == 3 & parent_fsm == 4 & child_fsm == 9.
But checking the grandparent and parent seems redundant: if the child is in any state but 0, doesn't that mean the grandparent must be 3 and the parent must be in state 4, since those are the only states in which the grandparent and parent are offloading to a particular child? In other words, why don't we just check child_fsm == 9 and get rid of those expensive grandparent/child checks? Answer: because it's possible that we might share the FSMs within trees, i.e., two different nodes in the same tree could use the same register.
I think that if we simply restricted sharing to only occur between two trees (i.e., each node in a given tree is represented by a different register, even though nodes in different trees could end up using the same register), then we would be able to reduce this redundancy.

Another, even more vague idea that was inspired by this thought was to have some sort of full-program representation of the control program we are implementing so we can perform FSM optimizations on it (the trees can handle the static island, but we need a representation for the dynamic FSM that triggers these static islands.)

Some maintenance stuff for me to do

I still need to merge the PR I made that implements pausing when offloading
The perf-dashboard repo, while effective, could be made easier/smoother to use for outside people.

0 replies

parthsarkar17 · 2024-08-02T20:35:58Z

parthsarkar17
Aug 2, 2024
Collaborator

Summer Dynamic FSM Optimizations Recap

A dynamic schedule in Calyx can be represented as a directed graph, in which nodes are "enables" (or groups to be executed) and edges specify the order in which these enables are executed. As Caleb mentioned, we're looking into opening better, equivalent schedules (i.e. creating a schedule that remains faithful to the original schedule's output, for a given input), but our work this summer was mainly focused on exploring how a specific schedule (or finite state machine) is represented in hardware.

The broad goal was to represent the FSM such that, when synthesized, FPGA resource (e.g. LUT, register) usage would go down, while the maximum frequency of the overall design would increase. We had rough ideas of why our optimizations would impact these results. In particular, increased LUT usage was synonymous with more complicated internal logic (such as a complex transition-assignment guard for the FSM register) and decreased frequency might be the result of a high fan-out wire (i.e. one source is feeding into many destinations). Our optimization ideas were focused on addressing these physical constraints, and we often ran into important tradeoffs, expected and unexpected.

Integer Representation

We added the option for dynamic schedules to also be one-hot encoded. The thought here was, since dynamic schedules only require equality checks on the current state (and not whether the current state exists within a range of values, as may be required for static schedules), we would only need to read from one bit of the FSM register. This would minimize internal transition logic (thereby targeting LUT usage) since you'd no longer need to use all the bits from the register to compute the guard, while also reducing fan-out on each register wire. Across our benchmarks, with static-promotion disabled, we saw ~1.1x frequency jumps using OHE compared to binary. Unfortunately, LUT usage only decreased modestly (~90-95% of original) on only about half of our benchmarks. This was definitely a bit surprising.

The important thing will be to figure out exactly where our assumptions deviated from what the synthesis results show. Understanding the true relationship between guard logic complexity and LUT usage will lead to more insights on what other integer representations could be explored (as Caleb mentioned, something like having a parameter that sets "how much" in between binary and OHE we are) or even help us fix our methods if we have implementation errors. A good next step is to dig into the designs with higher LUT usage and figure out which control-flow constructs prevented OHE from having the benefits we expected (this is part of a higher-order point of making sure we know exactly what goes on when we open up options).

One last note for dynamic integer representations: since many schedules have transitions like k --> k + 1, it may be worth trying out a + 1 operation for binary and a lsh operation for OHE. In our current method, we insert a completely new constant, instead of taking advantage of the fact that we are simply increasing the register value by one. There might be some low-level optimizations that can occur with this information.

Spreading Queries

This next set of optimizations were focused on reducing the fan-out that a single FSM register could experience if its output is used in every enable and transition guard. The idea is, if we could spread these queries across two registers that agree at each cycle, then we could trade off increased register usage for an increase in frequency that might come from having each register drive fewer wires.

Duplication

This method simply creates an arbitrary number of identical FSMs. If we duplicate once, and if an FSM has n states, state-queries for 0..n/2-1 read from the first register, and the rest read from the second. We haven't yet benchmarked with -d static-promotion, but the results from above showed that there were certainly designs where this helped. I think it's important to isolate the "most complicated" FSM and apply this optimization there, since duplicating unnecessarily might cause some resource overhead and frequency reduction-- that might explain some of the poorer results we saw on some designs.

Splitting

This method splits a large seq schedule into smaller, children sub-seqs. The idea is to have each enable primarily read from a local FSM register, and be globally controlled by a small, parent register.

@new_fsm Insertion

This is a pass that inserts the @new_fsm attribute at key spots within a seq block. It estimates the size of each enable and nested control block (which correspond to the number of reads from the local FSM register these would each need), and splits this ordered list into k even children. On the synthetic benchmark, this only improved max. frequency when the children seqs were small enough to be one-hot encoded. The next steps are to find the threshold at which this improvement occurs and to see if it generalizes to our benchmark suite.

Shared Logic Splitting

This builds on @new_fsm insertion by observing that each child register in that pass has distinct but identical logic. For example, if n is even, then each child register would have identical but separate transition guards to count from 0..n/2-1. Shared logic splitting introduced the idea of sharing the same child register across several of thek children schedules. This was only implemented for k=2, so a natural next step would be to generalize it for arbitrary many children (especially since the benchmark for @new_fsm insertion showed some benefits at larger numbers of children).

Future Work on Dynamic FSMs

As I mentioned already, a big order of business is identifying why these optimizations work on some design and not others. It would be nice to find a consistent pattern among control logic styles, for example, that benefits from OHE and control logic styles that don't. This is the first step in identifying good compile-time parameters and mapping them to clear frequency/resource improvements.
Similarly, it will be increasingly important to isolate the "most complicated" part of the schedule and focus optimizations on that. It might be worth staring at designs where optimizations helped, not just to figure out why these optimizations worked at the physical level, but where they worked (i.e. what schedule did they impact the most?). More directly, we could generate summary timing reports and try and match the FSM on the critical path (it probably will be) to the part of the schedule it implements

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Calyx-Opt Lab Notebook #2202

{{title}}

Replies: 6 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Calyx-Opt Lab Notebook #2202

calebmkim Jul 10, 2024 Collaborator

Replies: 6 comments · 14 replies

calebmkim Jul 10, 2024 Collaborator Author

Recap of Problem

Synthetic Results

Real Results

rachitnigam Jul 13, 2024 Maintainer

calebmkim Jul 13, 2024 Collaborator Author

calebmkim Jul 13, 2024 Collaborator Author

rachitnigam Jul 13, 2024 Maintainer

calebmkim Jul 13, 2024 Collaborator Author

parthsarkar17 Jul 15, 2024 Collaborator

calebmkim Jul 15, 2024 Collaborator Author

parthsarkar17 Jul 15, 2024 Collaborator

parthsarkar17 Jul 23, 2024 Collaborator

DESCRIPTION

GRAPHS

LUT Comparison

Register Comparison

CLB Register Comparison

Max. Frequency Estimate

calebmkim Jul 23, 2024 Collaborator Author

calebmkim Jul 23, 2024 Collaborator Author

parthsarkar17 Jul 23, 2024 Collaborator

calebmkim Jul 25, 2024 Collaborator Author

calebmkim Jul 25, 2024 Collaborator Author

calebmkim Aug 1, 2024 Collaborator Author

Summer Wrap-Up Summary

Given a schedule, how do we optimize its implementation?

How to represent the integer

Tree/Control-flow level optimizations

Some maintenance stuff for me to do

parthsarkar17 Aug 2, 2024 Collaborator

Summer Dynamic FSM Optimizations Recap

Integer Representation

Spreading Queries

Duplication

Splitting

@new_fsm Insertion

Shared Logic Splitting

Future Work on Dynamic FSMs

calebmkim
Jul 10, 2024
Collaborator

Replies: 6 comments 14 replies

calebmkim
Jul 10, 2024
Collaborator Author

rachitnigam Jul 13, 2024
Maintainer

calebmkim Jul 13, 2024
Collaborator Author

calebmkim Jul 13, 2024
Collaborator Author

rachitnigam Jul 13, 2024
Maintainer

calebmkim Jul 13, 2024
Collaborator Author

parthsarkar17
Jul 15, 2024
Collaborator

calebmkim Jul 15, 2024
Collaborator Author

parthsarkar17 Jul 15, 2024
Collaborator

parthsarkar17
Jul 23, 2024
Collaborator

calebmkim Jul 23, 2024
Collaborator Author

calebmkim Jul 23, 2024
Collaborator Author

parthsarkar17 Jul 23, 2024
Collaborator

calebmkim
Jul 25, 2024
Collaborator Author

calebmkim Jul 25, 2024
Collaborator Author

calebmkim
Aug 1, 2024
Collaborator Author

parthsarkar17
Aug 2, 2024
Collaborator