Calyx-Opt Lab Notebook #2202
Replies: 6 comments 14 replies
-
I'll start this notebook entry by repeating the same update I gave at today's Calyx-Opt meeting. Recap of ProblemSuppose we have the following control structure:
FSMs are currently implemented like this:
As you can see, because the parent does not stop counting when it offloads to the child, it can drastically increase the size of the parent register. I implemented FSMs so that the parent pauses when it offloads to the child:
Synthetic ResultsI wrote a synthetic benchmark, i.e., a benchmark specifically written to showcase the advantage of this new approach. Since it's handwritten it's a pretty small design but here are the numbers:
I view these results as a fairly promising upper bound on what type of results are achievable through the new technique. Real ResultsHere are some results on real benchmarks (blue is new, orange is old. Also I will work on making the graphs look nicer). LUT usage: Register usage: Worst slack: The tl;dr is that the trends follow the synthetic results (LUT and worst slack are better, registers are worse), just not as strongly. |
Beta Was this translation helpful? Give feedback.
-
Here are some graphs on varying the dynamic fsm duplication parameter, with some one-hot encoding. In the following graphs, |
Beta Was this translation helpful? Give feedback.
-
DESCRIPTIONHere are the graphs for the "smart-seq" split technique. Here's the description, copied from the PR (#2217): Similar to duplication, the goal with splitting a
into:
The children schedules generated control registers like so:
which was the goal, except, with this, the following groups and assignments would be generated:
Lines 1 and 2 are pretty similar; they each check the current state of the FSM register and ensure some other group has finished, and then they each update their register's value with a new value (the same value for each register!). Once we got synthesis results from this "new-fsm-insertion" method, we saw that WS decreased and LUT usage increased; we suspected it had something to do with the fact that we were duplicating the logic to transition FSM states, since both So, we decided to open up an option in TDCC that lets a
In short, the idea is duplication, but with an emphasis on making sure the registers reuse logic to update themselves. Benchmarking + synthesis results are in progress. GRAPHSLUT ComparisonRegister ComparisonCLB Register ComparisonMax. Frequency Estimate |
Beta Was this translation helpful? Give feedback.
-
Using the technique described in #2207, I ran results with one-hot encoding. Blue line is OHE (with new technique), Orange line is binary (with new technique), red line is what is on the
For the others, I'll have to poke around to see why resource usage is getting worse... I suspect it possibly may have something to do with sharing FSMs, but I'm not sure. Good news is that worst slack gets better pretty much universally for OHE. |
Beta Was this translation helpful? Give feedback.
-
Summer Wrap-Up SummaryGoing to record the conclusion for our end-of-summer meeting right here. I'm mainly going to talk about static FSMs, @parthsarkar can give an update on dynamic FSMs. The main thing I did this summer was (a) create a way to pause the parent FSM when offloading computations of static repeat compilation (PR here) and (b) coming up with this "tree" abstraction to be able to do this more easily (issue here). @rachitnigam mentioned an analogy to parallel runtime systems (e.g., Cilk, NESL, MPL). Also, we identified 2 broad types of optimizations we could perform: (1) determining what a good schedule is and (2) given a schedule, what the best way to implement it is. Given a schedule, how do we optimize its implementation?There are two different paths we talked about. How to represent the integerWe have implemented OHE and Binary, but are wondering if there could be a middle ground in between the two that could give the best of both worlds (in particular, for OHE you only have to check if you're in a given state, and only need a shifter to count up, but the register you use costs more bits). In particular, perhaps you could represent integers in the following way: choose a base b, and then represent each digit in that number by using a one-hot-encoded b bit register. For example, if you choose base 5, then you would have two 5-bit registers: one could count up to 5 and then reset and the other would count how many times the first register has counted to 5 (if you needed to count above 25 then obviously you would need another register). Tree/Control-flow level optimizationsThe second avenue for optimizations takes a step back and thinks about how we can better use register(s) to implement the entire control flow of a given program. Currently, we greedily share FSMs (in particular, since each node represents an FSM, we perform a greedy coloring after inserting conflicts between nodes of each tree). One point @sampsyo mentioned in the meeting is that there is some redundancy in checking FSM states. Another, even more vague idea that was inspired by this thought was to have some sort of full-program representation of the control program we are implementing so we can perform FSM optimizations on it (the trees can handle the static island, but we need a representation for the dynamic FSM that triggers these static islands.) Some maintenance stuff for me to do
|
Beta Was this translation helpful? Give feedback.
-
Summer Dynamic FSM Optimizations RecapA dynamic schedule in Calyx can be represented as a directed graph, in which nodes are "enables" (or groups to be executed) and edges specify the order in which these enables are executed. As Caleb mentioned, we're looking into opening better, equivalent schedules (i.e. creating a schedule that remains faithful to the original schedule's output, for a given input), but our work this summer was mainly focused on exploring how a specific schedule (or finite state machine) is represented in hardware. The broad goal was to represent the FSM such that, when synthesized, FPGA resource (e.g. LUT, register) usage would go down, while the maximum frequency of the overall design would increase. We had rough ideas of why our optimizations would impact these results. In particular, increased LUT usage was synonymous with more complicated internal logic (such as a complex transition-assignment guard for the FSM register) and decreased frequency might be the result of a high fan-out wire (i.e. one source is feeding into many destinations). Our optimization ideas were focused on addressing these physical constraints, and we often ran into important tradeoffs, expected and unexpected. Integer RepresentationWe added the option for dynamic schedules to also be one-hot encoded. The thought here was, since dynamic schedules only require equality checks on the current state (and not whether the current state exists within a range of values, as may be required for static schedules), we would only need to read from one bit of the FSM register. This would minimize internal transition logic (thereby targeting LUT usage) since you'd no longer need to use all the bits from the register to compute the guard, while also reducing fan-out on each register wire. Across our benchmarks, with The important thing will be to figure out exactly where our assumptions deviated from what the synthesis results show. Understanding the true relationship between guard logic complexity and LUT usage will lead to more insights on what other integer representations could be explored (as Caleb mentioned, something like having a parameter that sets "how much" in between binary and OHE we are) or even help us fix our methods if we have implementation errors. A good next step is to dig into the designs with higher LUT usage and figure out which control-flow constructs prevented OHE from having the benefits we expected (this is part of a higher-order point of making sure we know exactly what goes on when we open up options). One last note for dynamic integer representations: since many schedules have transitions like Spreading QueriesThis next set of optimizations were focused on reducing the fan-out that a single FSM register could experience if its output is used in every enable and transition guard. The idea is, if we could spread these queries across two registers that agree at each cycle, then we could trade off increased register usage for an increase in frequency that might come from having each register drive fewer wires. DuplicationThis method simply creates an arbitrary number of identical FSMs. If we duplicate once, and if an FSM has SplittingThis method splits a large @new_fsm InsertionThis is a pass that inserts the Shared Logic SplittingThis builds on Future Work on Dynamic FSMs
|
Beta Was this translation helpful? Give feedback.
-
@parthsarkar17 and I will post our updates here.
Beta Was this translation helpful? Give feedback.
All reactions