Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test case for duplicate FSM optimization #2204

Merged
merged 13 commits into from
Jul 14, 2024
Merged

Conversation

calebmkim
Copy link
Contributor

@calebmkim calebmkim commented Jul 11, 2024

@parthsarkar17 The runt test for duplicate FSMs was passing in the incorrect parameter. What probably happened was that it was using an old pass input parameter name that was updated but then the runt command was not updated.

This is one of the reasons why we should have at least one snapshot test for the FSM optimizations, so things like this are easier to catch (not a big deal at all, but I think it's good practice from now on).

fsm0.write_en = fsm0.out == 2'd2 & C[done] ? 1'd1;
fsm1.in = fsm0.out == 2'd2 & C[done] ? 2'd3;
fsm1.write_en = fsm0.out == 2'd2 & C[done] ? 1'd1;
tdcc[done] = fsm0.out == 2'd3 ? 1'd1;
Copy link
Contributor Author

@calebmkim calebmkim Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these internal transitions use fsm0. Is that intentional?

I think it might be worth doing the following: all fsm==0../n2 guards should read from fsm0 and all fsm==n/2....n guards should read from fsm1.
It seems like right now, we only do this for certain queries.

@parthsarkar17 lmk if you want to take a shot at this. I think I could also probably do this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I'm impatient so I just did it myself.

@parthsarkar17 please take a sec to review though to see if the changes look good to you.

Copy link
Contributor

@parthsarkar17 parthsarkar17 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is intentional. The enables (e.g. B[go] = !B[done] & fsm{x}.out == 2'd2 ? 1'd1;) are the ones that are split evenly between the two registers. The transitions for each of the registers are based on the logic for fsm0. I thought that was fine because the second register can just update based off of the first register's logic. If we were to split this logic, then that might just add more complexity. Is that right?

On second thought, I can see how all transition logic coming from fsm0 may also contribute to excessive fanout from the first register. If we split those guards between the two registers, that might help. I originally thought we are simply using the transition values computed for fsm0 to update fsm1, which is true, but it might be even better to split that logic also.

Copy link
Contributor Author

@calebmkim calebmkim Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So your idea is there are two choices:

  1. register A and register B both read exclusively from register A
  2. register A and register B both read from both registers A and B.

And your idea is that (1) will use more resources/have a worse slack than (2). That may be the case: it might be simpler for the synthesis tools to place everything if the dataflow graph is simpler.

Here's the reason I think it might actually be more profitable to do (2).
The number of queries (i.e., fsm == i) you have to do is actually fewer:
In case (1) fsm0 will probably have to perform N queries since it's in charge of all the transitions. fsm1 will probably have to perform N/2 queries (for the 1/2 of the enable queries). Which leads to 1.5 N queries total.
In case (2) fsm0 and fsm1 will both have to perform N/2 queries (since all queries are split), which leads to N queries total.

I'm not sure which is correct: it might be worth running some experiments to see. We can test this commit vs. a previous commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah those are also the options as I see them. I think resources wise, they should be the same (because the logic for each state transition needs to come from one of the registers, doesn't matter from which, right?), but worst slack might be worse if we read exclusively from A.

And yeah, I totally didn't think about the unevenness that could come from all transition reads coming from A. The math sounds good to me!

// num_states+1 is needed to prevent error (the done condition needs
// to check past the number of states, i.e., will check fsm == 3 when
// num_states == 3).
let reg_to_query: usize = (state * num_registers / (num_states + 1))
Copy link
Contributor

@parthsarkar17 parthsarkar17 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a whole back and forth with myself about this +1, and I eventually got rid of it for the following reason. Please lmk if my reasoning is off:

When you use reg_to_query, you pass in fsm_rep.last_state, which is itself computed by sch.last_state(). This function looks through all the transitions, which are tuples of u64s, and returns the maximum u64 value along the second element of the tuple. As you pointed out in the comment, this includes the state above the index of the true last state, meaning sch.last_state() returns the number of states, and not the index of the last state. I think the way it currently is, with the + 1, only the final state will query from the second register because that's the way the integer division works out if the denominator num_states is too high. Lmk if that checks out.

Copy link
Contributor Author

@calebmkim calebmkim Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I did +1 was only pragmatic:
If you have a single register, then you always want reg_to_query to return 0. The only way to do this for the last state (i.e., when state==num_state is if you do num_states +1).

The only thing i don't quite understand is this:

I think the way it currently is, with the + 1, only the final state will query from the second register because that's the way the integer division works out if the denominator num_states is too high.

Won't it query from the second register any time state > (num_states + 1)/2 which will be basically half of the states?

Copy link
Contributor

@parthsarkar17 parthsarkar17 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I forgot to mention that the biggest thing is that state will never equal num_state from what I can tell. The transition guards are computed by mapping over each (s, e, g) element in sch.transitions with respect to s. This s is always < num_states.

As for your second point, if there are n states, then the final state is n-1. This means that, if state = n-2, then (state * num_registers / (num_states + 1)) is equal to ((n-2)* 2)/(n + 1), which, for n >=2 results in integer division of 0.

That was wrong, sorry. It doesn't matter one way or another.

@parthsarkar17 parthsarkar17 merged commit 894e221 into main Jul 14, 2024
18 checks passed
@parthsarkar17 parthsarkar17 deleted the duplicate-fixes branch July 14, 2024 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants