For your first foray into chisel you will design a matrix matrix multiplication unit. Matrix multiplication is fairly straight forward, however on hardware it’s a little trickier than the standard for loops normally employed..
Important You will be working with skeleton code. Every component you implement has a corresponding skeleton source file found in src/main/scala/, so for Vector you’re looking for src/main/scala/Vector.scala
The first component you should implement is a register bank for storing a vector.
In the file Vector.scala you will find the skeleton code for this component. Unlike the standard Chisel.Vec our custom vector has a read enable which means that the memory pointed to by idx will only be overWritten when writeEnable is true. There are many hints commented out in the source code if you should get stuck.
Implement the vector and test that it works by running
testOnly Ex0.VectorSpec
in your sbt console.
The matrix works just like the vector only in two dimensions.
In the skeleton code a row of the vectors you have created in task 1 has been instantiated:
val rows = Vec.fill(rowsDim)(Module(new Vector(colsDim)).io)
rows
represents a matrix where each element of rows
corresponds to a matrix row, thus
in order to select an element in the matrix you must first select the correct row, then
the correct column.
As an example, consider the following matrix:
| 2, 5, 5, 0 |
A = | 7, -1, X, 3 |
| 0, 4, -5, 0 |
In order to select X we would first select the correct row (1) and drive the index select of that row with the desired column (2).
rows(1.U).idx := 2.U
io.dataOut := rows(1.U).dataOut
(Keep in mind that this is just to examplify, it is very far from correct and your code should not look like this)
If we wanted to set the value of X to 3 we would have to do something like this:
rows(1.U).idx := 2.U
rows(1.U).writeEnable := true.B
rows(1.U).dataIn := 3.B
Keep in mind that these code snippets show manually indexing, but for your circuit you want to generalize, which is why we use a for loop. For each row, ensure that
- Write enable is toggled only if the row is currently selected
- row dataIn is driven by matrix dataIn
If you’re unsure how to do this with a for loop you can try doing it manually first for some fixed size, then generalize later with a loop.
Since a vector will only overwrite data when writeEnable is toggled you can drive dataIn for all the vectors with the matrix dataIn.
Run the tests with testOnly Ex0.MatrixSpec
Important: You cannot index rows
as if it was a 2D array!
// This does not compile, to see why consider the fact that the result of rows(1.U) is a
// Bundle{val idx: UInt, val dataIn: UInt, val writeEnable: Boolean, val dataOut: UInt}
// which means the only accessors you can use are idx, dataIn, writeEnable and dataOut.
rows(1.U)(1.U) := 2.U
Hint: If you’re unsure what your circuit is actually doing, change the Driver invocation
in MatrixSpec.scala
to output .vcd files as explained in the Waveform debugging guide.
This component differs from the two previous in that it has no explicit control input, which might at first be rather confusing.
With only two inputs for data, how do we know when the dotproduct has been calculated?
The answer to this is the elements
argument that the DotProd
class takes in, which tells
the dot product calculator the size of the input vectors.
Consequently, the resulting hardware can only (at least on its own) compute dotproducts for one size of vector, which is fine in our circuit.
To get a better understanding we can model this behavior in regular scala:
case class DotProdCalculator(vectorLen: Int, timeStep: Int, accumulator: Int){
def update(inputA: Int, inputB: Int): (Int, Boolean, DotProdCalculator) = {
val product = inputA * inputB
if(((timeStep + 1) % vectorLen) == 0)
(accumulator + product, true, this.copy(timeStep = 0, accumulator = 0))
else
(accumulator + product, false, this.copy(timeStep = this.timeStep + 1, accumulator = accumulator + product))
}
}
To see it in action run testOnly Examples.SoftwareDotProdSpec
in your sbt console.
As with the previous tasks, the dot product calculator must pass the tests with
testOnly Ex0.DotProdSpec
As shown in the timing diagram below, the dot product calculator should deliver the result as soon as possible. This means you will have to drive the output with the sum of the accumulator and the product of the two inputs. If you choose to drive the output only by the value of the accumulator your circuit will lag behind by one cycle, which while good for pipelining purposes is not good for passing the test purposes.
Doing an action for a set amount of timesteps is a very common task in hardware design, so this functionality is included in chisel via the Counter class.
Keep in mind that a class and an object are not the same in Chisel. This can be very confusing when new to scala, but it is simply convention: When a class and an object share name this is just a convenience for keeping static methods, such as constructors, separated from the non-static methods.
In the Counter object there is an apply method:
def apply(cond: Bool, n: Int): (UInt, Bool)
The type signature tells you that the input is a regular scala integer, and a chisel boolean
(scala booleans are of type Boolean
, rather than Bool
) and the output is a UInt and a chisel
boolean.
This means that upon instantiating a Counter with its apply method you only get the outputs from
the counter, not the object itself.
The result is a convenient method of making a counter, simply supply how many ticks it takes for the
counter to roll over, as well as an input signal for enabling the clock, and receive a tuple with the
signal for the counters value, as well as a boolean signal that toggles whenever the clock rolls over.
A special property of apply methods are that they can be called directly on the object.
Counter.apply(cCond, 10)
does the same as Counter(cCond, 10)
.
To call the class constructor, use the new
keyword.
Important: If your circuit gets correct output shifted by one cycle it is likely because you are using the output of the accumulator register to drive dataOut which delays everything by one cycle. Use the signal that drives the accumulator register to drive dataOut instead and this should resolve itself.
With our matrix modules and dot product calculators we have every piece needed to implement the matrix multiplier.
When performing matrix multiplication on a computer transposing the second matrix can help us reduce complexity by quite a lot. To examplify, consider
| 2, 5 |
A = | 7, -1 |
| 0, 4 |
B = | 1, 1, 2 |
| 0, 4, 0 |
It would be much simpler to just have two modules with the same dimensions, and we can do this by transposing B so we get
| 2, 5 |
A = | 7, -1 |
| 0, 4 |
| 1, 0 |
BT = | 1, 4 |
| 2, 0 |
Now we need to do is calculate the dot products for the final matrix:
if A*B = C then
| A[0] × BT[0], A[0] × BT[1], A[0] × BT[2] |
C = | A[1] × BT[0], ... , ... |
| ... , ... , A[2] × BT[2] |
where
A[0] × BT[0] is the dot product of [2, 5] and [1, 0]
and
A[0] × BT[1] is the dot product of [2, 5] and [1, 4]
and so forth..
Because of this, the input for matrix B will be supplied transposed, thus you do not have to worry about this. For B the input would be [1, 0, 1, 4, 2, 0].
The skeleton code for the matrix multiplier is less detailed, with only one test. You’re encouraged to write your own tests to make this easier.
It is very easy to get bogged down with details in this exercise, so it’s useful to take a few moments to plan ahead.
A natural way to break down the task is to split it into two phases: setup and execution. For setup you simply want to shuffle data from the input signals to your two matrix modules.
The next task is to actually perform the calculation. This is a little more complex, seeing as the read patterns are different from matrix A and B.
To make this simpler a good idea is to introduce a control module. This module should keep track of which state the multiplier is in, setup or execution, and provide the appropriate row and column select signals.
You may also choose to split the control module into an init controller and an execution controller if you see fit.
A suggested design is shown underneath:
The timing for your matrix multiplier is straight forward. For a 3x4 matrix it takes 12 cycles to input data (cycles 0 to 11), and execution should proceed on cycle 12. While you can technically start execution sooner than this the tests expect you to not start executing before all data is loaded. As long as you start executing just as data has been loaded your dot prod design will take care of the rest.
In order to make testing easier, consider testing your row and column select signals first. The actual values stored in the matrixes are just noise, the important part is that you select the correct rows and columns at the correct times for the correct matrixes, and if you do this the rest is comparatively easy.
This last exercise has no deliverable, but you should spend some time thinking about where you spent most of your efforts.
A common saying is “A few hours of work can save you from several minutes of planning”, and this holds especially true for writing chisel!!