-
Notifications
You must be signed in to change notification settings - Fork 0
Multithreaded Operation
Future versions Versions 0.2.6 of this project and higher will support multithreading. The information in the following sections may be critical to making sense of it.
DyNet is designed for single-threaded operation. This can be observed from messages such as
- "WARNING: Attempting to initialize dynet twice. Ignoring duplicate initialization." and
- "Memory allocator assumes only a single ComputationGraph at a time."
Internally, DyNet is implemented with several mutable global variables and singleton objects in either the C++ layer or Java and Scala layers built on top of it. Together, these make it difficult or impossible to train multiple models at the same time or to run multiple inputs (e.g., sentences) through a single model in parallel.
That is, unless DyNet was to be modified to make such things possible, and that is exactly what has been done. By "such things" is not meant any such thing, but one specific thing which is the forward pass operation needed for inference. Again, this is not the training phase but rather the testing phase of operation. It can be used when there is a large set of data that needs to be classified or whatever else the neural network has been trained to do.
Standard operation of DyNet should not be affected by any of the changes made. To access the extended functionality, one needs to include two arguments with the initialization: forward-only
and dynamic-mem
. The first argument is one already known to DyNet and is specified as an integer (0 or usually 1) while the second is a newcomer specified as a boolean. In the Scala implementation which is the focus of this documentation, initialization code might look like this taken from the Xor test case:
def initialize(train: Boolean = true): Unit = {
val map = Map(
Initializer.RANDOM_SEED -> 411865951L,
Initializer.DYNET_MEM -> "2048",
Initializer.FORWARD_ONLY -> { if (train) 0 else 1 },
Initializer.DYNAMIC_MEM -> !train
)
Initializer.initialize(map)
}
Please keep in mind these limitations:
- DyNet can still only be initialized once. If it is to be run in forward-only, dynamic-memory mode with multithreaded operation like this, it should not also be used for training. Except for the random seed, re-initializations of DyNet are ignored and the mode will not change.
- The multithreaded forward pass is not intended for operation on the GPU and so far this project has not been built with both multithreading and GPU operation enabled. Dynamic memory allocation is performed in a CPU-centric way.
There are (at least) two major components of DyNet which need special consideration in multithreaded environments: the computation graph and builders. They are handled in slightly different ways. There is another minor concern with random number generation. Please file issues if you notice problems with any other components.
The C++ layer in DyNet has enforced a one computation graph at a time policy even though much of DyNet is able to deal with multiple instances. The enforcement has been removed and the (dynamic) memory allocator in particular modified to match. At the Java layer, the policy was enforced with a singleton, which is now being sidestepped. The Scala layer makes such extensive use of the expected Java singleton that it is infeasible to allow for explicit ComputationGraph
objects being constructed and passed around. Instead, the multithreaded implementation makes a compromise, one which strongly influences how code can be structured.
Rather than there being a single ComputationGraph
per program, there is now an (implicit) single ComputationGraph
per thread. This means that if you write ComputationGraph.renew()
or builder.newGraph()
, you will not be renewing or building the ComputationGraph
, but rather the ComputationGraph
of the current thread. It also means that there needs to be enough memory to account for one computation graph per thread even though it looks like there is only one total. Neither the Scala nor Java computation graphs should be created directly from their constructors in client code.
If one can follow these rules, there should be little code related to computation graphs that needs to change in order to achieve multithreaded operation. There are examples in Xor which uses a ComputationGraph
directly and in Lstm which manipulates one through a VanillaLstmBuilder
. Both are able to operate in single and multithreaded modes.
The builders such as LstmBuilder
(VanillaLstmBuilder
, CoupledLstmBuilder
, CompactVanillaLSTMBuilder
, FastLstmBuilder
), GruBuilder
, RnnBuilder
, etc. have fewer constraints and are free to be (mis)used in ways that should be avoided. The following information should help you to do that.
Builders have mutable state that only one thread at a time should manipulate. Many can be created but they should not be shared across threads. They have access to a parameter collection, sometimes called the model, which can be very large. It is therefore advantageous for the separate builders to share the same model. Luckily, the builders do not seem to modify the parameter collections but only the computation graphs described above. Taken together, these observations lead to the following strategy:
- Create one model (a ParameterCollection, lookup parameters, parameters, and builders) in the main thread and then populate the model by reading it from disk. (This is for the forward pass, so the values have to have been calculated earlier.) Call this collection the reference parameters.
- When it is time for a forward pass which is to take place in a thread of its own, clone the elements of these reference parameters which need to be specific to a thread. This should include any builders.
- One way to do this to place the reference parameters into a case class so that it can be automatically copied.
- Next add a get() method to the class which while copying the parameters replaces the critical parts like the builder with clones. This method supports the
Supplier[T]
trait that provides parameters to new threads. - The new builder will run it its thread and automatically access that thread's computation graph when it calls
newGraph()
. - Let Java know about the thread-local reference parameters and their initial values.
- It is then necessary to call get on the thread-local parameters in places that the parameters would otherwise be used directly. The
copy()
in get could be replaced bythis
if the environment is known to be single threaded or nothing needs to be cloned.
This strategy has the advantage of working whether or not there are multiple threads so that code does not need to change. Most of it is unaware of the multithreaded environment at runtime. However, it is also just an example and a different strategy might be called for under different circumstances.
DyNet uses a single random number generator. Access to it has been protected by mutexes so that operation is safe. However, in multithreaded environments the threads may not access the generator in the same order during consecutive runs so that the sequence of random numbers that a particular thread sees can differ. This has been observed especially with uninitialized parameter collections used by builders. Since models should already have been trained for use in the forward pass and then populated with trained values, there should be nothing random about them and results should be consistent. Please report any evidence to the contrary.