-
Notifications
You must be signed in to change notification settings - Fork 186
Memory Management
Three approaches are available for memory management. Technique 1 is the default and simplest way to program. It is the recommended starting point, but as your code grows in size and complexity, and as your training data sets move in the direction of "huge," it will become important to go beyond that. Even if you decide to go with technique 3 right away, it is useful to read through the entire document to understand the motivation for it.
-
If having trouble with CPU memory you may have to resort to technique 2 or 3.
-
If having trouble with GPU memory you may have to resort to technique 2 or 3.
In both cases, you may want to experiment with using a smaller batch size -- temporary tensor values produced by computation on the training data is the main memory problem, and they are in most cases proportional to the batch size. A smaller batch size means more batches, which will take longer, but can often speed up training convergence.
Note DiffSharp (which uses TorchSharp) relies on techniques 1.
Most of the examples included will use technique #1, doing frequent explicit calls to GC.Collect() in the training code -- if not after each batch in the training loop, at least after each epoch.
In this technique all tensors (CPU and GPU) are implicitly disposed via .NET finalizers. Just allocate new tensors to your heart's content, and let GC take care of them. It will only work for small models that do not require a lot of memory. If you do use this approach, you may want to place a call to GC.Collect()
after each mini-batch of data. It is generally not sufficient to do it at the end of each epoch.
👍 Simple
👎 The .NET GC doesn't know about the memory pressure from CPU tensors, so failure may happen if large tensors can't be allocated
👎 The .NET GC doesn't know about GPU resources.
👎 Native operations that allocate temporaries, whether on CPU or GPU, may fail -- the GC scheme implemented by TorchSharp only works when the allocation is initiated by .NET code.
This technique is more cumbersome, but will result in better performance and have a higher memory ceiling. For many non-trivial models, it is more or less required in order to train on a GPU.
👍 Specific lifetime management of all resources.
👎 Cumbersome, requiring lots of using statements in your code.
👎 You must know when to dispose.
👎 Temporaries are not covered by this approach, so to maximize the benefit, you may have to store all temporaries to variables and dispose.
Note: Even with this approach, it is a good idea to place a call to GC.Collect()
after each mini-batch of data. There may be temporaries that were overlooked, or inconvenient to pull out, or ones where the lifetime was unclear; calling GC.Collect()
will catch them.
It is important to understand that all TorchSharp "tensors" (type Tensor) are actually "tensor aliases", referring to a C++ tensor. When a C++ tensor is created and returned to .NET as a tensor alias, and the reference count on the C++ tensor is incremented. When you call Dispose()
on the TorchSharp tensor alias (that is, type Tensor), it is decremented. If the tensor alias is finalized instead, the decrement happens implicitly.
To enable this technique, all operations that return one or more TorchSharp Tensor
s should return "fresh" Tensor aliases (though that doesn't always mean freshly copied C++ tensors).
This used to be true even for in-place, destructive operations like add_()
, which overwrites the underlying native tensor with data. However, starting with v0.100.5, in-place operations are expected to return the same .NET instance as this
. The problem with the old regime was that it had the potential of incrementing the reference count and keeping tensors alive longer than necessary. After the change, the old behavior can be enforced by adding a call to alias()
:
var t = a.add_(b).alias();
Thus, when you write methods and functions that take and produce type Tensor, for example in the forward()
method of a model, you should always make sure to return a fresh alias. Most of the time, this happens automatically, because the last action of your code will normally be to call another tensor function, which itself will be returning a fresh alias, but there are cases when it's not, especially when returning input tensors or tensors stored in some lookaside table.
For example, consider a function that returns its input if its one-dimensional, otherwise it returns a reshaped version:
Tensor flatten(Tensor input) {
if (input.shape.Length == 1)
return input.alias();
else
return input.reshape(input.numel());
}
The alias()
function avoids doing a clone of the tensor, but still returns a fresh tensor instance (the .NET instance, not the underlying storage). I you simply return input
, the caller won't know whether both input and output should be disposed, so the protocol is to always return a fresh tensor.
In order to manage native storage, in particular GPU storage, it is necessary to do some explicit memory management for all temporaries, especially ones that are involved in a model's computation chain.
Here are the simple guidance rules:
-
Create a variable for each computed Tensor.
-
Use the
using
(C#) oruse
(F#) syntax to declare the variable. -
Don't call Dispose on the input of a function. Let the caller handle its lifetime.
For example, consider this expression from the 'TextClassification' example:
total_acc += (predicted_labels.argmax(1) == labels).sum().to(torch.CPU).item<long>();
There are lots of hidden temporaries in this relatively innocuous expression. In this particular case, it's involved in figuring out whether a prediction was accurate or not, so it's not going to be super-impactful on memory (the tensors are small), but it's still illustrative. A version where all temporaries are pulled out looks like this:
using var am = predicted_labels.argmax(1);
using var eq = am == labels;
using var sum = eq.sum();
using var moved = sum.to(torch.CPU);
total_acc += moved.item<long>();
The most essential places to do explicit memory management is in any function that might be involved with data preparation or the model computation, since the tensors are big and repeatedly used.
Some additional examples, in F# this time:
let myTensorFunction0(input: Tensor) =
input.alias()
let myTensorFunction1() =
if today then
table[4].alias()
else
table[5].alias()
let myTensorFunction2(input: Tensor) =
input.add(tensor(1))
let myTensorFunction3(input: Tensor) =
use tmp = input.add(tensor(1))
tmp.add(tensor(1))
let myTensorFunction4(input: Tensor) =
use tmp1 = input.add(tensor(1))
use tmp2 = input.add(tensor(1))
tmp2.add(tensor(1))
let myTensorFunction5(go: bool, input: Tensor) =
if go then
use tmp1 = input.add(tensor(1))
use tmp2 = input.add(tensor(1))
tmp2.add(tensor(1))
else
input.alias()
let myTensorFunction5(go: bool, input: Tensor) =
if go then
use tmp1 = input.add_(tensor(1)) // NOTE: even for in-place mutations
use tmp2 = input.add_(tensor(1)) // NOTE: even for in-place mutations
tmp2.add(tensor(1))
else
input.alias()
Rather than passing tensor arguments between neural network layers inside a custom module's forward()
, you should rely on the 'Sequential' layer collection, which will be efficient at memory management.
It may not be ideal when first experimenting with a model and trying to debug it, but once you are done with that and move on to a full training data set, it is advisable.
This approach, which was added in TorchSharp 0.95.4, makes it easier to dispose of tensors, without the non-determinism of technique 1, or the many temporaries of technique 2. It has most of the advantages of technique 2, and the code elegance of technique 1.
👍 Specific lifetime management of all resources, but in groups.
👍 Temporaries are covered by this approach.
👎 You don't have fine-grained control over when each tensor is reclaimed. All tensors created while a scope is in effect are disposed at once. (This is only partially true -- you can detach a tensor from the dispose scope and explicitly call Dispose(), but that's an odd thing to do in most situations.)
👎 It's a new code pattern, not widely used with other libraries.
Let's look at an example, similar to the earlier, technique 2 example:
using (var d = torch.NewDisposeScope()) {
total_acc += (predicted_labels.argmax(1) == labels).sum().cpu().item<long>();
...
}
What happens here, is that all tensors that are created while d
is alive will be disposed when d
is disposed. This includes temporaries, so you don't have to do anything special to get them to be disposed. (There's a problem with this simplistic example, which will be discussed later.)
In F#, it would look like:
use d = torch.NewDisposeScope()
total_acc <- total_acc + (predicted_labels.argmax(1) == labels).sum().cpu().item<long>()
If you need to dispose some tensors before the scope is disposed, you can use DisposeEverything()
, or DisposeEverythingBut(...)
if you want to exclude a few tensors from disposal. These can be useful when tensor lifetimes aren't cleanly nested in dynamic scopes.
NOTE: It is absolutely essential for the proper functioning of dynamic dispose scopes that the scope is created with a 'using' statemen (C#) or 'use' expression (F#).
It's important to note that these scopes are dynamic -- if any functions are called, the tensors inside them are also registered and disposed, unless there's a nested scope within those functions.
It is advisable to place a dispose scope around your training and test code, and in any library code that can be called from contexts that do not have dispose scopes.
That said, you should use dispose scope very carefully: having too few scope raises the pressure on native memory, which is particularly bad for GPUs. Having too many scopes, managing too few temporaries, will add runtime overhead to computations. For example, it may be better to put a scope outside an inner loop that contains multiple computations than to place it inside the loop. There is no single best answer.
Any tensor that needs to survive the dynamic dispose scope must be either removed from management completely, or promoted to a nesting (outer) scope.
For example, if a tensor variable tensor
is overwritten in a scope, there are two problems:
- The tensor held in
tensor
will be overwritten. Since it's not created within the scope, the scope will not dispose of it. A nesting scope must be added to managed the lifetime of all tensors kept intensor
:
using (var d0 = torch.NewDisposeScope()) {
var tensor = torch.zeros(...)
for ( ... ) {
...
using (var d1 = torch.NewDisposeScope()) {
var x = ...;
tensor += x.log();
...
}
}
}
- The new tensor that is placed in
tensor
will be disposed when the scope is exited. Since the static scope oftensor
is not the same as the dynamic scope where it is created, there's a problem.
This is probably not what you intended, so the tensor needs to be either detached, or moved to an outer scope (if one exists).
For example:
using (var d0 = torch.NewDisposeScope()) {
var tensor = torch.zeros(...)
for ( ... ) {
...
using (var d1 = torch.NewDisposeScope()) {
var x = ...;
tensor = (tensor + x.log()).MoveToOuterDisposeScope();
...
}
}
}
Sometimes, less is more -- a simple solution is to have fewer nested scopes:
using (var d0 = torch.NewDisposeScope()) {
var tensor = torch.zeros(...)
for ( ... ) {
...
var x = ...;
tensor += x.log();
...
}
}
but sometimes, you still have to move the tensor out, for example when you return a tensor from a method:
public Tensor foo() {
using (var d0 = torch.NewDisposeScope()) {
var tensor = torch.zeros(...)
foreach ( ... ) {
...
var x = ...;
tensor += x.log();
...
}
return tensor.MoveToOuterDisposeScope();
}
}
These examples show how to move a tensor up one level in the stack of scopes. To completely remove a tensor from scoped management, use DetatchFromDisposeScope()
instead of MoveToOuterDisposeScope()
.
Even with this technique, it is a good practice to use Sequential
when possible.
A conveniece method was added in 0.97.3 -- it is useful for wrapping a complex expression with multiple temporaries without having to set up a scope explicitly.
It is defined as:
public static Tensor WrappedTensorDisposeScope(Func<Tensor> expr)
{
using var scope = torch.NewDisposeScope();
var result = expr();
return result.MoveToOuterDisposeScope();
}
This is particularly useful in one-line functions and properties, such as in this example from the the Pareto distribution class:
public override Tensor entropy() =>
torch.WrappedTensorDisposeScope(() => ((scale / alpha).log() + (1 + alpha.reciprocal())));