support inference large models such as gpt-3 in storage calculation. #16

graykode · 2020-08-23T17:22:13Z

In deep learning, the popularity of large models (gpt-3, T5, megatron LM) is growing. However, due to this, the polarization of wealth in AI is intensifying.

As an example that touches very well, take gpt-3, a recently very hot potato. gpt-2 was 6GB on disk and the number of parameters was 1.5B. However, since gpt-3 has 175B parameters, it is assumed that its weight alone will occupy 700GB.

To train or inference through the existing framework, all weights had to be loaded into memory. However, in the case of gpt-3, it is difficult to use 700GB of memory on a general PC.

But matorage can solve this problem. The philosophy of matorage's model storage is not to store one model as a single file, but to store it layer-wise. Therefore, matorage will solve this problem by fetching only the submodel weight acceptable to the PC, loading it into memory, and storing the calculated value in file storage. It has a similar philosophy to pydata/numexpr.

The implementation of this feature is reflected in 0.3.0. In addition, we will implement operations that forward rather than backward and are released first in the pytorch version.
Once again, I hope that the future of AI will not be centralized by wealth, but decentralized by collective intelligence.

If you want to know more, please refer to the issue :
#openai/gpt-3/issues/1
#huggingface/transformers/issues/4658

Note
This issue is not using the official gpt-3 weights. Run the test by randomly initializing the model with the same conditions as shown in the picture below.

graykode · 2020-08-25T08:53:53Z

Write the following code to check the inference time of one transformer layer:

import torch
from transformers.configuration_gpt2 import GPT2Config
from transformers.modeling_gpt2 import Block, GPT2Model

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

if __name__ == '__main__':
    n_ctx = 2048
    n_embd = 12288
    config = GPT2Config(n_embd=n_embd, n_head=96)
    model = Block(n_ctx=n_ctx, config=config)
    print('count_parameters', count_parameters(model))
    # model = GPT2Model(config)
    model.eval()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()

    y = model(torch.ones([1, n_ctx, n_embd]))
    end.record()
    torch.cuda.synchronize()
    print(start.elapsed_time(end))

However, it takes about 44sec for one layer and about 1 hour for a total of 96 layers.

graykode added the enhancement New feature or request label Aug 23, 2020

graykode pinned this issue Aug 23, 2020

graykode changed the title ~~support inference large models such as gpt-3 in storage.~~ support inference large models such as gpt-3 in storage calculation. Aug 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support inference large models such as gpt-3 in storage calculation. #16

support inference large models such as gpt-3 in storage calculation. #16

graykode commented Aug 23, 2020 •

edited

Loading

graykode commented Aug 25, 2020

support inference large models such as gpt-3 in storage calculation. #16

support inference large models such as gpt-3 in storage calculation. #16

Comments

graykode commented Aug 23, 2020 • edited Loading

graykode commented Aug 25, 2020

graykode commented Aug 23, 2020 •

edited

Loading