Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support inference large models such as gpt-3 in storage calculation. #16

Open
graykode opened this issue Aug 23, 2020 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@graykode
Copy link
Owner

graykode commented Aug 23, 2020

In deep learning, the popularity of large models (gpt-3, T5, megatron LM) is growing. However, due to this, the polarization of wealth in AI is intensifying.

As an example that touches very well, take gpt-3, a recently very hot potato. gpt-2 was 6GB on disk and the number of parameters was 1.5B. However, since gpt-3 has 175B parameters, it is assumed that its weight alone will occupy 700GB.

To train or inference through the existing framework, all weights had to be loaded into memory. However, in the case of gpt-3, it is difficult to use 700GB of memory on a general PC.

But matorage can solve this problem. The philosophy of matorage's model storage is not to store one model as a single file, but to store it layer-wise. Therefore, matorage will solve this problem by fetching only the submodel weight acceptable to the PC, loading it into memory, and storing the calculated value in file storage. It has a similar philosophy to pydata/numexpr.

The implementation of this feature is reflected in 0.3.0. In addition, we will implement operations that forward rather than backward and are released first in the pytorch version.
Once again, I hope that the future of AI will not be centralized by wealth, but decentralized by collective intelligence.

If you want to know more, please refer to the issue :
#openai/gpt-3/issues/1
#huggingface/transformers/issues/4658

Note
This issue is not using the official gpt-3 weights. Run the test by randomly initializing the model with the same conditions as shown in the picture below.
image

@graykode graykode added the enhancement New feature or request label Aug 23, 2020
@graykode graykode pinned this issue Aug 23, 2020
@graykode graykode changed the title support inference large models such as gpt-3 in storage. support inference large models such as gpt-3 in storage calculation. Aug 23, 2020
@graykode
Copy link
Owner Author

Write the following code to check the inference time of one transformer layer:

import torch
from transformers.configuration_gpt2 import GPT2Config
from transformers.modeling_gpt2 import Block, GPT2Model

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

if __name__ == '__main__':
    n_ctx = 2048
    n_embd = 12288
    config = GPT2Config(n_embd=n_embd, n_head=96)
    model = Block(n_ctx=n_ctx, config=config)
    print('count_parameters', count_parameters(model))
    # model = GPT2Model(config)
    model.eval()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()

    y = model(torch.ones([1, n_ctx, n_embd]))
    end.record()
    torch.cuda.synchronize()
    print(start.elapsed_time(end))

However, it takes about 44sec for one layer and about 1 hour for a total of 96 layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant