Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipex backend enhancements #272

Merged
merged 1 commit into from
Sep 24, 2024
Merged

Conversation

yao-matrix
Copy link
Contributor

  1. add feature-extraction task mapping for ipex backend, to support embedding models benchmark
  2. change examples' no_weight to false, no_weight will allocate weight buffers and random initialize them, which will ruin performance in numa cases, 2x perf drop for decoding phases

@yao-matrix
Copy link
Contributor Author

@IlyasMoutawwakil, pls help review, thx.

@IlyasMoutawwakil
Copy link
Member

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods

TORCH_INIT_FUNCTIONS = {
"normal_": torch.nn.init.normal_,
"uniform_": torch.nn.init.uniform_,
"trunc_normal_": torch.nn.init.trunc_normal_,
"xavier_normal_": torch.nn.init.xavier_normal_,
"xavier_uniform_": torch.nn.init.xavier_uniform_,
"kaiming_normal_": torch.nn.init.kaiming_normal_,
"kaiming_uniform_": torch.nn.init.kaiming_uniform_,
"normal": torch.nn.init.normal,
"uniform": torch.nn.init.uniform,
"xavier_normal": torch.nn.init.xavier_normal,
"xavier_uniform": torch.nn.init.xavier_uniform,
"kaiming_normal": torch.nn.init.kaiming_normal,
"kaiming_uniform": torch.nn.init.kaiming_uniform,
}

It'll use the fasterst one of them
def fast_random_tensor(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
return torch.nn.init.uniform_(tensor)

How does that ruin performance ?

…mb3dding models benchmark

2. change examples' no_weight to false, no_weight will allocate weight buffers and random initialize them, which will ruin performance in numa cases, 2x perf drop for decoding phases

Signed-off-by: Yao, Matrix <[email protected]>
@yao-matrix
Copy link
Contributor Author

yao-matrix commented Sep 24, 2024

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods

TORCH_INIT_FUNCTIONS = {
"normal_": torch.nn.init.normal_,
"uniform_": torch.nn.init.uniform_,
"trunc_normal_": torch.nn.init.trunc_normal_,
"xavier_normal_": torch.nn.init.xavier_normal_,
"xavier_uniform_": torch.nn.init.xavier_uniform_,
"kaiming_normal_": torch.nn.init.kaiming_normal_,
"kaiming_uniform_": torch.nn.init.kaiming_uniform_,
"normal": torch.nn.init.normal,
"uniform": torch.nn.init.uniform,
"xavier_normal": torch.nn.init.xavier_normal,
"xavier_uniform": torch.nn.init.xavier_uniform,
"kaiming_normal": torch.nn.init.kaiming_normal,
"kaiming_uniform": torch.nn.init.kaiming_uniform,
}

It'll use the fasterst one of them

def fast_random_tensor(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
return torch.nn.init.uniform_(tensor)

How does that ruin performance ?

seems random init funtions are mostly single thread function(e.g. here), and numa memory allocation strategy somewhat follows a "allocate-while-first-write" way. So in random initialization case, the weights memory will allocate near to the core executing the random initialization logic, which, for example, is done all in numa domain 0. So when model forward computation happens, which may spread across numa domain 0 and 1, the compute in numa domain 1 must far fetch the data from numa domain 0(since the data already allocated while random initialization), which bring far memory access cost.

data evidence, using GCP c4-standard-96 instance to run meta-llama/Meta-Llama-3-8B w/ bs=1, in_seq=256, out_seq=64,
no_weights false: decoding throughput 16.37
no_weights true: decoding throughput 8.69

@IlyasMoutawwakil
Copy link
Member

very interesting behavior ! thanks for investigating it

@IlyasMoutawwakil IlyasMoutawwakil merged commit b502dde into huggingface:main Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants