ipex backend enhancements #272

yao-matrix · 2024-09-23T02:24:48Z

add feature-extraction task mapping for ipex backend, to support embedding models benchmark
change examples' no_weight to false, no_weight will allocate weight buffers and random initialize them, which will ruin performance in numa cases, 2x perf drop for decoding phases

yao-matrix · 2024-09-23T02:25:08Z

@IlyasMoutawwakil, pls help review, thx.

IlyasMoutawwakil · 2024-09-23T06:49:10Z

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods

optimum-benchmark/optimum_benchmark/backends/transformers_utils.py

Lines 190 to 205 in 01e4e59

    
           TORCH_INIT_FUNCTIONS = { 
        
               "normal_": torch.nn.init.normal_, 
        
               "uniform_": torch.nn.init.uniform_, 
        
               "trunc_normal_": torch.nn.init.trunc_normal_, 
        
               "xavier_normal_": torch.nn.init.xavier_normal_, 
        
               "xavier_uniform_": torch.nn.init.xavier_uniform_, 
        
               "kaiming_normal_": torch.nn.init.kaiming_normal_, 
        
               "kaiming_uniform_": torch.nn.init.kaiming_uniform_, 
        
               "normal": torch.nn.init.normal, 
        
               "uniform": torch.nn.init.uniform, 
        
               "xavier_normal": torch.nn.init.xavier_normal, 
        
               "xavier_uniform": torch.nn.init.xavier_uniform, 
        
               "kaiming_normal": torch.nn.init.kaiming_normal, 
        
               "kaiming_uniform": torch.nn.init.kaiming_uniform, 
        
           }

It'll use the fasterst one of them

optimum-benchmark/optimum_benchmark/backends/transformers_utils.py

Lines 207 to 208 in 01e4e59

    
           def fast_random_tensor(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor: 
        
               return torch.nn.init.uniform_(tensor)

How does that ruin performance ?

…mb3dding models benchmark 2. change examples' no_weight to false, no_weight will allocate weight buffers and random initialize them, which will ruin performance in numa cases, 2x perf drop for decoding phases Signed-off-by: Yao, Matrix <[email protected]>

yao-matrix · 2024-09-24T05:57:01Z

Hi ! are you sure about this no_weight will allocate weight buffers and random initialize them, no_weights only interferes with random generators used inside the model, so instead of using these methods

optimum-benchmark/optimum_benchmark/backends/transformers_utils.py

Lines 190 to 205 in 01e4e59

TORCH_INIT_FUNCTIONS = {

"normal_": torch.nn.init.normal_,

"uniform_": torch.nn.init.uniform_,

"trunc_normal_": torch.nn.init.trunc_normal_,

"xavier_normal_": torch.nn.init.xavier_normal_,

"xavier_uniform_": torch.nn.init.xavier_uniform_,

"kaiming_normal_": torch.nn.init.kaiming_normal_,

"kaiming_uniform_": torch.nn.init.kaiming_uniform_,

"normal": torch.nn.init.normal,

"uniform": torch.nn.init.uniform,

"xavier_normal": torch.nn.init.xavier_normal,

"xavier_uniform": torch.nn.init.xavier_uniform,

"kaiming_normal": torch.nn.init.kaiming_normal,

"kaiming_uniform": torch.nn.init.kaiming_uniform,

}

It'll use the fasterst one of them

optimum-benchmark/optimum_benchmark/backends/transformers_utils.py

Lines 207 to 208 in 01e4e59

def fast_random_tensor(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:

return torch.nn.init.uniform_(tensor)

How does that ruin performance ?

seems random init funtions are mostly single thread function(e.g. here), and numa memory allocation strategy somewhat follows a "allocate-while-first-write" way. So in random initialization case, the weights memory will allocate near to the core executing the random initialization logic, which, for example, is done all in numa domain 0. So when model forward computation happens, which may spread across numa domain 0 and 1, the compute in numa domain 1 must far fetch the data from numa domain 0(since the data already allocated while random initialization), which bring far memory access cost.

data evidence, using GCP c4-standard-96 instance to run meta-llama/Meta-Llama-3-8B w/ bs=1, in_seq=256, out_seq=64,
no_weights false: decoding throughput 16.37
no_weights true: decoding throughput 8.69

IlyasMoutawwakil · 2024-09-24T06:33:27Z

very interesting behavior ! thanks for investigating it

IlyasMoutawwakil merged commit b502dde into huggingface:main Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipex backend enhancements #272

ipex backend enhancements #272

yao-matrix commented Sep 23, 2024

yao-matrix commented Sep 23, 2024

IlyasMoutawwakil commented Sep 23, 2024

yao-matrix commented Sep 24, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 24, 2024

ipex backend enhancements #272

ipex backend enhancements #272

Conversation

yao-matrix commented Sep 23, 2024

yao-matrix commented Sep 23, 2024

IlyasMoutawwakil commented Sep 23, 2024

yao-matrix commented Sep 24, 2024 • edited Loading

IlyasMoutawwakil commented Sep 24, 2024

yao-matrix commented Sep 24, 2024 •

edited

Loading