Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnitrace-python errors with OMNITRACE_USE_ROCM_SMI = true #330

Open
anupambhatnagar opened this issue Jan 24, 2024 · 12 comments
Open

omnitrace-python errors with OMNITRACE_USE_ROCM_SMI = true #330

anupambhatnagar opened this issue Jan 24, 2024 · 12 comments

Comments

@anupambhatnagar
Copy link

anupambhatnagar commented Jan 24, 2024

Hi, I'm profiling a triton kernel on MI300 with rocm 6.0.0.

  1. When I set OMNITRACE_USE_ROCM_SMI to true the collected trace fails to collect events from ROCM_SMI. the backtrace is available here.
  2. There is no track in the generated trace with the HIP Activity Device.

The omnitrace config I use is here.

How can I enable the collection of events from rocm-smi and view the HIP Activity Device?

Thanks!

P.S. I installed omnitrace using omnitrace-1.11.0-rhel-9.3-ROCm-60000-PAPI-OMPT-Python3.sh from the releases page.

@jrmadsen
Copy link
Collaborator

Based on the error message here, it looks like rocm-smi doesn’t support getting the temperature on MI300 so omnitrace disables rocm-smi sampling, which is why you don’t see any activity.

How can I enable the collection of events from rocm-smi and view the HIP Activity Device?

It looks like you won’t be able to collect ROCm-SMI data until there is either an omnitrace patch to selectively collect only the queries that are supported or rocm-smi adds full support for MI300

@anupambhatnagar
Copy link
Author

anupambhatnagar commented Jan 25, 2024

I saw the error logs that you highlighted. I just wanted to confirm if it is expected behavior or not.

  1. Do you plan to add support (via a omnitrace patch) for MI300 in the near future?
  2. Why is the HIP Activity Device Track missing? I'm using rocm-6.0.0-91.

@anupambhatnagar
Copy link
Author

anupambhatnagar commented Jan 25, 2024

For my current use case HIP Activity Device is much more important than the metrics provided by rocm smi. I collected the trace on the following toy program and it shows the device activity track.

Toy hip kernel

#include <cstdio>
#include <hip/hip_runtime.h>

#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
inline static void check(hipError_t error_code, const char *file, int line)
{
    if (error_code != hipSuccess)
    {
        fprintf(stderr, "HIP Error %d %s: %s. In file '%s' on line %d\n", error_code, hipGetErrorName(error_code), hipGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}

__global__ void dummy_kernel(int a)
{
    printf("I am dummy kernel %d\n", a);
}

int main()
{
    printf("AAA\n");
    CHECK(hipDeviceSynchronize());
    printf("BBB\n");
    dummy_kernel<<< 1,1 >>>(1);
    printf("CCC\n");
    CHECK(hipDeviceSynchronize());
    printf("DDD\n");
    CHECK(hipStreamAddCallback(0, [](hipStream_t stream_, hipError_t status_, void * arg){
        printf("I am host function\n");
    }, nullptr, 0));
    printf("EEE\n");
    CHECK(hipDeviceSynchronize());
    printf("FFF\n");
    dummy_kernel<<< 1,1 >>>(2);
    printf("GGG\n");
    CHECK(hipDeviceSynchronize());
    printf("HHH\n");

    return 0;
}

Command used to profile the above example: omnitrace-run -c config-file.cfg -- ./binary_name.

omnitrace config

# auto-generated by omnitrace-avail (version 1.11.0) on 2024-01-24 @ 18:12

OMNITRACE_CONFIG_FILE               =
OMNITRACE_TRACE                     =
OMNITRACE_TRACE_DELAY               = 0
OMNITRACE_TRACE_DURATION            = 0
OMNITRACE_TRACE_PERIOD_CLOCK_ID     = CLOCK_REALTIME
OMNITRACE_TRACE_PERIODS             =
OMNITRACE_PROFILE                   = false
OMNITRACE_USE_SAMPLING              = true
OMNITRACE_USE_PROCESS_SAMPLING      = true
OMNITRACE_USE_ROCTRACER             = true
OMNITRACE_USE_ROCM_SMI              = false 
OMNITRACE_USE_KOKKOSP               = false
OMNITRACE_USE_CAUSAL                = false
OMNITRACE_USE_MPIP                  = true
OMNITRACE_USE_PID                   = true
OMNITRACE_USE_RCCLP                 = false
OMNITRACE_USE_ROCPROFILER           = true 
OMNITRACE_USE_ROCTX                 = false
OMNITRACE_OUTPUT_PATH               = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX             =
OMNITRACE_CAUSAL_BACKEND            = auto
OMNITRACE_CAUSAL_BINARY_EXCLUDE     =
OMNITRACE_CAUSAL_BINARY_SCOPE       = %MAIN%
OMNITRACE_CAUSAL_DELAY              = 0
OMNITRACE_CAUSAL_DURATION           = 0
OMNITRACE_CAUSAL_FUNCTION_EXCLUDE   = 
OMNITRACE_CAUSAL_FUNCTION_SCOPE     = 
OMNITRACE_CAUSAL_MODE               = function
OMNITRACE_CAUSAL_RANDOM_SEED        = 0
OMNITRACE_CAUSAL_SOURCE_EXCLUDE     = 
OMNITRACE_CAUSAL_SOURCE_SCOPE       = 
OMNITRACE_CRITICAL_TRACE            = false
OMNITRACE_PAPI_EVENTS               = 
OMNITRACE_PERFETTO_BACKEND          = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB   = 1024000
OMNITRACE_PERFETTO_FILL_POLICY      = discard
OMNITRACE_PROCESS_SAMPLING_DURATION = -1
OMNITRACE_PROCESS_SAMPLING_FREQ     = 0
OMNITRACE_ROCM_EVENTS               = 
OMNITRACE_SAMPLING_CPUS             = none
OMNITRACE_SAMPLING_DELAY            = 0.5
OMNITRACE_SAMPLING_DURATION         = 0
OMNITRACE_SAMPLING_FREQ             = 300
OMNITRACE_SAMPLING_GPUS             = all
OMNITRACE_SAMPLING_OVERFLOW_EVENT   = perf::PERF_COUNT_HW_CACHE_REFERENCES
OMNITRACE_TIME_OUTPUT               = true
OMNITRACE_TIMEMORY_COMPONENTS       = wall_clock
OMNITRACE_VERBOSE                   = 1
OMNITRACE_ENABLED                   = true
OMNITRACE_SUPPRESS_CONFIG           = false 
OMNITRACE_SUPPRESS_PARSING          = false 

Triton code

import torch
import triton
import triton.language as tl


@triton.jit
def add_kernel(x_ptr,  y_ptr, output_ptr,  n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)


def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output


def main():
    torch.manual_seed(0)
    size = 98432
    x = torch.rand(size, device='cuda')
    y = torch.rand(size, device='cuda')

    output_triton = add(x, y)

if __name__ == "__main__":
    main()

My setup: MI 300X, omnitrace 1.11, rocm-6.0.0-91

Any idea why the HIP activity trace doesn't render with the same config while I profile the triton kernel?

@jrmadsen
Copy link
Collaborator

Try:

omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>

@anupambhatnagar
Copy link
Author

omnitrace-run -c config-file.cfg -- python -m omnitrace <triton-python-script>

that didn't help. same result as before.

@jrmadsen
Copy link
Collaborator

There may be some issues regardless which require some detailed explanation. I’ve got a full docket today so I’ll try to provide that once I’ve got some time.

@jrmadsen
Copy link
Collaborator

jrmadsen commented Jan 25, 2024

But in the meantime, I’ll just let you know that you’ll probably want to try to play with LD_LIBRARY_PATH to get Omnitrace to use the same ROCm libraries as PyTorch, but it may not be possible if PyTorch doesn’t have/use ROCm libraries with SOVERSIONs (e.g. only libroctracer.so instead of libroctracer.so.4). It’s something we have a solution for in the new rocprofiler but until it’s released, there’s very little Omnitrace can do.

@jrmadsen
Copy link
Collaborator

Well actually, I’ve probably got enough time now. The fundamental problem I’ve seen with some PyTorch apps in the past is that PyTorch has an RPATH to the ROCm libraries it installs and those libs do not have SOVERSIONs. Omnitrace sets an env variable HSA_TOOLS_LIB which causes the HSA runtime to call an OnLoad function when it initializes (which is triggered on first HIP call). When that happens, Omnitrace makes the appropriate calls to roctracer to set up tracing. But roctracer is linked to the HSA and HIP runtimes with SOVERSIONs. My theory (which I haven’t fully confirmed but empirical evidence from experimentation with LD_PRELOAD and making soft links in PyTorch installs to emulate SOVERSIONs does suggest) is that roctracer ends up communicating with different runtime libraries and effectively enables instrumenting a different HIP/HSA runtime than the one PyTorch uses. Thus, from Omnitrace’s perspective it enables tracing HIP but the application simply never called the HIP API or launched any kernels.

@jrmadsen
Copy link
Collaborator

Could you do me a favor and run your app normally (without Omnitrace) and before it exits, print out /proc/self/maps? If there aren’t any ROCm libs loaded, the last comment is probably true and the ROCm libs I see in the maps printout of the backtrace are only there bc of omnitrace

@jrmadsen
Copy link
Collaborator

And for the record, the way we are addressing this issue in the new rocprofiler (which combines the capabilities of roctracer and rocprofiler) is that rocprofiler doesn’t link to the runtimes and each runtime effectively passes a table of function pointers into rocprofiler when it initializes — guaranteeing that the calls (via the function pointers in the table) that rocprofiler needs to make to enable profiling capabilities are applied to that specific runtime instance. Once this is release and Omnitrace uses the new rocprofiler API, you could have 20 different HIP runtimes and Omnitrace would be able to trace any/all of them.

@anupambhatnagar
Copy link
Author

Thanks for the detailed answer. I'll try to get the maps and share them with you.

Could you please join the ext-amd-meta slack channel? I sent a request to Weijun Jiang yesterday to add you to it. It would make collaborating on this easier. Thanks!

@ppanchad-amd
Copy link

Hi @anupambhatnagar. Has your issue been resolved? If so, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants