-
Notifications
You must be signed in to change notification settings - Fork 10
Dictionary
General Meanings
Load Balancing: Adjusting the distribution ratios of total work to have more of it completed at the same time, by multiple workers.
Streaming: Continuous and steady data flow between two contexts. Needs one side to generate, another side to receive the data, automatically.
Pipeline: An assembly of stages that are connected with neigbors and run at the same time with them to maximize overall throughput of data flow.
Global Range: Number of total workitems. Must be integer multiple of local range.
Local Range: Number of workitems per workgroup. Must be integer divisor of global range.
Workgroup: A group of workitems that share all resources of a compute unit (AMD compute unit has 64 cores, Nvidia compute unit has 128 cores, Intel compute unit has 8 cores, generally)
Workitem: Smallest part of work that is given a global id, a local id and a group id to be identified in kernel function. Its similar to an iteration cycle of C# Parallel.For.
Grain: Unbreakable parts of total work. Load balancing uses this size and this size can be same or multiple of local range.
Cekirdekler API
Array Read: Host reads data from array, writes to device buffers. Kernel reads data from buffer and computes. Meant to be used for inputs of compute.
Array Write: Kernel computes and writes data to device buffer. Host reads data from buffer, writes to host-array. Meant to be used for outputs of compute.
Partial Read: All devices read all of an array instead of just load balanced part(parallel to workitem range).
Write All: Device writes all of an array instead of just load balanced part(parallel to workitem range).
Load Balancing: for parallel multi-gpu works, minimizes compute latency by trading workitems between GPUs.
Streaming data: zero-copy access between device and host. Used in both multi-gpu and single-gpu.
Event driven pipeline:
- overlaps data-read(of all input arrays) with first kernel execution in the list of kernel names.
- computes all intermediate kernels.
- overlaps data-write(of all output arrays) with last kernel execution in the list of kernel names.
works with multi-single gpu
Driver controlled pipeline:
- divides all work into smaller read+compute+write operations
- sends all concurrently to gpu which gets driver controlled overlapping behavior
works with multi-single gpu
Device to device pipeline:
- only single gpu per pipeline stage is assumed to be used
- data flows through pipeline only 1 stage at a time
- data exits the pipeline after N times
- gpu-compute and gpu-gpu data transitions are overlapped. host-gpu and gpu-host transitions are not overlapped and serialized with gpu compute.
- higher latency than other pipelines, may have higher througput than other pipelines.
Enqueue Mode:
- Meant to optimize single GPU scenarios. Uses single command queue for all work.
- Musch less accumulation of API-overhead over thousands of compute()
- Async mode enables multiple command queues for different compute() groups in a single enqueue mode batch. Overlaps all queues work to save time.
- Used to built custom single-GPU pipelines.