Skip to content

Dictionary

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 30, 2017 · 7 revisions

General Meanings

Load Balancing: Adjusting the distribution ratios of total work to have more of it completed at the same time, by multiple workers.

Streaming: Continuous and steady data flow between two contexts. Needs one side to generate, another side to receive the data, automatically.

Pipeline: An assembly of stages that are connected with neigbors and run at the same time with them to maximize overall throughput of data flow.

Global Range: Number of total workitems. Must be integer multiple of local range.

Local Range: Number of workitems per workgroup. Must be integer divisor of global range.

Workgroup: A group of workitems that share all resources of a compute unit (AMD compute unit has 64 cores, Nvidia compute unit has 128 cores, Intel compute unit has 8 cores, generally)

Workitem: Smallest part of work that is given a global id, a local id and a group id to be identified in kernel function. Its similar to an iteration cycle of C# Parallel.For.

Grain: Unbreakable parts of total work. Load balancing uses this size and this size can be same or multiple of local range.


Cekirdekler API

Array Read: Host reads data from array, writes to device buffers. Kernel reads data from buffer and computes. Meant to be used for inputs of compute.

Array Write: Kernel computes and writes data to device buffer. Host reads data from buffer, writes to host-array. Meant to be used for outputs of compute.

Partial Read: All devices read all of an array instead of just load balanced part(parallel to workitem range).

Write All: Device writes all of an array instead of just load balanced part(parallel to workitem range).

Load Balancing: for parallel multi-gpu works, minimizes compute latency by trading workitems between GPUs.

Streaming data: zero-copy access between device and host. Used in both multi-gpu and single-gpu.

Event driven pipeline:

  • overlaps data-read(of all input arrays) with first kernel execution in the list of kernel names.
  • computes all intermediate kernels.
  • overlaps data-write(of all output arrays) with last kernel execution in the list of kernel names.

works with multi-single gpu

Driver controlled pipeline:

  • divides all work into smaller read+compute+write operations
  • sends all concurrently to gpu which gets driver controlled overlapping behavior

works with multi-single gpu

Device to device pipeline:

  • only single gpu per pipeline stage is assumed to be used
  • data flows through pipeline only 1 stage at a time
  • data exits the pipeline after N times
  • gpu-compute and gpu-gpu data transitions are overlapped. host-gpu and gpu-host transitions are not overlapped and serialized with gpu compute.
  • higher latency than other pipelines, may have higher througput than other pipelines.

Enqueue Mode:

  • Meant to optimize single GPU scenarios. Uses single command queue for all work.
  • Musch less accumulation of API-overhead over thousands of compute()
  • Async mode enables multiple command queues for different compute() groups in a single enqueue mode batch. Overlaps all queues work to save time.
  • Used to built custom single-GPU pipelines.