-
Notifications
You must be signed in to change notification settings - Fork 50
/
Copy pathReadme - About OpenCL.txt
248 lines (176 loc) · 11 KB
/
Readme - About OpenCL.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
OpenCL is a framework for writing programs that can be run specialized compute devices like GPU.
The way OpenCL works is like this :
Programmer writes a small program written OpenCL C, usually in a .cl file
This program contains at least 1 function that has the attribute 'kernel'
A sofware can then read the source of the OpenCL C and give it to OpenCL.
OpenCL then gives the source to the device driver which compiles it for the desired device.
Afterward, the software asks OpenCL to send data to the device memory.
Then the software asks OpenCL to run the kernel with the data.
After that, the software asks for the result of the computation to be transfered back to the host (to the CPU).
Some terminology :
device The device used by OpenCL to run the kernels, usually a GPU
host The environment using OpenCL, usually a software running on a CPU
kernel A function marked with 'kernel' in OpenCL C
Must be return void
Usually takes at least one pointer as argument
Compiled by the driver
Executed in parralel on the device
program Code written in OpenCL C
OpenCL C Language based on C (version C99 + additions)
Global range Total number of work items to start for one kernel execution
(number of time the kernel will be executed)
work-item One instance of execution of a kernel, also called a 'thread' or a 'worker'
work-group Work items are launched in groups.
Each group is completely independant from other groups (no synchronization, no data sharing)
A common work-group size 256
The maximum work-group size is dependent on the device used
work-items in the same workgroup can be synchonized
Local range Number of work items per work-group
NDRange A list of numbers representing the size of ranges (Global range and Local range)
ND is for 'N' dimentions, meaning the 'problem size' can have multiple dimentions
The NDRange is usually made of 3 values, so 3 dimentions
The size of the range is each of the value multiplied together
For image processing, we use 2 dimentions, which give a range similar to this :
Global : 512, 512, 1 - Which gives a total of 262 144 work items
Local : 16, 16, 1 - Which gives 256 work-items per workgroup
buffer A segment of memory on the device
Viewed as a sequence of bytes by the hardware of the device
image An image in the memory of the device (1D, 2D or 3D)
Viewed as a 'texture' by the hardware of the device
The device knows the type of data and number of channels in the image
global A type of device memory that all work-items can read and write
This is the slowest type of memory in the device
constant A type of device memory that work-items can only read from
Similar to global but faster because it is usually cached
A limited amount of constant memory is available (sizes depend on the device used)
local A type of device memory that is local to one work-group
Each work-group has one instance of its local memory
Work-items in one work-group can read and write in their work-group's local memory
local memory is much faster than global memory
local memory can be used to share data between items in the same workgroup
A very limited amount of constant memory is available (sizes depend on the device used, can be as low as 16KB)
private A type of device memory that is independent for each work-item
It is the fastest type of memory
All variables declared without 'global, constant or local' will be private
Only a very limited amount is available (usually a few KB)
read_only An image that is given as input to a kernel
Can only be read
Faster than global memory and usually as fast as constant memory
Uses texture hardware on GPUs
write_only An image that is given as output to a kernel
Can only be written
sampler An objects that tells the device how to read from a read_only image
Can set :
What to return when reading outside of image
How to interpret in-between pixel coordinates
barrier A point in a kernel that all work-items in a work-group must reach
before any work-item in the work-group can continue.
Used to synchronize data exchanges between work-items inside a work-group
platform Environment provided a the device vendor that implements OpenCL
Usual platforms are : AMD, nVidia, Intel, etc...
context Object representing a collection of devices from the same platform
Command queue Object representing a list of actions to do on one device of a context
These actions are mainly kernel executions and memory transfers
Actions are asyncronous, they will be done when the device(s) is available
The command queue can either be set 'in-order' (the default) or 'out-of-order'
When 'in-order', each action is started after the previous action is done
When 'out-of-order', multiple actions can be executed at the same time and
not necessarily in the order they were issued
'out-of-order' can give more performance trough increased device utilisation,
at the cost of more complex programming
event A syncronization object used to order execution of dependent actions in 'out-of-order' command queues
C API for OpenCL
OpenCL is provided with a C API
The API is Object Oriented
Because we use C++ and C++ handled Object Oriented programming better than C, we won't use the C API directly
C++ bindings for OpenCL
There exist an official C++ bindings for the OpenCL API, which is well made and reduces the amount of work needed to make a C++ software that uses OpenCL.
How to make a working OpenCL C++ program :
#include <cl/cl.hpp>
#include <vector>
using namespace cl;
using namespace std;
int main(int, char**)
{
// Create a platform
Platform platform = Platform::getDefault();
// List devices for this platform
vector<Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
// Create a context for the first device
Context context(devices[0]);
// Create a command queue
CommandQueue queue(context, devices[0]);
// Create a program
Program program(context, "OpenCL C code goes here...");
// Build the program
program.build();
// Create an object representing our kernel
auto kernel = make_kernel<Buffer, Buffer>(program, "example_kernel");
// Prepare
const static int Size = 1000000;
vector<int> inputData(Size, 0), outputData(Size, 0);
// Create the buffers in the device
Buffer inputBuffer(context, CL_MEM_READ_ONLY, Size * sizeof(int)); // Can only be read by the kernel
Buffer outputBuffer(context, CL_MEM_WRITE_ONLY, Size * sizeof(int)); // Can only be written by the kernel
// Send the inputData to the device memory
queue.enqueueWriteBuffer(inputBuffer, false, 0, Size * sizeof(int), inputData.data());
// Enqueue execution of the kernel
// With a Global size of 1000000 and an automatic local size
// The kernel will be executed 1000000 times
kernel(EnqueueArgs(queue, NDRange(Size)), inputBuffer, outputBuffer);
// Read outputData from the device memory
queue.enqueueReadBuffer(outputBuffer, false, 0, Size * sizeof(int), outputData.data());
// Wait for all queued actions to complete
queue.finish();
// outputData now contains the result of the computation
}
How to write an OpenCL C program :
// This program would work if given to the above example where "OpenCL C code goes here..."
kernel void example_kernel(global int * input, global int * output)
{
int worker_id = get_global_id(0); // Retreives the id of the current work-item
// Each work item will have a different id, starting from 0 to the first value given to NDRange
output[worker_id] = input[worker_id] + 10 + worker_id; // Example computation
}
// OpenCL C has the C pre-processor :
#define MACRO(a, b) a + b
// More advanced
bool function(int a)
{
float4 vector_type(0, 1, 2, 3); // OpenCL C adds vector data types
// These vector data types simplify 3D calculations, pixel color manipulations and other similar tasks
// They also ease the task of the compiler when trying to generate SIMD code
vector_type *= 2; // They work just like normal basic types
float v = vector_type.x; // Individual members can be accessed with .x, .y, .z and .w or .S0, .S1, .S2 and.S3
float2 v2; float8 v8; float16 v16 // Multiple sizes are available
uchar uc; uint ui; // unsigned types are simpler to write
local bool local_buffer[256]; // in local memory - can't have an initializer
int lid = get_local_id(0); // worker no in the workgroup
if (lid < 256)
local_buffer[lid] = (uc8.S1 == uc);
barrier(CLK_LOCAL_MEM_FENCE); // synch work-items in the workgroup and issue all local memory operations
if (lid < 256 && lid > 1)
return local_buffer[lid - 1]; // Read result from another work item in the same work group
return false;
}
Performance notes about OpenCL :
Program build time
Programs can take a long time to build (100+ms), make sure to compile the program only once and
keep the built program ready for when it is needed
Memory transfer
Transferering memory from host to device and back takes time (1+ms), make sure to transfer only needed data and transfer them only when needed
Global data access
Devices generally have more compute power than they have global memory bandwith
kernels that read multiple values from global memory can be accelerated by copying the data in a local array
Amount of computation per kernel
A very simple kernel that read data, do few calculations and then write the result do not use the full power of a GPU.
A sequence of simple kernels will perform less than one kernel doing all the calculations at once.
But a very big kernel can suffer from private or local memory exhaustion on some devices and will have less performance
(this is usually not a problem except for very complex algorithms).
More reading :
C++ Wrapper for OpenCL : http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf
OpenCL Reference : http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/
Quick reference card : http://www.khronos.org/registry/cl/sdk/1.2/docs/OpenCL-1.2-refcard.pdf
Complete OpenCL tutorial : http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=41&Itemid=75
(uses .Net for host software but it still a very good and complete tutorial)