function specialization Archives - GPU programming environment: Gepura

The Three Function Types of Quasar from a Technical Perspective

March 20, 2018

To give programmers ultimate control on the acceleration of functions on compute devices (GPU, multi-core CPU), Quasar has three distinct function types:

Host Functions: host functions are the default functions in Quasar. Depending on the compilation target, they are either interpreted or compiled to binary CPU code. Host functions run on the CPU and form a glue for the code that runs on other devices. When host functions contain for-loops, these loops are (when possible) automatically parallelized by the Quasar compiler. The parallelized version then runs on multi-core CPU or GPU; what happens in the background is that a kernel function is generated. Within the Quasar Redshift IDE, host functions are interpreted, which also allows stepping through the code and putting breakpoints.
Kernel Functions: given certain data dimensions, kernel functions specify what needs to be done to each element of the data. The task can be either a parallel task or a serial task. Correspondingly, kernel functions are launched either using the parallel_do function or using the serial_do function:
```
function [] = __kernel__ filter(image, pos)
    ...
endfunction
parallel_do(size(image),image,filter)
```
In the above example, the kernel function filter is applied in parallel to each pixel component of the image image. The first parameter of parallel_do/serial_do is always the data dimension (i.e., the size of the data). The last parameter is the kernel function to be called, and in the middle are the parameters to be passed to the kernel function. pos is a special kernel function parameter that indicates the position within the data (several other parameters exist, such as blkpos, blkdim etc., see the documentation for a detailed list).

Kernel functions can call other kernel functions, but only through serial_do and/or parallel_do. The regular function calling syntax (direct call, i.e, without serial_do/parallel_do) on kernel functions is not permitted. This is because kernel functions have other infrastructure (e.g., shared memory, device-specific parameters) which would cause conflicts when kernel functions would call each other.
Device functions: in contrast to host functions, device functions (indicated with __device__) are intended to be natively compiled using the back-end compiler. They can be seen as auxiliary functions that can be called from either kernel functions, device functions or even host functions (note that a kernel function cannot call a host function). The compiler back-end may generate C++, CUDA or OpenCL code that is then further compiled to binary code to be executed on CPU and/or GPU. Currently, apart from the CPU there are no accelerators that can run a device function directly when called from a host function. This means, for a GPU, the only way to call a device function is through a kernel function.
```
function y = __device__ norm(vector : vec)
    y = sqrt(sum(vector.^2))
endfunction
```

Functions in Quasar are first-class variables. Correspondingly, they have a data type. Below are some example variable declarations:

host : [(cube, cube) -> scalar]
filter : [__kernel__ (cube) -> ()]
norm : [__device__ (vec) -> scalar]

This allows passing functions to other functions, where the compiler can statically check the types. Additionally, to prevent functions from being accessed unintendedly, functions can be nested. For example, a gamma correction operation can be implemented using a host function containing a kernel function.

    function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
        function [] = __kernel__ my_kernel(gamma_fn : [__device__ scalar->scalar], x : cube, y : cube, pos : ivec2)    
           y[pos] = gamma_fn(x[pos])
        endfunction
        y = uninit(size(x)) % Allocation of uninitialized memory
        parallel_do(size(x), gamma_fn, x, y, my_kernel)
    endfunction

Because functions support closures (i.e., they capture the variable values of the surround contexts at the moment the function is defined), the above code can be simplified to:

    function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
        function [] = __kernel__ my_kernel(pos : ivec2)    
           y[pos] = gamma_fn(x[pos])
        endfunction
        y = uninit(size(x))
        parallel_do(size(x), my_kernel)
    endfunction

Or even in shorter form, using lambda expression syntax:

function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
    y = uninit(size(x))
    parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end

In many cases, the type of the variables can safely be omitted (e.g., pos in the above example) because the compiler can deduce it from the context.

Now, experienced GPU programmers may wonder about the performance cost of a function pointer call from a kernel function. The good news is that the function pointer call is avoided, by automatic function specialization (a bit similar to template instantiation in C++). With this mechanism, device functions passed via the function parameters or closure can automatically be inlined in a kernel function.

function y = gamma_correction(x, gamma_fn)
    y = uninit(size(x))
    parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end
gamma = __device__ (x) -> 255*(x/255)^2.4
im_out = gamma_correction(im_in, gamma)

This will cause the following kernel function to be generated, behind the hood:

function [] = __kernel__ opt__gamma_correction(pos:int,y:vec,x:vec)
      % Inlining of function 'gamma' with parameters 'x'=x[pos]
      y[pos]=(255*((x[pos]/255)^2.4))
endfunction

Another optimization was performed in the background: kernel flattening. Because each element of the data is accessed separately, the data is processed as a vector (using raster scanning) rather than as a matrix (or cube) representation. This entirely eliminates costly index calculations.

Summarizing, three function types offer the flexibility of controlling on which target devices (CPU or GPU) certain code fragments might run, while offering a unified programming interface: the code written for GPU can be executed on CPU as well, thereby triggering other target-specific optimizations.