New feature in Quasar: fixed size data types

In the past, small matrices such as ones(3,3) required dynamic memory allocation (which also incurs additional overhead). The dynamic memory allocation could be avoided by reverting to the fixed-length vector types (vec4 etc.), but matrix code written using this technique is more difficult to read and often error prone (for example for a 4×4 matrix one would write A[5] instead of A[1,1]). The use of high-level expressions such as zeros(N,N) is very natural in a high-level language such as Quasar, but zeros(N,N) does not lead to the most efficient code generation for a hardware architecture such as a GPU. Therefore, solving this problem represents a major milestone in the conversion from high-level code to low-level code.

For this purpose, fixed-length vectors have now been generalized to fixed-sized matrices, cubes and hypercubes, with full compiler support (type inference and code generator). We will show that this successfully solves the code generation problem. With no or limited code changes, existing code can even benefit of the fixed-sized matrix optimizations. Additionally, it is useful to guide the compiler by annotating with the appropriate fixed-sized matrix types (e.g., to catch matrix dimension mismatches early on in the development). The following types have been defined:

Quasar type Parametric alias Meaning
vec3 vec(3) A vector of length 3
mat2x2 mat(2,2) A matrix of dimensions 2 x 2
cube2x3x2 cube(2,3,2) A cube with dimensions 2 x 3 x 2
cube2x2x2x2 cube(2,2,2,2) A hypercube with dimensions 2 x 2 x 2 x 2

 

In the above table, all types exist in fixed form and a parametric form. In code, the fixed form is often slightly more readable. The parametric form has the possibility for generic parametrization, which will be explained in one of the next sections.

Fixed sized data types allow the compiler to generate more efficient code, especially in cases when the dimensions are known at compile-time. In some cases, this allows out-of-bounds indices to be detected faster:

function y = transform(A : mat2x2'checked, b : vec2)
    y = A[4,4] * b[0] % Index out-of-bounds
endfunction

Also, incompatibilities in matrix/vector operations can be detected at compile-time in some cases, for example:

y = [1,2,3] * eye(4)  % Error: Matrix dimensions (1x3 and 4x4) do not match for multiplication!

The explicit matrix dimensions can be seen as a type constraint (e.g., a constraint that is applicable to every instance of the type and that can be handled by the truth-value system to enable inference).

For the generated CUDA/CPU code, the fixed sized data types are stored respectively in the registers (GPU) or stack memory (CPU). This entirely avoids dynamic kernel memory and offers significant performance benefits, often a tenfold. Because the dimensions are known at compile-time, also the indexing calculation is accelerated.

All fixed sized data types are by default passed by reference (note: this is the same for generic vectors/matrices and even fixed-length vectors). The fixed dimensions should be seen as a constraint that applies to every instance of the type. Internally, when safe, the compiler may generate code that passes the vector/matrix by value (for example, when the corresponding variable is constant). Except for class/mutable class objects, the fixed sized data types are stored in-place (i.e. without reference).

To take advantage of fixed sized data types, no changes are required to existing Quasar programs, other than occasional changes to the function parameter types (e.g., from mat to mat3x3 for a 3×3 matrix). Whenever possible, the compiler attempts to derive the data dimensions by type inference. For example, the following defines a 2×3 matrix and its transpose:

A = [[1,2,3],[2,3,4]]
B = transpose(A)

The compiler will automatically determine the type of A as mat2x3 and B as mat3x2. Moreover, when the following builtin functions are used within a device/kernel function, they will automatically take advantage of the fixed sized data types:

zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax

Idem for arithmetic operations (add, subtract, multiply, divide etc.), matrix multiplication as well as mathematical functions (sin, cos, tan etc.).

Some unobvious performance considerations: passing a mat(N,N) to a device function accepting mat does not require use of dynamic kernel memory! Internally, the runtime will cast the fixed sized matrix to a generic matrix. Also, a device function returning mat(N,N) does not use dynamic kernel memory, however, a device function returning mat (e.g. function [y:mat] = __device__ X()) does, even when the function returns a fixed sized matrix such as mat(4,4). The reason is that device functions need to respect the type signature so that they can be used as function variables. Therefore: for device functions returning mat/cube types, it is best to update the type signature, especially of the dimensions of the return value are known beforehand.

Examples

Below are a few useful examples of using fixed sized data types.

1 Built-in function optimizations

function [y1,y2] = __device__ func (x : mat4x4'safe, y : vec4'circular)
    y1 = sum(x,1) + x[4,0]
    y2 = cumsum(x,1) + y[6]
endfunction

Because the matrix and vector dimensions are known at compile-time, the above function will only use stack memory/registers to calculate the results y1 and y2, i.e., there is no dynamic kernel memory involved.

2 Matrix extraction and processing

Another use case that can now be optimized, is given below:

im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))

for m=0..size(im,0)-1
    for n=0..size(im,1)-1
        window = im[m+(-1..1),n+(-1..1)]
        im_out[m,n] = max(window) - min(window)
    endfor
endfor

Here, the compiler will determine the type of window to be mat3x3. In previous versions of Quasar, the same effect would be obtained by writing:

window = [im[m-1,n-1],im[m-1,n]+im[m-1,n+1]+
          im[m  ,n-1],im[m  ,n]+im[m  ,n+1]+
          im[m+1,n-1],im[m+1,n]+im[m+1,n+1]]

The fixed-sized matrix approach is clearly more versatile and also allows two-dimensional addressing. One caveat is that the programmer needs to ensure that the dimensions of window can be obtained at compile-time. If this is not the case, the compiler will generate a warning mentioning that the kernel function consumes dynamic memory. In case the dimensions are parametric, a way to get around it is by using generical size parametrized types (see Section 1.2 below) or by using generic specialization:

im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))

function [] = generic_nlfilter(K : int)
    {!auto_specialize}
    for m=0..size(im,0)-1
        for n=0..size(im,1)-1
            window = im[m+(-K..K),n+(-K..K)]
            im_out[m,n] = max(window) - min(window)
        endfor
    endfor
endfunction

generic_nlfilter(2)

3 Matrix-processing in the loop

The following code benefits well from the fixed-sized matrix optimizations, causing efficient code to be generated because 1) matrix dimensions are known inside the kernel function, which allows additional optimization techniques under the hood (such as loop unrolling) and 2) because no dynamic kernel memory is required:

% (code courtesy of Michiel Vlaminck)
{!parallel for}
for i=0..size(covs,2)-1
    D = vec(3)          % Type: vec3   or in parametric form vec(3)
    V = zeros(3,3)      % Type: mat3x3 or in parametric form mat(3,3)
    unused = jacobi(covs[0..2,0..2,i], 3, D, V)

    % Avoids matrices near singularities
    mineig : scalar = 100 * D[2]
    D[0] = D[0] < mineig ? mineig : D[0]
    D[1] = D[1] < mineig ? mineig : D[1]

    covs[0..2,0..2,i] = V * diagm(D) * transpose(V)

    for j=0..2
        D[j] = 1.0/D[j]
    endfor

    icovs[0..2,0..2,i] = transpose(V) * diagm(D) * V
endfor

In particular, the type inference engine can determine that type(diagm(D)) is mat(3,3), so that the matrix multiplications transpose(V) * diagm(D) * V also can be calculated with fixed dimensions.

Generic dimension/size-parametrized array types

Often, it is desired to specify vector and matrix dimensions, but it is not desirable to restrict the functions to a particular choice of dimensions. In this case, dimension/size-parametrized types come to the rescue.

The following example demonstrated a circshift function that can be used in arbitrary dimension (1D, 4D, 8D etc.)

function [y] = circshift[N : int](x : cube{N}, shift : vec(N))
    function [] = __kernel__ kernel (x : cube{N}'circular, y : cube{N}, shift : vec(N), pos : vec(N))
        y[pos] = x[pos + shift]
    endfunction
    y = zeros(size(x,0..N-1))
    parallel_do(size(x), x, y, shift, kernel)
endfunction

This is called a dimension-parametric function (the function works for any dimension N>0). The specialization of this function is done at the caller-side. For example, if the function is called circshift(randn(64,64,3),[10,10,0]) the compiler will deduct N=3 and will generate a specialized version (as if the function was written specifically for N=3). Alternatively, one may also call circshift[3](randn(64,64,3),[10,10,0]) to explicitly pass the parameter (e.g., in order to ensure type correctness).

Compare this to the non-generic counterpart (note that cube{:} denotes a (hyper)cube of unspecified dimension):

function [y] = circshift(x : cube{:}, shift : vec)
    function [] = __kernel__ kernel (x : cube{:}'circular, y : cube{:}, shift : vec, pos)
        y[pos] = x[pos + shift]
    endfunction
    y = zeros(size(x))
    parallel_do(size(x), x, y, shift, kernel)
endfunction

Sadly, the non-generic counterpart implementation does not work as expected. To work, the length of the shift vector needs to match the maximal dimension of cube{:}. This is because the compiler will adjust the length of pos to match the maximal dimensionality (currently 16). In case the shift vector has a different length, a runtime error will result. Moreover, it is desirable to write code that does not depend on built-in constants like the maximal dimensionality.

The parametric version entirely solves this problem but also allows the exact value of N to be captured. It is possible to declare types like mat(N,N) and even vec(2*N) and mat(2,N). The type definitions currently allow simple linear arithmetic (for example 2*N-1).

Next, defining a matrix of type mat(2,N):

A = mat(2,N)
B = cube(N,2*N,3)
Y = A[0,:]
C = B[:,:,1]

The compiler can now infer that Y has type vec(N) and that C has type mat(N,2*N). After specialization with N=2, C will have type mat(2,4) and the resulting variables enjoy the benefits of fixed sized data types.

Partially parametrized array types

Existing code (e.g., for image processing) often relies on hard-coded conventions with respect to the cube dimensions, for example, the number of color channels that is fixed to be 3. To make code more easily extendible and also to facilitate compiler analysis (for loop parallelization, type inference), partially parametrized array types can be used.

Example:

img : cube(:,:,3) = imread("lena_big.tif")

indicates that A is a cube with three dimensions, for which the first two dimensions have unspecified length (:) and the last dimension has length 3. When a for-loop is written over the third dimension, either explicitly or implicitly via matrix slices:

img[m,n,:]

the compiler can determine the length numel(img[m,n,:])==3 so that the vector type vec3 can be assigned. In previous Quasar versions, the compiler would infer vec as type (not being able to make any assumption on the size), resulting possibly in the use of dynamic kernel memory.

For imread, the type inference of cube(:,:,3) is now automatic. For other functions, it may again be useful to parametrize generically on the number of color channels:

function [y] = circshift2D[N](x : cube(:,:,N), shift : vec(2))
    function [] = __kernel__ kernel (x : cube(:,:,N)'circular, y : cube(:,:,N), shift : vec(2), pos)
        y[...pos,:] = x[...(pos + shift),:]
    endfunction
    y = zeros(size(x))
    parallel_do(size(x), x, y, shift, kernel)
endfunction

Note that here, we are combining the technique from previous section with partially parametrized array types (since we are using cube(:,:,N) instead of cube(:,:,3)). The body of the kernel function will then be implemented entirely with fixed-length vectors vec(N). The above function can be called with arbitrary number of color channels, although this number needs to be known at compile-time:

circshift2D(randn(100,100,3), [4,4])
circshift2D(randn(100,100,9), [-2,-2])

Another example is a color transform for an image with unspecified number of channels:

function img_out = colortransform[N](C : mat(N,N), img : cube(:,:,N))
    img_out = zeros(size(img,0),size(img,1),N)
    {!parallel for}
    for m=0..size(img,0)-1
        for n=0..size(img,1)-1
            img_out[m,n,:] = C * squeeze(img[m,n,:])
        endfor
    endfor
endfunction

Here, calling the function with a mat3x3 color transform matrix will ensure that the parameter N=3 is substituted during compile-time, leading to efficient calculation of the vector-matrix product C * squeeze(img[m,n,:]). The squeeze function is required here, because img[m,n,:] returns a vector of dimensions 1x1xN which is multiplied with matrix C. The squeeze function removes the singleton dimensions and converts the vector to Nx1x1 (or simply N, since singleton dimensions at the end do not matter in Quasar). For the assignment, this conversion is done automatically. In practice the squeeze function will have zero-overhead; it is only required to keep the compiler happy.

Again, the type inference of the following builtin functions has been updated to work with these parametrized types:

zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax

such that repmat(img, [1,1,3]) : cube(:,:,9)) etc. Another use case is summation along a given dimension:

A : mat(3,3,:)
B = sum(A,2)

Here the result B has the fixed sized matrix type mat(3,3).

Conclusion

In Quasar, fixed sized vectors, matrices and cubes offer the same code generation advantages as fixed sized arrays in C, but because they are actually “concealed” regular matrices with special type constraints, they can be used with the same conventions (e.g., by reference parameter passing) as regular matrices and they are therefore much more flexible. The type inference has been extended to automatically find matrices/cubes with fixed dimensions and once found, the code generation is adjusted appropriately. The fixed sized constraint then not only allows catching size errors at compile-time, but also ensures that loops with fixed-sized matrices are converted to efficient implementations that no longer rely on dynamic kernel memory.

New Feature in Quasar: Cooperative Thread Groups

The latest release of Quasar now offers support for CUDA 9 cooperative groups. Below, we describe how cooperative groups can be used from Quasar. Overall, cooperative threading brings some interesting optimization possibilities for Quasar kernel functions.

1 Synchronization granularity

The keyword syncthreads now accepts a parameter that indicates which threads are being synchronized. This allows more fine grain control on the synchronization.

Keyword Description
syncthreads(warp) performs synchronization across the current (possibly diverged) warp (32 threads)
syncthreads(block) performs synchronization across the current block
syncthreads(grid) performs synchronization across the entire grid
syncthreads(multi_grid) performs synchronization across the entire multi-grid (multi-GPU)
syncthreads(host) synchronizes all host (CPU and GPU threads)

 

The first statement syncthreads(warp) allows divergent threads to synchronize at any time (it is also useful in the context of Volta’s independent scheduling). syncthreads(block) is equivalent to syncthreads in previous versions of Quasar. The grid synchronization primitive syncthreads(grid) is particularly interesting, it is a feature of CUDA 9 that was not available before. It allows to place barriers inside kernel function that synchronize all blocks. The following function:

function y = gaussian_filter_separable(x, fc, n)
    function [] = __kernel__ gaussian_filter_hor(x : cube, y : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [0,i-n,0]] * fc[i]
        endfor
        y[pos] = sum
    endfunction
    function [] = __kernel__ gaussian_filter_ver(x : cube, y : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [i-n,0,0]] * fc[i]
        end
        y[pos] = sum
    end

    z = uninit(size(x))
    y = uninit(size(x))
    parallel_do (size(y), x, z, fc, n, gaussian_filter_hor)
    parallel_do (size(y), z, y, fc, n, gaussian_filter_ver)
endfunction

Can now be simplified to:

function y = gaussian_filter_separable(x, fc, n)
    function [] = __kernel__ gaussian_filter_separable(x : cube, y : cube, z : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [0,i-n,0]] * fc[i]
        endfor
        z[pos] = sum
        syncthreads(grid)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + z[pos + [i-n,0,0]] * fc[i]
        end
        y[pos] = sum
    endfunction
    z = uninit(size(x))
    y = uninit(size(x))
    parallel_do (size(y), x, y, z, fc, n, gaussian_filter_separable)
endfunction

The advantage is not only in the improved readability of the code, but the number of kernel function calls can be reduced which further increases the performance. Note: this feature requires at least a Pascal GPU.

2 Interwarp communication

New special kernel function parameters (similar to blkpos, pos etc.) will be added to control the GPU threads.

Parameter Type Description
coalesced_threads thread_block a thread block of coalesced threads
this_thread_block thread_block describes the current thread block
this_grid thread_block describes the current grid
this_multi_grid thread_block describes the current multi-GPU grid

 

The new thread_block class will have the following methods.

Method Description
sync() Synchronizes all threads within the thread block
partition(size : int) Allows partitioning a thread block into smaller blocks
shfl(var, src_thread_idx : int) Direct copy from another thread
shfl_up(var, delta : int) Direct copy from another thread, with index specified relatively
shfl_down(var, delta : int) Direct copy from another thread, with index specified relatively
shfl_xor(var, mask : int) Direct copy from another thread, with index specified by a XOR relative to the current thread index
all(predicate) Returns true if the predicate for all threads within the thread block evaluates to non-zero
any(predicate) Returns true if the predicate for any thread within the thread block evaluates to non-zero
ballot(predicate) Evaluates the predicate for all threads within the thread block and returns a mask where every bit corresponds to one predicate from one thread
match_any(value) Returns a mask of all threads that have the same value
match_all(value) Returns a mask only if all threads that share the same value, otherwise returns 0.

 

In principle, the above functions allow threads to communicate with each other. The shuffle operations allow taking values from other active threads (active means not disabled due to thread divergence). all, any, ballot, match_any and match_all allow to determine whether the threads have reached a given state.

The warp shuffle operations require a Kepler GPU and allow the use of shared memory to be avoided (register access is faster than shared memory). This may bring again performance benefits for computationally intensive kernels such as convolutions and parallel reductions (sum, min, max, prod etc.).

Using this functionality will require the CUDA target to be specified explicitly (i.e., the functionality cannot be easily simulated by the CPU). This may be obtained by placing the following code attribute inside the kernel: {!kernel target="nvidia_cuda"}. For CPU execution a separate kernel needs to be written. Luckily, several of the warp shuffling optimizations can be integrated in the compiler optimization stages, so that only one single kernel needs to be written.

The Three Function Types of Quasar from a Technical Perspective

To give programmers ultimate control on the acceleration of functions on compute devices (GPU, multi-core CPU), Quasar has three distinct function types:

  1. Host Functions: host functions are the default functions in Quasar. Depending on the compilation target, they are either interpreted or compiled to binary CPU code. Host functions run on the CPU and form a glue for the code that runs on other devices. When host functions contain for-loops, these loops are (when possible) automatically parallelized by the Quasar compiler. The parallelized version then runs on multi-core CPU or GPU; what happens in the background is that a kernel function is generated. Within the Quasar Redshift IDE, host functions are interpreted, which also allows stepping through the code and putting breakpoints.
  2. Kernel Functions: given certain data dimensions, kernel functions specify what needs to be done to each element of the data. The task can be either a parallel task or a serial task. Correspondingly, kernel functions are launched either using the parallel_do function or using the serial_do function:
    function [] = __kernel__ filter(image, pos)
        ...
    endfunction
    parallel_do(size(image),image,filter)

    In the above example, the kernel function filter is applied in parallel to each pixel component of the image image. The first parameter of parallel_do/serial_do is always the data dimension (i.e., the size of the data). The last parameter is the kernel function to be called, and in the middle are the parameters to be passed to the kernel function. pos is a special kernel function parameter that indicates the position within the data (several other parameters exist, such as blkpos, blkdim etc., see the documentation for a detailed list).

    Kernel functions can call other kernel functions, but only through serial_do and/or parallel_do. The regular function calling syntax (direct call, i.e, without serial_do/parallel_do) on kernel functions is not permitted. This is because kernel functions have other infrastructure (e.g., shared memory, device-specific parameters) which would cause conflicts when kernel functions would call each other.

  3. Device functions: in contrast to host functions, device functions (indicated with __device__) are intended to be natively compiled using the back-end compiler. They can be seen as auxiliary functions that can be called from either kernel functions, device functions or even host functions (note that a kernel function cannot call a host function). The compiler back-end may generate C++, CUDA or OpenCL code that is then further compiled to binary code to be executed on CPU and/or GPU. Currently, apart from the CPU there are no accelerators that can run a device function directly when called from a host function. This means, for a GPU, the only way to call a device function is through a kernel function.
    function y = __device__ norm(vector : vec)
        y = sqrt(sum(vector.^2))
    endfunction

Functions in Quasar are first-class variables. Correspondingly, they have a data type. Below are some example variable declarations:

host : [(cube, cube) -> scalar]
filter : [__kernel__ (cube) -> ()]
norm : [__device__ (vec) -> scalar]

This allows passing functions to other functions, where the compiler can statically check the types. Additionally, to prevent functions from being accessed unintendedly, functions can be nested. For example, a gamma correction operation can be implemented using a host function containing a kernel function.

    function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
        function [] = __kernel__ my_kernel(gamma_fn : [__device__ scalar->scalar], x : cube, y : cube, pos : ivec2)    
           y[pos] = gamma_fn(x[pos])
        endfunction
        y = uninit(size(x)) % Allocation of uninitialized memory
        parallel_do(size(x), gamma_fn, x, y, my_kernel)
    endfunction

Because functions support closures (i.e., they capture the variable values of the surround contexts at the moment the function is defined), the above code can be simplified to:

    function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
        function [] = __kernel__ my_kernel(pos : ivec2)    
           y[pos] = gamma_fn(x[pos])
        endfunction
        y = uninit(size(x))
        parallel_do(size(x), my_kernel)
    endfunction

Or even in shorter form, using lambda expression syntax:

function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
    y = uninit(size(x))
    parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end

In many cases, the type of the variables can safely be omitted (e.g., pos in the above example) because the compiler can deduce it from the context.

Now, experienced GPU programmers may wonder about the performance cost of a function pointer call from a kernel function. The good news is that the function pointer call is avoided, by automatic function specialization (a bit similar to template instantiation in C++). With this mechanism, device functions passed via the function parameters or closure can automatically be inlined in a kernel function.

function y = gamma_correction(x, gamma_fn)
    y = uninit(size(x))
    parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end
gamma = __device__ (x) -> 255*(x/255)^2.4
im_out = gamma_correction(im_in, gamma)

This will cause the following kernel function to be generated, behind the hood:

function [] = __kernel__ opt__gamma_correction(pos:int,y:vec,x:vec)
      % Inlining of function 'gamma' with parameters 'x'=x[pos]
      y[pos]=(255*((x[pos]/255)^2.4))
endfunction

Another optimization was performed in the background: kernel flattening. Because each element of the data is accessed separately, the data is processed as a vector (using raster scanning) rather than as a matrix (or cube) representation. This entirely eliminates costly index calculations.

Summarizing, three function types offer the flexibility of controlling on which target devices (CPU or GPU) certain code fragments might run, while offering a unified programming interface: the code written for GPU can be executed on CPU as well, thereby triggering other target-specific optimizations.

Real-time High Dynamic Range Upconversion of Video

The following video demonstrates a real-time standard dynamic range (SDR) to high dynamic range (HDR) upconversion algorithm that was developed entirely in Quasar. The purpose of this algorithm is to display movies stored in SDR format on a HDR television. A user interface, also developed in Quasar, gives the user a few slides to choose how the conversion is done (e.g., which specular highlights are being boosted and to what extent).

"Raw upscaling" indicates direct scaling of the SDR to the HDR range, without additional processing of the video content. "Proposed method" refers to the SDR to HDR conversion algorithm.

(note: this video is played at a higher speed for viewing purposes)

Quasar Autonomous Vehicles Demo

The following demo demonstrates the use of Quasar in autonomous driving applications. A Lidar is mounted on an electric bus, together with two RGB cameras. The incoming video signals are fused and processed in real-time using algorithms developed in Quasar.

Autonomous vehicles making use of Quasar

Simultaneous localization and mapping (SLAM)

Martin Dimitrievski and his colleagues propose a novel real-time method for SLAM in autonomous vehicles. The environment is mapped using a probabilistic occupancy map model and EGO motion is estimated within the same environment by using a feedback loop. Input data is provided via a rotating laser scanner as 3D measurements of the current environment which are projected on the ground plane. The local ground plane is estimated in real-time from the actual point cloud data using a robust plane fitting scheme. Then the computed occupancy map is registered against the previous map in order to estimate the translation and rotation of the vehicle. Experimental results demonstrate that the method produces high quality occupancy maps and the measured translation and rotation errors of the trajectories are lower compared to other 6 degrees of freedom methods. The entire SLAM system runs on a mid-range GPU and keeps up with the data from the sensor which enables more computational power for the other tasks of the autonomous vehicle.

“Many of the Autonomous Vehicles sub-systems are massively parallel and this is where Quasar can speed things up. From pre-processing of LIDAR point cloud data to odometry, object detection, tracking and route planning, Quasar made all of these components possible to run on a mid-range GPU in real-time. When you are done prototyping, you can consult the profiler to easily spot any areas for improved execution of the code.” – ir. Martin Dimitrievski

 

Example: SLAM for autonomous vehicles

Reference

Robust matching of occupancy maps for odometry in autonomous vehicles“; Martin Dimitrievski , David Van Hamme , Peter Veelaert, Wilfried Philips in
proceedings of Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 2016 (VISIGRAPP).

Acknowledgements

The work was financially supported the Flanders Make ICON project 140647 “Environmental Modelling for automated Driving and Active Safety (EMDAS)”.

MRI reconstruction speedups up to x20 using GPUs

Magnetic Resonance Imaging (MRI)

MRI is a very powerful and safe medical diagnostic tool, but it is prohibitively expensive to use frequently. Hence, a technique that speeds up MRI acquisition would not only be helpful for patients, as it requires them to lie still for shorter periods of time, but it would also be of great benefit for doctors, as it leads to a higher patient throughput and a reduced susceptibility of the images to motion artifacts. Using Quasar, Jan Aelterman and his colleagues developed a reconstruction algorithm that handles acquisition speedup correctly.

Speeding up MRI is done by reducing the amount of acquired data. However, signal processing theory states it is impossible to do this beyond the Nyquist limit, without losing information. However, it is possible to only lose superfluous image information, this is called compressive sensing (CS). The danger is that that a naive reconstruction technique inadvertently corrupts an image while filling in lost information.

Proposed Technique

An MRI image is constructed from so called Fourier, or k-space coefficients. Due to acceleration techniques, the number of Fourier coefficients is less than the number of image pixels, resulting in an infinite number of possible images that would correspond with the acquired Fourier coefficients (an infinite number of ways to fill in the missing, superfluous information). Therefore, we impose an additional constraint: the reconstructed image is the one with the lowest number of shearlet coefficients possible. The shearlet transform can represent natural images with few coefficients, but not noise and corruptions. It is optimal in this respect. Hence, its use will force a noise – and corruption – free reconstruction. The optimization of the proposed problem requires iteratively applying the Non-uniform fast Fourier transform (NUFFT), which after profiling turns out to be a major bottleneck. By accelerating the NUFFT using the GPU we are able to gain significant speedups.

COMPASS reconstruction experiment using 4 acquisition coils and 10% of the Nyquist sampling rate on a k-space spiral.

COMPASS reconstruction experiment using 4 acquisition coils and 10% of the Nyquist sampling rate on a k-space spiral.

Implementation was done using the new programming language Quasar, which allows for fast and hardware agnostic development, while still using the computation power of the GPU. Without requiring long development cycles, we were able to achieve speed-ups up to a factor 20 on an NVIDIA Geforce GTX 770. The speed-up achieved by the GPU acceleration opens the path for new and innovative research for MRI reconstruction: e.g. auto calibration, 3D reconstruction, advanced regularization parameters, etc.

“It took experts using CUDA/C++ three months to implement our MRI reconstruction algorithm; a developer using Quasar for the very first time achieved the same numerical results at the same computational performance in less than a single development week.” – Dr. ir. Jan Aelterman

Example

Reference

COMPASS: a joint framework for parallel imaging and compressive sensing in MRI“; Jan Aelterman, Quang Luong, Bart Goossens, Aleksandra Pizurica, Wilfried Philips; in proceedings of IEEE International Conference on Image Processing ICIP(2010).

Focal Black & White effect

Focal Black & White Effect

A well known Google Picasa effect is the Focal Black & White Effect. This effect preserves the color within a focal region and converts pixels outside this region to grayscale.

The algorithm is surprisingly simple: it consists of calculating a weighting factor (that depends on the focal radius), converting the pixel RGB values at each position to grayscale, and calculating a weighted average of this value with the original RGB value. Let us see how this can be achieved in Quasar:

function [] = __kernel__ focus_bw(x, y, focus_spot, falloff, radius, pos)

    % Calculation of the weight
    p = (pos - focus_spot) / max(size(x,0..1))
    weight = exp(-1.0 / (0.05 + radius * (2 * dotprod(p,p)) ^ falloff))

    % Conversion to grayscale & averaging
    rgbval = x[pos[0],pos[1],0..2]
    grayval = dotprod(rgbval,[0.3,0.59,0.11])*[1,1,1]
    y[pos[0],pos[1],0..2] = lerp(grayval, rgbval, weight)

end

Code explanation

First, the kernel function focus_bw is defined. __kernel__ is a special function qualifier that identifies a kernel function (similar to OpenCL’s __kernel or kernel qualifier). Kernel functions are natively compiled to any target architecture that you have in mind. This can be multi-core CPU x86/64 ELF, even ARM v9 with Neon instructions; up to NVidia PTX code. Next, a parameter list follows, the type of the parameters is (in this case) not specified.

Note the special parameter pos. In Quasar, pos is a parameter name that is reserved for kernel functions to obtain the current position in the image.

The kernel function contains two major blocks:

  • In the weight calculation step, first the relative position compared to the focus spot coordinates is computed. This relative position is normalized by dividing by the maximum size in the first two dimensions (note that Quasar uses base-0), so this is the maximum of the width and the height of the image. Next, the weight is obtained as being inversely proportional to distance to the the focal spot. A special built-in function dotprod, that can also be found in high-level shader languages (HLSL/GLSL) and that calculates the dot product between two vectors, is used for this purpose.
  • For extracting the RGB value at the position pos, we use a matrix slice indexer: 0..2 constructs a vector of length 3 (actually [0,1,2]), which vector is used for indexing. In fact:
    x[pos[0],pos[1],0..2] = [x[pos[0],pos[1],0],x[pos[0],pos[1],1],x[pos[0],pos[1],2]]        
    

    which form do you prefer, the left-handed side, or the right-handed side? You can choose. Note that it is not possible to write: x[pos[0..1],0..2], because this expression would construct a matrix of size 2 x 3.

  • The gray value is calculated by performing the dot product of [0.3,0.59,0.11] with the original RGB value. Finally, the gray value is mixed with the original RGB value using the lerp “linear interpolation” function. In fact, lerp is nothing more than the function:
    lerp = (x,y,a) -> (1-a) * x + a * x
    

    The resulting RGB value is written to the output image y. That’s it!

Finally, we still need to call the kernel function. For this, we use the parallel_do construct:

img_in = imread("quasar.jpg")
img_out = zeros(size(img_in))

parallel_do(size(img_out,0..1),img_in,img_out,[256,128],0.5,10,focus_bw)
imshow(img_out)

First, an input image img_in is loaded using the function imread “image read”. Then, an output image is allocated with the same size as the input image.

The parallel_do function is called, with some parameters. The first parameter specifies the dimensions of the “work items” that can run in parallel. Here, each pixel of the image can be processed in parallel, hence the dimensions are the size (i.e., height + width) of the output image. The following parameters are argument values that are passed to the kernel function and that are declared in the kernel function definition. Finally, the kernel function to be called is passed.

Note that in contrast to scripting languages that are dynamically typed, the Quasar language is (mostly) statically typed and the Quasar compiler performs type inference in order to derive the data types of all the parameters. This is done based on the surrounding context. Here, Quasar will find out that img_in is a cubedata type (a 3D array) and it will derive all other missing data types based on that. Consequently, efficient parallel code can be generated in a manner that is independent of the underlying platform.

Now: the complete code again:

function [] = __kernel__ focus_bw(x, y, focus_spot, falloff, radius, pos)  
    p = (pos - focus_spot) / max(size(x,0..1))
    weight = exp(-1.0 / (0.05 + radius * (2 * dotprod(p,p)) ^ falloff))
    rgbval = x[pos[0],pos[1],0..2]
    grayval = dotprod(rgbval,[0.3,0.59,0.11])*[1,1,1]
    y[pos[0],pos[1],0..2] = lerp(grayval, rgbval, weight)                    
end

img_in = imread("flowers.jpg")
img_out = zeros(size(img_in))

parallel_do(size(img_out,0..1),img_in,img_out,[256,128],0.5,10,focus_bw)
imshow(img_out)

Example

With eleven lines of code, you have a beautifully shining Focal Black & White effect:

focal-bnw-effect

Overview of some new features in Quasar

This documents lists a number of new features that were introduced in Quasar in Jan. 2014.

Object-oriented programming

The implementation of object-oriented programming in Quasar is far from complete, however there are a number of new concepts:

  1. Unification of static and dynamic classes:Before, there existed static class types (type myclass : {mutable} class) and dynamic object types (myobj = object()). In many cases the set of properties (and corresponding types) forobject() is known in advance. To enjoy the advantages of the type inference, there are now also dynamic class types:
    type Bird : dynamic class
        name : string
        color : vec3
    end 

    The dynamic class types are similar to classes in Python. At run-time, it is possible to add fields or methods:

    bird = Bird()
    bird.position = [0, 0, 10]
    bird.speed = [1, 1, 0]
    bird.is_flying = false
    bird.start_flying = () -> bird.is_flying = true

    Alternatively, member functions can be implemented statically (similar to mutable or immutable classes):

    function [] = start_flying(self : bird)
        self.is_flying = true
    end

    Dynamic classes are also useful for interoperability with other languages, particularly when the program is run within the Quasar interpreter. The dynamic classes implement MONO/.Net dynamic typing, which means that imported libraries (e.g. through import "lib.dll") can now use and inspect the object properties more easily.

    Dynamic classes are also frequently used by the UI library (Quasar.UI.dll). Thanks to the static typing for the predefined members, efficient code can be generated.

    One limitation is that dynamic classes cannot be used from within __kernel__ or __device__ functions. As a compensation, the dynamic classes are also a bit lighter (in terms of run-time overhead), because there is no multi-device (CPU/GPU/…) management overhead. It is known a priori that the dynamic objects will “live” in the CPU memory.

    Also see Github issue #88 for some earlier thoughts.

  2. Parametric typesIn earlier versions of Quasar, generic types could be obtained by not specifying the types of the members of a class:
    type stack : mutable class
        tab
        pointer
    end

    However, this limits the type inference, because the compiler cannot make any assumptions w.r.t. the type of tab or pointer. When objects of the type stack are used within a for-loop, the automatic loop parallelizer will complain that insufficient information is available on the types of tab and pointer.

    To solve this issue, types can now be parametric:

    type stack[T] : mutable class
        tab : vec[T]
        pointer : int
    end

    An object of the type stack can then be constructed as follows:

    obj = stack[int]
    obj = stack[stack[cscalar]] 

    Parametric classes are similar to template classes in C++. For the Quasar back-ends, the implementation of parametric types is completely analogous as in C++: for each instantiation of the parametric type, a struct is generated.

    It is also possible to define methods for parametric classes:

    function [] = __device__ push[T](self : stack[T], item : T)
        cnt = (self.pointer += 1) % atomic add for thread safety
        self.tab[cnt - 1] = item
    end

    Methods for parametric classes can be __device__ functions as well, so that they can be used on both the CPU and the GPU. In the future, this will allow us to create thread-safe and lock-free implementations of common data types, such as sets, lists, stacks, dictionaries etc. within Quasar.

    The internal implementation of parametric types and methods in Quasar (i.e. the runtime) uses a combination of erasure and reification.

  3. InheritanceInherited classes can be defined as follows:
    type bird : class
        name : string
        color : vec3
    end
    
    type duck : bird
        ...
    end 

    Inheritance is allowed on all three class types (mutable, immutable and dynamic).

    Note: multiple inheritance is currently not supported (multiple inheritance has the problem that special “precedent rules” are required to determine with method is used when multiple instances define a certain method. In a dynamical context, this would create substantial overhead.

  4. ConstructorsDefining a constructor is based on the same pattern that we used to define methods. For the above stack class, we have:
    % Default constructor
    function y = stack[T]()
        y = stack[T](tab:=vec[T](100), pointer:=0)
    end
    
    % Constructor with int parameter
    function y = stack[T](capacity : int)
        y = stack[T](tab:=vec[T](capacity), pointer:=0)
    end
    
    % Constructor with vec[T] parameter
    function y = stack[T](items : vec[T])
        y = stack[T](tab:=copy(items), pointer:=0)
    end

    Note that the constructor itself creates an instance of the type, rather than that it is done automatically. Consequently, it is possible to return a null value as well.

    function y : ^stack[T] = stack[T](capacity : int)
        if capacity > 1024
            y = null % Capacity too large, no can do...
        else
            y = stack[T](tab:=vec[T](capacity), pointer:=0)
        endif
    end

    In C++ / Java this is not possible: the constructor always returns the this-object. This is often seen as a disadvantage.

    A constructor that is intended to be used on the GPU (or CPU in native mode), can then simply be defined by adding the __device__ modifier:

    function y = __device__ stack[T](items : vec[T])
        y = stack[T](tab:=copy(items), pointer:=0)
    end

    Note #1: instead of stack[T](), we could have used any other name, such as make_stack[T](). Using the type name to identify the constructor:

    • the compiler will know that this method is intended to be used to create objects of this class
    • non-uniformity (new_stack[T](), make_stack[T](), create_stack()…) is avoided.

    Note #2: there are no destructors (yet). Because of the automatic memory management, this is not a big issue right now.

Type inference enhancements

  1. Looking ‘through’ functions (type reconstruction)In earlier releases, the compiler could not handle the determination of the return types of functions very well. This could lead to some problems with the automatic loop parallelizer:
    function y = imfilter(x, kernel)
        ...
    end  % Warning - type of y unknown
    
    y = imfilter(imread("x.png")[:,:,1])
    assert(type(y,"scalar"))  % Gives compilation error!

    Here, the compiler cannot determine the type of y, even though it is known that imread("x.png")[:,:,1] is a matrix.

    In the newest version, the compiler attempts to perform type inference for the imfilter function, knowing the type of y. This does not allow to determine the return type of imfilter in general, but it does for this specific case.

    Note that type reconstruction can create some additional burden for the compiler (especially when the function contains a lot of calls that require recursive type reconstruction). However, type reconstruction is only used when the type of at least one of the output parameters of a function could not be determined.

  2. Members of dynamic objectsThe members of many dynamic objects (e.g. qform, qplot) are now statically typed. This also greatly improves the type inference in a number of places.

High-level operations inside kernel functions

Automatic memory management on the computation device is a new feature that greatly improves the expressiveness of Quasar programs. Typically, the programmer intends to use (non-fixed length) vector or matrix expressions within a for-loop (or a kernel function). Up till now, this resulted in a compilation error “function cannot be used within the context of a kernel function” or “loop parallelization not possible because of function XX”. The transparent handling of vector or matrix expressions with in kernel functions requires some special (and sophisticated) handling at the Quasar compiler and runtime sides. In particular: what is needed is dynamic kernel memory. This is memory that is allocated on the GPU (or CPU) during the operation of the kernel. The dynamic memory is disposed (freed) either when the kernel function terminates or at a later point.

There are a few use cases for dynamic kernel memory:

  • When the algorithm requires to process several small-sized (3x3) to medium-sized (e.g. 64x64) matrices. For example, a kernel function that performs matrix operations for every pixel in the image. The size of the matrices may or may not be known in advance.
  • Efficient handling of multivariate functions that are applied to (non-overlapping or overlapping) image blocks.
  • When the algorithm works with dynamic data structures such as linked lists, trees, it is also often necessary to allocate “nodes” on the fly.
  • To use some sort of “/scratch” memory that does not fit into the GPU shared memory (note: the GPU shared memory is 32K, but this needs to be shared between all threads – for 1024 threads this is 32 bytes private memory per thread). Dynamic memory does not have such a stringent limitation. Moreover, dynamic memory is not shared and disposed either 1) immediately when the memory is not needed anymore or 2) when a GPU/CPU thread exists. Correspondingly, when 1024 threads would use 32K each, this will require less than 32MB, because the threads are logically in parallel, but not physically.

In all these cases, dynamic memory can be used, simply by calling the zeros, ones, eye or uninit functions. One may also use slicing operators (A[0..9, 2]) in order to extract a sub-matrix. The slicing operations then take the current boundary access mode (e.g. mirroring, circular) into account.

Examples

The following program transposes 16x16 blocks of an image, creating a cool tiling effect. Firstly, a kernel function version is given and secondly a loop version. Both versions are equivalent: in fact, the second version is internally converted to the first version.

Kernel version

function [] = __kernel__ kernel (x : mat, y : mat, B : int, pos : ivec2)
    r1 = pos[0]*B..pos[0]*B+B-1   % creates a dynamically allocated vector
    r2 = pos[1]*B..pos[1]*B+B-1   % creates a dynamically allocated vector

    y[r1, r2] = transpose(x[r1, r2]) % matrix transpose 
                                     % creates a dynamically allocated vector 
end

x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size    
parallel_do(size(x,0..1) / B,x,y,B,kernel)

Loop version

x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size

#pragma force_parallel
for m = 0..B..size(x,0)-1
    for n = 0..B..size(x,1)-1
        A = x[m..m+B-1,n..n+B-1]   % creates a dynamically allocated vector
        y[m..m+B-1,n..n+B-1] = transpose(A)   % matrix transpose  
    end
end

Memory models

To acommodate the widest range of algorithms, two memory models are currently provided (some more may be added in the future).

  1. Concurrent memory model In the concurrent memory model, the computation device (e.g. GPU) autonomously manages a separate memory heap that is reserved for dynamic objects. The size of the heap can be configured in Quasar and is typically 32MB.The concurrent memory model is extremely efficient when all threads (e.g. > 512) request dynamic memory at the same time. The memory allocation is done by a specialized parallel allocation algorithm that significantly differs from traditional sequential allocators.

    For efficiency, there are some internal limitations on the size of the allocated blocks:

    • The minimum size is 1024 bytes (everything smaller is rounded up to 1024 bytes)
* The maximum size is 32768 bytes 

For larger allocations, please see the *cooperative memory model*. The minimum size also limits the number of objects that can be allocated.
  1. Cooperative memory modelIn the cooperative memory model, the kernel function requests memory directly to the Quasar allocator. This way, there are no limitations on the size of the allocated memory. Also, the allocated memory is automatically garbage collected.

    Because the GPU cannot launch callbacks to the CPU, this memory model requires the kernel function to be executed on the CPU.

    Advantages:

    • The maximum block size and the total amount of allocated memory only depend on the available system resources.

    Limitations:

    • The Quasar memory allocator uses locking (to limited extend), so simultaneous memory allocations on all processor cores may be expensive.
    • The memory is disposed only when the kernel function exists. This is to internally avoid the number of callbacks from kernel function code to host code. Suppose that you have a 1024x1024grayscale image that allocates 256 bytes per thread. Then this would require 1GB of RAM! In this case, you should use the cooperative memory model (which does not have this problem).

Selection between the memory models.

Features

  • Device functions can also use dynamic memory. The functions may even return objects that are dynamically allocated.
  • The following built-in functions are supported and can now be used from within kernel and device functions:
    zeros, czeros, ones, uninit,
    eye, copy, reshape, repmat, shuffledims, seq, linspace,
    real, imag, complex,
    mathematical functions
    matrix/matrix multiplication
    matrix/vector multiplication

Performance considerations

  • Global memory access: code relying on dynamic memory may be slow (for linear filters on GPU: 4x-8x slower), not because of the allocation algorithms, but because of the global memory accesses. However, it all depends on what you want to do: for example, for non-overlapping block-based processing (e.g., blocks of a fixed size), the dynamic kernel memory is an excellent choice.
  • Static vs. dynamic allocation: when the size of the matrices is known in advanced, static allocation (e.g. outside the kernel function may be used as well). The dynamic allocation approach relieves the programmer from writing code to pre-allocate memory and calculating the size as a function of the size of the data dimensions. The cost of calling the functions uninit, zeros is negligible to the global memory access times (one memory allocation is comparable to 4-8 memory accesses on average – 16-32 bytes is still small compared to the typical sizes of allocated memory blocks). Because dynamic memory is disposed whenever possible when a particular threads exists, the maximum amount of dynamic memory that is in use at any time is much smaller than the amount of memory required for pre-allocation.
  • Use vecX types for vectors of length 2 to 16 whenever your algorithm allows it. This completely avoids using global memory, by using the registers instead. Once a vector of length 17 is created, the vector is allocated as dynamic kernel memory.
  • Avoid writing code that leads to thread divergence: in CUDA, instructions execute in warps of 32 threads. A group of 32 threads must execute (every instruction) together. Control flow instructions (if, match, repeat, while) can negatively affect the performance by causing threads of the same warp to diverge; that is, to follow different execution paths. Then,the different execution paths must be serialized, because all of the threads of a warp share a program counter. Consequently, the total number of instructions executed for this warp is increased. When all the different execution paths have completed, the threads converge back to the same execution path.

To obtain best performance in cases where the control flow depends on the block position (blkpos), the controlling condition should be written so as to minimize the number of divergent warps.

Nested parallelism

It is desired to specify parallelism in all stages of the computation. For example, within a parallel loop, it must be possible to declare another parallel loop etc. Up till now, parallel loops could only be placed at the top-level (in a host function), and multiple levels of parallelism had to be expressed using multi-dimensional perfect for loops. A new feature is that __kernel__ and __device__functions can now also use the parallel_do (and serial_do) functions. The top-level host function may for example spawn 8 threads, from which every of these 8 threads spans again 64 threads (after some algorithm-specific initialization steps). There are several advantages of this approach:

  • More flexibility in expressing the algorithms
  • The nested kernel functions are (or will be) mapped onto CUDA dynamic parallelism on Kepler devices such as the GTX 780, GTX Titan. (Note: requires one of these cards to be effective).
  • When a parallel_do is placed in a __device__ function that is called directly from the host code (CPU computation device), the parallel_do will be accelerated using OpenMP.
  • The high-level matrix operations from the previous section are automatically taking advantage of the nested parallelism.

Notes:

  • There is no guarantee that the CPU/GPU will effectively perform the nested operations in parallel. However, future GPUs may be expected to become more efficient in handling parallelism on different levels.

Limitations:

  • Nested kernel functions may not use shared memory (they can access the shared memory through the calling function however), and they may also not use thread sychronization.
  • Currently only one built-in parameter for the nested kernel functions is supported: pos (and not blkpos, blkidx or blkdim).

Example

The following program showcases the nested parallelism, the improved type inference and the automatic usage of dynamic kernel memory:

function y = gamma_correction(x, gamma)
    y = uninit(size(x))

    % Note: #pragma force_parallel is required here, otherwise
    % the compiler will just use dynamic typing. 
    #pragma force_parallel
    for m=0..size(x,0)-1
        for n=0..size(x,1)-1
            for k=0..size(x,2)-1
                y[m,n,k] = 512 * (x[m,n,k]/512)^gamma
            end
        end
    end
end

function [] = main()
    x = imread("lena_big.tif")
    y = gamma_correction(x, 0.5)

    #pragma force_parallel
    for m=0..size(y,0)-1
        for c=0..size(y,2)-1
            row = y[m,:,c]
            y[m,:,c] = row[numel(row)-1..-1..0]
        end
    end

    imshow(y)
end

The pragma’s are just added here to illustrate that the corresponding loops need to be parallelized, however, using this pragma is optionally. Note:

  • The lack of explicit typing in the complete program (even type T is only an unspecified type parameter)
  • The return type of gamma correction is also not specified, however the compiler is able to deduce that type(y,"cube") is true.
  • The second for-loop (inside the main function), uses slicing operations (: and ..). The assignment row=y[m,:,c] leads to dynamic kernel memory allocation.
  • The vector operations inside the second for-loop automatically express nested parallelism and can be mapped onto CUDA dynamic parallelism.