Boundary access modes in Quasar

In earlier versions of Quasar, the boundary extension modes (such as 'mirror, 'circular) only affected the __kernel__ and __device__ functions.

To improve transparency, this is has recently changed. This has the consequence that the following get access modes needed to be supported by the runtime:

(no modifier)     % =error (default) or zero extension (kernel, device function)
safe              % zero extension
mirror            % mirroring near the boundaries
circular          % circular boundary extension
clamped           % keep boundary values
unchecked         % results undefined when reading outside
checked           % error when reading outside

Implementation details: there is a bit of work involved, because it needs to be done for all data types (int8, int16, int32, uint8, uint16, uint32, single, double, UDT, object, …), for different dimensions (vec, mat, cube), and for both matrix getters / slice accessors. perhaps the reason that you will not see this feature implemented in other programming languages: 5 x 10 x 3 x 2 = 300 combinations (=functions to be written). Luckily the use of generics in C# alleviates the problem, reducing it (after a bit of research) to 6 combinations (!) where each algorithm has 3 generic parameters. Idem for CUDA computation engine.

Note that the modifier unchecked should be used with care: only when you are 100% sure that the function is working properly and that there are no memory accesses outside the matrix. A good approach is to use checked first, and when you find out that there never occur any errors, you can switch to unchecked, in other to gain a little speed-up (typically 20%-30% on memory accesses).

Now I would like to point out that the access modifiers are not part of type of the object itself, as the following example illustrates:

A : mat'circular = ones(10, 10)
B = A    % boundary access mode: B : mat'circular
C : mat'unchecked = A

Here, both B and C will hold a reference to the matrix A. However, B will copy the access modifier from A (through type inference) and C will override the access modifier of A. The result is that the access modifiers for A, B and C are circular, circular and unchecked, respectively. Even though there is only one matrix involved, there are effectively two ways of accessing the data in this matrix.

Now, to make things even more complicated, there are also put access modes. But for implementation complexity reasons (and more importantly, to avoid unforeseen data races in parallel algorithms), the number of options have been reduced to three:

safe (default)      % ignore writing outside the boundaries
unchecked           % writing outside the boundaries = crash!
checked             % error when writing outside the boundaries

This means that 'circular, 'mirror, and 'clamped are mapped to 'safe when writing values to the matrix. For example:

A : vec'circular = [1, 2, 3]
A[-1] = 0   % Value neglected

The advantage will then be that you can write code such as:

A : mat'circular = imread("lena_big.tif")[:,:,1]
B = zeros(size(A))
B[-127..128,-127..128] = A[-127..128,-127..128]

Construction of Cell matrices

In Quasar, it is possible to construct cell vectors/matrices, similar to in MATLAB:

A = `1,2,3j,zeros(4,4)´

(Matlab-syntax uses braces as in {1,2,3j,zeros(4,4)}).

Now, I have found that there are two problems with the cell construction syntax:

  • The prime (´) cannot be represented with the ANSI character set and requires UTF-8 encoding. In the Redshift code editor, UTF-8 is used by default, however when editing Quasar code with other editors, problems can occur.
  • The prime (´) is not present on QUERTY keyboards (in contrast to Belgian AZERTY keyboard). I suspect this is one of the reasons that the cell construction syntax is used rarely.
  • Even the documentation system has problems with the prime symbol, that may be the cause that you cannot see it right now. In other words, it is an evil symbol and all evil should be extinct.

To solve these problems, the Quasar parser now also accepts an apostrophe ' (ASCII character 39, or in hex 27h) for closing:

A = `1,2,3j,zeros(4,4)'

Because the apostrophe character exists in ANSI and on QUERTY keyboards, the use of cell matrices constructions is greatly simplified.


Note: the old-fashioned alternative was to construct cell matrices using the function cell, or vec[vec], vec[mat], vec[cube], … For example:

A = cell(4)
A[0] = 1j
A[1] = 2j
A[2] = 3j
A[3] = 4j

Note that this notation is not very elegant, compared to A='1j,2j,3j,4j'. Also it does not allow the compiler to fully determine the type of A (the compiler will find type(A) == "vec[??]" rather than type(A) == "cvec"). In the following section, we will discuss the type inference in more detail.


Type inference

Another new feature of the compiler is that it attempts to infer the type from cell matrices. In earlier versions, all cell matrices defined with the above syntax, had type vec[??]. Now, this has changed, as illustrated by the following example:

    a = `[1, 2],[1, 2, 3]'
    print type(a)            % Prints vec[vec[int]]

    b = `(x->2*x), (x->3*x), (x->4*x)'
    print type(b)            % Prints [ [??->??] ]

    c = ` `[1, 2],[1,2]',`[1, 2, 3],[4, 5, 6]' '
    print type(c)            % Prints vec[vec[vec[int]]]

    d = ` [ [2, 1], [1, 2] ], [ [4, 3], [3, 4] ]'
    print type(d)            % Prints vec[mat]

    e = `(x:int->2*x), (x:int->3*x), (x:int->4*x)'
    print type(e)            % Prints vec[ [int->int] ]

This allows cell matrices that are constructed with the above syntax to be used from kernel functions. A simple example:

    d = `eye(4), ones(4,4)'

    parallel_do(size(d[0]), d, 
        __kernel__ (d : vec[mat], pos : ivec2) -> d[0][pos] += d[1][pos])
    print d[0]

The output is:

       [ [2,1,1,1],
         [1,2,1,1],
         [1,1,2,1],
         [1,1,1,2] ]

Dynamic Memory Allocation on CPU/GPU

In some algorithms, it is desirable to dynamically allocate memory inside __kernel__ or __device__ functions, for example:

  • When the algorithm processes blocks of data for which the maximum size is not known in advance.
  • When the amount of memory that an individual thread uses is too large to fit in the shared memory or in the registers. The shared memory of the GPU consists of typically 32K, that has to be shared between all threads in one block. For single-precision floating point vectors vec or matrices mat and for 1024 threads per block, the maximum amount of shared memory is 32K/(1024*4) = 8 elements. The size of the register memory is also of the same order: 32 K for CUDA compute architecture 2.0.

Example: brute-force median filter

So, suppose that we want to calculate a brute-force median filter for an image (note that there exist much more efficient algorithms based on image histograms, see immedfilt.q). The filter could be implemented as follows:

  1. we extract a local window per pixel in the image (for example of size 13x13).
  2. the local window is then passed to a generic median function, that sorts the intensities in the local window and returns the median.

The problem is that there may not be enough register memory for holding a local window of this size. 1024 threads x 13 x 13 x 4 = 692K!

The solution is then to use a new Quasar runtime & compiler feature: dynamic kernel memory. In practice, this is actually very simple: first ensure that the compiler setting “kernel dynamic memory support” is enabled. Second, matrices can then be allocated through the regular matrix functions zeros, complex(zeros(.)), uninit, ones.

For the median filter, the implementation could be as follows:

% Function: median
% Computes the median of an array of numbers
function y = __device__ median(x : vec)
    % to be completed
end

% Function: immedfilt_kernel
% Naive implementation of a median filter on images
function y = __kernel__ immedfilt_kernel(x : mat, y : mat, W : int, pos : ivec2)

    % Allocate dynamic memory (note that the size depends on W,
    % which is a parameter for this function)
    r = zeros((W*2)^2)

    for m=-W..W-1
        for n=-W..W-1
            r[(W+m)*(2*W)+W+n] = x[pos+[m,n]]
        end
    end     

    % Compute the median of the elements in the vector r
    y[pos] = median(r)
end

For W=4 this algorithm is illustrated in the figure below:

Dynamic memory allocation Figure 1. dynamic memory allocation inside a kernel function.

Parallel memory allocation algorithm

To support dynamic memory, a special parallel memory allocator was designed. The allocator has the following properties:

  1. The allocation/disposal is distributed in space and does not use lists/free-lists of any sort.
  2. The allocation algorithm is designed for speed and correctness.

To accomplish such a design, a number of limitations were needed:

  1. The minimal memory block that can be allocated is 1024 bytes. If the block size is smaller then the size is rounded up to 1024 bytes.
  2. When you try to allocate a block with size that is not a pow-2 multiple of 1024 bytes (i.e. 1024*2^N with N integer), then the size is rounded up to a pow-2 multiple of 1024 bytes.
  3. The maximal memory block that can be allocated is 32768 bytes (=215 bytes). Taking into account that this can be done per pixel in an image, this is actually quite a lot!
  4. The total amount of memory that can be allocated from inside a kernel function is also limited (typically 16 MB). This restriction is mainly to ensure program correctness and to keep memory free for other processes in the system.

It is possible to compute an upper bound for the amount of memory that will be allocated at a given point in time. Suppose that we have a kernel function, that allocates a cube of size M*N*K, then:

max_memory = NUM_MULTIPROC * MAX_RESIDENT_BLOCKS * prod(blkdim) * 4*M*N*K

Where prod(blkdim) is the number of elements in one block, MAX_RESIDENT_BLOCKS is the maximal number of resident blocks per multi-processor and NUM_MULTIPROC is the number of multiprocessors.

So, suppose that we allocate a matrix of size 8x8 on a Geforce 660Ti then:

max_memory = 5 * 16 * 512 * 4 * 8 * 8 = 10.4 MB

This is still much smaller than what would be needed if one would consider pre-allocation (in this case this number would depend on the image dimensions!)

Comparison to CUDA malloc

CUDA has built-in malloc(.) and free(.) functions that can be called from device/kernel functions, however after a few performance tests and seeing warnings on CUDA forums, I decided not to use them. This is the result of a comparison between the Quasar dynamic memory allocation algorithm and that of NVidia:

    Granularity: 1024
    Memory size: 33554432
    Maximum block size: 32768
    Start - test routine

    Operation took 183.832443 msec [Quasar]     
    Operation took 1471.210693 msec [CUDA malloc]
    Success

I obtained similar results for other tests. As you can see, the memory allocation is about 8 times faster using the new apporach than with the NVidia allocator.

Why better avoiding dynamic memory

Even though the memory allocation is quite fast, to obtain the best performance, it is better to avoid dynamic memory:

  • The main issue is that kernel functions using dynamic memory also require several read/write accesses to the global memory. Because dynamically allocated memory has typically the size of hundreds of KBs, the data will not fit into the cache (of size 16KB to 48KB). Correspondingly: the cost of using dynamic memory is in the associated global memory accesses!
  • Please note that the compiler-level handling of dynamic memory is currently in development. As long as the memory is “consumed” locally as in the above example, i.e. not written to external data structures the should not be a problem.

Matrix Conditional Assignment

In MATLAB, it is fairly simple to assign to a subset of a matrix, for example, the values that satisfy a given condition. For example, saturation can be obtained as follows:

A[A < 0] = 0
A[A > 1] = 1

In Quasar, this can be achieved with:

A = saturate(A)

However, the situation can be more complex and then there is no direct equivalent to MATLAB. For example,

A[A > B] = C

where A, B, C are all matrices. The trick is to define a reduction (now in system.q):

type matrix_type : [vec|mat|cube|cvec|cmat|ccube]
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a > b] = c) = (x += (c - x) .* (a > b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a < b] = c) = (x += (c - x) .* (a < b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a <= b] = c) = (x += (c - x) .* (a <= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a >= b] = c) = (x += (c - x) .* (a >= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a == b] = c) = (x += (c - x) .* (a == b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a != b] = c) = (x += (c - x) .* (a != b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a && b] = c) = (x += (c - x) .* (a && b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a || b] = c) = (x += (c - x) .* (a || b))

The first line defines a “general” matrix type, then is then used for the subsequent reductions. The reduction simply works on patterns of the form:

x[a #op# b] = c

and replaces them by the appropriate Quasar expression. The last two reductions are a trick, to get the conditional assignment also working with boolean expressions, such as:

A[A<-0.1 || A>0.1] = 5

Note that, on the other hand:

B = A[A<-0.1]

will currently result in a runtime error (this syntax is not defined yet).

Debugging Quasar kernel functions

Since recently, Quasar Redshift is equipped with a parallel debugger for kernel functions. In the parallel debugging mode, the kernel function is emulated on a “virtual” CUDA-like device architecture, with device parameters that are taken from the GPU installed in your system. This permits device-independent debugging of kernel functions.

To step into the parallel debugging mode, it suffices to place a breakpoint inside a kernel function (or a device function) and to start the program. When the breakpoint hits: Redshift jumps to the new parallel debugger pane (see below).

In the left pane, you see a visualization of the shared memory (through the variable xr), as well as the current position and block position variables (pos and blkpos, respectively). In the middle pane at the bottom, some information about the kernel function is given. The occupancy is an indicator of the performance of the kernel function (for example, a low occupancy may be caused by selecting block sizes that are too small). Occupancy alone however, does not give a full picture: there exist kernel function designed to achieve a low occupancy, but offering very high throughput (see for example the Volkov GTC 2010 talk).

In the right pane, an overview of the scheduled blocks and active threads is given. For active threads, the barrier index is also given: barrier 0 corresponds to all statements before the first barrier, barrier 1 corresponds to statements between the first and second barrier, and so on. Using the context menu of the list box, it is possible to switch to an other thread or to put a breakpoint at a given block.

The parallel debugger allows you to:

  • Step through the kernel function (e.g. using F8 or shift-F8) [code editor]
  • Place (conditional) breakpoints in kernel functions [code editor]
  • Traverse different positions in the data array (or pixels in an image) [parallel debugger pane]
  • Jump to the next thread/block [parallel debugger pane]
  • See which threads are currently active/inactive [parallel debugger pane]
  • Inspect the values of all variables that are used inside a kernel function [variable watch]
  • Visualize the content of the shared memory [variable watch]
  • Record videos of the kernel function in action [variable watch]

The statement evaluation order is not necessarily linear, especially in case thread synchronization is used (through syncthreads). syncthreads places a barrier, which all threads within a block must have encountered, before continuing to the next block.

Internally, kernel functions are interpreted on the CPU, in order to allow full inspection of all variables, placing breakpoints at arbitrary position within a block etc. Moreover, for clarity, the threads in between two barriers are executed serially. When a barrier (or the end of the kernel function) is met, the debugger switches to the next available thread.

The serial execution definitely makes it easy to follow what your program is doing, however in case of data races, it may also hide potential memory access conflicts (due to serialization). For this reason, there is also an option to “mimic parallel execution”, in which random thread context switches are made.

Multi-level breaks in sequential loops

Sometimes, small language features can make a lot of difference (in terms of code readability, productivity etc.). In Quasar, multi-dimensional for-loops are quite common. Recently, I came across a missing feature for dealing with multi-dimensional loops.

Suppose we have a multi-dimensional for-loop, as in the following example:

for m=0..511
    for n=0..511
        im_out[m,n] = 255-im[m,n]
        if m==128
            break
        endif
        a = 4
    end
end 

Suppose that we want to break outside the loop, as in the above code. This is useful for stopping the processing at a certain point. There is only one caveat: the break-statement only applies to the loop that surrounds it. In the above example, the processing of row 128 is simply stopped at column 0 (the loop over n is interrupted), but it is then resumed starting from row 129. Some programmers are not aware of this, sometimes this can lead to less efficient code, as in the following example:

for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break
        end
    end
end

Here we perform a sequential search, to find the first matrix element for which V[j,k] != 0. When this matrix element is found, the search is stopped. However, because the break statement stops the inner loop, the outer loop is still executed several times (potentially leading to a performance degradation).

1. Solution with extra variables

To make sure that we break outside the outer loop, we would have to introduce an extra variable:

break_outer = false
for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break_outer = true
            break
        end
    end
    if break_outer
        break
    endif
end

It is clear that this approach is not very readible. The additional variable break_outer is also a bit problematic (in the worst case, if the compiler can not filter it out, extra stack memory/registers will be required).

2. Encapsulation in a function

An obvious alternative is the use of a function:

function found = my_func()
    for j = 0..size(V,0)-1
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break
            end
        end
    end
end
found = my_func()

However, the use of function is sometimes not desired for this case. It also involves extra work, such as adding the input/output parameters and adding a function call.

3. New solution: labeling loops

To avoid the above problems, it is now possible to label the for loops (as in e.g. ADA, java):

outer_loop:
    for j = 0..size(V,0)-1
    inner_loop:
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break outer_loop
            end
        end
    end

Providing labels to for-loops is optional, i.e. you only have to do it when it is needed. The new syntax is also supported by the following finding in programming literature:

In 1973 S. Rao Kosaraju refined the structured program theorem by proving that it’s possible to avoid adding additional variables in structured programming, as long as arbitrary-depth, multi-level breaks from loops are allowed. [11]

Note that Quasar has no goto labels (it will never have). The reasons are:

  • Control flow blocks can always be used instead. Control flow blocks offer more visual cues which enhances the readability of the code.
  • At the compiler-level, goto labels may make it more difficult to optimize certain operations (e.g. jumps to different scopes).

Remarks:

  • This applies to the keyword continue as well.
  • Labels can be applied to for, repeatuntil and while loops.
  • In the future, more compiler functionality may be added to make use of the loop labels. For example, it may be possible to indicate that multiple loops (with the specified names) must be merged.
  • It is not possible to break outside a parallel loop! The reason is that the execution of the different threads is (usually) non-deterministic, hence using breaks in parallel-loops would result in non-deterministic results.
  • However, loop labels can be attached to either serial/parallel loops. A useful situation is an iterative algorithm with an inner/outer loop.

Functions in Quasar

Hurray! Today I found some of my old notes about Quasar, written about one year ago. Since I forget everything, I thought it could be useful to put it here.

FunctionsDiagram

This diagram is quite essential, if there are some elements you don’t fully understand, please have a look at the reference manual.

Summarized:

  • Both __kernel__ and __device__ functions are low-level functions, they are natively compiled for CPU and/or GPU. This has the practical consequence that the functionality available for these functions is restricted. It is for example not possible to print, load or save information inside kernel or device functions.
  • Host functions are high-level functions, typically they are interpreted (or Quasar EXE’s, compiled using the just-in-time compiler).
  • A kernel function is normally repeated for every element of a matrix. Kernel functions can only be called from host code (although in future support for CUDA 5.0 dynamic parallelism, this may change).
  • A device function can be called from host code, in which case it is normally interpreted (if not inlined), or from other device/kernel functions, in which case it is natively compiled.

The distinction between these three types of functions is necessary to allow GPU programming. Furthermore, it provides a mechanism (to some extent) to balance the work between CPU/GPU. As programmer, you know whether the code inside the function will be run on GPU/CPU.

Assertions in kernel code

From now on, it is possible to put assertions in a kernel function:

function [] =  __kernel__ kernel (pos : ivec3) 
    b = 2
    assert(b==3)
end

In this example, the assertion obviously fails. Quasar breaks with the following error message:

(parallel_do) test_kernel - assertion failed: line 23

Note that the assertion handling is implemented in CUDA using a C macro:

#define assert(x)   if(!(x)) __trap;

Also see CUDA Error handling for more information about assertions and error handling.

Matrix data types and type inference

Quasar is an array language, this means that array types (vec, mat and cube) are primitive types and have built-in support (for example, this is in contrast with C/C++ where the user has to define it’s own matrix classes).

The reason for the built-in support is of course that this enables easier mapping of Quasar programs to different parallel devices (GPU, …). Moreover, the user is forced to use one representation for its data (rather than using different class libraries, where it is necessary to wrap one matrix class into another matrix class).

On the other hand, by default Quasar abstracts numeric values into one data type scalar. The type scalar just represents a scalar number, and whether this is a floating point number or a fix point number with 16/32/64-bit precision is actually implementation specific (note currently the Quasar runtime system only supports 32-bit and 64-bit floating point numbers).

Type parameters

For efficiency reasons, there is also support for integer data types int, int8, int16, int32, int64, uint8, uint16, uint32, uint64. (Please note that using 64-bit types can suffer from precision errors, because all the calculations are performed in scalar format). To support matrices built of these types, the array types vec, mat and cube are parametric, for example

  • vec[int8] denotes a vector (1D array) of 8-bit signed integers
  • cube[int] denotes a cube (3D array) of signed integers (note: by default, int is 32-bit).

To simplify the types (and to reduce key strokes while programming), there are a number of built-in type aliases:

type vec  : vec[scalar]      % real-valued vector
type cvec : vec[cscalar]    % complex-valued vector

type mat  : mat[scalar]      % real-valued vector
type cmat : mat[cscalar]    % complex-valued vector

type cube  : cube[scalar]    % real-valued vector
type ccube : cube[cscalar]  % complex-valued vector

Please note that these types are just aliases! For example, cube is just cube[scalar] and not cube[something else]:

a = cube[scalar](10)
assert(type(a, "cube"))    % Successful

b = cube[int](10)
assert(type(b, "cube"))    % Unsuccessful - compiler error

However, in case the intention is to check whether a or b is a 3D array regardless of the element type, the special ?? type can be used:

b = cube[int](10)
assert(type(b, "cube[??]"))   % Successful

Type inference

When the type is not specified (for example data that is read dynamically from a file, using the load("data.qd") function), the default data type is ‘??‘. This is a very generic type, every type comparison with ?? results in TRUE. For example:

assert(type(1i+1, '??'))
assert(type([1,2,3], '??'))

However, using variables of type ?? will prevent the compiler to optimize whole operations (for example, applying reductions or automatically generating kernel functions for for-loops). Therefore, it is generally a bad idea to have functions return variables of unspecified type ‘??‘ and correspondingly the compiler gives a warning message when this is the case.

Practically, the type inference starts from the matrix creation functions zeros, ones, imread, … that have a built-in mechanism for deciding the type of the result (based on the parameters of the function).

For example:

  • A = zeros([1,1,4]) creates a vector of length 4 (vec)
  • B = zeros([2,3]) creates a matrix of dimensions 2 x 3 (mat).
  • C = imread("data.tif") creates a cube at all times.

Note that the type inference also works when a variable is passed to the matrix creation functions:

sz = [1,1,4]; A = zeros(sz)

In this case, the compiler knows that sz is a constant vector, it keeps track of the value and uses it for determining the type of zeros.

However: the compiler cannot do this when the variable sz is passed as argument of a function:

function A = create_data(sz)
    A = zeros(sz)
end

In this case, because the type of sz is unknown, the compiler cannot determine the type of A and will therefore use the default type ??. For convenience, the compiler then also generates a warning message “could not determine the type of output argument A”. The solution is then simply to specify the type of sz:

function A = create_data(sz : ivec2)
    A = zeros(sz)
end

This way, the compiler knows that sz is a vector of length 2, and can deduce the type of A, which is a matrix (mat).

Summary

The type system can be summarized as follows. There are 6 categories of types:

  1. Primitive scalar types scalar, cscalar, int, int8, …
  2. Matrix types vec, mat, cube

    with parametrized versions vec[??], mat[??], cube[??].

  3. Classes: type R : class / type T : mutable class
  4. Function types [?? -> ??], [(??,??)->(??,??)], …

    Device functions: [__device__ ?? -> ??] Kernel functions: [__kernel__ ?? -> ??]

  5. Individual types type
  6. Type classes: T : [scalar|mat|cube]

Finally, different types can be combined to define new types.

Exercise:

  • Figure out what the following type means:
    type X : [vec[ [??->[int|mat|cube[??->??] ] | int -> ?? | __device__ mat->() ] | cscalar ]

    Just kidding;-)

Parallel reduction patterns

An often recurring programming idiom is the use of atomic operations for data aggregation (e.g. to calculate a sum). I noted this when inspecting the code from several Quasar users. In the most simple form, this idiom is as follows (called the JDV variant):

total = 0.0
#pragma force_parallel
for m=0..511
    for n=0..511
        total += im[m,n]
    end
end

However, it could also be more sophisticated as well (called the HQL variant):

A = zeros(2,2)
#pragma force_parallel
for i=0..255        
    A[0,0] += x[i,0]*y[i,0]
    A[0,1] += x[i,0]*y[i,1]
    A[1,0] += x[i,1]*y[i,0]
    A[1,1] += x[i,1]*y[i,1]
end    

Here, the accumulator variables are matrix elements, also multiple accumulators are used inside a for loop.

Even though this code is correct, the atomic add (+=) may result in a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations in serial!

The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction patterns. I found that such algorithms are actually quite hard to write well, taking all side-effects in consideration, such as register pressure, shared memory pressure. To avoid Quasar users from writing these more sophisticated algorithms, the Quasar compiler now detects the above pattern, under the following conditions:

  • All accumulator expressions (e.g. total, A[0,0]) should be 1) variables, 2) expressions with constant numeric indices or 3) expressions with indices whose value does not change during the for-loop.
  • The accumulator variables should be scalar numbers. Complex-valued numbers and fixed-length vectors are currently not (yet) supported.
  • Only full dimensional parallel reductions are currently supported. A sum along the rows or columns can not be handled yet.
  • There is an upper limit on the number of accumulators (due to the size limit of the shared memory). For 32-bit floating point, up to 32 accumulators and for 64-bit floating point, up to 16 accumulators are supported. When the upper limit is exceeded, the generated code will still work, but the block size will silently be reduced. This, together with the impact on the occupancy (due to high number of registers being used) might lead to a performance degradation.

For the first example, the loop transformer will generate the following code:

function total:scalar = __kernel__ opt__for_test1_kernel(im:mat,$datadims:ivec2,blkpos:ivec2,blkdim:ivec2)
    % NOTE: the for-loop on line 14 was optimized using the parallel reduction loop transform.        
    $bins=shared(blkdim[0],blkdim[1],1)
    $accum0=0
    $m=blkpos[0]

    while ($m<$datadims[0])
        $n=blkpos[1]
        while ($n<$datadims[1])
            $accum0+=im[$m,$n]
            $n+=blkdim[1]
        end
        $m+=blkdim[0]
    end

    $bins[blkpos[0],blkpos[1],0]=$accum0
    syncthreads
    $bit=1
    while ($bit<blkdim[0])
        if (mod(blkpos[0],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[(blkpos[0]+$bit),blkpos[1],0])
        endif
        syncthreads
        $bit*=2
    end

    $bit=1
    while ($bit<blkdim[1])
        if (mod(blkpos[1],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[blkpos[0],(blkpos[1]+$bit),0])
        endif
        syncthreads
        $bit*=2
    end
    if (sum(blkpos)==0)
        total+=$bins[0,0,0]
    endif
end

$blksz=max_block_size(opt__for_test1_kernel,min([16,32],[512,512]))
total=parallel_do([$blksz,$blksz],im,[512,512],opt__for_test1_kernel)

Note that variables starting with $ are only used internally by the compiler, so please do not use them yourself.

Some results (NVidia Geforce 435M), for 100 iterations:

#pragma force_parallel (atomic add):        609 ms
#pragma force_serial:                       675 ms
#pragma force_parallel (reduction pattern): 137 ms (NEW)

So in this case, the parallel reduction pattern results in code that is about 4x-5x faster.

Conclusion: 5x less code and 5x faster computation time!

What should I do: actually nothing.