Posts Tagged ‘Programming tips’

Multi-level breaks in sequential loops

Sometimes, small language features can make a lot of difference (in terms of code readability, productivity etc.). In Quasar, multi-dimensional for-loops are quite common. Recently, I came across a missing feature for dealing with multi-dimensional loops.

Suppose we have a multi-dimensional for-loop, as in the following example:

for m=0..511
    for n=0..511
        im_out[m,n] = 255-im[m,n]
        if m==128
            break
        endif
        a = 4
    end
end 

Suppose that we want to break outside the loop, as in the above code. This is useful for stopping the processing at a certain point. There is only one caveat: the break-statement only applies to the loop that surrounds it. In the above example, the processing of row 128 is simply stopped at column 0 (the loop over n is interrupted), but it is then resumed starting from row 129. Some programmers are not aware of this, sometimes this can lead to less efficient code, as in the following example:

for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break
        end
    end
end

Here we perform a sequential search, to find the first matrix element for which V[j,k] != 0. When this matrix element is found, the search is stopped. However, because the break statement stops the inner loop, the outer loop is still executed several times (potentially leading to a performance degradation).

1. Solution with extra variables

To make sure that we break outside the outer loop, we would have to introduce an extra variable:

break_outer = false
for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break_outer = true
            break
        end
    end
    if break_outer
        break
    endif
end

It is clear that this approach is not very readible. The additional variable break_outer is also a bit problematic (in the worst case, if the compiler can not filter it out, extra stack memory/registers will be required).

2. Encapsulation in a function

An obvious alternative is the use of a function:

function found = my_func()
    for j = 0..size(V,0)-1
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break
            end
        end
    end
end
found = my_func()

However, the use of function is sometimes not desired for this case. It also involves extra work, such as adding the input/output parameters and adding a function call.

3. New solution: labeling loops

To avoid the above problems, it is now possible to label the for loops (as in e.g. ADA, java):

outer_loop:
    for j = 0..size(V,0)-1
    inner_loop:
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break outer_loop
            end
        end
    end

Providing labels to for-loops is optional, i.e. you only have to do it when it is needed. The new syntax is also supported by the following finding in programming literature:

In 1973 S. Rao Kosaraju refined the structured program theorem by proving that it’s possible to avoid adding additional variables in structured programming, as long as arbitrary-depth, multi-level breaks from loops are allowed. [11]

Note that Quasar has no goto labels (it will never have). The reasons are:

  • Control flow blocks can always be used instead. Control flow blocks offer more visual cues which enhances the readability of the code.
  • At the compiler-level, goto labels may make it more difficult to optimize certain operations (e.g. jumps to different scopes).

Remarks:

  • This applies to the keyword continue as well.
  • Labels can be applied to for, repeatuntil and while loops.
  • In the future, more compiler functionality may be added to make use of the loop labels. For example, it may be possible to indicate that multiple loops (with the specified names) must be merged.
  • It is not possible to break outside a parallel loop! The reason is that the execution of the different threads is (usually) non-deterministic, hence using breaks in parallel-loops would result in non-deterministic results.
  • However, loop labels can be attached to either serial/parallel loops. A useful situation is an iterative algorithm with an inner/outer loop.

Functions in Quasar

Hurray! Today I found some of my old notes about Quasar, written about one year ago. Since I forget everything, I thought it could be useful to put it here.

FunctionsDiagram

This diagram is quite essential, if there are some elements you don’t fully understand, please have a look at the reference manual.

Summarized:

  • Both __kernel__ and __device__ functions are low-level functions, they are natively compiled for CPU and/or GPU. This has the practical consequence that the functionality available for these functions is restricted. It is for example not possible to print, load or save information inside kernel or device functions.
  • Host functions are high-level functions, typically they are interpreted (or Quasar EXE’s, compiled using the just-in-time compiler).
  • A kernel function is normally repeated for every element of a matrix. Kernel functions can only be called from host code (although in future support for CUDA 5.0 dynamic parallelism, this may change).
  • A device function can be called from host code, in which case it is normally interpreted (if not inlined), or from other device/kernel functions, in which case it is natively compiled.

The distinction between these three types of functions is necessary to allow GPU programming. Furthermore, it provides a mechanism (to some extent) to balance the work between CPU/GPU. As programmer, you know whether the code inside the function will be run on GPU/CPU.

Assertions in kernel code

From now on, it is possible to put assertions in a kernel function:

function [] =  __kernel__ kernel (pos : ivec3) 
    b = 2
    assert(b==3)
end

In this example, the assertion obviously fails. Quasar breaks with the following error message:

(parallel_do) test_kernel - assertion failed: line 23

Note that the assertion handling is implemented in CUDA using a C macro:

#define assert(x)   if(!(x)) __trap;

Also see CUDA Error handling for more information about assertions and error handling.

Matrix data types and type inference

Quasar is an array language, this means that array types (vec, mat and cube) are primitive types and have built-in support (for example, this is in contrast with C/C++ where the user has to define it’s own matrix classes).

The reason for the built-in support is of course that this enables easier mapping of Quasar programs to different parallel devices (GPU, …). Moreover, the user is forced to use one representation for its data (rather than using different class libraries, where it is necessary to wrap one matrix class into another matrix class).

On the other hand, by default Quasar abstracts numeric values into one data type scalar. The type scalar just represents a scalar number, and whether this is a floating point number or a fix point number with 16/32/64-bit precision is actually implementation specific (note currently the Quasar runtime system only supports 32-bit and 64-bit floating point numbers).

Type parameters

For efficiency reasons, there is also support for integer data types int, int8, int16, int32, int64, uint8, uint16, uint32, uint64. (Please note that using 64-bit types can suffer from precision errors, because all the calculations are performed in scalar format). To support matrices built of these types, the array types vec, mat and cube are parametric, for example

  • vec[int8] denotes a vector (1D array) of 8-bit signed integers
  • cube[int] denotes a cube (3D array) of signed integers (note: by default, int is 32-bit).

To simplify the types (and to reduce key strokes while programming), there are a number of built-in type aliases:

type vec  : vec[scalar]      % real-valued vector
type cvec : vec[cscalar]    % complex-valued vector

type mat  : mat[scalar]      % real-valued vector
type cmat : mat[cscalar]    % complex-valued vector

type cube  : cube[scalar]    % real-valued vector
type ccube : cube[cscalar]  % complex-valued vector

Please note that these types are just aliases! For example, cube is just cube[scalar] and not cube[something else]:

a = cube[scalar](10)
assert(type(a, "cube"))    % Successful

b = cube[int](10)
assert(type(b, "cube"))    % Unsuccessful - compiler error

However, in case the intention is to check whether a or b is a 3D array regardless of the element type, the special ?? type can be used:

b = cube[int](10)
assert(type(b, "cube[??]"))   % Successful

Type inference

When the type is not specified (for example data that is read dynamically from a file, using the load("data.qd") function), the default data type is ‘??‘. This is a very generic type, every type comparison with ?? results in TRUE. For example:

assert(type(1i+1, '??'))
assert(type([1,2,3], '??'))

However, using variables of type ?? will prevent the compiler to optimize whole operations (for example, applying reductions or automatically generating kernel functions for for-loops). Therefore, it is generally a bad idea to have functions return variables of unspecified type ‘??‘ and correspondingly the compiler gives a warning message when this is the case.

Practically, the type inference starts from the matrix creation functions zeros, ones, imread, … that have a built-in mechanism for deciding the type of the result (based on the parameters of the function).

For example:

  • A = zeros([1,1,4]) creates a vector of length 4 (vec)
  • B = zeros([2,3]) creates a matrix of dimensions 2 x 3 (mat).
  • C = imread("data.tif") creates a cube at all times.

Note that the type inference also works when a variable is passed to the matrix creation functions:

sz = [1,1,4]; A = zeros(sz)

In this case, the compiler knows that sz is a constant vector, it keeps track of the value and uses it for determining the type of zeros.

However: the compiler cannot do this when the variable sz is passed as argument of a function:

function A = create_data(sz)
    A = zeros(sz)
end

In this case, because the type of sz is unknown, the compiler cannot determine the type of A and will therefore use the default type ??. For convenience, the compiler then also generates a warning message “could not determine the type of output argument A”. The solution is then simply to specify the type of sz:

function A = create_data(sz : ivec2)
    A = zeros(sz)
end

This way, the compiler knows that sz is a vector of length 2, and can deduce the type of A, which is a matrix (mat).

Summary

The type system can be summarized as follows. There are 6 categories of types:

  1. Primitive scalar types scalar, cscalar, int, int8, …
  2. Matrix types vec, mat, cube

    with parametrized versions vec[??], mat[??], cube[??].

  3. Classes: type R : class / type T : mutable class
  4. Function types [?? -> ??], [(??,??)->(??,??)], …

    Device functions: [__device__ ?? -> ??] Kernel functions: [__kernel__ ?? -> ??]

  5. Individual types type
  6. Type classes: T : [scalar|mat|cube]

Finally, different types can be combined to define new types.

Exercise:

  • Figure out what the following type means:
    type X : [vec[ [??->[int|mat|cube[??->??] ] | int -> ?? | __device__ mat->() ] | cscalar ]

    Just kidding;-)

Parallel reduction patterns

An often recurring programming idiom is the use of atomic operations for data aggregation (e.g. to calculate a sum). I noted this when inspecting the code from several Quasar users. In the most simple form, this idiom is as follows (called the JDV variant):

total = 0.0
#pragma force_parallel
for m=0..511
    for n=0..511
        total += im[m,n]
    end
end

However, it could also be more sophisticated as well (called the HQL variant):

A = zeros(2,2)
#pragma force_parallel
for i=0..255        
    A[0,0] += x[i,0]*y[i,0]
    A[0,1] += x[i,0]*y[i,1]
    A[1,0] += x[i,1]*y[i,0]
    A[1,1] += x[i,1]*y[i,1]
end    

Here, the accumulator variables are matrix elements, also multiple accumulators are used inside a for loop.

Even though this code is correct, the atomic add (+=) may result in a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations in serial!

The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction patterns. I found that such algorithms are actually quite hard to write well, taking all side-effects in consideration, such as register pressure, shared memory pressure. To avoid Quasar users from writing these more sophisticated algorithms, the Quasar compiler now detects the above pattern, under the following conditions:

  • All accumulator expressions (e.g. total, A[0,0]) should be 1) variables, 2) expressions with constant numeric indices or 3) expressions with indices whose value does not change during the for-loop.
  • The accumulator variables should be scalar numbers. Complex-valued numbers and fixed-length vectors are currently not (yet) supported.
  • Only full dimensional parallel reductions are currently supported. A sum along the rows or columns can not be handled yet.
  • There is an upper limit on the number of accumulators (due to the size limit of the shared memory). For 32-bit floating point, up to 32 accumulators and for 64-bit floating point, up to 16 accumulators are supported. When the upper limit is exceeded, the generated code will still work, but the block size will silently be reduced. This, together with the impact on the occupancy (due to high number of registers being used) might lead to a performance degradation.

For the first example, the loop transformer will generate the following code:

function total:scalar = __kernel__ opt__for_test1_kernel(im:mat,$datadims:ivec2,blkpos:ivec2,blkdim:ivec2)
    % NOTE: the for-loop on line 14 was optimized using the parallel reduction loop transform.        
    $bins=shared(blkdim[0],blkdim[1],1)
    $accum0=0
    $m=blkpos[0]

    while ($m<$datadims[0])
        $n=blkpos[1]
        while ($n<$datadims[1])
            $accum0+=im[$m,$n]
            $n+=blkdim[1]
        end
        $m+=blkdim[0]
    end

    $bins[blkpos[0],blkpos[1],0]=$accum0
    syncthreads
    $bit=1
    while ($bit<blkdim[0])
        if (mod(blkpos[0],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[(blkpos[0]+$bit),blkpos[1],0])
        endif
        syncthreads
        $bit*=2
    end

    $bit=1
    while ($bit<blkdim[1])
        if (mod(blkpos[1],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[blkpos[0],(blkpos[1]+$bit),0])
        endif
        syncthreads
        $bit*=2
    end
    if (sum(blkpos)==0)
        total+=$bins[0,0,0]
    endif
end

$blksz=max_block_size(opt__for_test1_kernel,min([16,32],[512,512]))
total=parallel_do([$blksz,$blksz],im,[512,512],opt__for_test1_kernel)

Note that variables starting with $ are only used internally by the compiler, so please do not use them yourself.

Some results (NVidia Geforce 435M), for 100 iterations:

#pragma force_parallel (atomic add):        609 ms
#pragma force_serial:                       675 ms
#pragma force_parallel (reduction pattern): 137 ms (NEW)

So in this case, the parallel reduction pattern results in code that is about 4x-5x faster.

Conclusion: 5x less code and 5x faster computation time!

What should I do: actually nothing.