Documentation

Multi-level breaks in sequential loops

Sometimes, small language features can make a lot of difference (in terms of code readability, productivity etc.). In Quasar, multi-dimensional for-loops are quite common. Recently, I came across a missing feature for dealing with multi-dimensional loops.

Suppose we have a multi-dimensional for-loop, as in the following example:

for m=0..511
    for n=0..511
        im_out[m,n] = 255-im[m,n]
        if m==128
            break
        endif
        a = 4
    end
end 

Suppose that we want to break outside the loop, as in the above code. This is useful for stopping the processing at a certain point. There is only one caveat: the break-statement only applies to the loop that surrounds it. In the above example, the processing of row 128 is simply stopped at column 0 (the loop over n is interrupted), but it is then resumed starting from row 129. Some programmers are not aware of this, sometimes this can lead to less efficient code, as in the following example:

for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break
        end
    end
end

Here we perform a sequential search, to find the first matrix element for which V[j,k] != 0. When this matrix element is found, the search is stopped. However, because the break statement stops the inner loop, the outer loop is still executed several times (potentially leading to a performance degradation).

1. Solution with extra variables

To make sure that we break outside the outer loop, we would have to introduce an extra variable:

break_outer = false
for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break_outer = true
            break
        end
    end
    if break_outer
        break
    endif
end

It is clear that this approach is not very readible. The additional variable break_outer is also a bit problematic (in the worst case, if the compiler can not filter it out, extra stack memory/registers will be required).

2. Encapsulation in a function

An obvious alternative is the use of a function:

function found = my_func()
    for j = 0..size(V,0)-1
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break
            end
        end
    end
end
found = my_func()

However, the use of function is sometimes not desired for this case. It also involves extra work, such as adding the input/output parameters and adding a function call.

3. New solution: labeling loops

To avoid the above problems, it is now possible to label the for loops (as in e.g. ADA, java):

outer_loop:
    for j = 0..size(V,0)-1
    inner_loop:
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break outer_loop
            end
        end
    end

Providing labels to for-loops is optional, i.e. you only have to do it when it is needed. The new syntax is also supported by the following finding in programming literature:

In 1973 S. Rao Kosaraju refined the structured program theorem by proving that it’s possible to avoid adding additional variables in structured programming, as long as arbitrary-depth, multi-level breaks from loops are allowed. [11]

Note that Quasar has no goto labels (it will never have). The reasons are:

  • Control flow blocks can always be used instead. Control flow blocks offer more visual cues which enhances the readability of the code.
  • At the compiler-level, goto labels may make it more difficult to optimize certain operations (e.g. jumps to different scopes).

Remarks:

  • This applies to the keyword continue as well.
  • Labels can be applied to for, repeatuntil and while loops.
  • In the future, more compiler functionality may be added to make use of the loop labels. For example, it may be possible to indicate that multiple loops (with the specified names) must be merged.
  • It is not possible to break outside a parallel loop! The reason is that the execution of the different threads is (usually) non-deterministic, hence using breaks in parallel-loops would result in non-deterministic results.
  • However, loop labels can be attached to either serial/parallel loops. A useful situation is an iterative algorithm with an inner/outer loop.

Functions in Quasar

Hurray! Today I found some of my old notes about Quasar, written about one year ago. Since I forget everything, I thought it could be useful to put it here.

FunctionsDiagram

This diagram is quite essential, if there are some elements you don’t fully understand, please have a look at the reference manual.

Summarized:

  • Both __kernel__ and __device__ functions are low-level functions, they are natively compiled for CPU and/or GPU. This has the practical consequence that the functionality available for these functions is restricted. It is for example not possible to print, load or save information inside kernel or device functions.
  • Host functions are high-level functions, typically they are interpreted (or Quasar EXE’s, compiled using the just-in-time compiler).
  • A kernel function is normally repeated for every element of a matrix. Kernel functions can only be called from host code (although in future support for CUDA 5.0 dynamic parallelism, this may change).
  • A device function can be called from host code, in which case it is normally interpreted (if not inlined), or from other device/kernel functions, in which case it is natively compiled.

The distinction between these three types of functions is necessary to allow GPU programming. Furthermore, it provides a mechanism (to some extent) to balance the work between CPU/GPU. As programmer, you know whether the code inside the function will be run on GPU/CPU.

Assertions in kernel code

From now on, it is possible to put assertions in a kernel function:

function [] =  __kernel__ kernel (pos : ivec3) 
    b = 2
    assert(b==3)
end

In this example, the assertion obviously fails. Quasar breaks with the following error message:

(parallel_do) test_kernel - assertion failed: line 23

Note that the assertion handling is implemented in CUDA using a C macro:

#define assert(x)   if(!(x)) __trap;

Also see CUDA Error handling for more information about assertions and error handling.

Matrix data types and type inference

Quasar is an array language, this means that array types (vec, mat and cube) are primitive types and have built-in support (for example, this is in contrast with C/C++ where the user has to define it’s own matrix classes).

The reason for the built-in support is of course that this enables easier mapping of Quasar programs to different parallel devices (GPU, …). Moreover, the user is forced to use one representation for its data (rather than using different class libraries, where it is necessary to wrap one matrix class into another matrix class).

On the other hand, by default Quasar abstracts numeric values into one data type scalar. The type scalar just represents a scalar number, and whether this is a floating point number or a fix point number with 16/32/64-bit precision is actually implementation specific (note currently the Quasar runtime system only supports 32-bit and 64-bit floating point numbers).

Type parameters

For efficiency reasons, there is also support for integer data types int, int8, int16, int32, int64, uint8, uint16, uint32, uint64. (Please note that using 64-bit types can suffer from precision errors, because all the calculations are performed in scalar format). To support matrices built of these types, the array types vec, mat and cube are parametric, for example

  • vec[int8] denotes a vector (1D array) of 8-bit signed integers
  • cube[int] denotes a cube (3D array) of signed integers (note: by default, int is 32-bit).

To simplify the types (and to reduce key strokes while programming), there are a number of built-in type aliases:

type vec  : vec[scalar]      % real-valued vector
type cvec : vec[cscalar]    % complex-valued vector

type mat  : mat[scalar]      % real-valued vector
type cmat : mat[cscalar]    % complex-valued vector

type cube  : cube[scalar]    % real-valued vector
type ccube : cube[cscalar]  % complex-valued vector

Please note that these types are just aliases! For example, cube is just cube[scalar] and not cube[something else]:

a = cube[scalar](10)
assert(type(a, "cube"))    % Successful

b = cube[int](10)
assert(type(b, "cube"))    % Unsuccessful - compiler error

However, in case the intention is to check whether a or b is a 3D array regardless of the element type, the special ?? type can be used:

b = cube[int](10)
assert(type(b, "cube[??]"))   % Successful

Type inference

When the type is not specified (for example data that is read dynamically from a file, using the load("data.qd") function), the default data type is ‘??‘. This is a very generic type, every type comparison with ?? results in TRUE. For example:

assert(type(1i+1, '??'))
assert(type([1,2,3], '??'))

However, using variables of type ?? will prevent the compiler to optimize whole operations (for example, applying reductions or automatically generating kernel functions for for-loops). Therefore, it is generally a bad idea to have functions return variables of unspecified type ‘??‘ and correspondingly the compiler gives a warning message when this is the case.

Practically, the type inference starts from the matrix creation functions zeros, ones, imread, … that have a built-in mechanism for deciding the type of the result (based on the parameters of the function).

For example:

  • A = zeros([1,1,4]) creates a vector of length 4 (vec)
  • B = zeros([2,3]) creates a matrix of dimensions 2 x 3 (mat).
  • C = imread("data.tif") creates a cube at all times.

Note that the type inference also works when a variable is passed to the matrix creation functions:

sz = [1,1,4]; A = zeros(sz)

In this case, the compiler knows that sz is a constant vector, it keeps track of the value and uses it for determining the type of zeros.

However: the compiler cannot do this when the variable sz is passed as argument of a function:

function A = create_data(sz)
    A = zeros(sz)
end

In this case, because the type of sz is unknown, the compiler cannot determine the type of A and will therefore use the default type ??. For convenience, the compiler then also generates a warning message “could not determine the type of output argument A”. The solution is then simply to specify the type of sz:

function A = create_data(sz : ivec2)
    A = zeros(sz)
end

This way, the compiler knows that sz is a vector of length 2, and can deduce the type of A, which is a matrix (mat).

Summary

The type system can be summarized as follows. There are 6 categories of types:

  1. Primitive scalar types scalar, cscalar, int, int8, …
  2. Matrix types vec, mat, cube

    with parametrized versions vec[??], mat[??], cube[??].

  3. Classes: type R : class / type T : mutable class
  4. Function types [?? -> ??], [(??,??)->(??,??)], …

    Device functions: [__device__ ?? -> ??] Kernel functions: [__kernel__ ?? -> ??]

  5. Individual types type
  6. Type classes: T : [scalar|mat|cube]

Finally, different types can be combined to define new types.

Exercise:

  • Figure out what the following type means:
    type X : [vec[ [??->[int|mat|cube[??->??] ] | int -> ?? | __device__ mat->() ] | cscalar ]

    Just kidding;-)

Parallel reduction patterns

An often recurring programming idiom is the use of atomic operations for data aggregation (e.g. to calculate a sum). I noted this when inspecting the code from several Quasar users. In the most simple form, this idiom is as follows (called the JDV variant):

total = 0.0
#pragma force_parallel
for m=0..511
    for n=0..511
        total += im[m,n]
    end
end

However, it could also be more sophisticated as well (called the HQL variant):

A = zeros(2,2)
#pragma force_parallel
for i=0..255        
    A[0,0] += x[i,0]*y[i,0]
    A[0,1] += x[i,0]*y[i,1]
    A[1,0] += x[i,1]*y[i,0]
    A[1,1] += x[i,1]*y[i,1]
end    

Here, the accumulator variables are matrix elements, also multiple accumulators are used inside a for loop.

Even though this code is correct, the atomic add (+=) may result in a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations in serial!

The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction patterns. I found that such algorithms are actually quite hard to write well, taking all side-effects in consideration, such as register pressure, shared memory pressure. To avoid Quasar users from writing these more sophisticated algorithms, the Quasar compiler now detects the above pattern, under the following conditions:

  • All accumulator expressions (e.g. total, A[0,0]) should be 1) variables, 2) expressions with constant numeric indices or 3) expressions with indices whose value does not change during the for-loop.
  • The accumulator variables should be scalar numbers. Complex-valued numbers and fixed-length vectors are currently not (yet) supported.
  • Only full dimensional parallel reductions are currently supported. A sum along the rows or columns can not be handled yet.
  • There is an upper limit on the number of accumulators (due to the size limit of the shared memory). For 32-bit floating point, up to 32 accumulators and for 64-bit floating point, up to 16 accumulators are supported. When the upper limit is exceeded, the generated code will still work, but the block size will silently be reduced. This, together with the impact on the occupancy (due to high number of registers being used) might lead to a performance degradation.

For the first example, the loop transformer will generate the following code:

function total:scalar = __kernel__ opt__for_test1_kernel(im:mat,$datadims:ivec2,blkpos:ivec2,blkdim:ivec2)
    % NOTE: the for-loop on line 14 was optimized using the parallel reduction loop transform.        
    $bins=shared(blkdim[0],blkdim[1],1)
    $accum0=0
    $m=blkpos[0]

    while ($m<$datadims[0])
        $n=blkpos[1]
        while ($n<$datadims[1])
            $accum0+=im[$m,$n]
            $n+=blkdim[1]
        end
        $m+=blkdim[0]
    end

    $bins[blkpos[0],blkpos[1],0]=$accum0
    syncthreads
    $bit=1
    while ($bit<blkdim[0])
        if (mod(blkpos[0],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[(blkpos[0]+$bit),blkpos[1],0])
        endif
        syncthreads
        $bit*=2
    end

    $bit=1
    while ($bit<blkdim[1])
        if (mod(blkpos[1],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[blkpos[0],(blkpos[1]+$bit),0])
        endif
        syncthreads
        $bit*=2
    end
    if (sum(blkpos)==0)
        total+=$bins[0,0,0]
    endif
end

$blksz=max_block_size(opt__for_test1_kernel,min([16,32],[512,512]))
total=parallel_do([$blksz,$blksz],im,[512,512],opt__for_test1_kernel)

Note that variables starting with $ are only used internally by the compiler, so please do not use them yourself.

Some results (NVidia Geforce 435M), for 100 iterations:

#pragma force_parallel (atomic add):        609 ms
#pragma force_serial:                       675 ms
#pragma force_parallel (reduction pattern): 137 ms (NEW)

So in this case, the parallel reduction pattern results in code that is about 4x-5x faster.

Conclusion: 5x less code and 5x faster computation time!

What should I do: actually nothing.

Quasar for Matlab/Octave users

This section gives an overview of various functions and operators in Matlab and Quasar. Please feel free to extend this page if you find something that is missing.

Quick hints (see reference manual for a complete description):

  • Matrices are passed by reference rather than by value. If necessary, use the copy(.) function.
  • Indices start with 0 rather than with 1.
  • end variable, like in A(end,end) is currently not supported.
  • All functions work transparently on the GPU (in MATLAB you need to use special datatypes GPUsingle, GPUdouble, …)

Arithmetic operators

MATLAB/Octave Quasar
Assignment a=1; b=2; a=1; b=2
Multiple assignment [a,b]=[1,2]; [a,b]=[1,2]
Addition a+b a+b
Subtraction a-b a-b
Multiplication a*b a*b
Division a/b a/b
Power a^b a^b
Remainder rem(a,b) periodize(a,b)
Addition (atomic) a+=1 a+=1
Pointwise multiplication a.*b a.*b
Pointwise division a./b a./b
Pointwise power a.^b a.^b

Relational operators

MATLAB/Octave Quasar
Equality a==b a==b
Less than a < b a < b
Greater than a > b a > b
Less than or equal a <= b a <= b
Greater than or equal a >= b a >= b
Inequal a ~= b a != b

Logical operators

MATLAB/Octave Quasar
Logical AND a && b a && b
Logical OR a || b a || b
Logical NOT ~a !a
Logical XOR xor(a,b) xor(a,b)
Bitwise AND a & b and(a,b)
Bitwise OR a | b or(a,b)
Bitwise NOT ~a not(a)
Bitwise XOR xor(a,b) xor(a,b)

Square root, logarithm

MATLAB/Octave Quasar
Square root sqrt(a) sqrt(a)
Logarithm, base e log(a) log(a)
Logarithm, base 2 log2(a) log2(a)
Logarithm, base 10 log10(a) log10(a)
Exponential function exp(a) exp(a)

Rounding

MATLAB/Octave Quasar
Round round(a) round(a)
Round up ceil(a) ceil(a)
Round down floor(a) floor(a)
Round towards zero fix(a) Not implemented yet
Fractional part frac(a) frac(a)

Mathematical constants

MATLAB/Octave Quasar
pi pi pi
e exp(1) exp(1)

Missing values (IEEE-754 floating point status flags)

MATLAB/Octave Quasar
Not a number NaN 0/0
+Infinity +inf 1/0
-Infinity -inf -1/0
Is not a number isnan(x) isnan(x)
Is infinity isinf(x) isinf(x)
Is finite isfinite(x) isfinite(x)

Complex numbers

MATLAB/Octave Quasar
Imaginary unit i 1i or 1j
Complex number 3+4i z=3+4i z=3+4i or z=complex(3,4)
Modulus abs(z) abs(z)
Argument arg(z) angle(z)
Real part real(z) real(z)
Imaginary part imag(z) imag(z)
Complex conjugate conj(z) conj(z)

Trigonometry

MATLAB/Octave Quasar
Sine sin(x) sin(x)
Cosine cos(x) cos(x)
Tangent tan(x) tan(x)
Hyperbolic Sine sinh(x) sinh(x)
Hyperbolic Cosine cosh(x) cosh(x)
Hyperbolic Tangent tanh(x) tanh(x)
Arcus Sine asin(x) asin(x)
Arcus Cosine acos(x) acos(x)
Arcus Tangent atan(x) atan(x)
Arcus Hyperbolic Sine asinh(x) asinh(x)
Arcus Hyperbolic Cosine acosh(x) acosh(x)
Arcus Hyperbolic Tangent atanh(x) atanh(x)

Random numbers

MATLAB/Octave Quasar
Uniform distribution [0,1[ rand(1,10) rand(1,10)
Gaussian distribution randn(1,10) randn(1,10)
Complex Gaussian distribution randn(1,10)+1i*randn(1,10) complex(randn(1,10),randn(1,10))

Vectors

MATLAB/Octave Quasar Comment
Row vector a=[1 2 3 4]; a=[1,2,3,4]
Column vector a=[1; 2; 3; 4]; a=transpose([1,2,3,4])
Sequence 1:10 1..10
Sequence with steps 1:2:10 1..2..10
Decreasing sequence 10:-1:1 10..-1..1
Linearly spaced vectors linspace(1,10,5) linspace(1,10,5)
Reverse reverse(a) reverse(a) import “system.q”

Concatenation

Quasar notes: please avoid because this programming approach is not very efficient. More efficient is to create the resulting matrix using zeros(x), then use slice accessors to fill in the different parts.

MATLAB/Octave Quasar Comments
Horizontal concatenation [a b] cat(a,b) import “system.q”
Vertical concatenation [a; b] vertcat(a,b) import “system.q”
Vertical concatenation with sequences [0:5; 6:11] vertcat(0..5,6..11) import “system.q”
Repeating repmat(a,[1 2]) repmat(a,[1,2])

Matrices

MATLAB/Octave Quasar Comment
The first 10 element X+Y a(1:10,1:10); a[0..9,0..9]
Skip the first elements X+Y a(2:end,2:end); a[1..size(a,0)-1,1..size(a,1)-1]
The last elements a(end,end) a[size(a,0..1)-1]
10th row a(10,:) a[9,:]
10th column a(:,10) a[:,9]
Diagonal elements diag(a) diag(a) import “system.q”
Matrix with diagonal 1,2,…,10 diag(1..10) diag(0..9) import “system.q”
Reshaping reshape(a,[1 2 3]) reshape(a,[1,2,3])
Transposing transpose(a) transpose(a)
Hermitian Transpose a’ herm_transpose(a) import “system.q”
Copying matrices a=b a=copy(b) Matrices passed by reference for efficiency
Flip upsize-down flipud(a) flipud(a)
Flip left-right fliplr(a) fliplr(a)

Matrix dimensions

MATLAB/Octave Quasar Comments
Number of elements numel(a) numel(a)
Number of dimensions ndims(a) ndims(a) import “system.q”
Size size(a) size(a)
Size along 2nd dimension size(a,2) size(a,1)
Size along 2nd & 3rd dimension [size(a,2),size(a,3)] size(a,1..2)

Array/matrix creation

MATLAB/Octave Quasar Comment
Identity matrix eye(3) eye(3)
Zero matrix zeros(2,3) zeros(2,3)
Ones matrix ones(2,3) ones(2,3)
Uninitialized matrix uninit(2,3)
Complex identity matrix complex(eye(3)) complex(eye(3))
Complex zero matrix complex(zeros(2,3)) czeros(2,3)
Complex ones matrix complex(ones(2,3)) complex(ones(2,3))
Complex Uninitialized matrix cuninit(2,3)
Vector zeros(1,10) zeros(10) or vec(10) vec enables generic programming
Matrix zeros(10,10) zeros(10,10) or mat(10,10) mat enables generic programming
Cube zeros(10,10,10) zeros(10,10,10) or cube(10,10,10) cube enables generic programming

Maximum and minimum

MATLAB/Octave Quasar
Maximum of two values max(a,b) max(a,b)
Minimum of two values min(a,b) min(a,b)
Maximum of a matrix max(max(a)) max(a)
Minimum of a matrix along dimension X=1,2 min(a,[],X) min(a,X-1)
Maximum of a matrix along dimension X=1,2 max(a,[],X) max(a,X-1)

Matrix assignment

MATLAB/Octave Quasar Comments
Assigning a constant to a matrix slice a(:,1)=44; a[:,0]=44
Assigning a vector to a matrix row slice a(:,1)=0:99 a[:,0]=0..99
Assigning a vector to a matrix column slice a(1,:)=(0:99)’ a[0,:]=transpose(0..99)
Replace all positive elements by one a(a>0)=1 a[a>0]=1 *New*

Boundary handling, clipping (coordinate transforms)

MATLAB/Octave Quasar Comments
Periodic extension rem(x,N) periodize(x,N)
Mirrored extension mirror_ext(x,N)
Clamping clamp(x,N) returns 0 when x=N, otherwise x.
Saturation between 0 and 1 sat(x) returns 0 when x=1, otherwise x.

Fourier transform

MATLAB/Octave Quasar Comments
1D FFT fft(x) fft1(x)
2D FFT fft2(x) fft2(x)
3D FFT fftN(x) fft3(x)
1D inverse FFT ifft(x) ifft1(x)
2D inverse FFT ifft2(x) ifft2(x)
3D inverse FFT ifftN(x) ifft3(x)
1D fftshift fftshift(x) fftshift1(x) import “system.q”
2D fftshift fftshift(x) fftshift2(x) import “system.q”; Works correctly or images with size MxNx3
3D fftshift fftshift(x) fftshift3(x) import “system.q”
1D inverse fftshift ifftshift(x) ifftshift1(x) import “system.q”
2D inverse fftshift ifftshift(x) ifftshift2(x) import “system.q”; Works correctly or images with size MxNx3
3D inverse fftshift ifftshift(x) ifftshift3(x) import “system.q”

Conversion between integer and floating point

MATLAB/Octave Quasar
To single single(x) float(x) 32-bit FP precision mode only
To double double(x) float(x) 64-bit FP precision mode only
To 32-bit signed integer int32(x) int(x) or int32(x) import “inttypes.q”
To 16-bit signed integer int16(x) int16(x) import “inttypes.q”
To 8-bit signed integer int8(x) int8(x) import “inttypes.q”
To 32-bit unsigned integer uint32(x) uint32(x) import “inttypes.q”
To 16-bit unsigned integer uint16(x) uint16(x) import “inttypes.q”
To 8-bit unsigned integer uint8(x) uint8(x) import “inttypes.q”