Documentation

Multi-level breaks in sequential loops

April 20, 2016

Sometimes, small language features can make a lot of difference (in terms of code readability, productivity etc.). In Quasar, multi-dimensional for-loops are quite common. Recently, I came across a missing feature for dealing with multi-dimensional loops.

Suppose we have a multi-dimensional for-loop, as in the following example:

for m=0..511
    for n=0..511
        im_out[m,n] = 255-im[m,n]
        if m==128
            break
        endif
        a = 4
    end
end

Suppose that we want to break outside the loop, as in the above code. This is useful for stopping the processing at a certain point. There is only one caveat: the break-statement only applies to the loop that surrounds it. In the above example, the processing of row 128 is simply stopped at column 0 (the loop over n is interrupted), but it is then resumed starting from row 129. Some programmers are not aware of this, sometimes this can lead to less efficient code, as in the following example:

for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break
        end
    end
end

Here we perform a sequential search, to find the first matrix element for which V[j,k] != 0. When this matrix element is found, the search is stopped. However, because the break statement stops the inner loop, the outer loop is still executed several times (potentially leading to a performance degradation).

1. Solution with extra variables

To make sure that we break outside the outer loop, we would have to introduce an extra variable:

break_outer = false
for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break_outer = true
            break
        end
    end
    if break_outer
        break
    endif
end

It is clear that this approach is not very readible. The additional variable break_outer is also a bit problematic (in the worst case, if the compiler can not filter it out, extra stack memory/registers will be required).

2. Encapsulation in a function

An obvious alternative is the use of a function:

function found = my_func()
    for j = 0..size(V,0)-1
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break
            end
        end
    end
end
found = my_func()

However, the use of function is sometimes not desired for this case. It also involves extra work, such as adding the input/output parameters and adding a function call.

3. New solution: labeling loops

To avoid the above problems, it is now possible to label the for loops (as in e.g. ADA, java):

outer_loop:
    for j = 0..size(V,0)-1
    inner_loop:
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break outer_loop
            end
        end
    end

Providing labels to for-loops is optional, i.e. you only have to do it when it is needed. The new syntax is also supported by the following finding in programming literature:

In 1973 S. Rao Kosaraju refined the structured program theorem by proving that it’s possible to avoid adding additional variables in structured programming, as long as arbitrary-depth, multi-level breaks from loops are allowed. [11]

Note that Quasar has no goto labels (it will never have). The reasons are:

Control flow blocks can always be used instead. Control flow blocks offer more visual cues which enhances the readability of the code.
At the compiler-level, goto labels may make it more difficult to optimize certain operations (e.g. jumps to different scopes).

Remarks:

This applies to the keyword continue as well.
Labels can be applied to for, repeat…until and while loops.
In the future, more compiler functionality may be added to make use of the loop labels. For example, it may be possible to indicate that multiple loops (with the specified names) must be merged.
It is not possible to break outside a parallel loop! The reason is that the execution of the different threads is (usually) non-deterministic, hence using breaks in parallel-loops would result in non-deterministic results.
However, loop labels can be attached to either serial/parallel loops. A useful situation is an iterative algorithm with an inner/outer loop.

Posted in Documentation and tagged Programming tips

Functions in Quasar

April 20, 2016

Hurray! Today I found some of my old notes about Quasar, written about one year ago. Since I forget everything, I thought it could be useful to put it here.

FunctionsDiagram

This diagram is quite essential, if there are some elements you don’t fully understand, please have a look at the reference manual.

Summarized:

Both __kernel__ and __device__ functions are low-level functions, they are natively compiled for CPU and/or GPU. This has the practical consequence that the functionality available for these functions is restricted. It is for example not possible to print, load or save information inside kernel or device functions.
Host functions are high-level functions, typically they are interpreted (or Quasar EXE’s, compiled using the just-in-time compiler).
A kernel function is normally repeated for every element of a matrix. Kernel functions can only be called from host code (although in future support for CUDA 5.0 dynamic parallelism, this may change).
A device function can be called from host code, in which case it is normally interpreted (if not inlined), or from other device/kernel functions, in which case it is natively compiled.

The distinction between these three types of functions is necessary to allow GPU programming. Furthermore, it provides a mechanism (to some extent) to balance the work between CPU/GPU. As programmer, you know whether the code inside the function will be run on GPU/CPU.

Posted in Documentation and tagged Programming tips

Assertions in kernel code

April 20, 2016

From now on, it is possible to put assertions in a kernel function:

function [] =  __kernel__ kernel (pos : ivec3) 
    b = 2
    assert(b==3)
end

In this example, the assertion obviously fails. Quasar breaks with the following error message:

(parallel_do) test_kernel - assertion failed: line 23

Note that the assertion handling is implemented in CUDA using a C macro:

#define assert(x)   if(!(x)) __trap;

Also see CUDA Error handling for more information about assertions and error handling.

Posted in Documentation and tagged Programming tips

Matrix data types and type inference

April 20, 2016

Quasar is an array language, this means that array types (vec, mat and cube) are primitive types and have built-in support (for example, this is in contrast with C/C++ where the user has to define it’s own matrix classes).

The reason for the built-in support is of course that this enables easier mapping of Quasar programs to different parallel devices (GPU, …). Moreover, the user is forced to use one representation for its data (rather than using different class libraries, where it is necessary to wrap one matrix class into another matrix class).

On the other hand, by default Quasar abstracts numeric values into one data type scalar. The type scalar just represents a scalar number, and whether this is a floating point number or a fix point number with 16/32/64-bit precision is actually implementation specific (note currently the Quasar runtime system only supports 32-bit and 64-bit floating point numbers).

Type parameters

For efficiency reasons, there is also support for integer data types int, int8, int16, int32, int64, uint8, uint16, uint32, uint64. (Please note that using 64-bit types can suffer from precision errors, because all the calculations are performed in scalar format). To support matrices built of these types, the array types vec, mat and cube are parametric, for example

vec[int8] denotes a vector (1D array) of 8-bit signed integers
cube[int] denotes a cube (3D array) of signed integers (note: by default, int is 32-bit).

To simplify the types (and to reduce key strokes while programming), there are a number of built-in type aliases:

type vec  : vec[scalar]      % real-valued vector
type cvec : vec[cscalar]    % complex-valued vector

type mat  : mat[scalar]      % real-valued vector
type cmat : mat[cscalar]    % complex-valued vector

type cube  : cube[scalar]    % real-valued vector
type ccube : cube[cscalar]  % complex-valued vector

Please note that these types are just aliases! For example, cube is just cube[scalar] and not cube[something else]:

a = cube[scalar](10)
assert(type(a, "cube"))    % Successful

b = cube[int](10)
assert(type(b, "cube"))    % Unsuccessful - compiler error

However, in case the intention is to check whether a or b is a 3D array regardless of the element type, the special ?? type can be used:

b = cube[int](10)
assert(type(b, "cube[??]"))   % Successful

Type inference

When the type is not specified (for example data that is read dynamically from a file, using the load("data.qd") function), the default data type is ‘??‘. This is a very generic type, every type comparison with ?? results in TRUE. For example:

assert(type(1i+1, '??'))
assert(type([1,2,3], '??'))

However, using variables of type ?? will prevent the compiler to optimize whole operations (for example, applying reductions or automatically generating kernel functions for for-loops). Therefore, it is generally a bad idea to have functions return variables of unspecified type ‘??‘ and correspondingly the compiler gives a warning message when this is the case.

Practically, the type inference starts from the matrix creation functions zeros, ones, imread, … that have a built-in mechanism for deciding the type of the result (based on the parameters of the function).

For example:

A = zeros([1,1,4]) creates a vector of length 4 (vec)
B = zeros([2,3]) creates a matrix of dimensions 2 x 3 (mat).
C = imread("data.tif") creates a cube at all times.

Note that the type inference also works when a variable is passed to the matrix creation functions:

sz = [1,1,4]; A = zeros(sz)

In this case, the compiler knows that sz is a constant vector, it keeps track of the value and uses it for determining the type of zeros.

However: the compiler cannot do this when the variable sz is passed as argument of a function:

function A = create_data(sz)
    A = zeros(sz)
end

In this case, because the type of sz is unknown, the compiler cannot determine the type of A and will therefore use the default type ??. For convenience, the compiler then also generates a warning message “could not determine the type of output argument A”. The solution is then simply to specify the type of sz:

function A = create_data(sz : ivec2)
    A = zeros(sz)
end

This way, the compiler knows that sz is a vector of length 2, and can deduce the type of A, which is a matrix (mat).

Summary

The type system can be summarized as follows. There are 6 categories of types:

Primitive scalar types scalar, cscalar, int, int8, …
Matrix types vec, mat, cube
with parametrized versions vec[??], mat[??], cube[??].
Classes: type R : class / type T : mutable class
Function types [?? -> ??], [(??,??)->(??,??)], …
Device functions: [__device__ ?? -> ??] Kernel functions: [__kernel__ ?? -> ??]
Individual types type
Type classes: T : [scalar|mat|cube]

Finally, different types can be combined to define new types.

Exercise:

Figure out what the following type means:

type X : [vec[ [??->[int|mat|cube[??->??] ] | int -> ?? | __device__ mat->() ] | cscalar ]

Just kidding;-)

Posted in Documentation and tagged data types, programming, Programming tips

Parallel reduction patterns

April 19, 2016 |

An often recurring programming idiom is the use of atomic operations for data aggregation (e.g. to calculate a sum). I noted this when inspecting the code from several Quasar users. In the most simple form, this idiom is as follows (called the JDV variant):

total = 0.0
#pragma force_parallel
for m=0..511
    for n=0..511
        total += im[m,n]
    end
end

However, it could also be more sophisticated as well (called the HQL variant):

A = zeros(2,2)
#pragma force_parallel
for i=0..255        
    A[0,0] += x[i,0]*y[i,0]
    A[0,1] += x[i,0]*y[i,1]
    A[1,0] += x[i,1]*y[i,0]
    A[1,1] += x[i,1]*y[i,1]
end

Here, the accumulator variables are matrix elements, also multiple accumulators are used inside a for loop.

Even though this code is correct, the atomic add (+=) may result in a poor performance on GPU devices, due to all adds being serialized in the hardware (all threads need to write to the same location in memory, so there is a spin-lock that basically serializes all the memory write accesses). The performance is often much worse than performing all operations in serial!

The obvious solution is the use of shared memory, thread synchronization in combination with parallel reduction patterns. I found that such algorithms are actually quite hard to write well, taking all side-effects in consideration, such as register pressure, shared memory pressure. To avoid Quasar users from writing these more sophisticated algorithms, the Quasar compiler now detects the above pattern, under the following conditions:

All accumulator expressions (e.g. total, A[0,0]) should be 1) variables, 2) expressions with constant numeric indices or 3) expressions with indices whose value does not change during the for-loop.
The accumulator variables should be scalar numbers. Complex-valued numbers and fixed-length vectors are currently not (yet) supported.
Only full dimensional parallel reductions are currently supported. A sum along the rows or columns can not be handled yet.
There is an upper limit on the number of accumulators (due to the size limit of the shared memory). For 32-bit floating point, up to 32 accumulators and for 64-bit floating point, up to 16 accumulators are supported. When the upper limit is exceeded, the generated code will still work, but the block size will silently be reduced. This, together with the impact on the occupancy (due to high number of registers being used) might lead to a performance degradation.

For the first example, the loop transformer will generate the following code:

function total:scalar = __kernel__ opt__for_test1_kernel(im:mat,$datadims:ivec2,blkpos:ivec2,blkdim:ivec2)
    % NOTE: the for-loop on line 14 was optimized using the parallel reduction loop transform.        
    $bins=shared(blkdim[0],blkdim[1],1)
    $accum0=0
    $m=blkpos[0]

    while ($m<$datadims[0])
        $n=blkpos[1]
        while ($n<$datadims[1])
            $accum0+=im[$m,$n]
            $n+=blkdim[1]
        end
        $m+=blkdim[0]
    end

    $bins[blkpos[0],blkpos[1],0]=$accum0
    syncthreads
    $bit=1
    while ($bit<blkdim[0])
        if (mod(blkpos[0],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[(blkpos[0]+$bit),blkpos[1],0])
        endif
        syncthreads
        $bit*=2
    end

    $bit=1
    while ($bit<blkdim[1])
        if (mod(blkpos[1],(2*$bit))==0)
            $bins[blkpos[0],blkpos[1],0]=($bins[blkpos[0],blkpos[1],0]+
                 $bins[blkpos[0],(blkpos[1]+$bit),0])
        endif
        syncthreads
        $bit*=2
    end
    if (sum(blkpos)==0)
        total+=$bins[0,0,0]
    endif
end

$blksz=max_block_size(opt__for_test1_kernel,min([16,32],[512,512]))
total=parallel_do([$blksz,$blksz],im,[512,512],opt__for_test1_kernel)

Note that variables starting with $ are only used internally by the compiler, so please do not use them yourself.

Some results (NVidia Geforce 435M), for 100 iterations:

#pragma force_parallel (atomic add):        609 ms
#pragma force_serial:                       675 ms
#pragma force_parallel (reduction pattern): 137 ms (NEW)

So in this case, the parallel reduction pattern results in code that is about 4x-5x faster.

Conclusion: 5x less code and 5x faster computation time!

What should I do: actually nothing.

Posted in Documentation and tagged Programming tips

1 Comment

Quasar for Matlab/Octave users

April 19, 2016 |

This section gives an overview of various functions and operators in Matlab and Quasar. Please feel free to extend this page if you find something that is missing.

Quick hints (see reference manual for a complete description):

Matrices are passed by reference rather than by value. If necessary, use the copy(.) function.
Indices start with 0 rather than with 1.
end variable, like in A(end,end) is currently not supported.
All functions work transparently on the GPU (in MATLAB you need to use special datatypes GPUsingle, GPUdouble, …)

Arithmetic operators

	MATLAB/Octave	Quasar
Assignment	a=1; b=2;	a=1; b=2
Multiple assignment	[a,b]=[1,2];	[a,b]=[1,2]
Addition	a+b	a+b
Subtraction	a-b	a-b
Multiplication	a*b	a*b
Division	a/b	a/b
Power	a^b	a^b
Remainder	rem(a,b)	periodize(a,b)
Addition (atomic)	a+=1	a+=1
Pointwise multiplication	a.*b	a.*b
Pointwise division	a./b	a./b
Pointwise power	a.^b	a.^b

Relational operators

	MATLAB/Octave	Quasar
Equality	a==b	a==b
Less than	a < b	a < b
Greater than	a > b	a > b
Less than or equal	a <= b	a <= b
Greater than or equal	a >= b	a >= b
Inequal	a ~= b	a != b

Logical operators

	MATLAB/Octave	Quasar
Logical AND	a && b	a && b
Logical OR	a \|\| b	a \|\| b
Logical NOT	~a	!a
Logical XOR	xor(a,b)	xor(a,b)
Bitwise AND	a & b	and(a,b)
Bitwise OR	a \| b	or(a,b)
Bitwise NOT	~a	not(a)
Bitwise XOR	xor(a,b)	xor(a,b)

Square root, logarithm

	MATLAB/Octave	Quasar
Square root	sqrt(a)	sqrt(a)
Logarithm, base e	log(a)	log(a)
Logarithm, base 2	log2(a)	log2(a)
Logarithm, base 10	log10(a)	log10(a)
Exponential function	exp(a)	exp(a)

Rounding

	MATLAB/Octave	Quasar
Round	round(a)	round(a)
Round up	ceil(a)	ceil(a)
Round down	floor(a)	floor(a)
Round towards zero	fix(a)	Not implemented yet
Fractional part	frac(a)	frac(a)

Mathematical constants

	MATLAB/Octave	Quasar
pi	pi	pi
e	exp(1)	exp(1)

Missing values (IEEE-754 floating point status flags)

	MATLAB/Octave	Quasar
Not a number	NaN	0/0
+Infinity	+inf	1/0
-Infinity	-inf	-1/0
Is not a number	isnan(x)	isnan(x)
Is infinity	isinf(x)	isinf(x)
Is finite	isfinite(x)	isfinite(x)

Complex numbers

	MATLAB/Octave	Quasar
Imaginary unit	i	1i or 1j
Complex number 3+4i	z=3+4i	z=3+4i or z=complex(3,4)
Modulus	abs(z)	abs(z)
Argument	arg(z)	angle(z)
Real part	real(z)	real(z)
Imaginary part	imag(z)	imag(z)
Complex conjugate	conj(z)	conj(z)

Trigonometry

	MATLAB/Octave	Quasar
Sine	sin(x)	sin(x)
Cosine	cos(x)	cos(x)
Tangent	tan(x)	tan(x)
Hyperbolic Sine	sinh(x)	sinh(x)
Hyperbolic Cosine	cosh(x)	cosh(x)
Hyperbolic Tangent	tanh(x)	tanh(x)
Arcus Sine	asin(x)	asin(x)
Arcus Cosine	acos(x)	acos(x)
Arcus Tangent	atan(x)	atan(x)
Arcus Hyperbolic Sine	asinh(x)	asinh(x)
Arcus Hyperbolic Cosine	acosh(x)	acosh(x)
Arcus Hyperbolic Tangent	atanh(x)	atanh(x)

Random numbers

	MATLAB/Octave	Quasar
Uniform distribution [0,1[	rand(1,10)	rand(1,10)
Gaussian distribution	randn(1,10)	randn(1,10)
Complex Gaussian distribution	randn(1,10)+1i*randn(1,10)	complex(randn(1,10),randn(1,10))

Vectors

	MATLAB/Octave	Quasar	Comment
Row vector	a=[1 2 3 4];	a=[1,2,3,4]
Column vector	a=[1; 2; 3; 4];	a=transpose([1,2,3,4])
Sequence	1:10	1..10
Sequence with steps	1:2:10	1..2..10
Decreasing sequence	10:-1:1	10..-1..1
Linearly spaced vectors	linspace(1,10,5)	linspace(1,10,5)
Reverse	reverse(a)	reverse(a)	import “system.q”

Concatenation

Quasar notes: please avoid because this programming approach is not very efficient. More efficient is to create the resulting matrix using zeros(x), then use slice accessors to fill in the different parts.

	MATLAB/Octave	Quasar	Comments
Horizontal concatenation	[a b]	cat(a,b)	import “system.q”
Vertical concatenation	[a; b]	vertcat(a,b)	import “system.q”
Vertical concatenation with sequences	[0:5; 6:11]	vertcat(0..5,6..11)	import “system.q”
Repeating	repmat(a,[1 2])	repmat(a,[1,2])

Matrices

	MATLAB/Octave	Quasar	Comment
The first 10 element X+Y	a(1:10,1:10);	a[0..9,0..9]
Skip the first elements X+Y	a(2:end,2:end);	a[1..size(a,0)-1,1..size(a,1)-1]
The last elements	a(end,end)	a[size(a,0..1)-1]
10th row	a(10,:)	a[9,:]
10th column	a(:,10)	a[:,9]
Diagonal elements	diag(a)	diag(a)	import “system.q”
Matrix with diagonal 1,2,…,10	diag(1..10)	diag(0..9)	import “system.q”
Reshaping	reshape(a,[1 2 3])	reshape(a,[1,2,3])
Transposing	transpose(a)	transpose(a)
Hermitian Transpose	a’	herm_transpose(a)	import “system.q”
Copying matrices	a=b	a=copy(b)	Matrices passed by reference for efficiency
Flip upsize-down	flipud(a)	flipud(a)
Flip left-right	fliplr(a)	fliplr(a)

Matrix dimensions

	MATLAB/Octave	Quasar	Comments
Number of elements	numel(a)	numel(a)
Number of dimensions	ndims(a)	ndims(a)	import “system.q”
Size	size(a)	size(a)
Size along 2nd dimension	size(a,2)	size(a,1)
Size along 2nd & 3rd dimension	[size(a,2),size(a,3)]	size(a,1..2)

Array/matrix creation

	MATLAB/Octave	Quasar	Comment
Identity matrix	eye(3)	eye(3)
Zero matrix	zeros(2,3)	zeros(2,3)
Ones matrix	ones(2,3)	ones(2,3)
Uninitialized matrix	–	uninit(2,3)
Complex identity matrix	complex(eye(3))	complex(eye(3))
Complex zero matrix	complex(zeros(2,3))	czeros(2,3)
Complex ones matrix	complex(ones(2,3))	complex(ones(2,3))
Complex Uninitialized matrix	–	cuninit(2,3)
Vector	zeros(1,10)	zeros(10) or vec(10)	vec enables generic programming
Matrix	zeros(10,10)	zeros(10,10) or mat(10,10)	mat enables generic programming
Cube	zeros(10,10,10)	zeros(10,10,10) or cube(10,10,10)	cube enables generic programming

Maximum and minimum

	MATLAB/Octave	Quasar
Maximum of two values	max(a,b)	max(a,b)
Minimum of two values	min(a,b)	min(a,b)
Maximum of a matrix	max(max(a))	max(a)
Minimum of a matrix along dimension X=1,2	min(a,[],X)	min(a,X-1)
Maximum of a matrix along dimension X=1,2	max(a,[],X)	max(a,X-1)

Matrix assignment

	MATLAB/Octave	Quasar	Comments
Assigning a constant to a matrix slice	a(:,1)=44;	a[:,0]=44
Assigning a vector to a matrix row slice	a(:,1)=0:99	a[:,0]=0..99
Assigning a vector to a matrix column slice	a(1,:)=(0:99)’	a[0,:]=transpose(0..99)
Replace all positive elements by one	a(a>0)=1	a[a>0]=1	New

Boundary handling, clipping (coordinate transforms)

	MATLAB/Octave	Quasar	Comments
Periodic extension	rem(x,N)	periodize(x,N)
Mirrored extension	–	mirror_ext(x,N)
Clamping	–	clamp(x,N)	returns 0 when x=N, otherwise x.
Saturation between 0 and 1	–	sat(x)	returns 0 when x=1, otherwise x.

Fourier transform

	MATLAB/Octave	Quasar	Comments
1D FFT	fft(x)	fft1(x)
2D FFT	fft2(x)	fft2(x)
3D FFT	fftN(x)	fft3(x)
1D inverse FFT	ifft(x)	ifft1(x)
2D inverse FFT	ifft2(x)	ifft2(x)
3D inverse FFT	ifftN(x)	ifft3(x)
1D fftshift	fftshift(x)	fftshift1(x)	import “system.q”
2D fftshift	fftshift(x)	fftshift2(x)	import “system.q”; Works correctly or images with size MxNx3
3D fftshift	fftshift(x)	fftshift3(x)	import “system.q”
1D inverse fftshift	ifftshift(x)	ifftshift1(x)	import “system.q”
2D inverse fftshift	ifftshift(x)	ifftshift2(x)	import “system.q”; Works correctly or images with size MxNx3
3D inverse fftshift	ifftshift(x)	ifftshift3(x)	import “system.q”

Conversion between integer and floating point

	MATLAB/Octave	Quasar
To single	single(x)	float(x)	32-bit FP precision mode only
To double	double(x)	float(x)	64-bit FP precision mode only
To 32-bit signed integer	int32(x)	int(x) or int32(x)	import “inttypes.q”
To 16-bit signed integer	int16(x)	int16(x)	import “inttypes.q”
To 8-bit signed integer	int8(x)	int8(x)	import “inttypes.q”
To 32-bit unsigned integer	uint32(x)	uint32(x)	import “inttypes.q”
To 16-bit unsigned integer	uint16(x)	uint16(x)	import “inttypes.q”
To 8-bit unsigned integer	uint8(x)	uint8(x)	import “inttypes.q”

Posted in Documentation, How to's and tagged Matlab, Must-read Resources, Octave, programming, quasar

1 Comment