OpenCL And Function Pointers

Introduction

Quasar has a nice mechanism to compose algorithms in a generic way, based on function types.

For example, we can define a function that reads an RGB color value from an image as follows:

    RGB_color_image = __device__ (pos : ivec2) -> img[pos[0], pos[1], 0..2]

Now suppose we are dealing with images in some imaginary Yab color space where

    Y = R+2*G+B        G = (Y +   a + b)/4
    a = G - R     or   R = (Y - 3*a + b)/4
    b = G - B          B = (Y +   a - 3*b)/4

we can define a similar read-out function that automatically converts Yab to RGB:

    RGB_Yab_image = __device__ (pos : ivec2) -> 
        [dotprod(img[pos[0],pos[1],0..2], [1,-3/4,1/4]),
         dotprod(img[pos[0],pos[1],0..2], [1/4,1/4,1/4]),
         dotprod(img[pos[0],pos[1],0..2], [1,1,-3/4])]

Consequently, both functions have the same type:

    RGB_color_image : [__device__ ivec2 -> vec3]
    RGB_Yab_image   : [__device__ ivec2 -> vec3]

Then we can build an algorithm that works both with RGB and Yab images:

    function [] = __kernel__ brightness_contrast_enhancer(
        brightness :    scalar, 
        contrast : scalar, 
        x : [__device__ ivec2 -> vec3], 
        y : cube,
        pos : ivec2)        

        y[pos[0],pos[1],0..2] = x(pos)*contrast + brightness
    end

    match input_fmt with
    | "RGB" ->
        fn_ptr = RGB_color_image
    | "Yab" ->
        fn_ptr = RGB_Yab_image
    | _ ->
        error("Sorry, input format is currently not defined")
    end

    y = zeros(512,512,3)
    parallel_do(size(y),brightness,contrast,fn_ptr,y,brightness_contrast_enhancer)

Although this approach is very convenient, and allows also different algorithms to be constructed easily (for example for YCbCr, Lab color spaces), there are a number of disadvantages:

  1. The C-implementation typically requires making use of function pointers. However, OpenCL currently does not support function pointers, so this kind of programs can not be executed on OpenCL-capable hardware.
  2. Although CUDA supports function pointers, in some circumstances they result in an internal compiler error (NVCC bug). These cases are very hard to fix.
  3. In CUDA, kernel functions that use function pointers may be 2x slower than the same code without function pointers (e.g. by inlining the function).

Manual solution

The (manual) solution is to use function specialization:

    match input_fmt with
    | "RGB" ->
        kernel_fn = $specialize(brightness_contrast_enhancer, fn_ptr==RGB_color_image)
    | "Yab" ->
        kernel_fn = $specialize(brightness_contrast_enhancer, fn_ptr==RGB_Yab_image)
    | _ ->
        error("Sorry, input format is currently not defined")
    end

    y = zeros(512,512,3)
    parallel_do(size(y),brightness,contrast,y,kernel_fn)

Here, the function brightness_contrast_enhancer is specialized using both RGB_color_image and RGB_Yab_image respectively. These functions are then simply substituted into the kernel function code, effectively eliminating the function pointers.

Automatic solution

The Quasar compiler now has an option UseFunctionPointers, which can have the following values:

  • Always: function pointers are always used (causes more compact code to be generated)
  • SmartlyAvoid: avoid function pointers where possible (less compact code)
  • Error: generate an error if a function pointer cannot be avoided.

In the example of the manual solution, the function pointer cannot be avoided. However, when rewriting the code block as follows:

    y = zeros(512,512,3)
    match input_fmt with
    | "RGB" ->
        parallel_do(size(y),brightness,contrast,RGB_color_image,y,brightness_contrast_enhancer)
    | "Yab" ->
        parallel_do(size(y),brightness,contrast,RGB_Yab_image,y,brightness_contrast_enhancer)
    | _ ->
        error("Sorry, input format is currently not defined")
    end

The compiler is able to automatically specialize the function brightness_contrast_enhancer for RGB_color_image and RGB_Yab_image (Avoid and Error modes).

Generic Programming – Opening New Frontiers

To solve a limitation of Quasar, in which __kernel__ functions in some circumstances needed to be duplicated for different container types (e.g. vec[int8], vec[scalar], vec[cscalar]), there is now finally support for generic programming.

Consider the following program that extracts the diagonal elements of a matrix and that is supposed to deal with arguments of either type mat or type cmat:

    function y : vec = diag(x : mat)
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = zeros(N)
        parallel_do(size(y), __kernel__ 
            (x:mat, y:vec, pos:int) -> y[pos] = x[pos,pos])
    end

    function y : cvec = diag(x : cmat)
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = czeros(N)
        parallel_do(size(y), __kernel__ 
            (x:cmat, y:cvec, pos : int) -> y[pos] = x[pos,pos])
    end

Although function overloading here greatly solves part of the problem (at least from the user’s perspective), there is still duplication of the function diag. In general, we would like to specify functions that can “work” irrespective of their underlying type.

The solution is to use Generic Programming. In Quasar, this is fairly easy to do:

    function y = diag[T](x : mat[T])
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = vec[T](N)
        parallel_do(size(y), __kernel__ 
            (pos) -> y[pos] = x[pos,pos])
    end

As you can see, the types of the function signature have simply be omitted. The same holds for the __kernel__ function.

In this example, the type parameter T is required because it is needed for the construction of vector y (through the vec[T] constructor). If T==scalar, vec[T] reduces to zeros, while ifT==cscalar, vec[T] reduces to czeros (complex-valued zero matrix). In case the type parameter is not required, it can be dropped, as in the following example:

    function [] = copy_mat(x, y)
        assert(size(x)==size(y))
        parallel_do(size(y), __kernel__
            (pos) -> y[pos] = x[pos])
    end

Remarkably, this is still a generic function in Quasar; no special syntax is needed here.

Note that in previous versions of Quasar, all kernel function parameters needed to be explicitly typed. This is now no longer the case: the compiler will deduce the parameter types by calls to diag and by applying the internal type inference mechanism. The same holds for the __device__ functions.

When calling diag with two different types of parameters (for example once with x:mat and a second time with x:cmat), the compiler will make two generic instantiations of diag. Internally, the compiler may either:

  1. Keep the generic definition (type erasion)
    function y = diag(x)
  2. Make two instances of diag (reification):
    function y : vec  = diag(x : mat)
    function y : cvec = diag(x : cmat)

The compiler will combine these two techniques in a transparent way, such that:

  1. For kernel-functions explicit code is generated for the specific data types,
  2. For less performance-critical host code type erasion is used (to avoid code duplication).

The selection of the code to run is made at compile-time, so correspondingly the Spectroscope Debugger needs special support for this.

Of course, when calling the diag function with a variable of type that cannot be determined at compile-time, a compiler error is generated:

The type of the arguments ('op') needs to be fully defined for this function call!

This is then similar to the original handling of kernel functions.

Extensions

There are several extensions possible to fine-tune the behavior of the generic code.

Type classes

Type classes allow the type range of the input parameters to be narrowed. For example:

    function y  = diag(x : [mat|cmat])

This construction only allows variables of the type mat and cmat to be passed to the function. This is useful when it is already known in advance which types are relevant (in this case a real-valued or complex-valued matrix).

Equivalently, type class aliases can be defined. The type:

    type AllInt : [int|int8|int16|int32|uint8|uint32|uint64]

groups all integer types that exist in Quasar. Type classes are also useful for defining reductions:

    type RealNumber: [scalar|cube|AllInt|cube[AllInt]]
    type ComplexNumber: [cscalar|ccube]

    reduction (x : RealNumber) -> real(x) = x   

Without type classes, the reduction would need to be written 4 times, one for each element.

Type parameters

Levels of genericity

There are three levels of genericity (for which generic instances can be constructed):

  1. Type constraints: a type constraint binds the type of an input argument of the function.
  2. Value constraints: gives an explicit value to the value of an input argument
  3. Logic predicates that are not type or value constraints

As an example, consider the following generic function:

    function y = __device__ soft_thresholding(x, T)
        if abs(x)>=T
            y = (abs(x) - T) * (x / abs(x))
        else
            y = 0
        endif
    end

    reduction x : scalar -> abs(x) = x where x >= 0

Now, we can make a specialization of this function to a specific type:

    soft_thresholding_real = $specialize(soft_thresholding,
        type(x,"scalar") && type(T, "scalar"))

But also for a fixed threshold:

    soft_thresholding_T = $specialize(soft_thresholding,T==10)

We can even go one step further and specify that x>0:

    soft_thresholding_P = $specialize(soft_thresholding,x>0)

Everything combined, we get:

    soft_thresholding_E = $specialize(soft_thresholding,
        type(x,"scalar") && type(T,"scalar") && T==10 && x>0)

Based on this knowledge (and the above reduction), the compiler will then generate the following function:

    function y = __device__ soft_thresholding_E(x : scalar, T : scalar)
        if x >= 10
            y = x - 10
        else
            y = 0
        endif
    end

There are two ways of performing this type of specialization:

  1. Using the $specialize function. Note that this approach is only recommended for testing.
  2. Alternatively, the specializations can be performed automatically, using the assert function from the calling function:
    function [] = __kernel__ denoising(x : mat, y : mat)
    
        assert(x[pos]>0)    
        y[pos] = soft_thresholding(x[pos], 10)
    
    end

Reductions with where-clauses

Recently, reduction where-clauses have been implemented. The where clause is a condition that determines at runtime (or at compile time) whether a given reduction may be applied. There are two main use cases for where clauses:

  1. To avoid invalid results: In some circumstances, applying certain reductions may lead to invalid results (for example a real-valued sqrt function applied to a complex-valued input, derivative of tan(x) in pi/2…)
  2. For optimization purposes.

For example:

reduction (x : scalar) -> abs(x) = x where x >= 0
reduction (x : scalar) -> abs(x) = -x where x < 0

In case the compiler has no information on the sign of x, the following mapping is applied:

abs(x) -> x >= 0 ? x : (x < 0 ? -x : abs(x))

And the evaluation of the where clauses of the reduction is performed at runtime. However, when the compiler has information on x (e.g. assert(x <= -1)), the mapping will be much simpler:

abs(x) -> -x

Note that the abs(.) function is a trivial example, in practice this could be more complicated:

reduction (x : scalar) -> some_op(x, a) = superfast_op(x, a) where 0 <= a && a < 1
reduction (x : scalar) -> some_op(x, a) = accurate_op(x, a) where 1 <= a

Assert the Truth, Unassert the Untruth

Quasar has a logic system, that is centered around the assert function and that can be useful for several reasons:

  • Assertions can be used for testing a specified condition, resulting in a runtime error (error) if the condition is not met:
    assert(positiveNumber>0,"positiveNumber became negative while it shouldn't")
  • Assertions can also help the optimization system. For example, the type of variables can be “asserted” using type assertions:
    assert(type(cubicle, "cube[cube]"))

    The compiler can then verify the type of the variable cubicle and if it is not known at this stage, knowledge can be inserted into the compiler, resulting in the compilation message:

    assert.q - Line 4: [info] 'cubicle' is now assumed to be of type 'cube[cube]'.

    At runtime, the assert function just behaves like usual, resulting in an error if the condition is not met.

  • Assertions are useful in combination with reduction-where clauses:
    reduction (x : scalar) -> abs(x) = x where x >= 0

    If we previously assert that x is a positive number, then this assertion will eliminate the runtime check for x >= 0.

  • Assertions can be used to cut branches:
    assert(x > 0 && x < 1)
    if x < 0
        ...
    endif   

    Here, the compiler will determine that the if-block will never be executed, so it will destroy the entire content of the if-block, resulting in a compilation message:

    assert.q - Line 10: [info] if-branch is cut due to the assertions `x > 0 && x < 1`.

    Similarly, pre-processor branches can be constructed with this approach.

  • Assertions can be combined with generic function specialization. Later more about this.

It is not possible to fool the compiler. For example, if the compiler can determine at compile-time that the assertion will never be met, an error will be generated, and it will not be even possible to run the program.

Logic system

The Quasar compiler has now a propositional logic system, that is able to “reason” about previous assertions. Also, different assertions can be combined using the logical operators AND &&, OR ||and NOT !.

There are three meta functions that help with assertions:

  • $check(proposition) checks whether proposition can be satisfied, given the previous set of assertions, resulting in three possible values: "Valid", "Satisfiable" or "Unsatisfiable".
  • $assump(variable) lists all assertions that are currently known about a variable, including the implicit type predicates that are obtained through type inference. Note that the result of $assumpis an expression, so for visualization it may be necessary to convert it to a textual representation using $str(.) (to avoid the expression from being evaluated).
  • $simplify(expr) simplifies logic expressions based on the knowledge that is inserted through assertions.

Types of assertions

There are different types of assertions that can be combined in a transparent way.

Equalities

The most simple cases of assertions are the equality assertions a==b. For example:

symbolic a, b
assert(a==4 && b==6)

assert($check(a==5)=="Unsatisfiable")
assert($check(a==4)=="Valid")
assert($check(a!=4)=="Unsatisfiable")
assert($check(b==6)=="Valid")
assert($check(b==3)=="Unsatisfiable")
assert($check(b!=6)=="Unsatisfiable")
assert($check(a==4 && b==6)=="Valid")
assert($check(a==4 && b==5)=="Unsatisfiable")
assert($check(a==4 && b!=6)=="Unsatisfiable")
assert($check(a==4 || b==6)=="Valid")
assert($check(a==4 || b==7)=="Valid")
assert($check(a==3 || b==6)=="Valid")
assert($check(a==3 || b==5)=="Unsatisfiable")
assert($check(a!=4 || b==6)=="Valid")

print $str($assump(a)),",",$str($assump(b)) % prints (a==4),(b==6)

Here, we use symbolic to declare symbolic variables (variables that are not to be “evaluated”, i.e. translated into their actual value since they don’t have a specific value). Next, the function assert is used to test whether the $check(.) function works correctly (=self-checking).

Inequalities

The propositional logic system can also work with inequalities:

symbolic a
assert(a>2 && a<4)
assert($check(a>1)=="Valid")
assert($check(a>3)=="Satisfiable")
assert($check(a<3)=="Satisfiable")
assert($check(a<2)=="Unsatisfiable")
assert($check(a>4)=="Unsatisfiable")
assert($check(a<=2)=="Unsatisfiable")
assert($check(a>=2)=="Valid")
assert($check(a<=3)=="Satisfiable")
assert($check(!(a>3))=="Satisfiable")

Type assertions

As in the above example:

assert(type(cubicle, "cube[cube]"))

Please note that assertions should not be used with the intention of variable type declaration. To declare the type of certain variables there is a more straightforward approach:

cubicle : cube[cube]

Type assertions can be used in functions that accept generic ?? arguments, then for example to ensure that a cube[cube] is passed depending on another parameter.

User-defined properties of variables

It is also possible to define “properties” of variables, using a symbolic declaration. For example:

symbolic is_a_hero, Jan_Aelterman

Then you can assert:

assert(is_a_hero(Jan_Aelterman))

Correspondingly, if you perform the test:

print $check(is_a_hero(Jan_Aelterman))   % Prints: Valid
print $check(!is_a_hero(Jan_Aelterman))  % Prints: Unsatisfiable

If you then try to assert the opposite:

assert(!is_a_hero(Jan_Aelterman))

The compiler will complain:

assert.q - Line 119: NO NO NO I don't believe this, can't be true! 
Assertion '!(is_a_hero(Jan_Aelterman))' is contradictory with 'is_a_hero(Jan_Aelterman)'

Unassert

In some cases, it is neccesary to undo certain assertions that were previously made. For this task, the function unassert can be used:

unassert(propositions)

This function only has a meaning at compile-time; at run-time nothing needs to be done.

For example, if you wish to reconsider the assertion is_a_hero(Jan_Aelterman) you can write:

unassert(is_a_hero(Jan_Aelterman))

print $check(is_a_hero(Jan_Aelterman))   % Prints: most likely not
print $check(!is_a_hero(Jan_Aelterman))  % Prints: very likely

Alternatively you could have written:

unassert(!is_a_hero(Jan_Aelterman))

print $check(is_a_hero(Jan_Aelterman))   % Prints: Valid
print $check(!is_a_hero(Jan_Aelterman))  % Prints: Unsatisfiable

Boundary access modes in Quasar

In earlier versions of Quasar, the boundary extension modes (such as 'mirror, 'circular) only affected the __kernel__ and __device__ functions.

To improve transparency, this is has recently changed. This has the consequence that the following get access modes needed to be supported by the runtime:

(no modifier)     % =error (default) or zero extension (kernel, device function)
safe              % zero extension
mirror            % mirroring near the boundaries
circular          % circular boundary extension
clamped           % keep boundary values
unchecked         % results undefined when reading outside
checked           % error when reading outside

Implementation details: there is a bit of work involved, because it needs to be done for all data types (int8, int16, int32, uint8, uint16, uint32, single, double, UDT, object, …), for different dimensions (vec, mat, cube), and for both matrix getters / slice accessors. perhaps the reason that you will not see this feature implemented in other programming languages: 5 x 10 x 3 x 2 = 300 combinations (=functions to be written). Luckily the use of generics in C# alleviates the problem, reducing it (after a bit of research) to 6 combinations (!) where each algorithm has 3 generic parameters. Idem for CUDA computation engine.

Note that the modifier unchecked should be used with care: only when you are 100% sure that the function is working properly and that there are no memory accesses outside the matrix. A good approach is to use checked first, and when you find out that there never occur any errors, you can switch to unchecked, in other to gain a little speed-up (typically 20%-30% on memory accesses).

Now I would like to point out that the access modifiers are not part of type of the object itself, as the following example illustrates:

A : mat'circular = ones(10, 10)
B = A    % boundary access mode: B : mat'circular
C : mat'unchecked = A

Here, both B and C will hold a reference to the matrix A. However, B will copy the access modifier from A (through type inference) and C will override the access modifier of A. The result is that the access modifiers for A, B and C are circular, circular and unchecked, respectively. Even though there is only one matrix involved, there are effectively two ways of accessing the data in this matrix.

Now, to make things even more complicated, there are also put access modes. But for implementation complexity reasons (and more importantly, to avoid unforeseen data races in parallel algorithms), the number of options have been reduced to three:

safe (default)      % ignore writing outside the boundaries
unchecked           % writing outside the boundaries = crash!
checked             % error when writing outside the boundaries

This means that 'circular, 'mirror, and 'clamped are mapped to 'safe when writing values to the matrix. For example:

A : vec'circular = [1, 2, 3]
A[-1] = 0   % Value neglected

The advantage will then be that you can write code such as:

A : mat'circular = imread("lena_big.tif")[:,:,1]
B = zeros(size(A))
B[-127..128,-127..128] = A[-127..128,-127..128]

Construction of Cell matrices

In Quasar, it is possible to construct cell vectors/matrices, similar to in MATLAB:

A = `1,2,3j,zeros(4,4)´

(Matlab-syntax uses braces as in {1,2,3j,zeros(4,4)}).

Now, I have found that there are two problems with the cell construction syntax:

  • The prime (´) cannot be represented with the ANSI character set and requires UTF-8 encoding. In the Redshift code editor, UTF-8 is used by default, however when editing Quasar code with other editors, problems can occur.
  • The prime (´) is not present on QUERTY keyboards (in contrast to Belgian AZERTY keyboard). I suspect this is one of the reasons that the cell construction syntax is used rarely.
  • Even the documentation system has problems with the prime symbol, that may be the cause that you cannot see it right now. In other words, it is an evil symbol and all evil should be extinct.

To solve these problems, the Quasar parser now also accepts an apostrophe ' (ASCII character 39, or in hex 27h) for closing:

A = `1,2,3j,zeros(4,4)'

Because the apostrophe character exists in ANSI and on QUERTY keyboards, the use of cell matrices constructions is greatly simplified.


Note: the old-fashioned alternative was to construct cell matrices using the function cell, or vec[vec], vec[mat], vec[cube], … For example:

A = cell(4)
A[0] = 1j
A[1] = 2j
A[2] = 3j
A[3] = 4j

Note that this notation is not very elegant, compared to A='1j,2j,3j,4j'. Also it does not allow the compiler to fully determine the type of A (the compiler will find type(A) == "vec[??]" rather than type(A) == "cvec"). In the following section, we will discuss the type inference in more detail.


Type inference

Another new feature of the compiler is that it attempts to infer the type from cell matrices. In earlier versions, all cell matrices defined with the above syntax, had type vec[??]. Now, this has changed, as illustrated by the following example:

    a = `[1, 2],[1, 2, 3]'
    print type(a)            % Prints vec[vec[int]]

    b = `(x->2*x), (x->3*x), (x->4*x)'
    print type(b)            % Prints [ [??->??] ]

    c = ` `[1, 2],[1,2]',`[1, 2, 3],[4, 5, 6]' '
    print type(c)            % Prints vec[vec[vec[int]]]

    d = ` [ [2, 1], [1, 2] ], [ [4, 3], [3, 4] ]'
    print type(d)            % Prints vec[mat]

    e = `(x:int->2*x), (x:int->3*x), (x:int->4*x)'
    print type(e)            % Prints vec[ [int->int] ]

This allows cell matrices that are constructed with the above syntax to be used from kernel functions. A simple example:

    d = `eye(4), ones(4,4)'

    parallel_do(size(d[0]), d, 
        __kernel__ (d : vec[mat], pos : ivec2) -> d[0][pos] += d[1][pos])
    print d[0]

The output is:

       [ [2,1,1,1],
         [1,2,1,1],
         [1,1,2,1],
         [1,1,1,2] ]

Dynamic Memory Allocation on CPU/GPU

In some algorithms, it is desirable to dynamically allocate memory inside __kernel__ or __device__ functions, for example:

  • When the algorithm processes blocks of data for which the maximum size is not known in advance.
  • When the amount of memory that an individual thread uses is too large to fit in the shared memory or in the registers. The shared memory of the GPU consists of typically 32K, that has to be shared between all threads in one block. For single-precision floating point vectors vec or matrices mat and for 1024 threads per block, the maximum amount of shared memory is 32K/(1024*4) = 8 elements. The size of the register memory is also of the same order: 32 K for CUDA compute architecture 2.0.

Example: brute-force median filter

So, suppose that we want to calculate a brute-force median filter for an image (note that there exist much more efficient algorithms based on image histograms, see immedfilt.q). The filter could be implemented as follows:

  1. we extract a local window per pixel in the image (for example of size 13x13).
  2. the local window is then passed to a generic median function, that sorts the intensities in the local window and returns the median.

The problem is that there may not be enough register memory for holding a local window of this size. 1024 threads x 13 x 13 x 4 = 692K!

The solution is then to use a new Quasar runtime & compiler feature: dynamic kernel memory. In practice, this is actually very simple: first ensure that the compiler setting “kernel dynamic memory support” is enabled. Second, matrices can then be allocated through the regular matrix functions zeros, complex(zeros(.)), uninit, ones.

For the median filter, the implementation could be as follows:

% Function: median
% Computes the median of an array of numbers
function y = __device__ median(x : vec)
    % to be completed
end

% Function: immedfilt_kernel
% Naive implementation of a median filter on images
function y = __kernel__ immedfilt_kernel(x : mat, y : mat, W : int, pos : ivec2)

    % Allocate dynamic memory (note that the size depends on W,
    % which is a parameter for this function)
    r = zeros((W*2)^2)

    for m=-W..W-1
        for n=-W..W-1
            r[(W+m)*(2*W)+W+n] = x[pos+[m,n]]
        end
    end     

    % Compute the median of the elements in the vector r
    y[pos] = median(r)
end

For W=4 this algorithm is illustrated in the figure below:

Dynamic memory allocation Figure 1. dynamic memory allocation inside a kernel function.

Parallel memory allocation algorithm

To support dynamic memory, a special parallel memory allocator was designed. The allocator has the following properties:

  1. The allocation/disposal is distributed in space and does not use lists/free-lists of any sort.
  2. The allocation algorithm is designed for speed and correctness.

To accomplish such a design, a number of limitations were needed:

  1. The minimal memory block that can be allocated is 1024 bytes. If the block size is smaller then the size is rounded up to 1024 bytes.
  2. When you try to allocate a block with size that is not a pow-2 multiple of 1024 bytes (i.e. 1024*2^N with N integer), then the size is rounded up to a pow-2 multiple of 1024 bytes.
  3. The maximal memory block that can be allocated is 32768 bytes (=215 bytes). Taking into account that this can be done per pixel in an image, this is actually quite a lot!
  4. The total amount of memory that can be allocated from inside a kernel function is also limited (typically 16 MB). This restriction is mainly to ensure program correctness and to keep memory free for other processes in the system.

It is possible to compute an upper bound for the amount of memory that will be allocated at a given point in time. Suppose that we have a kernel function, that allocates a cube of size M*N*K, then:

max_memory = NUM_MULTIPROC * MAX_RESIDENT_BLOCKS * prod(blkdim) * 4*M*N*K

Where prod(blkdim) is the number of elements in one block, MAX_RESIDENT_BLOCKS is the maximal number of resident blocks per multi-processor and NUM_MULTIPROC is the number of multiprocessors.

So, suppose that we allocate a matrix of size 8x8 on a Geforce 660Ti then:

max_memory = 5 * 16 * 512 * 4 * 8 * 8 = 10.4 MB

This is still much smaller than what would be needed if one would consider pre-allocation (in this case this number would depend on the image dimensions!)

Comparison to CUDA malloc

CUDA has built-in malloc(.) and free(.) functions that can be called from device/kernel functions, however after a few performance tests and seeing warnings on CUDA forums, I decided not to use them. This is the result of a comparison between the Quasar dynamic memory allocation algorithm and that of NVidia:

    Granularity: 1024
    Memory size: 33554432
    Maximum block size: 32768
    Start - test routine

    Operation took 183.832443 msec [Quasar]     
    Operation took 1471.210693 msec [CUDA malloc]
    Success

I obtained similar results for other tests. As you can see, the memory allocation is about 8 times faster using the new apporach than with the NVidia allocator.

Why better avoiding dynamic memory

Even though the memory allocation is quite fast, to obtain the best performance, it is better to avoid dynamic memory:

  • The main issue is that kernel functions using dynamic memory also require several read/write accesses to the global memory. Because dynamically allocated memory has typically the size of hundreds of KBs, the data will not fit into the cache (of size 16KB to 48KB). Correspondingly: the cost of using dynamic memory is in the associated global memory accesses!
  • Please note that the compiler-level handling of dynamic memory is currently in development. As long as the memory is “consumed” locally as in the above example, i.e. not written to external data structures the should not be a problem.

Matrix Conditional Assignment

In MATLAB, it is fairly simple to assign to a subset of a matrix, for example, the values that satisfy a given condition. For example, saturation can be obtained as follows:

A[A < 0] = 0
A[A > 1] = 1

In Quasar, this can be achieved with:

A = saturate(A)

However, the situation can be more complex and then there is no direct equivalent to MATLAB. For example,

A[A > B] = C

where A, B, C are all matrices. The trick is to define a reduction (now in system.q):

type matrix_type : [vec|mat|cube|cvec|cmat|ccube]
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a > b] = c) = (x += (c - x) .* (a > b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a < b] = c) = (x += (c - x) .* (a < b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a <= b] = c) = (x += (c - x) .* (a <= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a >= b] = c) = (x += (c - x) .* (a >= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a == b] = c) = (x += (c - x) .* (a == b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a != b] = c) = (x += (c - x) .* (a != b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a && b] = c) = (x += (c - x) .* (a && b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a || b] = c) = (x += (c - x) .* (a || b))

The first line defines a “general” matrix type, then is then used for the subsequent reductions. The reduction simply works on patterns of the form:

x[a #op# b] = c

and replaces them by the appropriate Quasar expression. The last two reductions are a trick, to get the conditional assignment also working with boolean expressions, such as:

A[A<-0.1 || A>0.1] = 5

Note that, on the other hand:

B = A[A<-0.1]

will currently result in a runtime error (this syntax is not defined yet).

Debugging Quasar kernel functions

Since recently, Quasar Redshift is equipped with a parallel debugger for kernel functions. In the parallel debugging mode, the kernel function is emulated on a “virtual” CUDA-like device architecture, with device parameters that are taken from the GPU installed in your system. This permits device-independent debugging of kernel functions.

To step into the parallel debugging mode, it suffices to place a breakpoint inside a kernel function (or a device function) and to start the program. When the breakpoint hits: Redshift jumps to the new parallel debugger pane (see below).

In the left pane, you see a visualization of the shared memory (through the variable xr), as well as the current position and block position variables (pos and blkpos, respectively). In the middle pane at the bottom, some information about the kernel function is given. The occupancy is an indicator of the performance of the kernel function (for example, a low occupancy may be caused by selecting block sizes that are too small). Occupancy alone however, does not give a full picture: there exist kernel function designed to achieve a low occupancy, but offering very high throughput (see for example the Volkov GTC 2010 talk).

In the right pane, an overview of the scheduled blocks and active threads is given. For active threads, the barrier index is also given: barrier 0 corresponds to all statements before the first barrier, barrier 1 corresponds to statements between the first and second barrier, and so on. Using the context menu of the list box, it is possible to switch to an other thread or to put a breakpoint at a given block.

The parallel debugger allows you to:

  • Step through the kernel function (e.g. using F8 or shift-F8) [code editor]
  • Place (conditional) breakpoints in kernel functions [code editor]
  • Traverse different positions in the data array (or pixels in an image) [parallel debugger pane]
  • Jump to the next thread/block [parallel debugger pane]
  • See which threads are currently active/inactive [parallel debugger pane]
  • Inspect the values of all variables that are used inside a kernel function [variable watch]
  • Visualize the content of the shared memory [variable watch]
  • Record videos of the kernel function in action [variable watch]

The statement evaluation order is not necessarily linear, especially in case thread synchronization is used (through syncthreads). syncthreads places a barrier, which all threads within a block must have encountered, before continuing to the next block.

Internally, kernel functions are interpreted on the CPU, in order to allow full inspection of all variables, placing breakpoints at arbitrary position within a block etc. Moreover, for clarity, the threads in between two barriers are executed serially. When a barrier (or the end of the kernel function) is met, the debugger switches to the next available thread.

The serial execution definitely makes it easy to follow what your program is doing, however in case of data races, it may also hide potential memory access conflicts (due to serialization). For this reason, there is also an option to “mimic parallel execution”, in which random thread context switches are made.

Multi-level breaks in sequential loops

Sometimes, small language features can make a lot of difference (in terms of code readability, productivity etc.). In Quasar, multi-dimensional for-loops are quite common. Recently, I came across a missing feature for dealing with multi-dimensional loops.

Suppose we have a multi-dimensional for-loop, as in the following example:

for m=0..511
    for n=0..511
        im_out[m,n] = 255-im[m,n]
        if m==128
            break
        endif
        a = 4
    end
end 

Suppose that we want to break outside the loop, as in the above code. This is useful for stopping the processing at a certain point. There is only one caveat: the break-statement only applies to the loop that surrounds it. In the above example, the processing of row 128 is simply stopped at column 0 (the loop over n is interrupted), but it is then resumed starting from row 129. Some programmers are not aware of this, sometimes this can lead to less efficient code, as in the following example:

for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break
        end
    end
end

Here we perform a sequential search, to find the first matrix element for which V[j,k] != 0. When this matrix element is found, the search is stopped. However, because the break statement stops the inner loop, the outer loop is still executed several times (potentially leading to a performance degradation).

1. Solution with extra variables

To make sure that we break outside the outer loop, we would have to introduce an extra variable:

break_outer = false
for j = 0..size(V,0)-1
    for k=0..size(V,1)-1
        if V[j,k]
            found=[j,k]
            break_outer = true
            break
        end
    end
    if break_outer
        break
    endif
end

It is clear that this approach is not very readible. The additional variable break_outer is also a bit problematic (in the worst case, if the compiler can not filter it out, extra stack memory/registers will be required).

2. Encapsulation in a function

An obvious alternative is the use of a function:

function found = my_func()
    for j = 0..size(V,0)-1
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break
            end
        end
    end
end
found = my_func()

However, the use of function is sometimes not desired for this case. It also involves extra work, such as adding the input/output parameters and adding a function call.

3. New solution: labeling loops

To avoid the above problems, it is now possible to label the for loops (as in e.g. ADA, java):

outer_loop:
    for j = 0..size(V,0)-1
    inner_loop:
        for k=0..size(V,1)-1
            if V[j,k]
                found=[j,k]
                break outer_loop
            end
        end
    end

Providing labels to for-loops is optional, i.e. you only have to do it when it is needed. The new syntax is also supported by the following finding in programming literature:

In 1973 S. Rao Kosaraju refined the structured program theorem by proving that it’s possible to avoid adding additional variables in structured programming, as long as arbitrary-depth, multi-level breaks from loops are allowed. [11]

Note that Quasar has no goto labels (it will never have). The reasons are:

  • Control flow blocks can always be used instead. Control flow blocks offer more visual cues which enhances the readability of the code.
  • At the compiler-level, goto labels may make it more difficult to optimize certain operations (e.g. jumps to different scopes).

Remarks:

  • This applies to the keyword continue as well.
  • Labels can be applied to for, repeatuntil and while loops.
  • In the future, more compiler functionality may be added to make use of the loop labels. For example, it may be possible to indicate that multiple loops (with the specified names) must be merged.
  • It is not possible to break outside a parallel loop! The reason is that the execution of the different threads is (usually) non-deterministic, hence using breaks in parallel-loops would result in non-deterministic results.
  • However, loop labels can be attached to either serial/parallel loops. A useful situation is an iterative algorithm with an inner/outer loop.