Posts Tagged ‘Matrix access’

Boundary access modes in Quasar

In earlier versions of Quasar, the boundary extension modes (such as 'mirror, 'circular) only affected the __kernel__ and __device__ functions.

To improve transparency, this is has recently changed. This has the consequence that the following get access modes needed to be supported by the runtime:

(no modifier)     % =error (default) or zero extension (kernel, device function)
safe              % zero extension
mirror            % mirroring near the boundaries
circular          % circular boundary extension
clamped           % keep boundary values
unchecked         % results undefined when reading outside
checked           % error when reading outside

Implementation details: there is a bit of work involved, because it needs to be done for all data types (int8, int16, int32, uint8, uint16, uint32, single, double, UDT, object, …), for different dimensions (vec, mat, cube), and for both matrix getters / slice accessors. perhaps the reason that you will not see this feature implemented in other programming languages: 5 x 10 x 3 x 2 = 300 combinations (=functions to be written). Luckily the use of generics in C# alleviates the problem, reducing it (after a bit of research) to 6 combinations (!) where each algorithm has 3 generic parameters. Idem for CUDA computation engine.

Note that the modifier unchecked should be used with care: only when you are 100% sure that the function is working properly and that there are no memory accesses outside the matrix. A good approach is to use checked first, and when you find out that there never occur any errors, you can switch to unchecked, in other to gain a little speed-up (typically 20%-30% on memory accesses).

Now I would like to point out that the access modifiers are not part of type of the object itself, as the following example illustrates:

A : mat'circular = ones(10, 10)
B = A    % boundary access mode: B : mat'circular
C : mat'unchecked = A

Here, both B and C will hold a reference to the matrix A. However, B will copy the access modifier from A (through type inference) and C will override the access modifier of A. The result is that the access modifiers for A, B and C are circular, circular and unchecked, respectively. Even though there is only one matrix involved, there are effectively two ways of accessing the data in this matrix.

Now, to make things even more complicated, there are also put access modes. But for implementation complexity reasons (and more importantly, to avoid unforeseen data races in parallel algorithms), the number of options have been reduced to three:

safe (default)      % ignore writing outside the boundaries
unchecked           % writing outside the boundaries = crash!
checked             % error when writing outside the boundaries

This means that 'circular, 'mirror, and 'clamped are mapped to 'safe when writing values to the matrix. For example:

A : vec'circular = [1, 2, 3]
A[-1] = 0   % Value neglected

The advantage will then be that you can write code such as:

A : mat'circular = imread("lena_big.tif")[:,:,1]
B = zeros(size(A))
B[-127..128,-127..128] = A[-127..128,-127..128]

Construction of Cell matrices

In Quasar, it is possible to construct cell vectors/matrices, similar to in MATLAB:

A = `1,2,3j,zeros(4,4)´

(Matlab-syntax uses braces as in {1,2,3j,zeros(4,4)}).

Now, I have found that there are two problems with the cell construction syntax:

  • The prime (´) cannot be represented with the ANSI character set and requires UTF-8 encoding. In the Redshift code editor, UTF-8 is used by default, however when editing Quasar code with other editors, problems can occur.
  • The prime (´) is not present on QUERTY keyboards (in contrast to Belgian AZERTY keyboard). I suspect this is one of the reasons that the cell construction syntax is used rarely.
  • Even the documentation system has problems with the prime symbol, that may be the cause that you cannot see it right now. In other words, it is an evil symbol and all evil should be extinct.

To solve these problems, the Quasar parser now also accepts an apostrophe ' (ASCII character 39, or in hex 27h) for closing:

A = `1,2,3j,zeros(4,4)'

Because the apostrophe character exists in ANSI and on QUERTY keyboards, the use of cell matrices constructions is greatly simplified.

Note: the old-fashioned alternative was to construct cell matrices using the function cell, or vec[vec], vec[mat], vec[cube], … For example:

A = cell(4)
A[0] = 1j
A[1] = 2j
A[2] = 3j
A[3] = 4j

Note that this notation is not very elegant, compared to A='1j,2j,3j,4j'. Also it does not allow the compiler to fully determine the type of A (the compiler will find type(A) == "vec[??]" rather than type(A) == "cvec"). In the following section, we will discuss the type inference in more detail.

Type inference

Another new feature of the compiler is that it attempts to infer the type from cell matrices. In earlier versions, all cell matrices defined with the above syntax, had type vec[??]. Now, this has changed, as illustrated by the following example:

    a = `[1, 2],[1, 2, 3]'
    print type(a)            % Prints vec[vec[int]]

    b = `(x->2*x), (x->3*x), (x->4*x)'
    print type(b)            % Prints [ [??->??] ]

    c = ` `[1, 2],[1,2]',`[1, 2, 3],[4, 5, 6]' '
    print type(c)            % Prints vec[vec[vec[int]]]

    d = ` [ [2, 1], [1, 2] ], [ [4, 3], [3, 4] ]'
    print type(d)            % Prints vec[mat]

    e = `(x:int->2*x), (x:int->3*x), (x:int->4*x)'
    print type(e)            % Prints vec[ [int->int] ]

This allows cell matrices that are constructed with the above syntax to be used from kernel functions. A simple example:

    d = `eye(4), ones(4,4)'

    parallel_do(size(d[0]), d, 
        __kernel__ (d : vec[mat], pos : ivec2) -> d[0][pos] += d[1][pos])
    print d[0]

The output is:

       [ [2,1,1,1],
         [1,1,1,2] ]

Dynamic Memory Allocation on CPU/GPU

In some algorithms, it is desirable to dynamically allocate memory inside __kernel__ or __device__ functions, for example:

  • When the algorithm processes blocks of data for which the maximum size is not known in advance.
  • When the amount of memory that an individual thread uses is too large to fit in the shared memory or in the registers. The shared memory of the GPU consists of typically 32K, that has to be shared between all threads in one block. For single-precision floating point vectors vec or matrices mat and for 1024 threads per block, the maximum amount of shared memory is 32K/(1024*4) = 8 elements. The size of the register memory is also of the same order: 32 K for CUDA compute architecture 2.0.

Example: brute-force median filter

So, suppose that we want to calculate a brute-force median filter for an image (note that there exist much more efficient algorithms based on image histograms, see immedfilt.q). The filter could be implemented as follows:

  1. we extract a local window per pixel in the image (for example of size 13x13).
  2. the local window is then passed to a generic median function, that sorts the intensities in the local window and returns the median.

The problem is that there may not be enough register memory for holding a local window of this size. 1024 threads x 13 x 13 x 4 = 692K!

The solution is then to use a new Quasar runtime & compiler feature: dynamic kernel memory. In practice, this is actually very simple: first ensure that the compiler setting “kernel dynamic memory support” is enabled. Second, matrices can then be allocated through the regular matrix functions zeros, complex(zeros(.)), uninit, ones.

For the median filter, the implementation could be as follows:

% Function: median
% Computes the median of an array of numbers
function y = __device__ median(x : vec)
    % to be completed

% Function: immedfilt_kernel
% Naive implementation of a median filter on images
function y = __kernel__ immedfilt_kernel(x : mat, y : mat, W : int, pos : ivec2)

    % Allocate dynamic memory (note that the size depends on W,
    % which is a parameter for this function)
    r = zeros((W*2)^2)

    for m=-W..W-1
        for n=-W..W-1
            r[(W+m)*(2*W)+W+n] = x[pos+[m,n]]

    % Compute the median of the elements in the vector r
    y[pos] = median(r)

For W=4 this algorithm is illustrated in the figure below:

Dynamic memory allocation Figure 1. dynamic memory allocation inside a kernel function.

Parallel memory allocation algorithm

To support dynamic memory, a special parallel memory allocator was designed. The allocator has the following properties:

  1. The allocation/disposal is distributed in space and does not use lists/free-lists of any sort.
  2. The allocation algorithm is designed for speed and correctness.

To accomplish such a design, a number of limitations were needed:

  1. The minimal memory block that can be allocated is 1024 bytes. If the block size is smaller then the size is rounded up to 1024 bytes.
  2. When you try to allocate a block with size that is not a pow-2 multiple of 1024 bytes (i.e. 1024*2^N with N integer), then the size is rounded up to a pow-2 multiple of 1024 bytes.
  3. The maximal memory block that can be allocated is 32768 bytes (=215 bytes). Taking into account that this can be done per pixel in an image, this is actually quite a lot!
  4. The total amount of memory that can be allocated from inside a kernel function is also limited (typically 16 MB). This restriction is mainly to ensure program correctness and to keep memory free for other processes in the system.

It is possible to compute an upper bound for the amount of memory that will be allocated at a given point in time. Suppose that we have a kernel function, that allocates a cube of size M*N*K, then:

max_memory = NUM_MULTIPROC * MAX_RESIDENT_BLOCKS * prod(blkdim) * 4*M*N*K

Where prod(blkdim) is the number of elements in one block, MAX_RESIDENT_BLOCKS is the maximal number of resident blocks per multi-processor and NUM_MULTIPROC is the number of multiprocessors.

So, suppose that we allocate a matrix of size 8x8 on a Geforce 660Ti then:

max_memory = 5 * 16 * 512 * 4 * 8 * 8 = 10.4 MB

This is still much smaller than what would be needed if one would consider pre-allocation (in this case this number would depend on the image dimensions!)

Comparison to CUDA malloc

CUDA has built-in malloc(.) and free(.) functions that can be called from device/kernel functions, however after a few performance tests and seeing warnings on CUDA forums, I decided not to use them. This is the result of a comparison between the Quasar dynamic memory allocation algorithm and that of NVidia:

    Granularity: 1024
    Memory size: 33554432
    Maximum block size: 32768
    Start - test routine

    Operation took 183.832443 msec [Quasar]     
    Operation took 1471.210693 msec [CUDA malloc]

I obtained similar results for other tests. As you can see, the memory allocation is about 8 times faster using the new apporach than with the NVidia allocator.

Why better avoiding dynamic memory

Even though the memory allocation is quite fast, to obtain the best performance, it is better to avoid dynamic memory:

  • The main issue is that kernel functions using dynamic memory also require several read/write accesses to the global memory. Because dynamically allocated memory has typically the size of hundreds of KBs, the data will not fit into the cache (of size 16KB to 48KB). Correspondingly: the cost of using dynamic memory is in the associated global memory accesses!
  • Please note that the compiler-level handling of dynamic memory is currently in development. As long as the memory is “consumed” locally as in the above example, i.e. not written to external data structures the should not be a problem.

Matrix Conditional Assignment

In MATLAB, it is fairly simple to assign to a subset of a matrix, for example, the values that satisfy a given condition. For example, saturation can be obtained as follows:

A[A < 0] = 0
A[A > 1] = 1

In Quasar, this can be achieved with:

A = saturate(A)

However, the situation can be more complex and then there is no direct equivalent to MATLAB. For example,

A[A > B] = C

where A, B, C are all matrices. The trick is to define a reduction (now in system.q):

type matrix_type : [vec|mat|cube|cvec|cmat|ccube]
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a > b] = c) = (x += (c - x) .* (a > b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a < b] = c) = (x += (c - x) .* (a < b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a <= b] = c) = (x += (c - x) .* (a <= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a >= b] = c) = (x += (c - x) .* (a >= b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a == b] = c) = (x += (c - x) .* (a == b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a != b] = c) = (x += (c - x) .* (a != b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a && b] = c) = (x += (c - x) .* (a && b))
reduction (x : matrix_type, a : matrix_type, b, c) -> (x[a || b] = c) = (x += (c - x) .* (a || b))

The first line defines a “general” matrix type, then is then used for the subsequent reductions. The reduction simply works on patterns of the form:

x[a #op# b] = c

and replaces them by the appropriate Quasar expression. The last two reductions are a trick, to get the conditional assignment also working with boolean expressions, such as:

A[A<-0.1 || A>0.1] = 5

Note that, on the other hand:

B = A[A<-0.1]

will currently result in a runtime error (this syntax is not defined yet).