## Realtime lens distortion corrections on UltraHD 4k video streams using Quasar

Quasar was used at imec Ghent to perform realtime lens distortion corrections on UltraHD 4k video streams.

Because the quality of digital image sensors often outperforms the quality of optical lens systems, the sensors capture various distortions (such as chromatic aberration, noise, blur and various other unwanted effects). Now, software comes in for an effective quality compensation at affordable cost. Before, lens correction algorithms were only available for offline processing (e.g., in software like Photoshop), but now, with use of GPUs and the Quasar software, the algorithms could easily be developed and tested in real-time. On a PC platform with GPU, UltraHD 4k video could be processed at 30 fps.

Read more in the imec Magazine article from April 2019.

## Real-time SDR to HDR video conversion using Quasar

The conversion from low-dynamic range video to high dynamic range content that can be displayed on, e.g., high dynamic range TVs, faces several obstacles: dealing with artifacts such as ‘ghost’ and ‘halo’ images, recognizing bright objects in dark environments, converting light and dark settings in a visually distinct way and carrying out all of these calculations in real-time on a GPU. Existing software solutions struggle with these problems and are unable to achieve the required quality in the end result.

By automatically handling the mapping from algorithm to GPU, Quasar allows application specialists to focus on the application at hand, focusing on the quality of the algorithms rather than hardware specific details. At the same time the algorithm can be tested on high resolution datasets and in realtime. This led to the development of a SDR->HDR conversion algorithm, from which the result can be inspected below.

*Left – visual artifacts occurring during traditional conversion to HDR. Right – processing result of software developed using Quasar*

Read more in the imec magazine.

See also our previous blog post from February, with a video example.

## New feature in Quasar: fixed size data types

In the past, small matrices such as `ones(3,3)`

required dynamic memory allocation (which also incurs additional overhead). The dynamic memory allocation could be avoided by reverting to the fixed-length vector types (`vec4`

etc.), but matrix code written using this technique is more difficult to read and often error prone (for example for a 4×4 matrix one would write `A[5]`

instead of `A[1,1]`

). The use of high-level expressions such as `zeros(N,N)`

is very natural in a high-level language such as Quasar, but `zeros(N,N)`

does not lead to the most efficient code generation for a hardware architecture such as a GPU. Therefore, solving this problem represents a major milestone in the conversion from high-level code to low-level code.

For this purpose, fixed-length vectors have now been generalized to fixed-sized matrices, cubes and hypercubes, with full compiler support (type inference and code generator). We will show that this successfully solves the code generation problem. With no or limited code changes, existing code can even benefit of the fixed-sized matrix optimizations. Additionally, it is useful to guide the compiler by annotating with the appropriate fixed-sized matrix types (e.g., to catch matrix dimension mismatches early on in the development). The following types have been defined:

Quasar type | Parametric alias | Meaning |
---|---|---|

`vec3` |
`vec(3)` |
A vector of length 3 |

`mat2x2` |
`mat(2,2)` |
A matrix of dimensions 2 x 2 |

`cube2x3x2` |
`cube(2,3,2)` |
A cube with dimensions 2 x 3 x 2 |

`cube2x2x2x2` |
`cube(2,2,2,2)` |
A hypercube with dimensions 2 x 2 x 2 x 2 |

In the above table, all types exist in fixed form and a parametric form. In code, the fixed form is often slightly more readable. The parametric form has the possibility for generic parametrization, which will be explained in one of the next sections.

Fixed sized data types allow the compiler to generate more efficient code, especially in cases when the dimensions are known at compile-time. In some cases, this allows out-of-bounds indices to be detected faster:

```
function y = transform(A : mat2x2'checked, b : vec2)
y = A[4,4] * b[0] % Index out-of-bounds
endfunction
```

Also, incompatibilities in matrix/vector operations can be detected at compile-time in some cases, for example:

`y = [1,2,3] * eye(4) % Error: Matrix dimensions (1x3 and 4x4) do not match for multiplication!`

The explicit matrix dimensions can be seen as a type constraint (e.g., a constraint that is applicable to every instance of the type and that can be handled by the truth-value system to enable inference).

For the generated CUDA/CPU code, the fixed sized data types are stored respectively in the registers (GPU) or stack memory (CPU). This entirely avoids dynamic kernel memory and offers significant performance benefits, often a tenfold. Because the dimensions are known at compile-time, also the indexing calculation is accelerated.

All fixed sized data types are by default passed *by reference* (note: this is the same for generic vectors/matrices and even fixed-length vectors). The fixed dimensions should be seen as a constraint that applies to every instance of the type. Internally, when safe, the compiler may generate code that passes the vector/matrix *by value* (for example, when the corresponding variable is constant). Except for `class`

/`mutable class`

objects, the fixed sized data types are stored in-place (i.e. without reference).

To take advantage of fixed sized data types, no changes are required to existing Quasar programs, other than occasional changes to the function parameter types (e.g., from `mat`

to `mat3x3`

for a 3×3 matrix). Whenever possible, the compiler attempts to derive the data dimensions by type inference. For example, the following defines a 2×3 matrix and its transpose:

```
A = [[1,2,3],[2,3,4]]
B = transpose(A)
```

The compiler will automatically determine the type of `A`

as `mat2x3`

and `B`

as `mat3x2`

. Moreover, when the following builtin functions are used within a device/kernel function, they will automatically take advantage of the fixed sized data types:

```
zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax
```

Idem for arithmetic operations (add, subtract, multiply, divide etc.), matrix multiplication as well as mathematical functions (`sin`

, `cos`

, `tan`

etc.).

*Some unobvious performance considerations*: passing a `mat(N,N)`

to a device function accepting `mat`

does not require use of dynamic kernel memory! Internally, the runtime will cast the fixed sized matrix to a generic matrix. Also, a device function returning `mat(N,N)`

does not use dynamic kernel memory, however, a device function returning `mat`

(e.g. `function [y:mat] = __device__ X()`

) *does*, even when the function returns a fixed sized matrix such as `mat(4,4)`

. The reason is that device functions need to respect the type signature so that they can be used as function variables. Therefore: for device functions returning `mat`

/`cube`

types, it is best to update the type signature, especially of the dimensions of the return value are known beforehand.

## Examples

Below are a few useful examples of using fixed sized data types.

### 1 Built-in function optimizations

```
function [y1,y2] = __device__ func (x : mat4x4'safe, y : vec4'circular)
y1 = sum(x,1) + x[4,0]
y2 = cumsum(x,1) + y[6]
endfunction
```

Because the matrix and vector dimensions are known at compile-time, the above function will only use stack memory/registers to calculate the results `y1`

and `y2`

, i.e., there is no dynamic kernel memory involved.

### 2 Matrix extraction and processing

Another use case that can now be optimized, is given below:

```
im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))
for m=0..size(im,0)-1
for n=0..size(im,1)-1
window = im[m+(-1..1),n+(-1..1)]
im_out[m,n] = max(window) - min(window)
endfor
endfor
```

Here, the compiler will determine the type of `window`

to be `mat3x3`

. In previous versions of Quasar, the same effect would be obtained by writing:

```
window = [im[m-1,n-1],im[m-1,n]+im[m-1,n+1]+
im[m ,n-1],im[m ,n]+im[m ,n+1]+
im[m+1,n-1],im[m+1,n]+im[m+1,n+1]]
```

The fixed-sized matrix approach is clearly more versatile and also allows two-dimensional addressing. One caveat is that the programmer needs to ensure that the dimensions of `window`

can be obtained at compile-time. If this is not the case, the compiler will generate a warning mentioning that the kernel function consumes dynamic memory. In case the dimensions are parametric, a way to get around it is by using *generical size parametrized types* (see Section 1.2 below) or by *using generic specialization*:

```
im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))
function [] = generic_nlfilter(K : int)
{!auto_specialize}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
window = im[m+(-K..K),n+(-K..K)]
im_out[m,n] = max(window) - min(window)
endfor
endfor
endfunction
generic_nlfilter(2)
```

### 3 Matrix-processing in the loop

The following code benefits well from the fixed-sized matrix optimizations, causing efficient code to be generated because 1) matrix dimensions are known inside the kernel function, which allows additional optimization techniques under the hood (such as loop unrolling) and 2) because no dynamic kernel memory is required:

```
% (code courtesy of Michiel Vlaminck)
{!parallel for}
for i=0..size(covs,2)-1
D = vec(3) % Type: vec3 or in parametric form vec(3)
V = zeros(3,3) % Type: mat3x3 or in parametric form mat(3,3)
unused = jacobi(covs[0..2,0..2,i], 3, D, V)
% Avoids matrices near singularities
mineig : scalar = 100 * D[2]
D[0] = D[0] < mineig ? mineig : D[0]
D[1] = D[1] < mineig ? mineig : D[1]
covs[0..2,0..2,i] = V * diagm(D) * transpose(V)
for j=0..2
D[j] = 1.0/D[j]
endfor
icovs[0..2,0..2,i] = transpose(V) * diagm(D) * V
endfor
```

In particular, the type inference engine can determine that `type(diagm(D))`

is `mat(3,3)`

, so that the matrix multiplications `transpose(V) * diagm(D) * V`

also can be calculated with fixed dimensions.

## Generic dimension/size-parametrized array types

Often, it is desired to specify vector and matrix dimensions, but it is not desirable to restrict the functions to a particular choice of dimensions. In this case, dimension/size-parametrized types come to the rescue.

The following example demonstrated a `circshift`

function that can be used in arbitrary dimension (1D, 4D, 8D etc.)

```
function [y] = circshift[N : int](x : cube{N}, shift : vec(N))
function [] = __kernel__ kernel (x : cube{N}'circular, y : cube{N}, shift : vec(N), pos : vec(N))
y[pos] = x[pos + shift]
endfunction
y = zeros(size(x,0..N-1))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

This is called a dimension-parametric function (the function works for any dimension `N>0`

). The specialization of this function is done at the caller-side. For example, if the function is called `circshift(randn(64,64,3),[10,10,0])`

the compiler will deduct `N=3`

and will generate a specialized version (as if the function was written specifically for `N=3`

). Alternatively, one may also call `circshift[3](randn(64,64,3),[10,10,0])`

to explicitly pass the parameter (e.g., in order to ensure type correctness).

Compare this to the non-generic counterpart (note that `cube{:}`

denotes a (hyper)cube of unspecified dimension):

```
function [y] = circshift(x : cube{:}, shift : vec)
function [] = __kernel__ kernel (x : cube{:}'circular, y : cube{:}, shift : vec, pos)
y[pos] = x[pos + shift]
endfunction
y = zeros(size(x))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

Sadly, the non-generic counterpart implementation does not work as expected. To work, the length of the `shift`

vector needs to match the maximal dimension of `cube{:}`

. This is because the compiler will adjust the length of `pos`

to match the maximal dimensionality (currently 16). In case the `shift`

vector has a different length, a runtime error will result. Moreover, it is desirable to write code that does not depend on built-in constants like the maximal dimensionality.

The parametric version entirely solves this problem but also allows the exact value of `N`

to be captured. It is possible to declare types like `mat(N,N)`

and even `vec(2*N)`

and `mat(2,N)`

. The type definitions currently allow simple linear arithmetic (for example `2*N-1`

).

Next, defining a matrix of type `mat(2,N)`

:

```
A = mat(2,N)
B = cube(N,2*N,3)
Y = A[0,:]
C = B[:,:,1]
```

The compiler can now infer that `Y`

has type `vec(N)`

and that `C`

has type `mat(N,2*N)`

. After specialization with `N=2`

, `C`

will have type `mat(2,4)`

and the resulting variables enjoy the benefits of fixed sized data types.

## Partially parametrized array types

Existing code (e.g., for image processing) often relies on hard-coded conventions with respect to the cube dimensions, for example, the number of color channels that is fixed to be 3. To make code more easily extendible and also to facilitate compiler analysis (for loop parallelization, type inference), partially parametrized array types can be used.

Example:

`img : cube(:,:,3) = imread("lena_big.tif")`

indicates that `A`

is a cube with three dimensions, for which the first two dimensions have unspecified length (`:`

) and the last dimension has length 3. When a for-loop is written over the third dimension, either explicitly or implicitly via matrix slices:

`img[m,n,:]`

the compiler can determine the length `numel(img[m,n,:])==3`

so that the vector type `vec3`

can be assigned. In previous Quasar versions, the compiler would infer `vec`

as type (not being able to make any assumption on the size), resulting possibly in the use of dynamic kernel memory.

For `imread`

, the type inference of `cube(:,:,3)`

is now automatic. For other functions, it may again be useful to parametrize generically on the number of color channels:

```
function [y] = circshift2D[N](x : cube(:,:,N), shift : vec(2))
function [] = __kernel__ kernel (x : cube(:,:,N)'circular, y : cube(:,:,N), shift : vec(2), pos)
y[...pos,:] = x[...(pos + shift),:]
endfunction
y = zeros(size(x))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

Note that here, we are combining the technique from previous section with partially parametrized array types (since we are using `cube(:,:,N)`

instead of `cube(:,:,3)`

). The body of the kernel function will then be implemented entirely with fixed-length vectors `vec(N)`

. The above function can be called with arbitrary number of color channels, although this number needs to be *known* at compile-time:

```
circshift2D(randn(100,100,3), [4,4])
circshift2D(randn(100,100,9), [-2,-2])
```

Another example is a color transform for an image with unspecified number of channels:

```
function img_out = colortransform[N](C : mat(N,N), img : cube(:,:,N))
img_out = zeros(size(img,0),size(img,1),N)
{!parallel for}
for m=0..size(img,0)-1
for n=0..size(img,1)-1
img_out[m,n,:] = C * squeeze(img[m,n,:])
endfor
endfor
endfunction
```

Here, calling the function with a `mat3x3`

color transform matrix will ensure that the parameter `N=3`

is substituted during compile-time, leading to efficient calculation of the vector-matrix product `C * squeeze(img[m,n,:])`

. The `squeeze`

function is required here, because `img[m,n,:]`

returns a vector of dimensions `1x1xN`

which is multiplied with matrix `C`

. The squeeze function removes the singleton dimensions and converts the vector to `Nx1x1`

(or simply `N`

, since singleton dimensions at the end do not matter in Quasar). For the assignment, this conversion is done automatically. In practice the `squeeze`

function will have zero-overhead; it is only required to keep the compiler happy.

Again, the type inference of the following builtin functions has been updated to work with these parametrized types:

```
zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax
```

such that `repmat(img, [1,1,3]) : cube(:,:,9))`

etc. Another use case is summation along a given dimension:

```
A : mat(3,3,:)
B = sum(A,2)
```

Here the result `B`

has the fixed sized matrix type `mat(3,3)`

.

# Conclusion

In Quasar, fixed sized vectors, matrices and cubes offer the same code generation advantages as fixed sized arrays in C, but because they are actually “concealed” regular matrices with special type constraints, they can be used with the same conventions (e.g., by reference parameter passing) as regular matrices and they are therefore much more flexible. The type inference has been extended to automatically find matrices/cubes with fixed dimensions and once found, the code generation is adjusted appropriately. The fixed sized constraint then not only allows catching size errors at compile-time, but also ensures that loops with fixed-sized matrices are converted to efficient implementations that no longer rely on dynamic kernel memory.

## New Feature in Quasar: Cooperative Thread Groups

The latest release of Quasar now offers support for CUDA 9 cooperative groups. Below, we describe how cooperative groups can be used from Quasar. Overall, cooperative threading brings some interesting optimization possibilities for Quasar kernel functions.

## 1 Synchronization granularity

The keyword `syncthreads`

now accepts a parameter that indicates which threads are being synchronized. This allows more fine grain control on the synchronization.

Keyword | Description |
---|---|

`syncthreads(warp)` |
performs synchronization across the current (possibly diverged) warp (32 threads) |

`syncthreads(block)` |
performs synchronization across the current block |

`syncthreads(grid)` |
performs synchronization across the entire grid |

`syncthreads(multi_grid)` |
performs synchronization across the entire multi-grid (multi-GPU) |

`syncthreads(host)` |
synchronizes all host (CPU and GPU threads) |

The first statement `syncthreads(warp)`

allows divergent threads to synchronize at any time (it is also useful in the context of Volta’s independent scheduling). `syncthreads(block)`

is equivalent to `syncthreads`

in previous versions of Quasar. The grid synchronization primitive `syncthreads(grid)`

is particularly interesting, it is a feature of CUDA 9 that was not available before. It allows to place barriers inside kernel function that synchronize all blocks. The following function:

```
function y = gaussian_filter_separable(x, fc, n)
function [] = __kernel__ gaussian_filter_hor(x : cube, y : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [0,i-n,0]] * fc[i]
endfor
y[pos] = sum
endfunction
function [] = __kernel__ gaussian_filter_ver(x : cube, y : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [i-n,0,0]] * fc[i]
end
y[pos] = sum
end
z = uninit(size(x))
y = uninit(size(x))
parallel_do (size(y), x, z, fc, n, gaussian_filter_hor)
parallel_do (size(y), z, y, fc, n, gaussian_filter_ver)
endfunction
```

Can now be simplified to:

```
function y = gaussian_filter_separable(x, fc, n)
function [] = __kernel__ gaussian_filter_separable(x : cube, y : cube, z : cube, fc : vec, n : int, pos : vec3)
sum = 0.
for i=0..numel(fc)-1
sum = sum + x[pos + [0,i-n,0]] * fc[i]
endfor
z[pos] = sum
syncthreads(grid)
sum = 0.
for i=0..numel(fc)-1
sum = sum + z[pos + [i-n,0,0]] * fc[i]
end
y[pos] = sum
endfunction
z = uninit(size(x))
y = uninit(size(x))
parallel_do (size(y), x, y, z, fc, n, gaussian_filter_separable)
endfunction
```

The advantage is not only in the improved readability of the code, but the number of kernel function calls can be reduced which further increases the performance. *Note: this feature requires at least a Pascal GPU.*

## 2 Interwarp communication

New special kernel function parameters (similar to `blkpos`

, `pos`

etc.) will be added to control the GPU threads.

Parameter | Type | Description |
---|---|---|

`coalesced_threads` |
`thread_block` |
a thread block of coalesced threads |

`this_thread_block` |
`thread_block` |
describes the current thread block |

`this_grid` |
`thread_block` |
describes the current grid |

`this_multi_grid` |
`thread_block` |
describes the current multi-GPU grid |

The new `thread_block`

class will have the following methods.

Method | Description |
---|---|

`sync()` |
Synchronizes all threads within the thread block |

`partition(size : int)` |
Allows partitioning a thread block into smaller blocks |

`shfl(var, src_thread_idx : int)` |
Direct copy from another thread |

`shfl_up(var, delta : int)` |
Direct copy from another thread, with index specified relatively |

`shfl_down(var, delta : int)` |
Direct copy from another thread, with index specified relatively |

`shfl_xor(var, mask : int)` |
Direct copy from another thread, with index specified by a XOR relative to the current thread index |

`all(predicate)` |
Returns true if the predicate for all threads within the thread block evaluates to non-zero |

`any(predicate)` |
Returns true if the predicate for any thread within the thread block evaluates to non-zero |

`ballot(predicate)` |
Evaluates the predicate for all threads within the thread block and returns a mask where every bit corresponds to one predicate from one thread |

`match_any(value)` |
Returns a mask of all threads that have the same value |

`match_all(value)` |
Returns a mask only if all threads that share the same value, otherwise returns 0. |

In principle, the above functions allow threads to communicate with each other. The shuffle operations allow taking values from other *active* threads (active means not disabled due to thread divergence). `all`

, `any`

, `ballot`

, `match_any`

and `match_all`

allow to determine whether the threads have reached a given state.

The warp shuffle operations require a Kepler GPU and allow the use of shared memory to be avoided (register access is faster than shared memory). This may bring again performance benefits for computationally intensive kernels such as convolutions and parallel reductions (`sum`

, `min`

, `max`

, `prod`

etc.).

Using this functionality will require the CUDA target to be specified explicitly (i.e., the functionality cannot be easily simulated by the CPU). This may be obtained by placing the following code attribute inside the kernel: `{!kernel target="nvidia_cuda"}`

. For CPU execution a separate kernel needs to be written. Luckily, several of the warp shuffling optimizations can be integrated in the compiler optimization stages, so that only one single kernel needs to be written.

## Quasar at the Design, Automation and Test in Europe (DATE’18) conference in Dresden

## The Three Function Types of Quasar from a Technical Perspective

To give programmers ultimate control on the acceleration of functions on compute devices (GPU, multi-core CPU), Quasar has three distinct function types:

*Host*Functions: host functions are the default functions in Quasar. Depending on the compilation target, they are either interpreted or compiled to binary CPU code. Host functions run on the CPU and form a glue for the code that runs on other devices. When host functions contain for-loops, these loops are (when possible) automatically parallelized by the Quasar compiler. The parallelized version then runs on multi-core CPU or GPU; what happens in the background is that a*kernel*function is generated. Within the Quasar Redshift IDE, host functions are interpreted, which also allows stepping through the code and putting breakpoints.*Kernel*Functions: given certain data dimensions, kernel functions specify what needs to be done to each element of the data. The task can be either a parallel task or a serial task. Correspondingly, kernel functions are launched either using the`parallel_do`

function or using the`serial_do`

function:`function [] = __kernel__ filter(image, pos) ... endfunction parallel_do(size(image),image,filter)`

In the above example, the kernel function

`filter`

is applied in parallel to each pixel component of the image`image`

. The first parameter of`parallel_do`

/`serial_do`

is always the data dimension (i.e., the size of the data). The last parameter is the kernel function to be called, and in the middle are the parameters to be passed to the kernel function.`pos`

is a special kernel function parameter that indicates the position within the data (several other parameters exist, such as`blkpos`

,`blkdim`

etc., see the documentation for a detailed list).Kernel functions can call other kernel functions, but only through

`serial_do`

and/or`parallel_do`

. The regular function calling syntax (direct call, i.e, without`serial_do`

/`parallel_do`

) on kernel functions is*not permitted*. This is because kernel functions have other infrastructure (e.g., shared memory, device-specific parameters) which would cause conflicts when kernel functions would call each other.*Device*functions: in contrast to host functions, device functions (indicated with`__device__`

) are intended to be natively compiled using the back-end compiler. They can be seen as auxiliary functions that can be called from either kernel functions, device functions or even host functions (note that a kernel function cannot call a host function). The compiler back-end may generate C++, CUDA or OpenCL code that is then further compiled to binary code to be executed on CPU and/or GPU. Currently, apart from the CPU there are no accelerators that can run a device function directly when called from a host function. This means, for a GPU, the only way to call a device function is through a kernel function.`function y = __device__ norm(vector : vec) y = sqrt(sum(vector.^2)) endfunction`

Functions in Quasar are first-class variables. Correspondingly, they have a data type. Below are some example variable declarations:

```
host : [(cube, cube) -> scalar]
filter : [__kernel__ (cube) -> ()]
norm : [__device__ (vec) -> scalar]
```

This allows passing functions to other functions, where the compiler can statically check the types. Additionally, to prevent functions from being accessed unintendedly, functions can be nested. For example, a gamma correction operation can be implemented using a host function containing a kernel function.

```
function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
function [] = __kernel__ my_kernel(gamma_fn : [__device__ scalar->scalar], x : cube, y : cube, pos : ivec2)
y[pos] = gamma_fn(x[pos])
endfunction
y = uninit(size(x)) % Allocation of uninitialized memory
parallel_do(size(x), gamma_fn, x, y, my_kernel)
endfunction
```

Because functions support closures (i.e., they capture the variable values of the surround contexts at the moment the function is defined), the above code can be simplified to:

```
function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
function [] = __kernel__ my_kernel(pos : ivec2)
y[pos] = gamma_fn(x[pos])
endfunction
y = uninit(size(x))
parallel_do(size(x), my_kernel)
endfunction
```

Or even in shorter form, using lambda expression syntax:

```
function y : mat = gamma_correction(x : mat, gamma_fn : [__device__ scalar->scalar])
y = uninit(size(x))
parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end
```

In many cases, the type of the variables can safely be omitted (e.g., `pos`

in the above example) because the compiler can deduce it from the context.

Now, experienced GPU programmers may wonder about the performance cost of a function pointer call from a kernel function. The good news is that the function pointer call is avoided, by automatic function specialization (a bit similar to template instantiation in C++). With this mechanism, device functions passed via the function parameters or closure can automatically be inlined in a kernel function.

```
function y = gamma_correction(x, gamma_fn)
y = uninit(size(x))
parallel_do(size(x), __kernel__ (pos) -> y[pos] = gamma(x[pos]))
end
gamma = __device__ (x) -> 255*(x/255)^2.4
im_out = gamma_correction(im_in, gamma)
```

This will cause the following kernel function to be generated, behind the hood:

```
function [] = __kernel__ opt__gamma_correction(pos:int,y:vec,x:vec)
% Inlining of function 'gamma' with parameters 'x'=x[pos]
y[pos]=(255*((x[pos]/255)^2.4))
endfunction
```

Another optimization was performed in the background: *kernel flattening*. Because each element of the data is accessed separately, the data is processed as a vector (using raster scanning) rather than as a matrix (or cube) representation. This entirely eliminates costly index calculations.

Summarizing, three function types offer the flexibility of controlling on which target devices (CPU or GPU) certain code fragments *might* run, while offering a unified programming interface: the code written for GPU can be executed on CPU as well, thereby triggering other target-specific optimizations.

## Real-time High Dynamic Range Upconversion of Video

The following video demonstrates a real-time standard dynamic range (SDR) to high dynamic range (HDR) upconversion algorithm that was developed entirely in Quasar. The purpose of this algorithm is to display movies stored in SDR format on a HDR television. A user interface, also developed in Quasar, gives the user a few slides to choose how the conversion is done (e.g., which specular highlights are being boosted and to what extent).

"Raw upscaling" indicates direct scaling of the SDR to the HDR range, without additional processing of the video content. "Proposed method" refers to the SDR to HDR conversion algorithm.

(note: this video is played at a higher speed for viewing purposes)

## Quasar Autonomous Vehicles Demo

The following demo demonstrates the use of Quasar in autonomous driving applications. A Lidar is mounted on an electric bus, together with two RGB cameras. The incoming video signals are fused and processed in real-time using algorithms developed in Quasar.

## Autonomous vehicles making use of Quasar

## Simultaneous localization and mapping (SLAM)

Martin Dimitrievski and his colleagues propose a novel real-time method for SLAM in autonomous vehicles. The environment is mapped using a probabilistic occupancy map model and EGO motion is estimated within the same environment by using a feedback loop. Input data is provided via a rotating laser scanner as 3D measurements of the current environment which are projected on the ground plane. The local ground plane is estimated in real-time from the actual point cloud data using a robust plane fitting scheme. Then the computed occupancy map is registered against the previous map in order to estimate the translation and rotation of the vehicle. Experimental results demonstrate that the method produces high quality occupancy maps and the measured translation and rotation errors of the trajectories are lower compared to other 6 degrees of freedom methods. The entire SLAM system runs on a mid-range GPU and keeps up with the data from the sensor which enables more computational power for the other tasks of the autonomous vehicle.

“Many of the Autonomous Vehicles sub-systems are massively parallel and this is where Quasar can speed things up. From pre-processing of LIDAR point cloud data to odometry, object detection, tracking and route planning, Quasar made all of these components possible to run on a mid-range GPU in real-time. When you are done prototyping, you can consult the profiler to easily spot any areas for improved execution of the code.” – ir. Martin Dimitrievski

## Example: SLAM for autonomous vehicles

## Reference

“Robust matching of occupancy maps for odometry in autonomous vehicles“; Martin Dimitrievski , David Van Hamme , Peter Veelaert, Wilfried Philips in

proceedings of Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 2016 (VISIGRAPP).

## Acknowledgements

The work was financially supported the Flanders Make ICON project 140647 “Environmental Modelling for automated Driving and Active Safety (EMDAS)”.

## MRI reconstruction speedups up to x20 using GPUs

## Magnetic Resonance Imaging (MRI)

MRI is a very powerful and safe medical diagnostic tool, but it is prohibitively expensive to use frequently. Hence, a technique that speeds up MRI acquisition would not only be helpful for patients, as it requires them to lie still for shorter periods of time, but it would also be of great benefit for doctors, as it leads to a higher patient throughput and a reduced susceptibility of the images to motion artifacts. Using Quasar, Jan Aelterman and his colleagues developed a reconstruction algorithm that handles acquisition speedup correctly.

Speeding up MRI is done by reducing the amount of acquired data. However, signal processing theory states it is impossible to do this beyond the Nyquist limit, without losing information. However, it is possible to only lose superfluous image information, this is called compressive sensing (CS). The danger is that that a naive reconstruction technique inadvertently corrupts an image while filling in lost information.

## Proposed Technique

An MRI image is constructed from so called Fourier, or k-space coefficients. Due to acceleration techniques, the number of Fourier coefficients is less than the number of image pixels, resulting in an infinite number of possible images that would correspond with the acquired Fourier coefficients (an infinite number of ways to fill in the missing, superfluous information). Therefore, we impose an additional constraint: the reconstructed image is the one with the lowest number of shearlet coefficients possible. The shearlet transform can represent natural images with few coefficients, but not noise and corruptions. It is optimal in this respect. Hence, its use will force a noise – and corruption – free reconstruction. The optimization of the proposed problem requires iteratively applying the Non-uniform fast Fourier transform (NUFFT), which after profiling turns out to be a major bottleneck. By accelerating the NUFFT using the GPU we are able to gain significant speedups.

Implementation was done using the new programming language Quasar, which allows for fast and hardware agnostic development, while still using the computation power of the GPU. Without requiring long development cycles, we were able to achieve speed-ups up to a factor 20 on an NVIDIA Geforce GTX 770. The speed-up achieved by the GPU acceleration opens the path for new and innovative research for MRI reconstruction: e.g. auto calibration, 3D reconstruction, advanced regularization parameters, etc.

“It took experts using CUDA/C++ three months to implement our MRI reconstruction algorithm; a developer using Quasar for the very first time achieved the same numerical results at the same computational performance in less than a single development week.” – Dr. ir. Jan Aelterman

## Example

## Reference

“COMPASS: a joint framework for parallel imaging and compressive sensing in MRI“; Jan Aelterman, Quang Luong, Bart Goossens, Aleksandra Pizurica, Wilfried Philips; in proceedings of IEEE International Conference on Image Processing ICIP(2010).