# New feature in Quasar: fixed size data types

In the past, small matrices such as `ones(3,3)`

required dynamic memory allocation (which also incurs additional overhead). The dynamic memory allocation could be avoided by reverting to the fixed-length vector types (`vec4`

etc.), but matrix code written using this technique is more difficult to read and often error prone (for example for a 4×4 matrix one would write `A[5]`

instead of `A[1,1]`

). The use of high-level expressions such as `zeros(N,N)`

is very natural in a high-level language such as Quasar, but `zeros(N,N)`

does not lead to the most efficient code generation for a hardware architecture such as a GPU. Therefore, solving this problem represents a major milestone in the conversion from high-level code to low-level code.

For this purpose, fixed-length vectors have now been generalized to fixed-sized matrices, cubes and hypercubes, with full compiler support (type inference and code generator). We will show that this successfully solves the code generation problem. With no or limited code changes, existing code can even benefit of the fixed-sized matrix optimizations. Additionally, it is useful to guide the compiler by annotating with the appropriate fixed-sized matrix types (e.g., to catch matrix dimension mismatches early on in the development). The following types have been defined:

Quasar type | Parametric alias | Meaning |
---|---|---|

`vec3` |
`vec(3)` |
A vector of length 3 |

`mat2x2` |
`mat(2,2)` |
A matrix of dimensions 2 x 2 |

`cube2x3x2` |
`cube(2,3,2)` |
A cube with dimensions 2 x 3 x 2 |

`cube2x2x2x2` |
`cube(2,2,2,2)` |
A hypercube with dimensions 2 x 2 x 2 x 2 |

In the above table, all types exist in fixed form and a parametric form. In code, the fixed form is often slightly more readable. The parametric form has the possibility for generic parametrization, which will be explained in one of the next sections.

Fixed sized data types allow the compiler to generate more efficient code, especially in cases when the dimensions are known at compile-time. In some cases, this allows out-of-bounds indices to be detected faster:

```
function y = transform(A : mat2x2'checked, b : vec2)
y = A[4,4] * b[0] % Index out-of-bounds
endfunction
```

Also, incompatibilities in matrix/vector operations can be detected at compile-time in some cases, for example:

`y = [1,2,3] * eye(4) % Error: Matrix dimensions (1x3 and 4x4) do not match for multiplication!`

The explicit matrix dimensions can be seen as a type constraint (e.g., a constraint that is applicable to every instance of the type and that can be handled by the truth-value system to enable inference).

For the generated CUDA/CPU code, the fixed sized data types are stored respectively in the registers (GPU) or stack memory (CPU). This entirely avoids dynamic kernel memory and offers significant performance benefits, often a tenfold. Because the dimensions are known at compile-time, also the indexing calculation is accelerated.

All fixed sized data types are by default passed *by reference* (note: this is the same for generic vectors/matrices and even fixed-length vectors). The fixed dimensions should be seen as a constraint that applies to every instance of the type. Internally, when safe, the compiler may generate code that passes the vector/matrix *by value* (for example, when the corresponding variable is constant). Except for `class`

/`mutable class`

objects, the fixed sized data types are stored in-place (i.e. without reference).

To take advantage of fixed sized data types, no changes are required to existing Quasar programs, other than occasional changes to the function parameter types (e.g., from `mat`

to `mat3x3`

for a 3×3 matrix). Whenever possible, the compiler attempts to derive the data dimensions by type inference. For example, the following defines a 2×3 matrix and its transpose:

```
A = [[1,2,3],[2,3,4]]
B = transpose(A)
```

The compiler will automatically determine the type of `A`

as `mat2x3`

and `B`

as `mat3x2`

. Moreover, when the following builtin functions are used within a device/kernel function, they will automatically take advantage of the fixed sized data types:

```
zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax
```

Idem for arithmetic operations (add, subtract, multiply, divide etc.), matrix multiplication as well as mathematical functions (`sin`

, `cos`

, `tan`

etc.).

*Some unobvious performance considerations*: passing a `mat(N,N)`

to a device function accepting `mat`

does not require use of dynamic kernel memory! Internally, the runtime will cast the fixed sized matrix to a generic matrix. Also, a device function returning `mat(N,N)`

does not use dynamic kernel memory, however, a device function returning `mat`

(e.g. `function [y:mat] = __device__ X()`

) *does*, even when the function returns a fixed sized matrix such as `mat(4,4)`

. The reason is that device functions need to respect the type signature so that they can be used as function variables. Therefore: for device functions returning `mat`

/`cube`

types, it is best to update the type signature, especially of the dimensions of the return value are known beforehand.

## Examples

Below are a few useful examples of using fixed sized data types.

### 1 Built-in function optimizations

```
function [y1,y2] = __device__ func (x : mat4x4'safe, y : vec4'circular)
y1 = sum(x,1) + x[4,0]
y2 = cumsum(x,1) + y[6]
endfunction
```

Because the matrix and vector dimensions are known at compile-time, the above function will only use stack memory/registers to calculate the results `y1`

and `y2`

, i.e., there is no dynamic kernel memory involved.

### 2 Matrix extraction and processing

Another use case that can now be optimized, is given below:

```
im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))
for m=0..size(im,0)-1
for n=0..size(im,1)-1
window = im[m+(-1..1),n+(-1..1)]
im_out[m,n] = max(window) - min(window)
endfor
endfor
```

Here, the compiler will determine the type of `window`

to be `mat3x3`

. In previous versions of Quasar, the same effect would be obtained by writing:

```
window = [im[m-1,n-1],im[m-1,n]+im[m-1,n+1]+
im[m ,n-1],im[m ,n]+im[m ,n+1]+
im[m+1,n-1],im[m+1,n]+im[m+1,n+1]]
```

The fixed-sized matrix approach is clearly more versatile and also allows two-dimensional addressing. One caveat is that the programmer needs to ensure that the dimensions of `window`

can be obtained at compile-time. If this is not the case, the compiler will generate a warning mentioning that the kernel function consumes dynamic memory. In case the dimensions are parametric, a way to get around it is by using *generical size parametrized types* (see Section 1.2 below) or by *using generic specialization*:

```
im : cube'mirror = imread("lena_big.tif")[:,:,1]
im_out = zeros(size(im))
function [] = generic_nlfilter(K : int)
{!auto_specialize}
for m=0..size(im,0)-1
for n=0..size(im,1)-1
window = im[m+(-K..K),n+(-K..K)]
im_out[m,n] = max(window) - min(window)
endfor
endfor
endfunction
generic_nlfilter(2)
```

### 3 Matrix-processing in the loop

The following code benefits well from the fixed-sized matrix optimizations, causing efficient code to be generated because 1) matrix dimensions are known inside the kernel function, which allows additional optimization techniques under the hood (such as loop unrolling) and 2) because no dynamic kernel memory is required:

```
% (code courtesy of Michiel Vlaminck)
{!parallel for}
for i=0..size(covs,2)-1
D = vec(3) % Type: vec3 or in parametric form vec(3)
V = zeros(3,3) % Type: mat3x3 or in parametric form mat(3,3)
unused = jacobi(covs[0..2,0..2,i], 3, D, V)
% Avoids matrices near singularities
mineig : scalar = 100 * D[2]
D[0] = D[0] < mineig ? mineig : D[0]
D[1] = D[1] < mineig ? mineig : D[1]
covs[0..2,0..2,i] = V * diagm(D) * transpose(V)
for j=0..2
D[j] = 1.0/D[j]
endfor
icovs[0..2,0..2,i] = transpose(V) * diagm(D) * V
endfor
```

In particular, the type inference engine can determine that `type(diagm(D))`

is `mat(3,3)`

, so that the matrix multiplications `transpose(V) * diagm(D) * V`

also can be calculated with fixed dimensions.

## Generic dimension/size-parametrized array types

Often, it is desired to specify vector and matrix dimensions, but it is not desirable to restrict the functions to a particular choice of dimensions. In this case, dimension/size-parametrized types come to the rescue.

The following example demonstrated a `circshift`

function that can be used in arbitrary dimension (1D, 4D, 8D etc.)

```
function [y] = circshift[N : int](x : cube{N}, shift : vec(N))
function [] = __kernel__ kernel (x : cube{N}'circular, y : cube{N}, shift : vec(N), pos : vec(N))
y[pos] = x[pos + shift]
endfunction
y = zeros(size(x,0..N-1))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

This is called a dimension-parametric function (the function works for any dimension `N>0`

). The specialization of this function is done at the caller-side. For example, if the function is called `circshift(randn(64,64,3),[10,10,0])`

the compiler will deduct `N=3`

and will generate a specialized version (as if the function was written specifically for `N=3`

). Alternatively, one may also call `circshift[3](randn(64,64,3),[10,10,0])`

to explicitly pass the parameter (e.g., in order to ensure type correctness).

Compare this to the non-generic counterpart (note that `cube{:}`

denotes a (hyper)cube of unspecified dimension):

```
function [y] = circshift(x : cube{:}, shift : vec)
function [] = __kernel__ kernel (x : cube{:}'circular, y : cube{:}, shift : vec, pos)
y[pos] = x[pos + shift]
endfunction
y = zeros(size(x))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

Sadly, the non-generic counterpart implementation does not work as expected. To work, the length of the `shift`

vector needs to match the maximal dimension of `cube{:}`

. This is because the compiler will adjust the length of `pos`

to match the maximal dimensionality (currently 16). In case the `shift`

vector has a different length, a runtime error will result. Moreover, it is desirable to write code that does not depend on built-in constants like the maximal dimensionality.

The parametric version entirely solves this problem but also allows the exact value of `N`

to be captured. It is possible to declare types like `mat(N,N)`

and even `vec(2*N)`

and `mat(2,N)`

. The type definitions currently allow simple linear arithmetic (for example `2*N-1`

).

Next, defining a matrix of type `mat(2,N)`

:

```
A = mat(2,N)
B = cube(N,2*N,3)
Y = A[0,:]
C = B[:,:,1]
```

The compiler can now infer that `Y`

has type `vec(N)`

and that `C`

has type `mat(N,2*N)`

. After specialization with `N=2`

, `C`

will have type `mat(2,4)`

and the resulting variables enjoy the benefits of fixed sized data types.

## Partially parametrized array types

Existing code (e.g., for image processing) often relies on hard-coded conventions with respect to the cube dimensions, for example, the number of color channels that is fixed to be 3. To make code more easily extendible and also to facilitate compiler analysis (for loop parallelization, type inference), partially parametrized array types can be used.

Example:

`img : cube(:,:,3) = imread("lena_big.tif")`

indicates that `A`

is a cube with three dimensions, for which the first two dimensions have unspecified length (`:`

) and the last dimension has length 3. When a for-loop is written over the third dimension, either explicitly or implicitly via matrix slices:

`img[m,n,:]`

the compiler can determine the length `numel(img[m,n,:])==3`

so that the vector type `vec3`

can be assigned. In previous Quasar versions, the compiler would infer `vec`

as type (not being able to make any assumption on the size), resulting possibly in the use of dynamic kernel memory.

For `imread`

, the type inference of `cube(:,:,3)`

is now automatic. For other functions, it may again be useful to parametrize generically on the number of color channels:

```
function [y] = circshift2D[N](x : cube(:,:,N), shift : vec(2))
function [] = __kernel__ kernel (x : cube(:,:,N)'circular, y : cube(:,:,N), shift : vec(2), pos)
y[...pos,:] = x[...(pos + shift),:]
endfunction
y = zeros(size(x))
parallel_do(size(x), x, y, shift, kernel)
endfunction
```

Note that here, we are combining the technique from previous section with partially parametrized array types (since we are using `cube(:,:,N)`

instead of `cube(:,:,3)`

). The body of the kernel function will then be implemented entirely with fixed-length vectors `vec(N)`

. The above function can be called with arbitrary number of color channels, although this number needs to be *known* at compile-time:

```
circshift2D(randn(100,100,3), [4,4])
circshift2D(randn(100,100,9), [-2,-2])
```

Another example is a color transform for an image with unspecified number of channels:

```
function img_out = colortransform[N](C : mat(N,N), img : cube(:,:,N))
img_out = zeros(size(img,0),size(img,1),N)
{!parallel for}
for m=0..size(img,0)-1
for n=0..size(img,1)-1
img_out[m,n,:] = C * squeeze(img[m,n,:])
endfor
endfor
endfunction
```

Here, calling the function with a `mat3x3`

color transform matrix will ensure that the parameter `N=3`

is substituted during compile-time, leading to efficient calculation of the vector-matrix product `C * squeeze(img[m,n,:])`

. The `squeeze`

function is required here, because `img[m,n,:]`

returns a vector of dimensions `1x1xN`

which is multiplied with matrix `C`

. The squeeze function removes the singleton dimensions and converts the vector to `Nx1x1`

(or simply `N`

, since singleton dimensions at the end do not matter in Quasar). For the assignment, this conversion is done automatically. In practice the `squeeze`

function will have zero-overhead; it is only required to keep the compiler happy.

Again, the type inference of the following builtin functions has been updated to work with these parametrized types:

```
zeros, ones, uninit, eye,
repmat, reshape,
shuffledims, transpose,
sum, prod, min, max, mindim, maxdim
cumsum, cumprod, cummin, cummax
```

such that `repmat(img, [1,1,3]) : cube(:,:,9))`

etc. Another use case is summation along a given dimension:

```
A : mat(3,3,:)
B = sum(A,2)
```

Here the result `B`

has the fixed sized matrix type `mat(3,3)`

.

# Conclusion

In Quasar, fixed sized vectors, matrices and cubes offer the same code generation advantages as fixed sized arrays in C, but because they are actually “concealed” regular matrices with special type constraints, they can be used with the same conventions (e.g., by reference parameter passing) as regular matrices and they are therefore much more flexible. The type inference has been extended to automatically find matrices/cubes with fixed dimensions and once found, the code generation is adjusted appropriately. The fixed sized constraint then not only allows catching size errors at compile-time, but also ensures that loops with fixed-sized matrices are converted to efficient implementations that no longer rely on dynamic kernel memory.