Overview of some new features in Quasar
This documents lists a number of new features that were introduced in Quasar in Jan. 2014.
Object-oriented programming
The implementation of object-oriented programming in Quasar is far from complete, however there are a number of new concepts:
- Unification of static and dynamic classes:Before, there existed static class types (
type myclass : {mutable} class
) and dynamic object types (myobj = object()
). In many cases the set of properties (and corresponding types) forobject()
is known in advance. To enjoy the advantages of the type inference, there are now also dynamic class types:type Bird : dynamic class name : string color : vec3 end
The dynamic class types are similar to classes in Python. At run-time, it is possible to add fields or methods:
bird = Bird() bird.position = [0, 0, 10] bird.speed = [1, 1, 0] bird.is_flying = false bird.start_flying = () -> bird.is_flying = true
Alternatively, member functions can be implemented statically (similar to mutable or immutable classes):
function [] = start_flying(self : bird) self.is_flying = true end
Dynamic classes are also useful for interoperability with other languages, particularly when the program is run within the Quasar interpreter. The dynamic classes implement MONO/.Net dynamic typing, which means that imported libraries (e.g. through
import "lib.dll"
) can now use and inspect the object properties more easily.Dynamic classes are also frequently used by the UI library (
Quasar.UI.dll
). Thanks to the static typing for the predefined members, efficient code can be generated.One limitation is that dynamic classes cannot be used from within
__kernel__
or__device__
functions. As a compensation, the dynamic classes are also a bit lighter (in terms of run-time overhead), because there is no multi-device (CPU/GPU/…) management overhead. It is known a priori that the dynamic objects will “live” in the CPU memory.Also see Github issue #88 for some earlier thoughts.
- Parametric typesIn earlier versions of Quasar, generic types could be obtained by not specifying the types of the members of a class:
type stack : mutable class tab pointer end
However, this limits the type inference, because the compiler cannot make any assumptions w.r.t. the type of
tab
orpointer
. When objects of the typestack
are used within a for-loop, the automatic loop parallelizer will complain that insufficient information is available on the types oftab
andpointer
.To solve this issue, types can now be parametric:
type stack[T] : mutable class tab : vec[T] pointer : int end
An object of the type
stack
can then be constructed as follows:obj = stack[int] obj = stack[stack[cscalar]]
Parametric classes are similar to template classes in C++. For the Quasar back-ends, the implementation of parametric types is completely analogous as in C++: for each instantiation of the parametric type, a
struct
is generated.It is also possible to define methods for parametric classes:
function [] = __device__ push[T](self : stack[T], item : T) cnt = (self.pointer += 1) % atomic add for thread safety self.tab[cnt - 1] = item end
Methods for parametric classes can be
__device__
functions as well, so that they can be used on both the CPU and the GPU. In the future, this will allow us to create thread-safe and lock-free implementations of common data types, such as sets, lists, stacks, dictionaries etc. within Quasar.The internal implementation of parametric types and methods in Quasar (i.e. the runtime) uses a combination of erasure and reification.
- InheritanceInherited classes can be defined as follows:
type bird : class name : string color : vec3 end type duck : bird ... end
Inheritance is allowed on all three class types (mutable, immutable and dynamic).
Note: multiple inheritance is currently not supported (multiple inheritance has the problem that special “precedent rules” are required to determine with method is used when multiple instances define a certain method. In a dynamical context, this would create substantial overhead.
- ConstructorsDefining a constructor is based on the same pattern that we used to define methods. For the above stack class, we have:
% Default constructor function y = stack[T]() y = stack[T](tab:=vec[T](100), pointer:=0) end % Constructor with int parameter function y = stack[T](capacity : int) y = stack[T](tab:=vec[T](capacity), pointer:=0) end % Constructor with vec[T] parameter function y = stack[T](items : vec[T]) y = stack[T](tab:=copy(items), pointer:=0) end
Note that the constructor itself creates an instance of the type, rather than that it is done automatically. Consequently, it is possible to return a
null
value as well.function y : ^stack[T] = stack[T](capacity : int) if capacity > 1024 y = null % Capacity too large, no can do... else y = stack[T](tab:=vec[T](capacity), pointer:=0) endif end
In C++ / Java this is not possible: the constructor always returns the
this
-object. This is often seen as a disadvantage.A constructor that is intended to be used on the GPU (or CPU in native mode), can then simply be defined by adding the
__device__
modifier:function y = __device__ stack[T](items : vec[T]) y = stack[T](tab:=copy(items), pointer:=0) end
Note #1: instead of
stack[T]()
, we could have used any other name, such asmake_stack[T]()
. Using the type name to identify the constructor:- the compiler will know that this method is intended to be used to create objects of this class
- non-uniformity (
new_stack[T]()
,make_stack[T]()
,create_stack()
…) is avoided.
Note #2: there are no destructors (yet). Because of the automatic memory management, this is not a big issue right now.
Type inference enhancements
- Looking ‘through’ functions (type reconstruction)In earlier releases, the compiler could not handle the determination of the return types of functions very well. This could lead to some problems with the automatic loop parallelizer:
function y = imfilter(x, kernel) ... end % Warning - type of y unknown y = imfilter(imread("x.png")[:,:,1]) assert(type(y,"scalar")) % Gives compilation error!
Here, the compiler cannot determine the type of
y
, even though it is known thatimread("x.png")[:,:,1]
is a matrix.In the newest version, the compiler attempts to perform type inference for the
imfilter
function, knowing the type ofy
. This does not allow to determine the return type ofimfilter
in general, but it does for this specific case.Note that type reconstruction can create some additional burden for the compiler (especially when the function contains a lot of calls that require recursive type reconstruction). However, type reconstruction is only used when the type of at least one of the output parameters of a function could not be determined.
- Members of dynamic objectsThe members of many dynamic objects (e.g.
qform
,qplot
) are now statically typed. This also greatly improves the type inference in a number of places.
High-level operations inside kernel functions
Automatic memory management on the computation device is a new feature that greatly improves the expressiveness of Quasar programs. Typically, the programmer intends to use (non-fixed length) vector or matrix expressions within a for-loop (or a kernel function). Up till now, this resulted in a compilation error “function cannot be used within the context of a kernel function” or “loop parallelization not possible because of function XX”. The transparent handling of vector or matrix expressions with in kernel functions requires some special (and sophisticated) handling at the Quasar compiler and runtime sides. In particular: what is needed is dynamic kernel memory. This is memory that is allocated on the GPU (or CPU) during the operation of the kernel. The dynamic memory is disposed (freed) either when the kernel function terminates or at a later point.
There are a few use cases for dynamic kernel memory:
- When the algorithm requires to process several small-sized (
3x3
) to medium-sized (e.g.64x64
) matrices. For example, a kernel function that performs matrix operations for every pixel in the image. The size of the matrices may or may not be known in advance. - Efficient handling of multivariate functions that are applied to (non-overlapping or overlapping) image blocks.
- When the algorithm works with dynamic data structures such as linked lists, trees, it is also often necessary to allocate “nodes” on the fly.
- To use some sort of “/scratch” memory that does not fit into the GPU shared memory (note: the GPU shared memory is 32K, but this needs to be shared between all threads – for 1024 threads this is 32 bytes private memory per thread). Dynamic memory does not have such a stringent limitation. Moreover, dynamic memory is not shared and disposed either 1) immediately when the memory is not needed anymore or 2) when a GPU/CPU thread exists. Correspondingly, when 1024 threads would use 32K each, this will require less than 32MB, because the threads are logically in parallel, but not physically.
In all these cases, dynamic memory can be used, simply by calling the zeros
, ones
, eye
or uninit
functions. One may also use slicing operators (A[0..9, 2]
) in order to extract a sub-matrix. The slicing operations then take the current boundary access mode (e.g. mirroring, circular) into account.
Examples
The following program transposes 16x16
blocks of an image, creating a cool tiling effect. Firstly, a kernel function version is given and secondly a loop version. Both versions are equivalent: in fact, the second version is internally converted to the first version.
Kernel version
function [] = __kernel__ kernel (x : mat, y : mat, B : int, pos : ivec2)
r1 = pos[0]*B..pos[0]*B+B-1 % creates a dynamically allocated vector
r2 = pos[1]*B..pos[1]*B+B-1 % creates a dynamically allocated vector
y[r1, r2] = transpose(x[r1, r2]) % matrix transpose
% creates a dynamically allocated vector
end
x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
parallel_do(size(x,0..1) / B,x,y,B,kernel)
Loop version
x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size
#pragma force_parallel
for m = 0..B..size(x,0)-1
for n = 0..B..size(x,1)-1
A = x[m..m+B-1,n..n+B-1] % creates a dynamically allocated vector
y[m..m+B-1,n..n+B-1] = transpose(A) % matrix transpose
end
end
Memory models
To acommodate the widest range of algorithms, two memory models are currently provided (some more may be added in the future).
- Concurrent memory model In the concurrent memory model, the computation device (e.g. GPU) autonomously manages a separate memory heap that is reserved for dynamic objects. The size of the heap can be configured in Quasar and is typically 32MB.The concurrent memory model is extremely efficient when all threads (e.g. > 512) request dynamic memory at the same time. The memory allocation is done by a specialized parallel allocation algorithm that significantly differs from traditional sequential allocators.
For efficiency, there are some internal limitations on the size of the allocated blocks:
- The minimum size is 1024 bytes (everything smaller is rounded up to 1024 bytes)
* The maximum size is 32768 bytes
For larger allocations, please see the *cooperative memory model*. The minimum size also limits the number of objects that can be allocated.
- Cooperative memory modelIn the cooperative memory model, the kernel function requests memory directly to the Quasar allocator. This way, there are no limitations on the size of the allocated memory. Also, the allocated memory is automatically garbage collected.
Because the GPU cannot launch callbacks to the CPU, this memory model requires the kernel function to be executed on the CPU.
Advantages:
- The maximum block size and the total amount of allocated memory only depend on the available system resources.
Limitations:
- The Quasar memory allocator uses locking (to limited extend), so simultaneous memory allocations on all processor cores may be expensive.
- The memory is disposed only when the kernel function exists. This is to internally avoid the number of callbacks from kernel function code to host code. Suppose that you have a
1024x1024
grayscale image that allocates 256 bytes per thread. Then this would require1GB
of RAM! In this case, you should use the cooperative memory model (which does not have this problem).
Selection between the memory models.
Features
- Device functions can also use dynamic memory. The functions may even return objects that are dynamically allocated.
- The following built-in functions are supported and can now be used from within kernel and device functions:
zeros, czeros, ones, uninit, eye, copy, reshape, repmat, shuffledims, seq, linspace, real, imag, complex, mathematical functions matrix/matrix multiplication matrix/vector multiplication
Performance considerations
- Global memory access: code relying on dynamic memory may be slow (for linear filters on GPU: 4x-8x slower), not because of the allocation algorithms, but because of the global memory accesses. However, it all depends on what you want to do: for example, for non-overlapping block-based processing (e.g., blocks of a fixed size), the dynamic kernel memory is an excellent choice.
- Static vs. dynamic allocation: when the size of the matrices is known in advanced, static allocation (e.g. outside the kernel function may be used as well). The dynamic allocation approach relieves the programmer from writing code to pre-allocate memory and calculating the size as a function of the size of the data dimensions. The cost of calling the functions
uninit
,zeros
is negligible to the global memory access times (one memory allocation is comparable to 4-8 memory accesses on average – 16-32 bytes is still small compared to the typical sizes of allocated memory blocks). Because dynamic memory is disposed whenever possible when a particular threads exists, the maximum amount of dynamic memory that is in use at any time is much smaller than the amount of memory required for pre-allocation. - Use
vecX
types for vectors of length 2 to 16 whenever your algorithm allows it. This completely avoids using global memory, by using the registers instead. Once a vector of length 17 is created, the vector is allocated as dynamic kernel memory. - Avoid writing code that leads to thread divergence: in CUDA, instructions execute in warps of 32 threads. A group of 32 threads must execute (every instruction) together. Control flow instructions (
if
,match
,repeat
,while
) can negatively affect the performance by causing threads of the same warp to diverge; that is, to follow different execution paths. Then,the different execution paths must be serialized, because all of the threads of a warp share a program counter. Consequently, the total number of instructions executed for this warp is increased. When all the different execution paths have completed, the threads converge back to the same execution path.
To obtain best performance in cases where the control flow depends on the block position (blkpos
), the controlling condition should be written so as to minimize the number of divergent warps.
Nested parallelism
It is desired to specify parallelism in all stages of the computation. For example, within a parallel loop, it must be possible to declare another parallel loop etc. Up till now, parallel loops could only be placed at the top-level (in a host function), and multiple levels of parallelism had to be expressed using multi-dimensional perfect for loops. A new feature is that __kernel__
and __device__
functions can now also use the parallel_do
(and serial_do
) functions. The top-level host function may for example spawn 8 threads, from which every of these 8 threads spans again 64 threads (after some algorithm-specific initialization steps). There are several advantages of this approach:
- More flexibility in expressing the algorithms
- The nested kernel functions are (or will be) mapped onto CUDA dynamic parallelism on Kepler devices such as the GTX 780, GTX Titan. (Note: requires one of these cards to be effective).
- When a
parallel_do
is placed in a__device__
function that is called directly from the host code (CPU computation device), theparallel_do
will be accelerated using OpenMP. - The high-level matrix operations from the previous section are automatically taking advantage of the nested parallelism.
Notes:
- There is no guarantee that the CPU/GPU will effectively perform the nested operations in parallel. However, future GPUs may be expected to become more efficient in handling parallelism on different levels.
Limitations:
- Nested kernel functions may not use shared memory (they can access the shared memory through the calling function however), and they may also not use thread sychronization.
- Currently only one built-in parameter for the nested kernel functions is supported:
pos
(and notblkpos
,blkidx
orblkdim
).
Example
The following program showcases the nested parallelism, the improved type inference and the automatic usage of dynamic kernel memory:
function y = gamma_correction(x, gamma)
y = uninit(size(x))
% Note: #pragma force_parallel is required here, otherwise
% the compiler will just use dynamic typing.
#pragma force_parallel
for m=0..size(x,0)-1
for n=0..size(x,1)-1
for k=0..size(x,2)-1
y[m,n,k] = 512 * (x[m,n,k]/512)^gamma
end
end
end
end
function [] = main()
x = imread("lena_big.tif")
y = gamma_correction(x, 0.5)
#pragma force_parallel
for m=0..size(y,0)-1
for c=0..size(y,2)-1
row = y[m,:,c]
y[m,:,c] = row[numel(row)-1..-1..0]
end
end
imshow(y)
end
The pragma’s are just added here to illustrate that the corresponding loops need to be parallelized, however, using this pragma is optionally. Note:
- The lack of explicit typing in the complete program (even type T is only an unspecified type parameter)
- The return type of gamma correction is also not specified, however the compiler is able to deduce that
type(y,"cube")
is true. - The second for-loop (inside the
main
function), uses slicing operations (:
and..
). The assignmentrow=y[m,:,c]
leads to dynamic kernel memory allocation. - The vector operations inside the second for-loop automatically express nested parallelism and can be mapped onto CUDA dynamic parallelism.