Autonomous vehicles making use of Quasar

Simultaneous localization and mapping (SLAM)

Martin Dimitrievski and his colleagues propose a novel real-time method for SLAM in autonomous vehicles. The environment is mapped using a probabilistic occupancy map model and EGO motion is estimated within the same environment by using a feedback loop. Input data is provided via a rotating laser scanner as 3D measurements of the current environment which are projected on the ground plane. The local ground plane is estimated in real-time from the actual point cloud data using a robust plane fitting scheme. Then the computed occupancy map is registered against the previous map in order to estimate the translation and rotation of the vehicle. Experimental results demonstrate that the method produces high quality occupancy maps and the measured translation and rotation errors of the trajectories are lower compared to other 6 degrees of freedom methods. The entire SLAM system runs on a mid-range GPU and keeps up with the data from the sensor which enables more computational power for the other tasks of the autonomous vehicle.

“Many of the Autonomous Vehicles sub-systems are massively parallel and this is where Quasar can speed things up. From pre-processing of LIDAR point cloud data to odometry, object detection, tracking and route planning, Quasar made all of these components possible to run on a mid-range GPU in real-time. When you are done prototyping, you can consult the profiler to easily spot any areas for improved execution of the code.” – ir. Martin Dimitrievski

 

Example: SLAM for autonomous vehicles

Reference

Robust matching of occupancy maps for odometry in autonomous vehicles“; Martin Dimitrievski , David Van Hamme , Peter Veelaert, Wilfried Philips in
proceedings of Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 2016 (VISIGRAPP).

Acknowledgements

The work was financially supported the Flanders Make ICON project 140647 “Environmental Modelling for automated Driving and Active Safety (EMDAS)”.

MRI reconstruction speedups up to x20 using GPUs

Magnetic Resonance Imaging (MRI)

MRI is a very powerful and safe medical diagnostic tool, but it is prohibitively expensive to use frequently. Hence, a technique that speeds up MRI acquisition would not only be helpful for patients, as it requires them to lie still for shorter periods of time, but it would also be of great benefit for doctors, as it leads to a higher patient throughput and a reduced susceptibility of the images to motion artifacts. Using Quasar, Jan Aelterman and his colleagues developed a reconstruction algorithm that handles acquisition speedup correctly.

Speeding up MRI is done by reducing the amount of acquired data. However, signal processing theory states it is impossible to do this beyond the Nyquist limit, without losing information. However, it is possible to only lose superfluous image information, this is called compressive sensing (CS). The danger is that that a naive reconstruction technique inadvertently corrupts an image while filling in lost information.

Proposed Technique

An MRI image is constructed from so called Fourier, or k-space coefficients. Due to acceleration techniques, the number of Fourier coefficients is less than the number of image pixels, resulting in an infinite number of possible images that would correspond with the acquired Fourier coefficients (an infinite number of ways to fill in the missing, superfluous information). Therefore, we impose an additional constraint: the reconstructed image is the one with the lowest number of shearlet coefficients possible. The shearlet transform can represent natural images with few coefficients, but not noise and corruptions. It is optimal in this respect. Hence, its use will force a noise – and corruption – free reconstruction. The optimization of the proposed problem requires iteratively applying the Non-uniform fast Fourier transform (NUFFT), which after profiling turns out to be a major bottleneck. By accelerating the NUFFT using the GPU we are able to gain significant speedups.

COMPASS reconstruction experiment using 4 acquisition coils and 10% of the Nyquist sampling rate on a k-space spiral.

COMPASS reconstruction experiment using 4 acquisition coils and 10% of the Nyquist sampling rate on a k-space spiral.

Implementation was done using the new programming language Quasar, which allows for fast and hardware agnostic development, while still using the computation power of the GPU. Without requiring long development cycles, we were able to achieve speed-ups up to a factor 20 on an NVIDIA Geforce GTX 770. The speed-up achieved by the GPU acceleration opens the path for new and innovative research for MRI reconstruction: e.g. auto calibration, 3D reconstruction, advanced regularization parameters, etc.

“It took experts using CUDA/C++ three months to implement our MRI reconstruction algorithm; a developer using Quasar for the very first time achieved the same numerical results at the same computational performance in less than a single development week.” – Dr. ir. Jan Aelterman

Example

Reference

COMPASS: a joint framework for parallel imaging and compressive sensing in MRI“; Jan Aelterman, Quang Luong, Bart Goossens, Aleksandra Pizurica, Wilfried Philips; in proceedings of IEEE International Conference on Image Processing ICIP(2010).

Focal Black & White effect

Focal Black & White Effect

A well known Google Picasa effect is the Focal Black & White Effect. This effect preserves the color within a focal region and converts pixels outside this region to grayscale.

The algorithm is surprisingly simple: it consists of calculating a weighting factor (that depends on the focal radius), converting the pixel RGB values at each position to grayscale, and calculating a weighted average of this value with the original RGB value. Let us see how this can be achieved in Quasar:

function [] = __kernel__ focus_bw(x, y, focus_spot, falloff, radius, pos)

    % Calculation of the weight
    p = (pos - focus_spot) / max(size(x,0..1))
    weight = exp(-1.0 / (0.05 + radius * (2 * dotprod(p,p)) ^ falloff))

    % Conversion to grayscale & averaging
    rgbval = x[pos[0],pos[1],0..2]
    grayval = dotprod(rgbval,[0.3,0.59,0.11])*[1,1,1]
    y[pos[0],pos[1],0..2] = lerp(grayval, rgbval, weight)

end

Code explanation

First, the kernel function focus_bw is defined. __kernel__ is a special function qualifier that identifies a kernel function (similar to OpenCL’s __kernel or kernel qualifier). Kernel functions are natively compiled to any target architecture that you have in mind. This can be multi-core CPU x86/64 ELF, even ARM v9 with Neon instructions; up to NVidia PTX code. Next, a parameter list follows, the type of the parameters is (in this case) not specified.

Note the special parameter pos. In Quasar, pos is a parameter name that is reserved for kernel functions to obtain the current position in the image.

The kernel function contains two major blocks:

  • In the weight calculation step, first the relative position compared to the focus spot coordinates is computed. This relative position is normalized by dividing by the maximum size in the first two dimensions (note that Quasar uses base-0), so this is the maximum of the width and the height of the image. Next, the weight is obtained as being inversely proportional to distance to the the focal spot. A special built-in function dotprod, that can also be found in high-level shader languages (HLSL/GLSL) and that calculates the dot product between two vectors, is used for this purpose.
  • For extracting the RGB value at the position pos, we use a matrix slice indexer: 0..2 constructs a vector of length 3 (actually [0,1,2]), which vector is used for indexing. In fact:
    x[pos[0],pos[1],0..2] = [x[pos[0],pos[1],0],x[pos[0],pos[1],1],x[pos[0],pos[1],2]]        
    

    which form do you prefer, the left-handed side, or the right-handed side? You can choose. Note that it is not possible to write: x[pos[0..1],0..2], because this expression would construct a matrix of size 2 x 3.

  • The gray value is calculated by performing the dot product of [0.3,0.59,0.11] with the original RGB value. Finally, the gray value is mixed with the original RGB value using the lerp “linear interpolation” function. In fact, lerp is nothing more than the function:
    lerp = (x,y,a) -> (1-a) * x + a * x
    

    The resulting RGB value is written to the output image y. That’s it!

Finally, we still need to call the kernel function. For this, we use the parallel_do construct:

img_in = imread("quasar.jpg")
img_out = zeros(size(img_in))

parallel_do(size(img_out,0..1),img_in,img_out,[256,128],0.5,10,focus_bw)
imshow(img_out)

First, an input image img_in is loaded using the function imread “image read”. Then, an output image is allocated with the same size as the input image.

The parallel_do function is called, with some parameters. The first parameter specifies the dimensions of the “work items” that can run in parallel. Here, each pixel of the image can be processed in parallel, hence the dimensions are the size (i.e., height + width) of the output image. The following parameters are argument values that are passed to the kernel function and that are declared in the kernel function definition. Finally, the kernel function to be called is passed.

Note that in contrast to scripting languages that are dynamically typed, the Quasar language is (mostly) statically typed and the Quasar compiler performs type inference in order to derive the data types of all the parameters. This is done based on the surrounding context. Here, Quasar will find out that img_in is a cubedata type (a 3D array) and it will derive all other missing data types based on that. Consequently, efficient parallel code can be generated in a manner that is independent of the underlying platform.

Now: the complete code again:

function [] = __kernel__ focus_bw(x, y, focus_spot, falloff, radius, pos)  
    p = (pos - focus_spot) / max(size(x,0..1))
    weight = exp(-1.0 / (0.05 + radius * (2 * dotprod(p,p)) ^ falloff))
    rgbval = x[pos[0],pos[1],0..2]
    grayval = dotprod(rgbval,[0.3,0.59,0.11])*[1,1,1]
    y[pos[0],pos[1],0..2] = lerp(grayval, rgbval, weight)                    
end

img_in = imread("flowers.jpg")
img_out = zeros(size(img_in))

parallel_do(size(img_out,0..1),img_in,img_out,[256,128],0.5,10,focus_bw)
imshow(img_out)

Example

With eleven lines of code, you have a beautifully shining Focal Black & White effect:

focal-bnw-effect

January 2014 – new features

This documents lists a number of new features that were introduced in Quasar in Jan. 2014.

Object-oriented programming

The implementation of object-oriented programming in Quasar is far from complete, however there are a number of new concepts:

  1. Unification of static and dynamic classes:

    Before, there existed static class types (type myclass : {mutable} class) and dynamic object types (myobj = object()). In many cases the set of properties (and corresponding types) forobject() is known in advance. To enjoy the advantages of the type inference, there are now also dynamic class types:

    type Bird : dynamic class
        name : string
        color : vec3
    end 

    The dynamic class types are similar to classes in Python. At run-time, it is possible to add fields or methods:

    bird = Bird()
    bird.position = [0, 0, 10]
    bird.speed = [1, 1, 0]
    bird.is_flying = false
    bird.start_flying = () -> bird.is_flying = true

    Alternatively, member functions can be implemented statically (similar to mutable or immutable classes):

    function [] = start_flying(self : bird)
        self.is_flying = true
    end

    Dynamic classes are also useful for interoperability with other languages, particularly when the program is run within the Quasar interpreter. The dynamic classes implement MONO/.Net dynamic typing, which means that imported libraries (e.g. through import "lib.dll") can now use and inspect the object properties more easily.

    Dynamic classes are also frequently used by the UI library (Quasar.UI.dll). Thanks to the static typing for the predefined members, efficient code can be generated.

    One limitation is that dynamic classes cannot be used from within __kernel__ or __device__ functions. As a compensation, the dynamic classes are also a bit lighter (in terms of run-time overhead), because there is no multi-device (CPU/GPU/…) management overhead. It is known a priori that the dynamic objects will “live” in the CPU memory.

    Also see Github issue #88 for some earlier thoughts.

  2. Parametric types

    In earlier versions of Quasar, generic types could be obtained by not specifying the types of the members of a class:

    type stack : mutable class
        tab
        pointer
    end

    However, this limits the type inference, because the compiler cannot make any assumptions w.r.t. the type of tab or pointer. When objects of the type stack are used within a for-loop, the automatic loop parallelizer will complain that insufficient information is available on the types of tab and pointer.

    To solve this issue, types can now be parametric:

    type stack[T] : mutable class
        tab : vec[T]
        pointer : int
    end

    An object of the type stack can then be constructed as follows:

    obj = stack[int]
    obj = stack[stack[cscalar]] 

    Parametric classes are similar to template classes in C++. For the Quasar back-ends, the implementation of parametric types is completely analogous as in C++: for each instantiation of the parametric type, a struct is generated.

    It is also possible to define methods for parametric classes:

    function [] = __device__ push[T](self : stack[T], item : T)
        cnt = (self.pointer += 1) % atomic add for thread safety
        self.tab[cnt - 1] = item
    end

    Methods for parametric classes can be __device__ functions as well, so that they can be used on both the CPU and the GPU. In the future, this will allow us to create thread-safe and lock-free implementations of common data types, such as sets, lists, stacks, dictionaries etc. within Quasar.

    The internal implementation of parametric types and methods in Quasar (i.e. the runtime) uses a combination of erasure and reification.

  3. Inheritance

    Inherited classes can be defined as follows:

    type bird : class
        name : string
        color : vec3
    end
    
    type duck : bird
        ...
    end 

    Inheritance is allowed on all three class types (mutable, immutable and dynamic).

    Note: multiple inheritance is currently not supported (multiple inheritance has the problem that special “precedent rules” are required to determine with method is used when multiple instances define a certain method. In a dynamical context, this would create substantial overhead.

  4. Constructors

    Defining a constructor is based on the same pattern that we used to define methods. For the above stack class, we have:

    % Default constructor
    function y = stack[T]()
        y = stack[T](tab:=vec[T](100), pointer:=0)
    end
    
    % Constructor with int parameter
    function y = stack[T](capacity : int)
        y = stack[T](tab:=vec[T](capacity), pointer:=0)
    end
    
    % Constructor with vec[T] parameter
    function y = stack[T](items : vec[T])
        y = stack[T](tab:=copy(items), pointer:=0)
    end

    Note that the constructor itself creates an instance of the type, rather than that it is done automatically. Consequently, it is possible to return a null value as well.

    function y : ^stack[T] = stack[T](capacity : int)
        if capacity > 1024
            y = null % Capacity too large, no can do...
        else
            y = stack[T](tab:=vec[T](capacity), pointer:=0)
        endif
    end

    In C++ / Java this is not possible: the constructor always returns the this-object. This is often seen as a disadvantage.

    A constructor that is intended to be used on the GPU (or CPU in native mode), can then simply be defined by adding the __device__ modifier:

    function y = __device__ stack[T](items : vec[T])
        y = stack[T](tab:=copy(items), pointer:=0)
    end

    Note #1: instead of stack[T](), we could have used any other name, such as make_stack[T](). Using the type name to identify the constructor:

    • the compiler will know that this method is intended to be used to create objects of this class
    • non-uniformity (new_stack[T](), make_stack[T](), create_stack()…) is avoided.

    Note #2: there are no destructors (yet). Because of the automatic memory management, this is not a big issue right now.

Type inference enhancements

  1. Looking ‘through’ functions (type reconstruction)

    In earlier releases, the compiler could not handle the determination of the return types of functions very well. This could lead to some problems with the automatic loop parallelizer:

    function y = imfilter(x, kernel)
        ...
    end  % Warning - type of y unknown
    
    y = imfilter(imread("x.png")[:,:,1])
    assert(type(y,"scalar"))  % Gives compilation error!

    Here, the compiler cannot determine the type of y, even though it is known that imread("x.png")[:,:,1] is a matrix.

    In the newest version, the compiler attempts to perform type inference for the imfilter function, knowing the type of y. This does not allow to determine the return type of imfilter in general, but it does for this specific case.

    Note that type reconstruction can create some additional burden for the compiler (especially when the function contains a lot of calls that require recursive type reconstruction). However, type reconstruction is only used when the type of at least one of the output parameters of a function could not be determined.

  2. Members of dynamic objects

    The members of many dynamic objects (e.g. qform, qplot) are now statically typed. This also greatly improves the type inference in a number of places.

High-level operations inside kernel functions

Automatic memory management on the computation device is a new feature that greatly improves the expressiveness of Quasar programs. Typically, the programmer intends to use (non-fixed length) vector or matrix expressions within a for-loop (or a kernel function). Up till now, this resulted in a compilation error “function cannot be used within the context of a kernel function” or “loop parallelization not possible because of function XX”. The transparent handling of vector or matrix expressions with in kernel functions requires some special (and sophisticated) handling at the Quasar compiler and runtime sides. In particular: what is needed is dynamic kernel memory. This is memory that is allocated on the GPU (or CPU) during the operation of the kernel. The dynamic memory is disposed (freed) either when the kernel function terminates or at a later point.

There are a few use cases for dynamic kernel memory:

  • When the algorithm requires to process several small-sized (3x3) to medium-sized (e.g. 64x64) matrices. For example, a kernel function that performs matrix operations for every pixel in the image. The size of the matrices may or may not be known in advance.
  • Efficient handling of multivariate functions that are applied to (non-overlapping or overlapping) image blocks.
  • When the algorithm works with dynamic data structures such as linked lists, trees, it is also often necessary to allocate “nodes” on the fly.
  • To use some sort of “/scratch” memory that does not fit into the GPU shared memory (note: the GPU shared memory is 32K, but this needs to be shared between all threads – for 1024 threads this is 32 bytes private memory per thread). Dynamic memory does not have such a stringent limitation. Moreover, dynamic memory is not shared and disposed either 1) immediately when the memory is not needed anymore or 2) when a GPU/CPU thread exists. Correspondingly, when 1024 threads would use 32K each, this will require less than 32MB, because the threads are logically in parallel, but not physically.

In all these cases, dynamic memory can be used, simply by calling the zeros, ones, eye or uninit functions. One may also use slicing operators (A[0..9, 2]) in order to extract a sub-matrix. The slicing operations then take the current boundary access mode (e.g. mirroring, circular) into account.

Examples

The following program transposes 16x16 blocks of an image, creating a cool tiling effect. Firstly, a kernel function version is given and secondly a loop version. Both versions are equivalent: in fact, the second version is internally converted to the first version.

Kernel version

function [] = __kernel__ kernel (x : mat, y : mat, B : int, pos : ivec2)
    r1 = pos[0]*B..pos[0]*B+B-1   % creates a dynamically allocated vector
    r2 = pos[1]*B..pos[1]*B+B-1   % creates a dynamically allocated vector

    y[r1, r2] = transpose(x[r1, r2]) % matrix transpose 
                                     % creates a dynamically allocated vector 
end

x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size    
parallel_do(size(x,0..1) / B,x,y,B,kernel)

Loop version

x = imread("lena_big.tif")[:,:,1]
y = zeros(size(x))
B = 16 % block size

#pragma force_parallel
for m = 0..B..size(x,0)-1
    for n = 0..B..size(x,1)-1
        A = x[m..m+B-1,n..n+B-1]   % creates a dynamically allocated vector
        y[m..m+B-1,n..n+B-1] = transpose(A)   % matrix transpose  
    end
end

Memory models

To acommodate the widest range of algorithms, two memory models are currently provided (some more may be added in the future).

  1. Concurrent memory model In the concurrent memory model, the computation device (e.g. GPU) autonomously manages a separate memory heap that is reserved for dynamic objects. The size of the heap can be configured in Quasar and is typically 32MB.

    The concurrent memory model is extremely efficient when all threads (e.g. > 512) request dynamic memory at the same time. The memory allocation is done by a specialized parallel allocation algorithm that significantly differs from traditional sequential allocators.

    For efficiency, there are some internal limitations on the size of the allocated blocks:

    • The minimum size is 1024 bytes (everything smaller is rounded up to 1024 bytes)
* The maximum size is 32768 bytes 

For larger allocations, please see the *cooperative memory model*. The minimum size also limits the number of objects that can be allocated.
  1. Cooperative memory model

    In the cooperative memory model, the kernel function requests memory directly to the Quasar allocator. This way, there are no limitations on the size of the allocated memory. Also, the allocated memory is automatically garbage collected.

    Because the GPU cannot launch callbacks to the CPU, this memory model requires the kernel function to be executed on the CPU.

    Advantages:

    • The maximum block size and the total amount of allocated memory only depend on the available system resources.

    Limitations:

    • The Quasar memory allocator uses locking (to limited extend), so simultaneous memory allocations on all processor cores may be expensive.
    • The memory is disposed only when the kernel function exists. This is to internally avoid the number of callbacks from kernel function code to host code. Suppose that you have a 1024x1024grayscale image that allocates 256 bytes per thread. Then this would require 1GB of RAM! In this case, you should use the cooperative memory model (which does not have this problem).

Selection between the memory models.

Features

  • Device functions can also use dynamic memory. The functions may even return objects that are dynamically allocated.
  • The following built-in functions are supported and can now be used from within kernel and device functions:
    zeros, czeros, ones, uninit,
    eye, copy, reshape, repmat, shuffledims, seq, linspace,
    real, imag, complex,
    mathematical functions
    matrix/matrix multiplication
    matrix/vector multiplication

Performance considerations

  • Global memory access: code relying on dynamic memory may be slow (for linear filters on GPU: 4x-8x slower), not because of the allocation algorithms, but because of the global memory accesses. However, it all depends on what you want to do: for example, for non-overlapping block-based processing (e.g., blocks of a fixed size), the dynamic kernel memory is an excellent choice.
  • Static vs. dynamic allocation: when the size of the matrices is known in advanced, static allocation (e.g. outside the kernel function may be used as well). The dynamic allocation approach relieves the programmer from writing code to pre-allocate memory and calculating the size as a function of the size of the data dimensions. The cost of calling the functions uninit, zeros is negligible to the global memory access times (one memory allocation is comparable to 4-8 memory accesses on average – 16-32 bytes is still small compared to the typical sizes of allocated memory blocks). Because dynamic memory is disposed whenever possible when a particular threads exists, the maximum amount of dynamic memory that is in use at any time is much smaller than the amount of memory required for pre-allocation.
  • Use vecX types for vectors of length 2 to 16 whenever your algorithm allows it. This completely avoids using global memory, by using the registers instead. Once a vector of length 17 is created, the vector is allocated as dynamic kernel memory.
  • Avoid writing code that leads to thread divergence: in CUDA, instructions execute in warps of 32 threads. A group of 32 threads must execute (every instruction) together. Control flow instructions (if, match, repeat, while) can negatively affect the performance by causing threads of the same warp to diverge; that is, to follow different execution paths. Then,the different execution paths must be serialized, because all of the threads of a warp share a program counter. Consequently, the total number of instructions executed for this warp is increased. When all the different execution paths have completed, the threads converge back to the same execution path.

To obtain best performance in cases where the control flow depends on the block position (blkpos), the controlling condition should be written so as to minimize the number of divergent warps.

Nested parallelism

It is desired to specify parallelism in all stages of the computation. For example, within a parallel loop, it must be possible to declare another parallel loop etc. Up till now, parallel loops could only be placed at the top-level (in a host function), and multiple levels of parallelism had to be expressed using multi-dimensional perfect for loops. A new feature is that __kernel__ and __device__functions can now also use the parallel_do (and serial_do) functions. The top-level host function may for example spawn 8 threads, from which every of these 8 threads spans again 64 threads (after some algorithm-specific initialization steps). There are several advantages of this approach:

  • More flexibility in expressing the algorithms
  • The nested kernel functions are (or will be) mapped onto CUDA dynamic parallelism on Kepler devices such as the GTX 780, GTX Titan. (Note: requires one of these cards to be effective).
  • When a parallel_do is placed in a __device__ function that is called directly from the host code (CPU computation device), the parallel_do will be accelerated using OpenMP.
  • The high-level matrix operations from the previous section are automatically taking advantage of the nested parallelism.

Notes:

  • There is no guarantee that the CPU/GPU will effectively perform the nested operations in parallel. However, future GPUs may be expected to become more efficient in handling parallelism on different levels.

Limitations:

  • Nested kernel functions may not use shared memory (they can access the shared memory through the calling function however), and they may also not use thread sychronization.
  • Currently only one built-in parameter for the nested kernel functions is supported: pos (and not blkpos, blkidx or blkdim).

Example

The following program showcases the nested parallelism, the improved type inference and the automatic usage of dynamic kernel memory:

function y = gamma_correction(x, gamma)
    y = uninit(size(x))

    % Note: #pragma force_parallel is required here, otherwise
    % the compiler will just use dynamic typing. 
    #pragma force_parallel
    for m=0..size(x,0)-1
        for n=0..size(x,1)-1
            for k=0..size(x,2)-1
                y[m,n,k] = 512 * (x[m,n,k]/512)^gamma
            end
        end
    end
end

function [] = main()
    x = imread("lena_big.tif")
    y = gamma_correction(x, 0.5)

    #pragma force_parallel
    for m=0..size(y,0)-1
        for c=0..size(y,2)-1
            row = y[m,:,c]
            y[m,:,c] = row[numel(row)-1..-1..0]
        end
    end

    imshow(y)
end

The pragma’s are just added here to illustrate that the corresponding loops need to be parallelized, however, using this pragma is optionally. Note:

  • The lack of explicit typing in the complete program (even type T is only an unspecified type parameter)
  • The return type of gamma correction is also not specified, however the compiler is able to deduce that type(y,"cube") is true.
  • The second for-loop (inside the main function), uses slicing operations (: and ..). The assignment row=y[m,:,c] leads to dynamic kernel memory allocation.
  • The vector operations inside the second for-loop automatically express nested parallelism and can be mapped onto CUDA dynamic parallelism.

CUDA – Memory manager enhancements I (advanced)

There are some problems operating on large images that do not fit into the GPU memory. The solution is to provide a FAULT-TOLERANT mode, in which the operations are completely performed on the CPU (we assume that the CPU has more memory than the GPU). Of course, running on the CPU comes at a performance hit. Therefore I will add some new configurable settings in this enhancement.

Please note that GPU memory problems can only occur when the total amount of memory used by one single kernel function > (max GPU memory – reserved mem) * (1 – fragmented mem%). For a GPU with 1 GB, this might be around 600 MB. Quasar automatically transfers memory buffers back to the CPU memory when it is running out of GPU space. Nevertheless, this may not be sufficient, as some very large images can take all the space of the GPU memory (for example 3D datasets).

Therefore, three configurable settings are added to the runtime system (Quasar.Redshift.config.xml):

1) RUNTIME_GPU_MEMORYMODEL with possible values:

  • SmallFootPrint – A small memory footprint – opts for conservative memory allocation leaving a lot of GPU memory available for other programs in the system
  • MediumFootprint (*) – A medium memory footprint – the default mode
  • LargeFootprint – chooses aggressive memory allocation, consuming a lot of available GPU memory quickly. This option is recommended for GPU memory intensive applications.

2) RUNTIME_GPU_SCHEDULINGMODE with possible values:

  • MaximizePerformance – Attempts to perform as many operations as possible on the GPU (potentially leading to memory failure if there is not sufficient memory available. Recommended for systems with a lot of GPU memory).
  • MaximizeStability (*) – Performs operations on the CPU if there is not GPU memory available. For example, processing 512 MB images when the GPU only has 1 GB memory available. The resulting program may be slower. (FAULT-TOLERANT mode)

3) RUNTIME_GPU_RESERVEDMEM

  • The amount of GPU memory reserved for the system (in MB). The Quasar runtime system will not use the reserved memory (so that other desktop programs can still run correctly). Default value = 160 MB. This value can be decreased at the user’s risk to obtain more GPU memory for processing (desktop applications such as Firefox may complain…)

Please note that the “imshow” function also makes use of the reserved system GPU memory (the CUDA data is copied to an OpenGL texture).

(*) default

OpenCL And Function Pointers

Introduction

Quasar has a nice mechanism to compose algorithms in a generic way, based on function types.

For example, we can define a function that reads an RGB color value from an image as follows:

    RGB_color_image = __device__ (pos : ivec2) -> img[pos[0], pos[1], 0..2]

Now suppose we are dealing with images in some imaginary Yab color space where

    Y = R+2*G+B        G = (Y +   a + b)/4
    a = G - R     or   R = (Y - 3*a + b)/4
    b = G - B          B = (Y +   a - 3*b)/4

we can define a similar read-out function that automatically converts Yab to RGB:

    RGB_Yab_image = __device__ (pos : ivec2) -> 
        [dotprod(img[pos[0],pos[1],0..2], [1,-3/4,1/4]),
         dotprod(img[pos[0],pos[1],0..2], [1/4,1/4,1/4]),
         dotprod(img[pos[0],pos[1],0..2], [1,1,-3/4])]

Consequently, both functions have the same type:

    RGB_color_image : [__device__ ivec2 -> vec3]
    RGB_Yab_image   : [__device__ ivec2 -> vec3]

Then we can build an algorithm that works both with RGB and Yab images:

    function [] = __kernel__ brightness_contrast_enhancer(
        brightness :    scalar, 
        contrast : scalar, 
        x : [__device__ ivec2 -> vec3], 
        y : cube,
        pos : ivec2)        

        y[pos[0],pos[1],0..2] = x(pos)*contrast + brightness
    end

    match input_fmt with
    | "RGB" ->
        fn_ptr = RGB_color_image
    | "Yab" ->
        fn_ptr = RGB_Yab_image
    | _ ->
        error("Sorry, input format is currently not defined")
    end

    y = zeros(512,512,3)
    parallel_do(size(y),brightness,contrast,fn_ptr,y,brightness_contrast_enhancer)

Although this approach is very convenient, and allows also different algorithms to be constructed easily (for example for YCbCr, Lab color spaces), there are a number of disadvantages:

  1. The C-implementation typically requires making use of function pointers. However, OpenCL currently does not support function pointers, so this kind of programs can not be executed on OpenCL-capable hardware.
  2. Although CUDA supports function pointers, in some circumstances they result in an internal compiler error (NVCC bug). These cases are very hard to fix.
  3. In CUDA, kernel functions that use function pointers may be 2x slower than the same code without function pointers (e.g. by inlining the function).

Manual solution

The (manual) solution is to use function specialization:

    match input_fmt with
    | "RGB" ->
        kernel_fn = $specialize(brightness_contrast_enhancer, fn_ptr==RGB_color_image)
    | "Yab" ->
        kernel_fn = $specialize(brightness_contrast_enhancer, fn_ptr==RGB_Yab_image)
    | _ ->
        error("Sorry, input format is currently not defined")
    end

    y = zeros(512,512,3)
    parallel_do(size(y),brightness,contrast,y,kernel_fn)

Here, the function brightness_contrast_enhancer is specialized using both RGB_color_image and RGB_Yab_image respectively. These functions are then simply substituted into the kernel function code, effectively eliminating the function pointers.

Automatic solution

The Quasar compiler now has an option UseFunctionPointers, which can have the following values:

  • Always: function pointers are always used (causes more compact code to be generated)
  • SmartlyAvoid: avoid function pointers where possible (less compact code)
  • Error: generate an error if a function pointer cannot be avoided.

In the example of the manual solution, the function pointer cannot be avoided. However, when rewriting the code block as follows:

    y = zeros(512,512,3)
    match input_fmt with
    | "RGB" ->
        parallel_do(size(y),brightness,contrast,RGB_color_image,y,brightness_contrast_enhancer)
    | "Yab" ->
        parallel_do(size(y),brightness,contrast,RGB_Yab_image,y,brightness_contrast_enhancer)
    | _ ->
        error("Sorry, input format is currently not defined")
    end

The compiler is able to automatically specialize the function brightness_contrast_enhancer for RGB_color_image and RGB_Yab_image (Avoid and Error modes).

Generic implementations of Linear Filtering

Introduction

A linear filter computes a weighted average of a local neighborhood of pixel intensities, and the weights are determined by the so-called filter mask.

In essence, 2D linear filtering formula can be implemented in Quasar using a 6 line __kernel__ function:

function [] = __kernel__ filter(x : cube, y : cube, mask : mat, ctr : ivec3, pos : ivec3)
    sum =0.0
    for m=0..size(mask,0)-1
        for n=0..size(mask,1)-1
            sum += x[pos+[m,n,0]-ctr] * mask[m,n]
        end
    end
    y[pos] = sum
end

However, this may not be the fastest implementation, for two reasons:

  • The above kernel function performs several read accesses to x (e.g. for 3×3 masks it requires 9 read accesses per pixel!). As outlined in the Quick optimization guide, the implementation should use shared memory as much as possible.
  • In case the filter kernel is separable (i.e. mask = transpose(mask_y) * mask_x), a faster implementation can be obtained by performing the filtering in two passes: a horizontal pass and a vertical pass. However, a naive implementation of this approach may have a bad data locality and depending on the size of the filter mask, it may even do more worse than good.

The best approach is therefore to combine the above techniques (i.e. shared memory + separable filtering). For illustrational purposes, we will consider only the mean filter (with mask=ones(3,3)/9) in the following.

  1. Non-separable implementation:
    x = imread("image.png")
    y = zeros(size(x))
    parallel_do(size(y),x,y,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
        y[pos] = (x[pos+[-1,-1,0]]+x[pos+[-1,0,0]]+x[pos+[-1,1,0]] + _
                  x[pos+[ 0,-1,0]]+x[pos         ]+x[pos+[0,1,0]] + _
                  x[pos+[ 1,-1,0]]+x[pos+[ 1,0,0]]+x[pos+[1,1,0]])*(1.0/9))
    imshow(y)

    For small filter kernels, a non-separable implementation is actually not a bad idea, because it exploits data locality well. Nevertheless, it is useful to investigate if the performance can be further improved.

  2. Separable implementation:
    x = imread("image.png")
    y = zeros(size(x))
    tmp = zeros(size(x))
    parallel_do(size(y),x,tmp,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
        y[pos] = x[pos+[-1,0,0]]+x[pos]+x[pos+[1,0,0]])
    parallel_do(size(x),tmp,y,__kernel__ (x:cube,y:cube,pos:ivec3) -> _
        y[pos] = x[pos+[0,-1,0]]+x[pos]+x[pos+[0,1,0]]*(1.0/9))
    imshow(x)

    The separable implementation uses two distinct passes over the image. This requires using a temporary variable tmp and hence extra memory to be allocated.

  3. Separable implementation, using shared memory:
    function [] = __kernel__ filter3x3_kernel_separable(x:cube,y:cube,pos:ivec3,
        blkpos:ivec3,blkdim:ivec3)        
        vals = shared(blkdim+[2,0,0]) % shared memory
    
        sum = 0.
        for i=pos[1]-1..pos[1]+1   % step 1 - horizontal filter
            sum += x[pos[0],i,blkpos[2]]
        end
        vals[blkpos] = sum  % store the result        
        if blkpos[0]<2 % filter two extra rows (needed for vertical filtering)
            sum = 0.
            for i=pos[1]-1..pos[1]+1
                sum += x[pos[0]+blkdim[0],i,blkpos[2]]
            end
            vals[blkpos+[blkdim[0],0,0]] = sum 
        endif
        syncthreads       
        sum = 0.
        for i=blkpos[0]..blkpos[0]+2   % step 2 - vertical filter
            sum += vals[i,blkpos[1],blkpos[2]]
        end
        y[pos] = sum*(1.0/9)
    end
    x = imread("image.png")
    y = zeros(size(x))
    parallel_do(size(y),x,y,filter3x3_kernel_separable)
    imshow(y)

    Remark that the above implementation is rather complicated, especially the block boundary handling code is excessive.

  4. Non-separable implementation using shared memory:
    function [] = __kernel__ filter3x3_kernel_nonseparable
    (x:cube,y:cube,pos:ivec3, blkpos:ivec3,blkdim:ivec3)        
        vals = shared(blkdim+[2,2,0]) % shared memory
    
        % Cache input in shared memory
        vals[blkpos] = x[pos-[1,1,0]]           
        if blkpos[0]<2
            vals[blkpos+[blkdim[0]-1,-1,0]] = x[pos+[blkdim[0]-1,-1,0]]
        endif
        if blkpos[1]<2
            vals[blkpos+[blkdim[1]-1,-1,0]] = x[pos+[blkdim[1]-1,-1,0]]
        endif
        syncthreads
    
        % Filtering
        sum = 0.0
        for m=0..2
            for n=0..2
                sum += vals[blkpos+[m,n,0]]
            end
        end         
        y[pos] = sum*(1.0/9)
    end
    x = imread("image.png")
    y = zeros(size(x))
    parallel_do(size(y),x,y,filter3x3_kernel_nonseparable)
    imshow(y)

    Although one would expect that version 4 is generally faster than version 1, this is not necessarily the case! The reason is that 1) recent GPU devices cache the global memory (diminishing the advantage of shared memory in this case) and 2) extra copy operations from global memory to shared memory are required, again leading to a performance penalty! Nevertheless, for some filter mask sizes, there might be a benefit.

Filtering abstraction

One of the reasons why Quasar was introduced, was exactly not having to worry too much about such implementation issues. It could become a severe pain when a shared memory algorithm needs to be written for every possible case.

Luckily, Quasar has two programming techniques that allow to relieve this pain.

  1. Function variables and closure variables

    Suppose that we express a filtering operation in a general way:

    type f : [__device__ (cube, ivec2) -> vec3]

    This is a type declaration of a function that takes a cube and a 2D position as input, and computes a 3D color value.

    Then, a linear filter can be constructed simply as follows:

    mask = ones(3,3)/9  % any filter mask will do
    ctr = [1,1] % The center of the filter mask.
    function y : vec3 = linear_filter(x : cube, pos : ivec2)
        y = 0.0
        for m=0..size(mask,0)-1
            for n=0..size(mask,1)-1
                y += x[pos+[m,n,0]-ctr] * mask[m,n]
            end
        end
    end

    Note that the body of this function is essentially the body of the kernel function at the top of this page.

    Next, we can define a kernel function that performs filtering for any filtering operation of type f:

    function [] = __kernel__ generic_filter_kernel_nonseparable(x:cube,y:cube,
        masksz:op:f,ivec2,pos:ivec3, blkpos:ivec3,blkdim:ivec3) 
    
        vals = shared(blkdim+[masksz[0]-1,masksz[1]-1,0]) % shared memory
    
        % Cache input in shared memory
        vals[blkpos] = x[pos-[1,1,0]]           
        if blkpos[0]<masksz[0]-1
            vals[blkpos+[blkdim[0]-1,-1,0]] = x[pos+[blkdim[0]-1,-1,0]]
        endif
        if blkpos[1]<masksz[0]-1
            vals[blkpos+[blkdim[1]-1,-1,0]] = x[pos+[blkdim[1]-1,-1,0]]
        endif
        syncthreads
    
        % Filtering
        y[pos] = op(vals, blkpos)
    end
    x = imread("image.png")
    y = zeros(size(x))
    parallel_do(size(y),x,y,size(mask,0..1),linear_filter,generic_filter_kernel_nonseparable)
    imshow(y)

    Here, masksz = size(mask,0..1) (the size of the filter mask). Now we have written a generic kernel function, that can take any filtering operation and compute the result in an efficient way. For example, the filtering operation can also be used for mathematical morphology or for computing local maxima:

    function y : vec3 = max_filter(x : cube, pos : ivec2)
        y = 0.0
        for m=0..size(mask,0)-1
            for n=0..size(mask,1)-1
                y = max(y, x[pos+[m,n,0]-ctr])
            end
        end
    end

    The magic here, is to implicit use of closure variables: the function linear_filter and max_filter hold references to non-local variables (i.e. variables that are declared outside this function). Here these variables are mask and ctr. This way, the function signature is still [__device__ (cube, ivec2) -> vec3].

  2. Explicit specialization

    Previous point (1) is formally known as “generic programming”. Some peoply claim that, when programming in a generic way, you lose efficiency. One of their arguments is that by the dynamic function call y[pos] = op(vals, blkpos), where op is actually a function pointer, efficiency is lost: the compiler is for example not able to inline op and has to emit very general code to deal with this case.

    In Quasar, this is not necessarily true – being a true domain-specific language, the compiler has a lot of information. In fact, the optimization of the generic functiongeneric_filter_kernel_nonseparable can be made explicit, using the $specialize meta function:

    linear_filter_kernel = $specialize(generic_filter_kernel_nonseparable, op==linear_filter)
    max_filter_kernel    = $specialize(generic_filter_kernel_nonseparable, op==max_filter)
    
    x = imread("image.png")
    y = zeros(size(x))
    parallel_do(size(y),x,y,size(mask,0..1),linear_filter_kernel)
    imshow(y)

    The function $specialize is evaluated at compile-time and will substitute op with respectively linear_filter and max_filter. Correspondingly these two functions can be inlined and the resulting code is equivalent to the linear_filter_kernel function being completely written by hand.

Conclusion

Functions with closure variables are building blocks for larger algorithms. Functions can have arguments that are functions themselves. Function specialization is a compiler operation that can be used to generate explicit code for fixed argument values. In the future, function specialization may be done automatically in some circumstances.

Generic Programming – Opening New Frontiers

To solve a limitation of Quasar, in which __kernel__ functions in some circumstances needed to be duplicated for different container types (e.g. vec[int8], vec[scalar], vec[cscalar]), there is now finally support for generic programming.

Consider the following program that extracts the diagonal elements of a matrix and that is supposed to deal with arguments of either type mat or type cmat:

    function y : vec = diag(x : mat)
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = zeros(N)
        parallel_do(size(y), __kernel__ 
            (x:mat, y:vec, pos:int) -> y[pos] = x[pos,pos])
    end

    function y : cvec = diag(x : cmat)
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = czeros(N)
        parallel_do(size(y), __kernel__ 
            (x:cmat, y:cvec, pos : int) -> y[pos] = x[pos,pos])
    end

Although function overloading here greatly solves part of the problem (at least from the user’s perspective), there is still duplication of the function diag. In general, we would like to specify functions that can “work” irrespective of their underlying type.

The solution is to use Generic Programming. In Quasar, this is fairly easy to do:

    function y = diag[T](x : mat[T])
        assert(size(x,0)==size(x,1))
        N = size(x,0)
        y = vec[T](N)
        parallel_do(size(y), __kernel__ 
            (pos) -> y[pos] = x[pos,pos])
    end

As you can see, the types of the function signature have simply be omitted. The same holds for the __kernel__ function.

In this example, the type parameter T is required because it is needed for the construction of vector y (through the vec[T] constructor). If T==scalar, vec[T] reduces to zeros, while ifT==cscalar, vec[T] reduces to czeros (complex-valued zero matrix). In case the type parameter is not required, it can be dropped, as in the following example:

    function [] = copy_mat(x, y)
        assert(size(x)==size(y))
        parallel_do(size(y), __kernel__
            (pos) -> y[pos] = x[pos])
    end

Remarkably, this is still a generic function in Quasar; no special syntax is needed here.

Note that in previous versions of Quasar, all kernel function parameters needed to be explicitly typed. This is now no longer the case: the compiler will deduce the parameter types by calls to diag and by applying the internal type inference mechanism. The same holds for the __device__ functions.

When calling diag with two different types of parameters (for example once with x:mat and a second time with x:cmat), the compiler will make two generic instantiations of diag. Internally, the compiler may either:

  1. Keep the generic definition (type erasion)
    function y = diag(x)
  2. Make two instances of diag (reification):
    function y : vec  = diag(x : mat)
    function y : cvec = diag(x : cmat)

The compiler will combine these two techniques in a transparent way, such that:

  1. For kernel-functions explicit code is generated for the specific data types,
  2. For less performance-critical host code type erasion is used (to avoid code duplication).

The selection of the code to run is made at compile-time, so correspondingly the Spectroscope Debugger needs special support for this.

Of course, when calling the diag function with a variable of type that cannot be determined at compile-time, a compiler error is generated:

The type of the arguments ('op') needs to be fully defined for this function call!

This is then similar to the original handling of kernel functions.

Extensions

There are several extensions possible to fine-tune the behavior of the generic code.

Type classes

Type classes allow the type range of the input parameters to be narrowed. For example:

    function y  = diag(x : [mat|cmat])

This construction only allows variables of the type mat and cmat to be passed to the function. This is useful when it is already known in advance which types are relevant (in this case a real-valued or complex-valued matrix).

Equivalently, type class aliases can be defined. The type:

    type AllInt : [int|int8|int16|int32|uint8|uint32|uint64]

groups all integer types that exist in Quasar. Type classes are also useful for defining reductions:

    type RealNumber: [scalar|cube|AllInt|cube[AllInt]]
    type ComplexNumber: [cscalar|ccube]

    reduction (x : RealNumber) -> real(x) = x   

Without type classes, the reduction would need to be written 4 times, one for each element.

Type parameters

Levels of genericity

There are three levels of genericity (for which generic instances can be constructed):

  1. Type constraints: a type constraint binds the type of an input argument of the function.
  2. Value constraints: gives an explicit value to the value of an input argument
  3. Logic predicates that are not type or value constraints

As an example, consider the following generic function:

    function y = __device__ soft_thresholding(x, T)
        if abs(x)>=T
            y = (abs(x) - T) * (x / abs(x))
        else
            y = 0
        endif
    end

    reduction x : scalar -> abs(x) = x where x >= 0

Now, we can make a specialization of this function to a specific type:

    soft_thresholding_real = $specialize(soft_thresholding,
        type(x,"scalar") && type(T, "scalar"))

But also for a fixed threshold:

    soft_thresholding_T = $specialize(soft_thresholding,T==10)

We can even go one step further and specify that x>0:

    soft_thresholding_P = $specialize(soft_thresholding,x>0)

Everything combined, we get:

    soft_thresholding_E = $specialize(soft_thresholding,
        type(x,"scalar") && type(T,"scalar") && T==10 && x>0)

Based on this knowledge (and the above reduction), the compiler will then generate the following function:

    function y = __device__ soft_thresholding_E(x : scalar, T : scalar)
        if x >= 10
            y = x - 10
        else
            y = 0
        endif
    end

There are two ways of performing this type of specialization:

  1. Using the $specialize function. Note that this approach is only recommended for testing.
  2. Alternatively, the specializations can be performed automatically, using the assert function from the calling function:
    function [] = __kernel__ denoising(x : mat, y : mat)
    
        assert(x[pos]>0)    
        y[pos] = soft_thresholding(x[pos], 10)
    
    end

Modules, Namespaces And Concealed Functions

When importing different modules (.q files) using the import keyword, there is a risk of name collision of functions and variables.

To avoid this, the concealed specifier can be used on functions, to prevent the function definition to be exposed to other modules.

There are a number of rules:

  • Functions declared without the concealed specifier, for example:
    function [] = foo()

    are public to (and only to!) the module that imports the function definition.

    Note: it may happen that the interpreter still raises an error when it encounters the same function twice, this will be fixed in future versions of Quasar.

  • Functions declared with the concealed specifier are only visible within the same module:
    function [] = foo() concealed

    Any attempt to use the function from another module will cause a compiler error (undefined function).

  • Variables do not have a concealed specifier, and are thus public when defined in the global scope.
  • Another trick to hide functions is to use inner functions. Inner functions can only be accessed from within the outer function
    function [] = foo_public()
        function [] = foo_private()
        end
    end

The motivation for this approach is that it stops the programmer worrying about package/namespace names, as there is no need to refer to package names inside the .q module (other than theimport-definition).

In the future it may become possible to do selective imports from .q modules (for example import from "system.q" [diag, herm_transpose]).

Limitations (as of May, 2013):

  • Concealed functions can currently not be used as closure variable of other functions, this will cause a compiler error.
  • Due to internal restrictions, generic functions can currently not be concealed, the concealed specifier is simply ignored by the compiler. In the future this may change.

Comparison to C++/C# namespaces

In Quasar, every .q module has its own namespace, that is named after the module itself. When the .q module imports another module, its namespace is also included.

For example:

    // module1.q translated to C++
    namespace module1 {
        // Definitions
    }

    // module2.q translated to C++
    // import "module1.q"
    #include "module1.q"

    namespace module2 {
        using namespace module1;
    }

Example:

    % ===== concealed_incl.q =================
    import "system.q"

    function [] = my_method() concealed
        print "Great success!"
    end

    function [] = call_my_method()
        my_method()
    end

    % ===== concealed.q =================
    import "concealed_incl.q"

    function [] = my_method() concealed
        print "Second great success!"
    end

    function [] = main()

        call_my_method()  % Prints "Great success!"
        my_method() % Prints "Second great success!"

        % Note: cannot access functions from system.q from here,
        % an explicit import is needed.

    end

Explanation:

  • call_my_method() can only see my_method() in concealed_incl.q.
  • main() in concealed.q only has access to mymethod() declared in the same module.
  • Hence, my_method is defined twice, but private to each module.

Pre-processor branches

Sometimes there is a need to write code that is compiled conditionally, based on some input parameters. For example in C, for a conversion from floating point to integer:

#define TARGET_PLATFORM_X86

#if TARGET_PLATFORM_X86
    int __declspec(naked) ftoi(float a)
    { _asm {
        movss       xmm0,   [esp+4]
        cvtss2si    eax,    xmm0
        ret
    }}
#else
    int ftoi(float a)
    {
        return (int) a;
    }
#endif

Quasar can now achieve something similar to the pre-processor directives, however not in the ugly C-style that allowed pre-processor branches to be intertwined with control structures (which hampers code walkers and debuggers). It just suffices to:

ROUNDING_ACTIVE = 1

if ROUNDING_ACTIVE
    ftoi = a -> int(round(a))
else
    ftoi = a -> int(a)
endif

Isn’t that convenient. Note that it is not necessary to tell the compiler that ROUNDING_ACTIVE is a constant (or pre-processor symbol), as the compiler can figure this out by itself. If you want to prevent yourself from later changing this constant at other locations in the code, you can express that the constant is not to be changed:

ROUNDING_ACTIVE:'const = 1

which is a short-hand (due to type inference) for

ROUNDING_ACTIVE:int'const = 1

Using the constant, the above branch will automatically be reduced to:

ftoi = a -> int(round(a))

Note that this only works with lambda expressions. For functions (whose names need to be unique), this should be done as follows:

function [retval : int] = ftoi(a)
    if ROUNDING_ACTIVE
        retval = int(round(a))
    else
        retval = int(a)
    endif
end

Again, there is no loss of performance (even when ftoi is a __device__ function).

And, no there are no inline assembler instructions, nor there will ever be. The Quasar code is supposed to be portable across different platforms. If you wish to use them, revert to a good macro assembler instead.