New Feature in Quasar: Cooperative Thread Groups

June 1, 2018

The latest release of Quasar now offers support for CUDA 9 cooperative groups. Below, we describe how cooperative groups can be used from Quasar. Overall, cooperative threading brings some interesting optimization possibilities for Quasar kernel functions.

1 Synchronization granularity

The keyword syncthreads now accepts a parameter that indicates which threads are being synchronized. This allows more fine grain control on the synchronization.

Keyword	Description
`syncthreads(warp)`	performs synchronization across the current (possibly diverged) warp (32 threads)
`syncthreads(block)`	performs synchronization across the current block
`syncthreads(grid)`	performs synchronization across the entire grid
`syncthreads(multi_grid)`	performs synchronization across the entire multi-grid (multi-GPU)
`syncthreads(host)`	synchronizes all host (CPU and GPU threads)

The first statement syncthreads(warp) allows divergent threads to synchronize at any time (it is also useful in the context of Volta’s independent scheduling). syncthreads(block) is equivalent to syncthreads in previous versions of Quasar. The grid synchronization primitive syncthreads(grid) is particularly interesting, it is a feature of CUDA 9 that was not available before. It allows to place barriers inside kernel function that synchronize all blocks. The following function:

function y = gaussian_filter_separable(x, fc, n)
    function [] = __kernel__ gaussian_filter_hor(x : cube, y : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [0,i-n,0]] * fc[i]
        endfor
        y[pos] = sum
    endfunction
    function [] = __kernel__ gaussian_filter_ver(x : cube, y : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [i-n,0,0]] * fc[i]
        end
        y[pos] = sum
    end

    z = uninit(size(x))
    y = uninit(size(x))
    parallel_do (size(y), x, z, fc, n, gaussian_filter_hor)
    parallel_do (size(y), z, y, fc, n, gaussian_filter_ver)
endfunction

Can now be simplified to:

function y = gaussian_filter_separable(x, fc, n)
    function [] = __kernel__ gaussian_filter_separable(x : cube, y : cube, z : cube, fc : vec, n : int, pos : vec3)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + x[pos + [0,i-n,0]] * fc[i]
        endfor
        z[pos] = sum
        syncthreads(grid)
        sum = 0.
        for i=0..numel(fc)-1
            sum = sum + z[pos + [i-n,0,0]] * fc[i]
        end
        y[pos] = sum
    endfunction
    z = uninit(size(x))
    y = uninit(size(x))
    parallel_do (size(y), x, y, z, fc, n, gaussian_filter_separable)
endfunction

The advantage is not only in the improved readability of the code, but the number of kernel function calls can be reduced which further increases the performance. Note: this feature requires at least a Pascal GPU.

2 Interwarp communication

New special kernel function parameters (similar to blkpos, pos etc.) will be added to control the GPU threads.

Parameter	Type	Description
`coalesced_threads`	`thread_block`	a thread block of coalesced threads
`this_thread_block`	`thread_block`	describes the current thread block
`this_grid`	`thread_block`	describes the current grid
`this_multi_grid`	`thread_block`	describes the current multi-GPU grid

The new thread_block class will have the following methods.

Method	Description
`sync()`	Synchronizes all threads within the thread block
`partition(size : int)`	Allows partitioning a thread block into smaller blocks
`shfl(var, src_thread_idx : int)`	Direct copy from another thread
`shfl_up(var, delta : int)`	Direct copy from another thread, with index specified relatively
`shfl_down(var, delta : int)`	Direct copy from another thread, with index specified relatively
`shfl_xor(var, mask : int)`	Direct copy from another thread, with index specified by a XOR relative to the current thread index
`all(predicate)`	Returns true if the predicate for all threads within the thread block evaluates to non-zero
`any(predicate)`	Returns true if the predicate for any thread within the thread block evaluates to non-zero
`ballot(predicate)`	Evaluates the predicate for all threads within the thread block and returns a mask where every bit corresponds to one predicate from one thread
`match_any(value)`	Returns a mask of all threads that have the same value
`match_all(value)`	Returns a mask only if all threads that share the same value, otherwise returns 0.

In principle, the above functions allow threads to communicate with each other. The shuffle operations allow taking values from other active threads (active means not disabled due to thread divergence). all, any, ballot, match_any and match_all allow to determine whether the threads have reached a given state.

The warp shuffle operations require a Kepler GPU and allow the use of shared memory to be avoided (register access is faster than shared memory). This may bring again performance benefits for computationally intensive kernels such as convolutions and parallel reductions (sum, min, max, prod etc.).

Using this functionality will require the CUDA target to be specified explicitly (i.e., the functionality cannot be easily simulated by the CPU). This may be obtained by placing the following code attribute inside the kernel: {!kernel target="nvidia_cuda"}. For CPU execution a separate kernel needs to be written. Luckily, several of the warp shuffling optimizations can be integrated in the compiler optimization stages, so that only one single kernel needs to be written.

Posted in Uncategorized