Posts Tagged ‘developer tips’

Debugging Quasar kernel functions

Since recently, Quasar Redshift is equipped with a parallel debugger for kernel functions. In the parallel debugging mode, the kernel function is emulated on a “virtual” CUDA-like device architecture, with device parameters that are taken from the GPU installed in your system. This permits device-independent debugging of kernel functions.

To step into the parallel debugging mode, it suffices to place a breakpoint inside a kernel function (or a device function) and to start the program. When the breakpoint hits: Redshift jumps to the new parallel debugger pane (see below).

In the left pane, you see a visualization of the shared memory (through the variable xr), as well as the current position and block position variables (pos and blkpos, respectively). In the middle pane at the bottom, some information about the kernel function is given. The occupancy is an indicator of the performance of the kernel function (for example, a low occupancy may be caused by selecting block sizes that are too small). Occupancy alone however, does not give a full picture: there exist kernel function designed to achieve a low occupancy, but offering very high throughput (see for example the Volkov GTC 2010 talk).

In the right pane, an overview of the scheduled blocks and active threads is given. For active threads, the barrier index is also given: barrier 0 corresponds to all statements before the first barrier, barrier 1 corresponds to statements between the first and second barrier, and so on. Using the context menu of the list box, it is possible to switch to an other thread or to put a breakpoint at a given block.

The parallel debugger allows you to:

  • Step through the kernel function (e.g. using F8 or shift-F8) [code editor]
  • Place (conditional) breakpoints in kernel functions [code editor]
  • Traverse different positions in the data array (or pixels in an image) [parallel debugger pane]
  • Jump to the next thread/block [parallel debugger pane]
  • See which threads are currently active/inactive [parallel debugger pane]
  • Inspect the values of all variables that are used inside a kernel function [variable watch]
  • Visualize the content of the shared memory [variable watch]
  • Record videos of the kernel function in action [variable watch]

The statement evaluation order is not necessarily linear, especially in case thread synchronization is used (through syncthreads). syncthreads places a barrier, which all threads within a block must have encountered, before continuing to the next block.

Internally, kernel functions are interpreted on the CPU, in order to allow full inspection of all variables, placing breakpoints at arbitrary position within a block etc. Moreover, for clarity, the threads in between two barriers are executed serially. When a barrier (or the end of the kernel function) is met, the debugger switches to the next available thread.

The serial execution definitely makes it easy to follow what your program is doing, however in case of data races, it may also hide potential memory access conflicts (due to serialization). For this reason, there is also an option to “mimic parallel execution”, in which random thread context switches are made.