CUDADirective-1.5

Callbacks should not wait for other callbacks

Required inputs: IR

CUDA DIRECTIVE 1.5 [synchronize.callback] Callbacks should not wait for other callbacks

In callbacks which happen in stream order, e.g. those created via cudaStreamAddCallback, cuStreamAddCallback, cudaLaunchHostFunc, cuLaunchHostFunc, cudaGraphAddHostNode, or cuGraphAddHostNode, do not wait for a condition that will be satisfied by another callback.

Scope: Host.
Audience: CUDA C++, CUDA Libraries.
Hardware Applicability: All Compute Capabilities.
Rationale

Callbacks registered with CUDA which happen in stream order are executed in unspecified threads of execu- tion which are owned by the CUDA runtime and driver. There execution may be serialized, even if there is not a mandated order between them. Therefore, a callback blocking on another callback may never complete, because the callback it is waiting for may not start executing until the blocking callback completes.

Example 1 (Bad)
# include <cassert>
# include <atomic>
# include <memory>
# include <cuda_runtime.h>
auto make_stream() {
  auto deleter = [] (CUstream_st* ptr) {
    cudaError_t const error0 = cudaStreamDestroy(ptr);
    assert(cudaSuccess == error0);
  };
  CUstream_st* raw_stream = nullptr;
  cudaError_t const error1 = cudaStreamCreate(&raw_stream);
  assert(cudaSuccess == error1);
  return std::unique_ptr<CUstream_st, decltype(deleter)>(raw_stream, deleter);
}
std::atomic<bool> flag{false};

__host__ void poll(void*) {
  while (!flag.load(std::memory_order_acquire))
    ;
}

__host__ void signal(void*) {
  flag.store(true, std::memory_order_release);
}

int main() {
  auto stream0 = make_stream();
  // Callback waits for a global object to change status, which may never change
  // since the signal is in a different callback.
  cudaError_t const error0 = cudaLaunchHostFunc(stream0.get(), &poll, nullptr);
  assert(cudaSuccess == error0);
  auto stream1 = make_stream();
  // call back changes global object status.
  cudaError_t const error1 = cudaLaunchHostFunc(stream1.get(), &signal, nullptr);
  assert(cudaSuccess == error1);
  cudaError_t const error2 = cudaDeviceSynchronize();
  assert(cudaSuccess == error2);
}
Excerpt from NVIDIA CUDA C++ Guidelines for robust and safety-critical programming, Version 3.0.1, Copyright (C) 2018-2023 NVIDIA Corporation.

Possible Messages

Key

Text

Severity

Disabled

synchronization_primitive_used

{} synchronization primitive used in a CUDA callback

None

False

Options

functions

functions

Type: set[bauhaus.analysis.config.QualifiedName]

Default: {'cuGraphAddHostNode', 'cuLaunchHostFunc', 'cuStreamAddCallback', 'cudaGraphAddHostNode', 'cudaLaunchHostFunc', 'cudaStreamAddCallback'}

Functions that enqueue callbacks (on a stream, in a graph, …).
 

synchronization_types

synchronization_types

Type: set[bauhaus.analysis.config.QualifiedName]

Default: {'cnd_t', 'mtx_t', 'std::atomic', 'std::barrier', 'std::condition_variable', 'std::condition_variable_any', 'std::future', 'std::jthread', 'std::latch', 'std::mutex', 'std::recursive_mutex', 'std::recursive_timed_mutex', 'std::shared_future', 'std::thread', 'std::timed_mutex', 'thrd_t'}

Types that are typically used for blocking waits on a condition.