CUDA-1.14

A implicit thread block synchronization function should be called by all active threads

Required inputs: IR, StaticSemanticAnalysis

CUDA 1.14 [synchronize.active_threads] A implicit thread block synchronization function should be called by all active threads

An implicit thread block synchronization function is a synchronization function that has predefined behavior which cannot be changed by the programmer. The functions are defined to work on all active threads and must be called by all active threads or no threads; otherwise, the behavior is undefined. The following are implicit thread block synchronization functions:

  • __syncthreads, __syncthreads_count, __syncthreads_and, and __syncthreads_or.
  • All variations of this_thread_block().sync().
Scope: Device.
Audience: CUDA C++.
Category: Mandatory.
Hardware Applicability: All Compute Capabilities.
Rationale

The thread block synchronization functions ensure that all active threads have reached a given point in the program. All global and shared memory accesses made by these threads before the synchronization are guaranteed to be visible to all of the threads in the block after the call. Any active thread that skips the synchronization point violates the memory guarantees and may keep the call from returning.

Example 1 (Bad)
__global__ void kernel() {
  for(unsigned i = 0; i < blockDim.x; ++i) {
    if (i == threadIdx.x)
      __syncthreads(); // Non-compliant: synchronization function called with
// a CTA divergent condition.
  }
}
Example 2 (Bad)
__global__ void kernel(int flag) {
  if (flag)
    goto end; // Non-compliant: goto bypasses a synchronization function.
  __syncthreads();
end:
  return;
}
Example 3 (Bad)
__global__ void kernel(int flag) {
  if (flag)
    __syncthreads(); // Ok, condition is not CTA-divergent.
  if (blockIdx.x < flag)
    __syncthreads(); // Ok, condition is not CTA-divergent.
  if (threadIdx.x < 2)
    __syncthreads(); // Non-conformant: condition is CTA-divergent.
  __syncthreads();
  if (threadIdx.x < flag)
    __syncthreads(); // Ok, condition is potentially CTA-divergent but device-
// threads are either participating or terminating since
// this is the last statement of the kernel.
}
Example 4 (Bad)
# include <cooperative_groups.h>

using namespace cooperative_groups;

__device__ void kernel(int flag) {
  thread_block g = this_thread_block();
  if (flag)
    g.sync(); // Ok, condition is not CTA-divergent.
  if (blockIdx.x < flag)
    g.sync(); // Ok, condition is not CTA-divergent.
  if (g.thread_rank() < 2)
    g.sync(); // Nonconfromant: divergent code.
  g.sync();
  if (threadIdx.x < flag)
    g.sync(); // Ok, condition is potentially CTA-divergent but device-
// threads are either participating or terminating since
// this is the last statement of the kernel.
}
Excerpt from NVIDIA CUDA C++ Guidelines for robust and safety-critical programming, Version 3.0.1, Copyright (C) 2018-2023 NVIDIA Corporation.

Possible Messages

Key

Text

Severity

Disabled

thread_synchronization_in_divergent_region

Implicit thread block synchronization function is called in possibly divergent code.

None

False

Options

synchronization_functions

synchronization_functions

Type: set[bauhaus.analysis.config.QualifiedName]

Default: {'__syncthreads', '__syncthreads_and', '__syncthreads_count', '__syncthreads_or', 'cooperative_groups::__v1::thread_block::sync'}

Names of synchronization functions for the purposes of this rule. Qualified names can be used.