万物化作，萌区有状，盛衰之杀，变化之流也。
本文以helloworld为例，简述CUDA的线程结构

基本概念

即使是HelloWorld，Cuda程序也引入了一些不同的概念，确保了解《简略CUDA和GPU相关概念》中的内容在进行下去

从HelloWorld开始

下面是一个最简单的HelloWorld程序：

/**
 运行命令: cl ./hello_host_only.cpp
**/

#include <cstdio>
using namespace std;

int main() {
    printf("Host: Hello wolrd!");
    return 0;
}

在windows可在x64 Native Tools Command Prompt for VS 2022用cl来编译它

只有Host代码的CUDA版本HelloWolrd

/**
 运行命令: nvcc -arch=sm_62 hello_host_only.cu
 sm_62的来源:
 https://docs.nvidia.com/cuda/archive/12.9.0/cuda-compiler-driver-nvcc/index.html
 https://en.wikipedia.org/wiki/CUDA#GPUs_supported
*/

#include <cstdio>
using namespace std;

int main() {
    printf("Host: Hello wolrd!");
    return 0;
}

代码完全没有变化，只有文件后缀名发生了变化

但这里性质发生了变化，代码分为了Host代码部分和Device代码部分由nvcc分别处理

nvcc指定运算架构

注意到注释部分，nvcc可以指定虚拟和真实架构的编号，可以通过链接查看本地设备可以使用的标识

如果需要匹配多个架构，可以使用多个-gencode -arch 虚拟架构 -code 真实架构来编译产生不同的输出，最后真实架构不能落后与虚拟架构

使用CUDA线程

从这里正式开始添加Device代码：

/**
 运行命令: nvcc  XX.cu
*/
#include<cuda.h>
#include<cuda_runtime.h>
#include<cstdio>

__global__ void hello_from_gpu()
{
    printf("Device: Hello World.");
}
int main()
{
    const int grid_size=1;
    const int block_size=1;
    hello_from_gpu<<<grid_size,block_size>>>();

    cudaDeviceSynchronize();
    return 0;
}

注意四点：

include包含了两个cuda相关头文件，这不是必须要添加的，nvcc会自动引入，其中包含了cstdlib的内容，但没有包含cstdio的内容，最后device程序是不认iostream的
__global___就是device代码，即核函数的标志
cudaDeviceSynchronize的功能是同步host和device，也就是说gpu上的代码和cpu上的代码是异步运行的
<<<grid_size,block_size>>>指定了用多少线程运行该核函数

使用多线程的CUDA程序

再进一步改进程序

/**
 运行命令: nvcc  XX.cu
*/
#include<cuda.h>
#include<cuda_runtime.h>
#include<cstdio>

__global__ void hello_from_gpu()
{
    const int bid = blockIdx.x;
    const int tid = threadIdx.x;
    printf("Device: Hello World from block %d and thread %d\n",bid,tid);
}
int main()
{
    const int grid_size=2;
    const int block_size=4;
    hello_from_gpu<<<grid_size,block_size>>>();

    cudaDeviceSynchronize();
    return 0;
}

CUDA将线程组织为网格，网格被划分为线程块，即网格由线程块组成，线程块由线程组成，三者之间并不是简单的包含关系（即一般论的包含关系，网格有N个线程块，每个线程块又有M个线程），但是目前示例中确实只是单纯的包含关系，可简单认为有网格两个线程块，每个线程块四个线程共八个线程

下图展示了多维线程的布局，这也是网格的真正分布：

可以认为网格是一个三维立方体，被划分出的小格就是线程块，而小格又可以按三维继续分割，分割出的就是线程。

像上面简单的以2和4指定的线程块个数和线程个数，三维上可以分别认为是(2,1,1)和(4,1,1)，只按x轴延伸的2个大方块，每个大方块又是由按x轴延展的4个小方块组成

多维线程的CUDA程序

既然有了三维建构，每个线程块和线程自然有其三维坐标，按照简单的数学逻辑也可以转换为1维坐标，这里直接把这些概念在代码中展示：

/**
 运行命令: nvcc  XX.cu
*/
#include<cuda.h>
#include<cuda_runtime.h>
#include<cstdio>

__global__ void hello_from_gpu()
{
    const int bid = blockIdx.x;
    const int tid_x = threadIdx.x;
    const int tid_y = threadIdx.y;
    const int tid_z = threadIdx.z;
    //1D坐标:
    const int bid_1D = blockIdx.x+blockIdx.y*gridDim.x+blockIdx.z*gridDim.x*gridDim.y;
    const int tid_1D = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z;
    printf("Device(%d)(%d): Hello World from block %d and thread (%d-%d-%d)\n",bid_1D,tid_1D,bid,tid_x,tid_y,tid_z);
}
int main()
{
    const int grid_size=2;
    const dim3 block_size(3,3,3);
    hello_from_gpu<<<grid_size,block_size>>>();

    cudaDeviceSynchronize();
    return 0;
}

注意到：

dim3类似于vec3，表3维向量
gridDim和blockDim是dim3固有变量，内容是当初规定的各个轴的上限
blockIdx和threadIdx也是dim3固有变量，包含了当前线程的坐标

另外，线程块在3个轴上的个数和一个线程块的线程数是有限制的：

线程块在xyz三轴上分别最多：2^31-1,65535,65535
一个线程块线程总数最多1024，在三维上最多:1024,1024,64

下面是超过了限制的程序：

/**
 运行命令: nvcc  XX.cu
*/
#include<cuda.h>
#include<cuda_runtime.h>
#include<cstdio>

__global__ void hello_from_gpu()
{
    const int bid = blockIdx.x;
    const int tid_x = threadIdx.x;
    const int tid_y = threadIdx.y;
    const int tid_z = threadIdx.z;
    //1D坐标:
    const int bid_1D = blockIdx.x+blockIdx.y*gridDim.x+blockIdx.z*gridDim.x*gridDim.y;
    const int tid_1D = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z;
    printf("Device(%d)(%d): Hello World from block %d and thread (%d-%d-%d)\n",bid_1D,tid_1D,bid,tid_x,tid_y,tid_z);
}
int main()
{
    const dim3 grid_size(1,65535,1); //大于规定的2^31-1,65535,65535范围虽然能通过编译,但没有输出
    const dim3 block_size(1024,1,1); //大于1024虽然能通过编译,但没有输出
    hello_from_gpu<<<grid_size,block_size>>>();

    cudaDeviceSynchronize();
    return 0;
}

线程束

还有一个没有在代码中体现出的概念：线程束，简单理解为同一线程块中相邻的N个线程，一般按架构为32个一组

简述CUDA的线程结构-以HelloWorld为例