Thread如何划分为Warp?

jielahou大约 2 分钟

我们知道，在优化CUDA程序时，以Warp的视角去分析是很重要的。但是，（笔者在琢磨这个问题之前）只知道一个Warp中有32个线程，但是不知道（对于二维blockDim来说）究竟是哪32个线程划分到一个Warp中，写代码时心里也就没有谱。近来阅读CUDA C Programming Guideopen in new window，突然发现了问题的解，在此记录。

先看这里（https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture）的一句话：open in new window

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.

线程块被划分为线程的方式总是相同的。**每个线程束包含着Thread ID连续的、递增的线程（第一个线程束包含线程0）。**线程架构描述了Thread ID和Thread Index的关系。

看来，只要弄清楚“线程架构”，即Thread Index和Thread ID之间的关系，就能够知道给定Thread Index的线程，其和哪些线程归属同一个warp，进而去针对warp做优化。

Thread Index和Thread ID之间有什么关系呢？（线程架构参考这里：CUDA C++ Programming Guide (nvidia.com)open in new window）

1维的Thread Index，其Thread ID就是Thread Index
2维的Thread Index，其Thread ID为tx + ty * DX
3维的Thread Index，其Thread ID为tx + ty * DX + tz * DX * DY

由此再回到本文的问题：Thread如何划分为Warp?

对于1维的Thread Index，直接32个为一组划分（e.g. 0~31、32~63、64~95...）
对于2维的Thread Index，先按照x分，然后再按照y分（e.g. 假设Thread Block大小为[dx]16*[dy]32，那么(0,0),(1,0)...(14,0),(15,0),(0,1),(1,1)...(14,1),(15,1)是一个warp内的）
对于3维的Thread Index，先按照x分，然后再按照y分，最后按照z分（例子略）