3. cuda programming

待处理

cuda mma/wgmma 矩阵乘的实现和优化

3.1. MMA (Matrix Multiply-Accumulate Instructions)

Tensor core 的操作的基础形状如下：

3.2. old

3.2.1. Occupancy

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.

\[\text{occupancy} = \frac{\text{Actual number of active warps}}{\text{Maximum number of active warps}}\]

3.2.2. floating point number

3.2.2.1. BFLOAT16

bfloat16 does not support subnormal number.

3.2.2.2. FP8

The 8-bit floating point (FP8) binary interchange format consisting of two encodings:

E4M3 : 4-bit exponent and 3-bit mantissa
E5M2 : 5-bit exponent and 2-bit mantissa

E5M2 follows IEEE 754 conventions for representatio of special values. But E4M3’s dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.

E5M2格式遵守IEEE-754规范，但E4M3格式不遵守IEEE-754规范。E4M3没有无穷大且除符号位外全为1表示NaN。

3.3. cuda environment

CUDA 11.0 adds support for the NVIDIA Ampere GPU microarchitecture (compute_80 and sm_80), and supports bf16 data type (__nv_bfloat16) and compute type TF32 (tf32).

CUDA 11.1 adds support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.

CUDA 11.5:

linking is supported with cubins larger than 2 GB.
include cub.

CUDA 11.8:

This release introduces support for both the Hopper and Ada Lovelace GPU families.
Added fp8 data types and type conversion functions support via the new header cuda_fp8.h.
Performance improvements in bfloat16 basic arithmetic header for sm_90 targets.

3.4. whitepaper

3.4.1. A100

1555 GB/sec of memory bandwidth