cuda programming ******************* .. todo:: 1. cuda mma/wgmma 矩阵乘的实现和优化 MMA (Matrix Multiply-Accumulate Instructions) ================================================= Tensor core 的操作的基础形状如下: .. figure:: /_static/images/tensor_core_basic_shape.png old ================= Occupancy ---------- Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. .. math:: \text{occupancy} = \frac{\text{Actual number of active warps}}{\text{Maximum number of active warps}} floating point number ---------------------- BFLOAT16 ^^^^^^^^^^ bfloat16 does not support subnormal number. FP8 ^^^^^^^^ The 8-bit floating point (FP8) binary interchange format consisting of two encodings: - E4M3 : 4-bit exponent and 3-bit mantissa - E5M2 : 5-bit exponent and 2-bit mantissa E5M2 follows IEEE 754 conventions for representatio of special values. But E4M3’s dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. E5M2格式遵守IEEE-754规范,但E4M3格式不遵守IEEE-754规范。E4M3没有无穷大且除符号位外全为1表示NaN。 .. figure:: /_static/images/fp8.png cuda environment ==================== CUDA **11.0** adds support for the NVIDIA Ampere GPU microarchitecture (compute_80 and sm_80), and supports bf16 data type (``__nv_bfloat16``) and compute type TF32 (``tf32``). CUDA **11.1** adds support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series. CUDA **11.5**: * linking is supported with cubins larger than 2 GB. * include cub. CUDA **11.8**: * This release introduces support for both the Hopper and Ada Lovelace GPU families. * Added fp8 data types and type conversion functions support via the new header cuda_fp8.h. * Performance improvements in bfloat16 basic arithmetic header for sm_90 targets. whitepaper =========== A100 ------- .. figure:: /_static/images/A100_flops.png :width: 600 1555 GB/sec of memory bandwidth