2. floating point number

待处理

Scale e.g. E8M0
类型转换

2.1. Microscaling Formats

OCP 8-bit Floating Point Format
	E4M3	E5M2
Exponent bias	7	15
Infinities	N/A	S 1111 00₂
NaN	S 1111 111₂	S 11111 {01, 10, 11}₂
Zeros	S 0000 000₂	S 00000 00₂
Max normal	S 1111 110₂ = ± 2⁸ × 1.75 = ± 448	S 11110 11₂ = ± 2¹⁵ × 1.75 = ± 57,344
Min normal	S 0001 000₂ = ± 2⁻⁶	S 00001 00₂ = ± 2⁻¹⁴
Max subnormal	S 0000 111₂ = ± 2⁻⁶ × 0.875	S 00000 11₂ = ± 2⁻¹⁴ × 0.75
Min subnormal	S 0000 001₂ = ± 2⁻⁹	S 00000 01₂ = ± 2⁻¹⁶

FP6 Format
	E2M3	E3M2
Exponent bias	1	3
Infinities	N/A	N/A
NaN	N/A	N/A
Zeros	S 00 000₂	S 000 00₂
Max normal	S 11 111₂ = ± 2² × 1.875 = ± 7.5	S 111 11₂ = ± 2⁴ × 1.75 = ± 28
Min normal	S 01 000₂ = ± 2⁰ = ± 1	S 001 00₂ = ± 2⁻² = ± 0.25
Max subnormal	S 00 111₂ = ± 2⁰ × 0.875 = ± 0.875	S 000 11₂ = ± 2⁻² × 0.75 = ± 0.1875
Min subnormal	S 00 001₂ = ± 2⁻³ = ± 0.125	S 000 01₂ = ± 2⁻⁴ = ± 0.0625

FP4 Format
	E2M1
Exponent bias	1
Infinities	N/A
NaN	N/A
Zeros	S 00 0₂
Max normal	S 11 1₂ = ± 2² × 1.5 = ± 6
Min normal	S 01 0₂ = ± 2⁰ × 1.0 = ± 1
Subnormal	S 00 1₂ = ± 2⁰ × 0.5 = ± 0.5

2.2. IEEE 浮点数

浮点数最新标准为IEEE 754-2019

浮点数格式如下：

S(sign)	E (biased exponent)	T (trailing significand field)
1 bit	w bits	t bits, t = p -1

具有如下关系：

\[\begin{aligned} e & = E - bias \\ e_{max} & = bias = 2^{w-1} - 1 \\ e_{min} & = 1 - e_{max} \end{aligned}\]

关于biased E的说明:

normal number: [1 , \(2^w - 2\)]，值为 \((-1)^s \times 2^{E-bias} \times (1+ 2^{1-p} \times T)\)
0, 当T=0表示 \(\pm 0\); 当T!=0 表示 subnormal number, 值为 \((-1)^s \times 2^{e_{min}} \times (0+ 2^{1-p} \times T)\)
\(2^w − 1\) (二进制全部为1), 当T=0, 表示 \(\pm \infty\); 当T != 0, 表示 NaN.

2.2.1. ieee 规定的16, 32, 64, 128比特的浮点数格式列表

参数	binary16	binary32	binary64	binary128
指数位数	5	8	11	15
emax/bias	15	127	1023	16383
小数位数	10	23	52	112