2. floating point number

待处理

  1. Scale e.g. E8M0

  2. 类型转换

2.1. Microscaling Formats

OCP 8-bit Floating Point Format

E4M3

E5M2

Exponent bias

7

15

Infinities

N/A

S 1111 002

NaN

S 1111 1112

S 11111 {01, 10, 11}2

Zeros

S 0000 0002

S 00000 002

Max normal

S 1111 1102 = ± 28 × 1.75 = ± 448

S 11110 112 = ± 215 × 1.75 = ± 57,344

Min normal

S 0001 0002 = ± 2−6

S 00001 002 = ± 2−14

Max subnormal

S 0000 1112 = ± 2−6 × 0.875

S 00000 112 = ± 2−14 × 0.75

Min subnormal

S 0000 0012 = ± 2−9

S 00000 012 = ± 2−16

FP6 Format

E2M3

E3M2

Exponent bias

1

3

Infinities

N/A

N/A

NaN

N/A

N/A

Zeros

S 00 0002

S 000 002

Max normal

S 11 1112 = ± 22 × 1.875 = ± 7.5

S 111 112 = ± 24 × 1.75 = ± 28

Min normal

S 01 0002 = ± 20 = ± 1

S 001 002 = ± 2−2 = ± 0.25

Max subnormal

S 00 1112 = ± 20 × 0.875 = ± 0.875

S 000 112 = ± 2−2 × 0.75 = ± 0.1875

Min subnormal

S 00 0012 = ± 2−3 = ± 0.125

S 000 012 = ± 2−4 = ± 0.0625

FP4 Format

E2M1

Exponent bias

1

Infinities

N/A

NaN

N/A

Zeros

S 00 02

Max normal

S 11 12 = ± 22 × 1.5 = ± 6

Min normal

S 01 02 = ± 20 × 1.0 = ± 1

Subnormal

S 00 12 = ± 20 × 0.5 = ± 0.5

2.2. IEEE 浮点数

浮点数最新标准为IEEE 754-2019

浮点数格式如下:

S(sign)

E (biased exponent)

T (trailing significand field)

1 bit

w bits

t bits, t = p -1

具有如下关系:

\[\begin{aligned} e & = E - bias \\ e_{max} & = bias = 2^{w-1} - 1 \\ e_{min} & = 1 - e_{max} \end{aligned}\]

关于biased E的说明:

  1. normal number: [1 , \(2^w - 2\)], 值为 \((-1)^s \times 2^{E-bias} \times (1+ 2^{1-p} \times T)\)

  2. 0, 当T=0表示 \(\pm 0\); 当T!=0 表示 subnormal number, 值为 \((-1)^s \times 2^{e_{min}} \times (0+ 2^{1-p} \times T)\)

  3. \(2^w − 1\) (二进制全部为1), 当T=0, 表示 \(\pm \infty\); 当T != 0, 表示 NaN.

2.2.1. ieee 规定的16, 32, 64, 128比特的浮点数格式列表

参数

binary16

binary32

binary64

binary128

指数位数

5

8

11

15

emax/bias

15

127

1023

16383

小数位数

10

23

52

112