*****************************
floating point number
*****************************

.. todo::

  1. Scale e.g. E8M0
  2. 类型转换

Microscaling Formats
=========================

.. list-table:: OCP 8-bit Floating Point Format
   :widths: 25 40 40
   :header-rows: 1

   * -
     - E4M3
     - E5M2
   * - Exponent bias
     - 7
     - 15
   * - Infinities
     - N/A
     - S 1111 00\ :sub:`2`
   * - NaN
     - S 1111 111\ :sub:`2`
     - S 11111 {01, 10, 11}\ :sub:`2`
   * - Zeros
     - S 0000 000\ :sub:`2`
     - S 00000 00\ :sub:`2`
   * - Max normal
     - S 1111 110\ :sub:`2` = ± 2\ :sup:`8` × 1.75 = ± 448
     - S 11110 11\ :sub:`2` = ± 2\ :sup:`15` × 1.75 = ± 57,344
   * - Min normal
     - S 0001 000\ :sub:`2` = ± 2\ :sup:`−6`
     - S 00001 00\ :sub:`2` = ± 2\ :sup:`−14`
   * - Max subnormal
     - S 0000 111\ :sub:`2` = ± 2\ :sup:`−6` × 0.875
     - S 00000 11\ :sub:`2` = ± 2\ :sup:`−14` × 0.75
   * - Min subnormal
     - S 0000 001\ :sub:`2` = ± 2\ :sup:`−9`
     - S 00000 01\ :sub:`2` = ± 2\ :sup:`−16`


.. list-table:: FP6 Format
   :widths: 25 40 40
   :header-rows: 1

   * -
     - E2M3
     - E3M2
   * - Exponent bias
     - 1
     - 3
   * - Infinities
     - N/A
     - N/A
   * - NaN
     - N/A
     - N/A
   * - Zeros
     - S 00 000\ :sub:`2`
     - S 000 00\ :sub:`2`
   * - Max normal
     - S 11 111\ :sub:`2` = ± 2\ :sup:`2` × 1.875 = ± 7.5
     - S 111 11\ :sub:`2` = ± 2\ :sup:`4` × 1.75 = ± 28
   * - Min normal
     - S 01 000\ :sub:`2` = ± 2\ :sup:`0` = ± 1
     - S 001 00\ :sub:`2` = ± 2\ :sup:`−2` = ± 0.25
   * - Max subnormal
     - S 00 111\ :sub:`2` = ± 2\ :sup:`0` × 0.875 = ± 0.875
     - S 000 11\ :sub:`2` = ± 2\ :sup:`−2` × 0.75 = ± 0.1875
   * - Min subnormal
     - S 00 001\ :sub:`2` = ± 2\ :sup:`−3` = ± 0.125
     - S 000 01\ :sub:`2` = ± 2\ :sup:`−4` = ± 0.0625


.. list-table:: FP4 Format
   :widths: 25 40
   :header-rows: 1

   * -
     - E2M1
   * - Exponent bias
     - 1
   * - Infinities
     - N/A
   * - NaN
     - N/A
   * - Zeros
     - S 00 0\ :sub:`2`
   * - Max normal
     - S 11 1\ :sub:`2` = ± 2\ :sup:`2` × 1.5 = ± 6
   * - Min normal
     - S 01 0\ :sub:`2` = ± 2\ :sup:`0` × 1.0 = ± 1
   * - Subnormal
     - S 00 1\ :sub:`2` = ± 2\ :sup:`0` × 0.5 = ± 0.5


IEEE 浮点数
==============

浮点数最新标准为IEEE 754-2019

浮点数格式如下：

+---------+---------------------+--------------------------------+
| S(sign) | E (biased exponent) | T (trailing significand field) |
+---------+---------------------+--------------------------------+
|  1 bit  |      w bits         |  t bits, t = p -1              |
+---------+---------------------+--------------------------------+

具有如下关系：

.. math::

  \begin{aligned}
       e & = E - bias \\
    e_{max} & = bias = 2^{w-1} - 1 \\
    e_{min} & = 1 - e_{max}
  \end{aligned}


关于biased E的说明:

1. normal number: [1 , :math:`2^w - 2`]，
   值为 :math:`(-1)^s \times 2^{E-bias} \times (1+ 2^{1-p} \times T)`
2. 0, 当T=0表示 :math:`\pm 0`; 当T!=0 表示 subnormal number,
   值为 :math:`(-1)^s \times 2^{e_{min}} \times (0+ 2^{1-p} \times T)`
3. :math:`2^w − 1` (二进制全部为1), 当T=0, 表示 :math:`\pm \infty`; 当T != 0, 表示 NaN.


ieee 规定的16, 32, 64, 128比特的浮点数格式列表
------------------------------------------------


+-----------+----------+----------+----------+-----------+
|  参数     | binary16 | binary32 | binary64 | binary128 |
+===========+==========+==========+==========+===========+
| 指数位数  |    5     |     8    |    11    |    15     |
+-----------+----------+----------+----------+-----------+
| emax/bias |   15     |    127   |   1023   |   16383   |
+-----------+----------+----------+----------+-----------+
| 小数位数  |   10     |    23    |    52    |    112    |
+-----------+----------+----------+----------+-----------+