Data Format Fundamentals — Single Precision (FP32) vs Half Precision (FP16)
Now, let’s take a closer look at FP32 and FP16 formats. The FP32 and FP16 are IEEE formats that represent floating numbers using 32-bit binary storage and 16-bit binary storage. Both formats comprise three parts: a) a sign bit, b) exponent bits, and c) mantissa bits. The FP32 and FP16 differ in the number of bits allocated to exponent and mantissa, which result in different value ranges and precisions.
How do you convert FP16 and FP32 to real values? According to IEEE-754 standards, the decimal value for FP32 = (-1)^(sign) × 2^(decimal exponent —127 ) × (implicit leading 1 + decimal mantissa), where 127 is the biased exponent value. For FP16, the formula becomes (-1)^(sign) × 2^(decimal exponent — 15) × (implicit leading 1 + decimal mantissa), where 15 is the corresponding biased exponent value. See further details of the biased exponent value here.
In this sense, the value range for FP32 is approximately [-2¹²⁷, 2¹²⁷] ~[-1.7*1e38, 1.7*1e38], and the value range for FP16 is approximately [-2¹⁵, 2¹⁵]=[-32768, 32768]. Note that the decimal exponent for FP32 is between 0 and 255, and we’re excluding the largest value 0xFF as it represents NAN. That’s why the largest decimal exponent is 254–127 = 127. A similar rule applies to FP16.
For the precision, note that both the exponent and mantissa contributes to the precision limits (which is also called denormalization, see detailed discussion here), so FP32 can represent precision up to 2^(-23)*2^(-126)=2^(-149), and FP16 can represent precision up to 2^(10)*2^(-14)=2^(-24).
The difference between FP32 and FP16 representations brings the key concerns of mixed precision training, as different layers/operations of deep learning models are either insensitive or sensitive to value ranges and precision and need to be addressed separately.