16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev

Let’s go into bits and bytes

For 50 years, from the time of Kernighan, Ritchie, and their 1st edition of the C Language book, it was known that a single-precision “float” type has a 32-bit size and a double-precision type has 64 bits. There was also an 80-bit “long double” type with extended precision, and all these types covered almost all the needs for floating-point data processing. However, during the last few years, the advent of large neural network models required developers to move into another part of the spectrum and to shrink floating point types as much as possible.

Honestly, I was surprised when I discovered that the 4-bit floating-point format exists. How on Earth can it be possible? The best way to know is to test it on our own. In this article, we will discover the most popular floating point formats, make a simple neural network, and see how it works.

Let’s get started.

A “Standard” 32-bit Floating point

Before going into “extreme” formats, let’s recall a standard one. An IEEE 754 standard for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical number in a 32-float type looks like this:

Here, the first bit is a sign, the next 8 bits represent an exponent, and the last bits represent the mantissa. The final value is calculated using the formula:

This simple helper function allows us to print a floating point value in binary form:

import structdef print_float32(val: float):
""" Print Float32 in a binary form """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000

Let’s also make another helper for backward conversion, which will be useful later:

def ieee_754_conversion(sign, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary data into the floating point value """
sign_mult = -1 if sign == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in range(mant_len - 1, -1, -1):
if mantissa & (2 **…

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

What's Hot

16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

Let’s go into bits and bytes

A “Standard” 32-bit Floating point

Related Posts

Leave A Reply Cancel Reply