How do computers represent floating-point numbers?

4 min readOct 5, 2021

How do computers represent the floating-point number? is an important question when comes to programming. As software developers, we need to understand core concepts before solving complex mathematical problems and this led to becoming successful in your career as well.

The most commonly used floating-point standard is the IEEE 754 standard. According to this standard, floating-point numbers are represented with 32 bits (single precision) or 64 bits (double precision). In this section, we will look at only the 32-bit numbers and see how the mathematical operations work accordingly. If you need further information, you can have a quick look at the below article.

Floating Point Number

Floating-point numbers are used in VHDL to define real numbers and the predefined floating-point type in VHDL is called…

www.sciencedirect.com

According to the IEEE 754 standards on the floating-point number, can divide this into three components.

1. Sign bit

2. Exponent

3. Metissa

Single precision IEEE 754 Floating-point standard

Single, double, and long double-precision representation

As an example, we can get 9.1 and we need to convert the number into a CPU understandable IEEE 754 format. In order to do that, let’s follow the following steps.

1. Convert the floating-point number into binary.

2. Convert that binary number into a scientific number.

3. write that scientific number in the format of IEEE 754.

So let’s go step by step with the example of the 9.1 decimal value.

Convert the floating-point number into binary.

Then we will convert this binary value into IEEE 754 format value.

From the above example, we get 9.1 as the single-precision number. First, we convert the 9 into a binary value which is 1001. Then converted the 0.1 into a binary value which is a recurring value 00011001100110011…

Then we write the number according to scientific notation as follows.

9.1 in scientific format -> 1.00100011001100110011…. x 2³

The 3 referred to exponent base and we need to add 127 to get the exponent bits. ( the range is from -128 to 127). As we know, the 9.1 is a decimal value, the sign bit should be 0.

So if we write the 9.1 in IEEE 754 CPU understandable way, the following will be the answer.

9.1 in IEEE 754 format -> 010000001000100011001100110011001

So let’s check with the mantissa part 00100011001100110011001….. only can store the 23 bits in the mantissa and when having a recurring decimal, this will look for 24th bit 1 or 0. If the 24th bit of the mantissa is 1, then add 1 bit to the 23rd position and if the 24th bit of the mantissa is 0, leave as it is.

So the final IEE 754 format of the number 9.1 will be,

01000001000100011001100110011010

So as from the above example, we get unexpected value due to floating-point rounding error in computers. This will lead to a calculation mismatch in mathematical operations in CPUs and will have a look into this.

Binary to Decimal Conversion

So the IEEE 754 representation of the 9.1 value is 01000001000100011001100110011010. This is a positive number as the first binary number indicates whether the number is positive or negative. The exponent bits have 130 and we need to reduce the 127 from this and the value will be 3. Then we have to convert the mantissa bits into decimal values, and the value will be 0.13750048, but we need to add 1 as we exclude bits left to the decimal point in scientific notations. So the value will be 1.13750048. Now calculate the final value.

Sign bit = 0 = Positive number

Exponent = 2³

Mantissa = 1.13750048

We need to multiply the exponent and mantissa values together to get the result.

8 X 1.13750048 = 9.10000048

As you can see, after converting back the IEEE 754 representation of 9.1 number to a decimal value, we get a slightly different answer. So this is the reason why when doing subtraction continuously from floating-point numbers in the computer, not getting into zero and go beyond the zero of our programming running. So this is called the Floating-Point rounding error.

So this is all about how do computers represent floating-point numbers and I hope you understand what I was trying to explain. There should be a method to prevent from happening this. So here comes the BigDecimal numbers. We will explain more about BigDecimal numbers in java in the next tutorial. Until then bye.

How do computers represent floating-point numbers?

Floating Point Number

Floating-point numbers are used in VHDL to define real numbers and the predefined floating-point type in VHDL is called…

Binary to Decimal Conversion

Written by Pasan Kamburugamuwa

No responses yet