You can find the Chinese version at從IEEE 754 標準來看為什麼浮點誤差是無法避免的

### Introduction

When people first learn to code, they often encounter floating-point errors. If you haven't experienced this yet, you're very lucky!

For example, in Python, 0.1 + 0.2 does not equal 0.3, and 8.7 / 10 does not equal 0.87 but 0.869999… It's really strange.

However, this is not a bug or a flaw in Python. It's the result of how floating-point numbers are calculated, and this happens in other languages like Node.js too.

**How Computers Store Integers**

Before discussing floating-point errors, let's look at how computers represent integers using binary (0s and 1s). For instance, `101`

in binary represents 2² + 2⁰ = 5, and `1010`

represents 2³ + 2¹ = 10.

A 32-bit unsigned integer can hold 32 bits, so the smallest value is `0000...0000`

(0), and the largest is `1111...1111`

, which is 2³¹ + 2³⁰ + … + 2¹ + 2⁰ = 4294967295.

Each bit can be either 0 or 1, giving 2³² possible values. This means any value between 0 and 2³²-1 can be represented without error.

**How about Floating-Point Numbers**

While there are many integers between 0 and 2³²-1, their number is limited to 2³². Floating-point numbers are different. In the range of 1 to 10, there are only ten integers but an infinite number of floating-point numbers like 5.1, 5.11, 5.111, etc.

Since a 32-bit space has only 2³² possibilities, fitting all floating-point numbers into this space requires a standard representation, which is where IEEE 754 comes in.

**IEEE 754 Standard**

IEEE 754 defines how to represent single (32-bit), double (64-bit), and special values (infinity, NaN) for floating-point numbers.

### Normalization

For example, to convert `8.5`

to IEEE 754 format, you normalize it by breaking it into `8 + 0.5`

, which is `2³ + 1/2¹`

. In binary, this is `1000.1`

, or `1.0001 x 2³`

, similar to scientific notation in base 10.

**Single Precision Floating-Point**

In IEEE 754, a 32-bit floating-point number has three parts: sign, exponent, and fraction, adding up to 32 bits.

**Sign**: The leftmost bit indicates the sign (0 for positive, 1 for negative).**Exponent**: The next 8 bits represent the exponent in a biased format (value + 127).**Fraction**: The last 23 bits represent the fractional part (after the decimal point).

For example, 8.5 in 32-bit format is represented as:

**When is it Inaccurate?**

The example of 8.5 can be precisely represented as 2³ + 1/2¹ because both 8 and 0.5 are powers of 2, without precision loss.

However, for 8.9, which cannot be exactly represented as a sum of powers of 2, it must be approximated, resulting in small errors. You can explore these errors using the IEEE-754 Floating Point Converter.

**Double Precision Floating-Point**

To reduce errors, IEEE 754 also defines a 64-bit representation, doubling the size of the fraction part from 23 to 52 bits, thereby increasing precision.

For instance, while 8.9 still can't be perfectly represented, the error is much smaller in 64-bit format compared to 32-bit.

In Python, 1.0 and 0.999...999 are considered equal, as are 123 and 122.999...999, because their difference is too small to be represented in the fraction part.

**Solutions**

Since floating-point errors are unavoidable, here are two common ways to handle them:

**Set a Maximum Allowable Error (epsilon)**

Some languages provide an epsilon value to determine if a result is within an acceptable error range. In Python, epsilon is about 2.2e-16.

You can rewrite `0.1 + 0.2 == 0.3`

as `abs(0.1 + 0.2 - 0.3) <= epsilon`

to avoid errors in calculations.

### Use Decimal Calculations

To avoid conversion errors between decimal and binary, use decimal calculations directly. Python's `decimal`

module can handle this, performing exact decimal arithmetic.

While using decimal arithmetic eliminates floating-point errors, it is slower because it simulates decimal operations on the CPU, which natively uses binary.

## Conclusion

Floating-point errors are unavoidable, but understanding IEEE 754 can help you manage them. Knowing the format's details can be interesting and useful, allowing you to explain why these errors occur confidently.