Floating-point numbers are widely used for numerical calculations, including digital signal processing. In fact, we recently announced a family of floating-point optimized Tensilica DSPs. See my post Tensilica FloatingPoint DSP Family. Prior to 1985, floating-point implementations on different computer systems might get different answers due to subtle differences in rounding, support for infinity, and other implementation details. In 1985, the IEEE issued standard 954 which defines all these things, including the precise format of floating-point numbers. The standard now supports both binary (most common) and decimal formats. I'm only going to talk about binary formats. The standard defines everything in a lot of detail so you will get the same result on any floating-point unit.

The advantage of floating-point numbers compared to integers is the much larger dynamic range, the maximum values that can be represented. There is no free lunch, this range is paid for with reduced precision once the number gets too large for an exact representation in the mantissa. Compared to fixed-point numbers, the big advantage of floating-point representations is that fractions are represented directly and the programmer does not have to keep track of where the "binary point" is all the time. Some of these advantages are similar to the advantages from using "scientific notation", numbers like 6.02×10^{23} (or 6.02e+23), which would be pretty tedious to write out in full (plus, I don't think we know Avogadro's Number to 23 bits of precision).

### Representation

Floating-point numbers consist of an exponent and a mantissa. The number of bits allocated to exponent and mantissa varies depending on the precision. There is a sign bit, but also a little cheat since the mantissa always starts with a one-bit so that doesn't need to be represented. Since the sign is explicit, there are both positive and negative floating-point zeros (which we don't get with twos-complement integers, where we have a different anomaly, that the smallest negative number cannot be negated because it is too large to be represented as a positive number).

- 16-bit (half-precision) has 11 bits of mantissa and 5 of exponent. This is also called FP16
- 32-bit (single-precision) has 24 bits of mantissa and 8 of exponent. GPUs pretty much work using FP32
- 64-bit (double precision) has 53 bits of mantissa and 11 of exponent
- There are also 128-bit and 256-bit formats
- Recently Google defined bfloat16 (sometimes called "brain float" since it is designed for neural networks) with 8 bits of mantissa and 8 of exponent, so it is basically single precision with a truncated mantissa, since for neural networks the dynamic range possible is much more important than the number of bits of precision

### Numerical Methods

One of the compulsory courses, when I studied computer science, was Numerical Methods. Things like solving calculus problems with Newton Raphson or manipulating matrices with techniques like successive over-relaxation. This was all done with floating-point arithmetic. The lecturer was actually Maurice Wilkes, the head of the Computer Laboratory and famous for leading the team that designed EDSAC. I also remember him being challenged that computers couldn't really get any faster due to speed of light considerations (this was an era when the memory might be in a box on the other side of the room from the mainframe CPU). He thought for a moment and said "I think computers are going to get a lot smaller". With microprocessors, of course, they have.

I'm amazed to discover that the textbook we used, Numerical Methods that Usually Work, first published in 1970, has been reissued and you can get it on Amazon. Although I remember it had a bright yellow cover back in the day.

Floating-point numbers often exhibit unexpected behavior that is a trap for the unwary:

- Some numbers, and not just irrational numbers like π, and especially some useful ones like 1/10, don't have an exact representation in binary floating-point. Plus, of course, if a number gets too large it cannot be represented exactly even if it is an integer.
- You can't really compare floating-point numbers for equality. If you add 0.1 to 0.2 you won't get 0.3, you'll get 0.30000000000000004. Instead, you always have to have an epsilon and test whether the two numbers differ by less than epsilon.
- Floating-point arithmetic is not associative. The order in which operations are done matters. So (a+b)+c might not equal a+(b+c). This is especially important when some of the numbers are very big and some are very small. As a general rule, due to the limited precision and the way the exponents work, if you add a very small number to a very large number, it might be the same as adding zero to the large number, or losing a lot of the precision. And this is the same even if you add the small number to the large number a million times. Instead, you should add all the small numbers together first, and then add them to the large number.
- Taking the small difference between two very large floating-point numbers may lose a lot of your precision. A lot of our Numerical Methods course was focused on algorithms that never fell into this trap.

And, of course, there's always an XKCD for everything:

**Sign up for Sunday Brunch, the weekly Breakfast Bytes email.**

.