Addressing Misconceptions: On computers, Countable and Floating point numbers are not associative nor distributive

Randall Maas 5/23/2010 10:59:54 PM

One of the subtlest way to create bugs in embedded systems is with the math in C. Or, at least to assume that it is good enough, without considering how the compiler and hardware do math.

Arithmetic operators in C are not distributive. You need to know - and validate with - more information, such as the actual possible range of values in the variables. To wit, the countable numbers (ints, shorts, unsigned and signed, etc) preserve theleast significant digits under arithmetic operation. Worse - and something fewer people understand ' is that floating point values (floats and doubles) only preserve the most significant digits.

The oops of ints

In the integer family of types in C (and C-like) language values can silently overflow, leaving you with a surprisingly small number (even a very negative one when you expect otherwise). It helps if I give an example.

   int X = (A-B) + (D - C);

is not always the same as:

   int Y = (A + D)  - (B + C);

The sum of A and D could be large enough to overflow the integer (or whatever) type. The same for the sum of B and C. But - and the likely reason that they were written as two subtractions before the addition - B might shrink A enough, and C might shrink D enough to not overflow. I've seen it a lot in small microcontrollers that are doing control loops.

Actually, there is subtle, but frequent bug. What happens is that both the A+D and the B+C overflow almost always at the same time, making for the difference to be pretty close. But there are a few cases where they don't. I haven't found a test engineer that can design cases to test this. Code review occasionally catches this. And nature usually triggers it only after the code deploys ('ships'), and the person likely to work on the bug has no idea which formula is correct for stability.

There isn't a simple solution. Often I just write down a derivative of the equations so that other people can check them. (I do make mistakes after all). And we try to use a wider type - more bits - than we think we'll need. And we double check. It helps.

The oops of floating points

As I mentioned, floating point preserves the most significant digits, dropping the least. That is its major appeal - it prevents the problems you see with the integer family above. (Well, float point can overflow too, but it's less of an issue). Let's just assume that you have a DSP, microcontroller or processor where floating point is practical. There still is a class of bugs waiting to happen. (Fortunately is rare if you're just replace the equations you were using ints for earlier)

   double X = (A+B) + (D + C);

is not always the same as:

   double Y = (A + D) + (B + C);

When, say, A and B are small numbers, and C and D are big ones here is what happens. A plus D is D, because the digits of A are insignificant and dropped. And, similarly, the digits of B are insignificant and are dropped. But, A plus B does some up the digits, enough so that they do add with D and C, giving a different result.

The answer, for such simple cases, is to arrange the arithmetic operations from the smallest number to the biggest.

It all seems pretty trivial. Until you get into linear algebra, which is very heavily used in signal processing and control systems. In those systems, the pretty matrix operations we learn as sophomores is very unstable. Matrices get ridiculous numbers doing, say, an eigenvector. (By ridiculous, not only the computed results not work very well, they can have not-a-number results ' singularities and infinites and such). One way to prevent this is to permute the matrix before performing the operation, like that sort from smallest to largest, and the rearrange back to the proper order when done.

And that is where I have to cop out. Numerical stability for these networks of multiplies and divides, sums and differences, is a specialty. How to permute, and all the other things you need to do is something for good textbooks, and why you should use really really good libraries.

The End

For a moment, back to microcontrollers where floating point is not a practical option. Even if you aren't going to be doing linear algebra. What then? I use rational numbers. Basically it's multiplying by a 100, to track the pennies in sales figures, and knowing that you are dealing with currency in terms of pennies not dollars. (And you need to use a wide enough integer type). This can get pretty ugly, depending on the equations and how many digits you need to track. Or, worse, if you need to increase the number of digits as the equation progresses.

But it is faster than floating point on some machines.