Addressing Misconceptions: Loop Termination

Randall Maas 5/13/2010 12:58:52 PM

One of the first things someone new to a language like C is that it integer variables wrap around. For example the following is an infinite loop:

	for(char I = 0; I < 256; I++)
	 do something;

(Worse, in C char might be signed, and you could replace 256 with 128 for the same effect in those cases). I was bitten by this, and I don't know of any C programmer that hasn't been. (There is even a Microsoft Video that purported to prove loop termination, but failed with the above loop). There are lots of similar things I've seen people get bitten by, and have been bitten by myself. That is what this entry is about.

This could be a wonderful launching point about why this problem exists in other languages like Java and C#, yet is more of a 'rite of passage' in C. We could discuss silent overflow and underflow, and whether it is possible, reasonable or desirable to have a language that supports making them explicit and trapping them. Another time perhaps.

When this applies to binary search

As I said, termination errors are a fairly common problem that people stumble across. In 2006 a google blog post gained some popularity, because it claimed to discover such a bug, (supposedly one that no one knew about it before). Then it proposed a fix.

The loop in this case is the text book example of binary search. I'll use an edit of the code he posted:

1:   int low = 0; // The starting area to search at 
2:   int high = ..; // The end of the area to search
3:
4:         while (low <= high) {
5:             int mid = (low + high) / 2;
6:             int midVal = a[mid];
7:
8:             if (midVal < key)
9:                 low = mid + 1;
10:             else if (midVal > key)
11:                 high = mid - 1;
12:             else
12:                 return mid; // key found
13:         }
14:         return -(low + 1);  // key not found.

He wanted this code to work with very large data sets, and the variables to be 'int's ' because Java uses those. (Again this problem is not new, croping up for generations now - for example, people commonly used compilers that had 16bit ints, and but the size of the array was 32 bits). He reported the bug as the method you find a mid point in array:

	mid = (low + high)/2;
	get entry at Array[mid]

Given his conditions, that is correct. He identified this as the sum of low plus high as overflowing. Then proposed the following fix, change the code to:

For C: mid = (unsigned)(low + high) >>1;
Java: mid = (low + high) >>> 1;

Second, the number of inputs is limited to 231-1 for 32 bit integers, 215-1 for 16 bit integers, etc.

I ignored the whole blog frenzy since his fix didn't work. The good news is that he has since found that he was wrong and had to fix his fix with:

For C: mid = ( (unsigned) low + (unsigned) high) >>1;

Why that fix doesn't always work

Taking into account the sign is a good thing. But the fix doesn't work if you want to work with the full range of unsigned numbers. That is low, high, and mid are all unsigned (say) 32 bit variables, and you want to sort 232-1 items. Let's look again at a simplified form of the search that takes the worst case traversal with this:

1:   unsigned  low = 0; // The starting area to search at 
2:   unsigned high = ~0u; // The end of the area to search
3:
4:         while (low <= high) {
5:             unsigned mid = ((unsigned) low + (unsigned) high)  >> 1;
9:             low = mid + 1;
13:         }

The loop will not terminate. Whenever 'low' is not 0, the calculation of mid will overflow. This means that low will get closer to high, then jump back toward zero, never converging on it.

What is the fix, then, that works with signed and unsigned integers, of most any bit length? First, the condition must change to:

         while (1) {

This removes the check to see if low is greater than high - there is no way to represent a value as greater than high when high is at the maximum value (e.g., 232-1). (The loop termination will be added later).

Second the computation of mid must change to:

	mid = (low>>1) + (high>>1) + (((low+high)>>1)&1);

That works by precondition the significant value of low and high so that their sum won't overflow, and then computing the minor value to keep the search stable.

Third, the computation of the next low value must change to:

       if (high == mid)
         break;
       low = mid + 1;

This adds the termination condition if the mid point was the high end.

Of course this formulation may still have bugs, or cases where the loop won't terminate.

All of this could lead to interesting discussions of why determining loop termination is very hard, why example code in text books is simplified, why adding error checks and stability makes code ugly, how to ensure that successive approximation converges or terminates, and so on.