C's Switch / Case construct, Loop Unrolling and Duff's Device

Randall Maas 5/17/2010 10:24:37 AM

A few years ago I was in a fun discussion with one of our junior firmware engineers about the different constructs in C and how these constructs mapped to underlying machine code, and performance characteristics. During the conversation came up that he didn't know the switch, case,break construct works in C, and how it makes Duff's device possible. I'll briefly describe it here.

The Switch, simplified

The basic C switch block looks something like:

switch (ValueExpression)
{
    default:
      do something;
      break;

   case 23: // Fall thru
   case 27:
      do something
      break;
 }

A naive C compiler splits this into two parts. The switch and case statements are turned into an if then else tree. The body - all the bits that do something ' are placed into a second code block after the tree. The two are linked by goto and labels. And, finally, the break's are turned into goto out of the block:

if (ValueExpression == 23)
   goto L_23;
else if (ValueExpression == 27)
   goto L_27;
else
  goto L_default;
goto L_done;


L_default:
   do something;
   goto L_done;
L_23: // Fall thru
L_27:
   do something;
   goto L_done;
L_done:

The if then else tree can be made a lot faster, with better compilers. If the values are densely grouped, a compiler could use the value as an index into a jump table. It could use a binary search of the values, either by careful arrangement of the tree or using a table and a special operation to scan the table. Java (atleast the last VM spec I read) preferred that approach. Or a hash based look up could be use. Of course, these techniques could (and probably should) be used in combination.

In the end, I don't know how many compilers do these. If a compiler optimizes at all, it is more likely to be able to optimize the switch case. Recognizing patterns in an if then else tree are possible, but difficult enough that it seems a lot of compilers put it off.

Other topics

At this point we could discuss idea that language provides similar constructs, but having different characteristics in performance. Is it better and reasonable that its user describe and have compiler identify the best (possibly in terms of memory, IO, and CPU) implementation? Or is better to provide more constructs, and have a designer choose based on their characteristics ' with the tradeoff of providing different ways of saying almost the same thing (degeneracy)?

As a historical side note, I heard that switch was added to provide computed goto behaviour, without computed goto.

Oh, and Duff's Device

For those that aren't familiar Duff's device is a technique that is used to performloop unrolling_. Loop unrolling is used when you have a loop with an inner block that is (usually) so small that the loop's increment, test and branch consume a lot (ie more than say 10%) of the time. The idea is just repeat the inner block a few times, and divide the loop counter. Of course you need to add a few extra cycles of this inner block to make sure you perform its actions the proper amount of time.

That is, it would take a loop like:

for (int I = 0; I < Max; I++)
{
   some action
}

Duff's device puts the loop _inside of the switch statement, using the switch to balance out the number of times the inner block is called:

switch(Max % 4)
{
   for (int I = 0; I < Max/4 - 1; I++)
   {
      case 0:
        some action
      case 3:
        some action
      case 2:
        some action
      case 1:
        some action
   }
}

This would in turn get translated into:

if ((Max%4) == 0)
   goto L_0;
else if ((Max%4) == 1)
   goto L_1;
else if ((Max%4) == 2)
   goto L_2;
else if ((Max%4) == 3)
   goto L_3;
else
   goto L_done;

   for (int I = 0; I < Max/4 - 1; I++)
   {
L_0:
        some action
L_3:
        some action
L_2:
        some action
L_1:
        some action
   }
L_done:

As with the discussion of the switch the compiler's optimizer could automatically insert a Duff's device for loop unrolling - if it knew that it should do this. Indeed many optimizers do. Just something to look at when reading the assembly output, when you're wondering why a small loop became a largish chunk of code.