Memory models

Randall Maas 7/19/2010 11:36:58 AM

I was recently curious about the issues of memory models. I saw a video lecture wherein the instructor mentioned research on the Java memory model, eventually incorporated into Java 5, as the most complete on the issues involved. It fixes a number of issues in earlier Java versions, a model that is not too different from C, C++, C#, and so on. As he pointed out, everyone seems to be getting along fine before the analysis, did we really miss that much? (Yes) What were we missing? Did we know about these, but choose to discretely ignore them, and not put them altogether?

J Manson, W Pugh, S Adve, "The Java memory model" in Proceedings of the Symposium on Principles of Programming Languages 2005.

The short version is that many of the mechanisms we've been sprinkling into C-like language address memory-model concerns, but not correct them. The problems are there even if no ever used preemptive threads (whether in the CPU or OS), multiprogramming concepts, or even just used a "single core" CPU. Interrupts, context switches, traps, signals, asynchronous IO callbacks, locks, exceptions, setjmp()/longjmp() all have problems with memory model semantics.

The rest of this note is structured as follows:

  1. The issues involved with memory models
  2. What issues are not (traditionally) covered by memory models
  3. The actors involved
  4. What this means for code
  5. How this relates to performance
  6. The language's definition of memory and storage operations
  7. The memory hierarchy

The Basics and Issues

First, what are the basic issues with a memory model?

No really, what are the basic issues with a memory model? The consistency problems arise when any of the actors put performance over correctness. Almost all do, sometimes without recognizing this. The memory model is a communication of what is needed to be done to correctly carry out what the software specifies, to do so consistently, especially in subtle cases. The models typically divide themselves along the following lines:

What issues are not (traditionally) covered by memory models? The following variable / memory issues are not discussed in memory model specifications. (I think that is a pity, not the least of which is because they are a standard bug source).

Who is involved? What are the actors? These are roughly the rules software and hardware.

What does this mean for source code?

What does this mean for source code? What are the issues?

These issues aren't exclusive of each other. In practice, I find it seems that these issues often overlap.

How does this related to performance?

The general trend has been for the actors to assume that they may rearrange memory operations for any reason, such as convenience or tuning performance. It is the programmer role to specify or imply constraints on what rearrangements can't be done.

The general memory model does concern performance, but it is more concerned with correctness and options available to allow performance. Although the discussions implicitly assume performance concepts, they devote little space them. I am going to ignore performance tradeoffs except as a motivation why 'risky' constructs are (or were) used.

What memory storage is not. Memory is not a direct, zero cost storage either for named variables, nor for addressable memory (well, byte indexed, since a text name is another way of addressing it). The language, compiler, and hardware treat the memory access as an IO queue, akin to disk IO, with read/write IOs that can be cached, delayed, reordered, optimized, and so forth.

Who creates the inconsistency? Let's look where values are moved between memory subsystems.

First, the software developer must specify the right order of access (and conflict handling) for variables and structures. A design error will leave inconsistency. Or if he, or someone later, makes an error in carrying out the design, the errors can lay in the code, but not be problem enough to be identified and fixed.

Second, the compiler may not generate code that works quite as the programmer expects. The compiler may store the result of operations and calls in registers, and write these out later. It may cache values in registers, not only writing them out late, but also missing subsequent changes. The compiler may choose to rearrange computations, for performance or as a fluke of its code generation. It may reload the same address several times, instead of saving a copy elsewhere, but getting different results. For instance, the compiler may need to use a register for other computation and have to reload the value when it is needed again.

Finally, the processor and memory subsystem issues several instructions simultaneously and rearranges IOs, (unless there is a dependency it knows about). They cache memory regions, populating the cache with lots of read. They buffer up reads and writes, writing thru memory transactions occur.

The Language's definition of Memory and Storage

If a language has a good set of rules (one properly implemented by compilers), there would be far less to worry about with compiler, memory hierarchy, and other actors. But the C language family has a very weak set of storage rules, and only tightened these up with new generations of languages like Java and C#.

I'll describe some of the issues in the following sections:

  1. A caution on what undefined means
  2. Variables can have undefined values, after an exception, goto, or jump.
  3. Keywords (and helper procedures) are expert only
  4. Variable and pointer attributes may not always be the proper approach
  5. The root of the problem

A caution on what the term "undefined" means

In the following sections, I will be referring to situations where variables don't have a well-defined value. This means that there is a fundamental problem, and that the software must be designed differently.

No test can show that the code will work correctly - it is not acceptable to try compiling the code and see if it works. The software could change behaviour if the code is modified slightly, the compiler flags are changed, a different compiler version or a different compiler is used.

The origin of this variation is rooted in how compilers work. Processors have too few registers forcing the compiler to free registers (by storing intermediate values in memory), and reload the register later. There is no single algorithm in use that governs this predictably from a programmer point of view. The compiler may change its allocation and use of registers - and when they get re-read ' based on seemingly arbitrary things including minor changes in far off code.

There is another reason that the code may appear to work, but fail under stress. Being "undefined" means that the compiler may not have generated proper memory handling instructions. The situation has an undefined meaning after all, so the compiler may not know what the proper thing to do is.

Of course, a test can show that a compiler or code is wrong. A test can't show the code is correct, won't have memory issues, or even that the compiler warnings are wrong (Sorry SQLite, you're dangerously wrong on this one.)

Variable values may not be defined with Goto's, Exceptions, or Jumps..

Let's start simple and look at cases where it is not defined what the value of variables will be, even though it may appear that they should have a defined value.

This is just a small area of the memory model. The language specification (see the C99 standard, especially section 7.13.2.1) defines these cases, putting the onus on the programmer not create them. I believe that you could address the branch issue in one of three ways.

Qualifier keywords (and helper procedures) are expert only

Source code is a communication to the compiler as well as to other people. Let's look at the language constructs that signal to the compiler that it is not to rearrange the memory access.

There is a plethora of keywords to guide the compiler. This tell how not to optimize, how to prevent the hardware from optimizing. Certain key procedures in the library tell how to access memory, and are typically considered a part of the language. Under broader language spec, those for vendor specific cases, and the coding style guides typically in play.

constThe variable (or pointer) to buffer can't written; no one is allowed to insert a write, but they are allowed to cache.
C volatileThe value must be re-read each time it is referenced by name (it isn't supposed be cached), and written with little delay after value is changed, can't be coalesced during writes, neither reads nor writes can be dropped nor merged. This is intended for variables shared between multiple threads, interrupts, syscalls and other context switches, and asynchronous IO.
C atomic, Java volatileThe compiler is to preserve ordering of field access. However it may not be possible to achieve this or prohibitively slow on many platforms. Nor is it clear when to apply this qualifier to an object or struct.
noalias, restrictUsed by the programmer to promise that variables (especially parameters) are not aliased. This affects when stores and loads are made. Without it, the compiler must assume that two pointers may overlap, and generate slow, conservative code to preserve subsequent accesses.
synchronized, locked(), @synchronize()Locks the object or specific lock object during execution of the code.
Interlocked Increment/Decrement, Compare-And-Swap (CAS)Updates a whole chunk of memory, hardware writes whole or none, conditional. This is often the basis for locks and other synchronization mechanisms.
read / writeThe read/write model is standardized in language ' more so more recently in Java and C#. (Traditionally it is mapped to an equivalent portion of the OS's storage model, with the language providing an interface to it.)

As we can see, these are tools for the programmer to tell the compiler how not to optimize. Presumably, this is because it is worse than a system where the programmer must say where you can optimize. Let's look at an illustrative example.

There used to be a register keyword in C, now deprecated. The idea was that the programmer would analyze the code, either in his head or by profiling, and selectively annotate key variables with the register keyword. This would guide the compiler with which variables should be given less data motion from working register and other memory. The problem was that there was inadequate profiler support, and without a good way to know what to use it with and not. The keyword was abused: why not use it with all variables? With overuse, compilers generated worse code, putting so many variables into so few registers, thrashing the registers in the process. Moreover, compilers improved until they were good enough to decide how to manage variable / register mappings.

There are rules, for the programmer, on effectively using the keywords. Alas, it is not always clear how to use these keywords properly ' and how not use these. Worse few programmers are likely to be aware of the rules and guidelines.

Variables and pointer attributes may not always be the proper approach

Let's take a moment for me to explain another reason why qualifying attributes are not ideal. As natural as these attributes are, they have a weakness.

Storage is accessed via variables and pointers. These are mapped onto addressable (potential shared) memory. User can specify the address or memory region a variable is stored in: sometimes a memory or linker map, or with a compiler specific extension on the variables memory address. Pointers can get their address by aliasing another pointer, a variable with the & operator, or by casting an integer.

It is too easy to lose the critical attributes when aliasing a variable or pointer ' one pointer may have them and another not, even though they point to the same address. Since the language, associates access properties with the variables (or pointers) not the memory address this can lead to inconsistent access implementation. This includes volatile, const, etc. (See the previous subsection for some of the properties.)

This is especially undesirable when communicating with devices through shared memory. (Although this is primarily a problem in C/C++, the same problem exists when Java and C# try to use hardware). Nevertheless, historically it has been an acceptable trade for language simplicity.

There are a few type rules to address aliasing of pointers. For instance, the C99 specification says that two pointers of two different types are not to be treated as aliased. However this goes against long standing tradition in code, Traditionally a program will use a generic structure to hold all sorts of data. The program will check a type code and cast this to a pointer of a more specific, but technically different, type. The C99 language documentation specifies that these are to be treated as not aliased. (Let's assume that the qualifiers are correct here.)

Although the specifics vary, a typical example of this casting is shown below:

typedef struct
{
   int Type;
} A_t;

typdef struct
{
   int Type;
   double Value;
} B_t;

'
   A_t* Ptr1 = ...;
   if (1 == Ptr1->Type)
     {
        B_t* Ptr2 = (B_t*) Ptr1;
        work with Ptr2;
       // ... Generic code may used with Ptr1 but it may have wrong memory semantics!
     }

In the example, there are two pointers to the same memory structure, but with different types. Changes using one pointer are not guaranteed to be properly (safely) seen in the other!

Memory Hierarchies

Memory hierarchies act like peripherals - special purpose processors with a queue of stores and fetches. (Implicitly memory hierarchies may be a network of storage peripherals, buses, with some contending with each other). Multiple devices may issue the work: processors / cores, caches, and peripherals devices. The operations may be buffered up and reordered for any reason, including improving speed. They may even be optimized by removing redundant IOs or using cache values.

And the memory peripherals take commands intended to control these helpful reorderings and optimizations.

An analogy to clocks and asynchronous digital circuits

Let me start with an analogy to the electrical design of logic. In logic, clocks are to synchronize the logic modules. Different logic nodes process their inputs (to produce the output signals) at different rates; some are faster than others are.

This becomes a problem when the inputs arrive at a logic node at even slightly different times. The node tries to process the change in input, then the input changes again. Sometimes it changes too fast for the node, and the node can get scrambled or just have its output flail around. These bogus results ripple through the logic network, spreading problems everywhere.

So a signal is used to trigger when the inputs to the logic have stabilized and can be used: the clock signal. It doesn't actually know directly that a given logic node is done or ready. Instead, it sends a 'ready' signal at a rate that is slower than the processing time for each of the logic nodes, ensuring that everything has stabilized.

Engineers obsessed with performance have tried to do away with clocks for over half a century. They use asynchronous logic that provides an output signal that says when the results are ready and stabilized.

Memory on microcontrollers and simpler (or older) computers is like the clock model. Memory on faster processors, like those in servers, desktops, and high-end embedded devices is more like asynchronous logic.

How the compiler must do asynchronous synchronization

On many architectures, independent memory operations are asynchronous, and many may be in flight at once. It is important to define the memory model precisely for these architectures; they are the ones that the issues crop up on. It is up to the compiler to insert special instructions that enforce their interdependency and expected behaviour.

For example:

lock
 op1 
 op2
unlock

The compiler isn't supposed to let op1 or op2 or unlock be performed until after the lock step has completed. Alas, while the compiler generates code with the operations after it in the instruction sequence, the processor may perform the operations anyway since they don't have any obvious dependency on the lock operations that are waiting to complete.

The compiler is supposed to insert special instructions (aka those asynchronous clock signals) to the memory system, called memory barriers, fences, and other names. They differ with each processor family in meaning, and what is available.

The software engineer coordinates

Even with "modern" programming environments, writing software still involves coordinating the hardware elements, especially where it is not practical for the compiler to figure out the right method.

First, the programmer is responsible for setting up the memory hierarchy, if necessary:

Second, the compiler lets the programmer insert assembly instructions to manipulate the memory hierarchy. This lets the programmer create special lock or synchronization structures. And it is useful for cases where the programmer is also be responsible for flushing caches.

Many processors have separate instruction and data caches, and pipelines that may need to be cleared when memory is changed. (The memory hierarchy isn't linked well enough to know that a write should update these caches in all cases). For instance, the instruction cache may need clearing if dynamic library was loaded, code was dynamically linked, code was patched, or a signal-handling trampoline has been inserted onto the stack.

A side note on "word tearing"

There is another minor hardware effect I wanted to mention. "Word tearing" happens when a part of memory is being written and read at the same time. Let's say an integer in memory has the value 0x12345678, and one thread is changing the integer to 0xabcdef00. If a second thread reads the memory at the same time, without proper synchronization, it can read a value that no one, anywhere, "wrote". It could read 0x1234ef00 for instance.

What is happening? The bytes of the new value are being written out (little endian in this case) or distributed to the other core, when the second thread reads them. In that case it reads values that are old and only part of the newly written values.

Like a lot of other memory model bugs, this can driver a programmer insane: the bug makes no sense and is intermittent.

Hardware documentation and specifications

Hardware specifications are complex. They are often ambiguous enough that it isn't possible to know that what the programmer wants to achieve is actually impossible to do reliably and consistently. Often the hardware's interesting behaviour is documented after the fact. If we're lucky, there are attempts to improve and clearly specify what will be in future processor generations.

Peter Sewell, Susmit Sarkak, Scott Owens, Francesco Zappa Nardelli, Magnus O Myreen, "X86-TSO: A Rigorous and Usable Programmer Model for x86 Multiprocessors" Proceedings of the 36th SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL) 2009

Conclusion

The root of these problems is twofold: the general trend of "optimizing" like crazy and market preference for apparent speed of safety. Cheap memory is so slow that special memory processing tweaks stores and fetches, dangerously violating what it means to be storage. The demand for performance is providing computers with increasing number of processor cores (especially since none of the "cores" is getting any faster).

Languages are required to explicitly state what proper memory operations are so that they can be properly implemented and tested. This traditionally was done in part because the operations were understood, obvious, and matched to the hardware. And there was a school of thought - the Kernighan & Ritchie approach - of having the language specify only a few rules that have to be followed and let the implementer make their own choices about the rest. (Other language approaches are specific and say it is an error if program is trying to do something that is ambiguous or up to the compiler implementer)

The result is that there are elements of code (often memory access sequences) that are likely never to be correctly implemented on modern architectures (desktop, server, handheld, high-end embedded). My opinion is that whole classes of IPC techniques will not be practical on those classes of CPUs. This has left us with attempts to improve on the languages, and understand the limits of memory models, and what we can expect we cannot do.

Software transactional memory was introduced in a (much hyped) attempt to do away with all of the issues and problems with synchronization, while retaining the language and interoperability with existing code. It's pretty clear that this either failed or is another decade away from meaningful utility.

The Consistency, Availability, Partititioning (CAP) Theorem presents another problem for the future. The short version is that you can't simultaneously have performance, scalability and consistency. the latency in memory operations completing ' or that they may fail to complete correctly. This is the norm in distributed computing. It may become the norm in servers and desktops as the memory hierarchy becomes more convoluted.

In the end I do not know that is a solution so much as there will be accepted folk wisdom around what not to do.