I was recently in a conversion and the topic of long-lived, but hard-to-field-service embedded systems came up. The topic ranged over checking for system degradation through error counters to checking for memory corruption. This time I thought I'd share some techniques (and ideas) on checking that the memory in embedded systems.
For quality, critical, or hard to service devices, it may be important to identify when there are hardware (or software) degrades or develops faults. When such a problem is found, the design may call for the device to stop operating or reduce performance and functionality it is replaced. This might be a system intended to be deployed in the field for a long time (possibly in inaccessible or not readily serviceable areas). It might be a system with safety considerations. Or one where the customer is buying based on quality for their own reasons.
These ideas and techniques are long standing (some date back more than half a century). They are simple enough to be included in even basic microcontrollers. While I will be focusing on what software can contribute, there are numerous hardware techniques (e.g. ECC memory) that can be done as well.
The simplest integrity check is to continuously scan memory regions, to verify that the contents have not changed unexpectedly. This can be used to check that read-only (and infrequently changed) memory - such as program memory - has not changed. The basic process to continuously perform incremental CRC checks on the memory. CRC's are compact and fast enough for most platforms. (Although a particular CRC may not be appropriate for all platforms). These are good for finding single bit errors, error runs, and so forth. Of course, a CRC doesn't find malicious errors - that is to say it is possible for someone to craft memory corruption that passes the CRC check.
The basic process for integrity checking is:
I mentioned that this process should be applied to all of the read-only and infrequently changed storage in the system. Often there are several such memory regions. Such regions may include:
Can do the checks sequentially, in parallel or a combination of those. A different check state value for each the ones done in parallel or moves from onto the next. I recommend that this process be used to check each of the different "read only" regions of memory to ensure their integrity.
This should include regions of memory that are empty or unused ' that is, have no code or data, but still is still a storage area. In that way, it can help, capture errant writes.
Several basic hardware checks can be done:
The algorithm is documented in several places. Bus based checks ' to find stuck pins on a bus, as well as damaged cells.
Michaels Barr's has described some specific tests for bus-based memory in the following book and article. He does an excellent job of describing what problem is being tested for, and the mechanism he is using to test it. And he includes useful source code.
The later is a destructive test; it destroys the contents of the memory. This can be worked around by saving a window of the memory, performing the tests, and restoring it when that portion of the test is done.
These memory tests are often developed for manufacturing tests anyway, so it can be cost effective to develop them for both the main application and tests at one time.
Now that we've found some bad memory, we need a plan on how to respond. This is roughly divided into one of two kinds of systems: those with paged memory and those without.
Paged memory systems offer the easiest way to track bad blocks of memory. With paged memory, the processor has a mechanism that maps the logical the addresses to the physical ones. In our case, the key property is that the kernel software is responsible for ensuring that it doesn't allocate physical memory that is already in use. We can exploit this mechanism to cleanly track the bad pages: all we have to is allocate the pages with bad storage cells, and never release them. This table of bad pages should be stored in non-volatile memory and used at system start up to prevent the bad pages from being reused.
What if the bad memory page was being used? Then we can use a technique called block sparing: copy the page of memory to a working area of memory, remap the logical address to the new address, and add the old address to the set of bad pages. (If the page was copied before the test was performed, you can use that copy.) If the memory wasn't corrupted (that is, it passes some other check on the contents, even though the storage hardware is failing for that page) or can be recovered, the system can continue. Otherwise the hardware should then reinitialize and resume from a known state, or go into a restricted fault state.
Tracking bad regions in non-paged memory is a little bit harder than page memory. Structures such as lists, or bitmaps can be used to track the bad regions. Recovery is much harder that with paged memory. In most cases you're probably using the memory ' i.e. the data stack, program variables with a fixed address there ' and can't relocate it. It is better that the device enter a fault state, set a failed bit, and get replaced.
More sophisticated checking techniques may include checks for each of many critical structures in memory. This is finer grained, identifying what records are corrupted as well as the addressable regions.
Yes! The worst problem I've found was memory corruption and it was a matter of figuring out how the corruption was triggered in otherwise good hardware and software.
It turned out that the micro-controller had a bad clock signal and would occasionally over-clock the processor. The instruction pointer would randomly jump around in code and execute the instructions to write memory. The random jumps meant the processor could (and eventually would) bypass any checks on write enables that occur before writing sensitive areas, such as program memory, or other nonvolatile memory. There didn't have to be any calling path to this write-memory code, and it would still 'run' when the instruction pointer jumped to it.
(From here we could discuss memory structures that aid in recovery, structures that are resistant to errors.)