Software and Hardware Single Event Effect mitigation
❗️Note: This was a class homework for my Master’s at the Federal University of Rio Grande do Sul and was not peer reviewed. It’s intention was to write a review of some presentation from the SERESSA-2020 event.
Introduction #
With the continuing trend of higher density devices for faster
processing and lower requirement of electric charge, a comparable amount
of charge can be generated in the semiconductor by the passage of cosmic
rays or alpha particles. These charges may, for example, temporarily
change memory contents or commands in a given instruction stream. The
effects of radiation regarding space-borne electronic systems may
penetrate sensitive nodes in these devices and affect its system
functions and behavior
The first satellite inconsistency was first reported in 1975, by D.
Binder et. al on SEU in flip-flops. In 1978, the first SEU was first
observed on earth by alpha particles, caused by packaging material in a
chip and eventually affecting the ram. In 1979, the first report on SEU
due to comic rays was published, and in 1992, the first destructive see
was observed in a memory on a space operating resource satellite
The phenomenon of see arises when a single energetic particle penetrates
these sensitive nodes, causing glitches to the electronic system or
catastrophic failures at the circuit level
Faults that may affect the system during its lifetime can be classified
into eight basic viewpoints of phenomenological causes, being of: (i)
natural faults, caused by natural phenomena without human interaction;
(ii) human-made faults, resulted by human interaction such as production
defects; (iii) transient faults, presented within a bounded time-frame; and
(iv) permanent faults, given within a continuous time-frame
During the system operation, natural faults can be either internal, due
to the natural process of physical deterioration, or external, due to
the natural process that happens outside the system boundaries and may
cause hardware interference
In fault-tolerant architecture, a fault is a physical defect, such as a
broken transistor. These faults may manifest themselves as an error,
such that having a bit 0 in place of a bit 1, or by not manifesting
itself as an error. An error can be masked or can result in a
user-visible failure
A fault and/or error does not necessarily become an error and/or a
fault, respectively. This can be mitigated by masking the system at the
design level. The effect of an error at a logical level may not affect
the system, and may not propagate to the architectural level either, as
it depends on which instruction the error will impact. Errors that
propagate to the application level may not be impacted by an error
either, as the error may affect an unused memory location by the
application and never gets triggered
A transient fault may occur once and not persist across the system,
these are often referred to as soft error or as SEU. Permanent faults
are often called hard fault, and persists when the fault occurs and
may manifest itself as a repeated error. An intermittent fault occurs
repeatedly but not over the same place in the system
Radiation device hardening and see fault tolerance approaches have been taken to
mitigate these issues when they arise
Due to the many physical phenomena that may lead to a fault, a variety
of techniques are available for mitigating these issues according to the
environment they run. Due to the transient high-energy particles, cosmic
rays may produce alpha particles or even electromagnetic interference
from outside sources, generating transient faults to the devices
The effects of the fault may change a value of a cell or transistor
output. Due to the one-time disruption, the error will vanish once the
cell or transistor’s output is overwritten.
This work aims at characterizing the types of see and the state-of-the-art that has been accomplished to mitigate these issues at the circuit- and software-level. The rest of the paper is organized as follows: section 2 gives a brief background over the types of see and how they may affect the system, among with fault metrics and types of errors, section 3 present some techniques for mitigating single events at the circuit level, section 4 refers to software-based approaches for single event mitigation. Finally, section 5 provides final conclusions.
Background #
With the decrease of dimension size of transistors, wires, and smaller chips,
the tendency to transient and permanent faults are much higher, as the dimension
of the chip may impact the temperature directly. Given Moore’s law increase the
number of transistors per chip, more opportunities arise for faults in the field
of application and manufacturing. The complexity of processor design increases
the likelihood of design bugs during production, which may bring permanent
faults to the processor during execution time
Types of Single Event Effect #
SEE depends on the interaction of a single particle penetrating the device,
which can be caused by the passage of a single heavy ion by a cosmic ray. As
cosmic rays are highly energetic in space, they may pass through the device and
be collected in the device’s electrodes. The ion produces an electric pulse that
may appear to the device as if it should respond and eventually causing a
failure. High energy protons can also be a cause of failure, as a proton may
have a nuclear reaction in the silicon device
SEE has a variety of possible effects, each of which is important, as they cause
malfunctioning of devices in space ionizing radiation environment
Term | Definition |
---|---|
Single event upset | A change of state or transient induced by an energetic particle |
Single hard error | Causes permanent changes to the operation of the device |
Single event latch-up | Loss of device functionality induced by high current |
Single event burnout | A condition which causes a device to destruct due to high current state in a power transistor |
Single event effect | A measurable effect to a circuit due to an ion strike |
Multiple bit upset | Event induced by a single energetic particle which may cause multiple upsets or transient |
Linear energy transfer | A measure of energy deposited per unit length |
Fault tolerance metric #
Fault tolerance solution requires experiments to test a hypothesis or compare
with previous works and knowing which errors may apply within the system.
Error detection #
Error detection provides a measure of safety, as it is an important aspect of
fault tolerance since the processor cannot tolerate a problem it is not aware
of. Redundancy is fundamental to error detection, as it helps the processor
detect when a given error occurs. There are three classes of redundancy,
spatial, temporal and information redundancy
Spatial redundancy adds redundant hardware to the system. The DMR is a simple
form of spatial redundancy, which provides error detection by using a voter
system, which then receives the output of all modules and checks for any error
Temporal redundancy may perform redundant operations, by requiring a unit to
operate twice and finally compare the results. Temporal redundancy doubles the
amount of time for each operation. However, in comparison to Spatial redundancy,
there is no extra hardware or power cost involved. For reducing performance
cost, some schemes may use pipelining to hide the latency of a redundant
operation
Finally, information redundancy detects when a datum has been affected by adding
bits to it. Schemes such as EDC can be used for such redundancy, for example, by
adding a parity bit to a data word and convert into a codeword. The parity
scheme is popular, due to its simplicity and inexpensive implementation
Error recovery #
Error detection is enough for providing safety to the system, but not recovering
from the error. By recovering from the error, it hides the effect of the error
from the end-user and allows the system to resume operation
FER corrects the error without having to revert to a previous state. FER can be
implemented through physical, temporal, and information means of redundancy. In
fer, if a specific amount of redundancy is required to determine if an error has
occurred, then additional redundancy is required to correct the error
BER restores the state of the system to a previously known good state, known as
recovery point on single-core systems and recovery-line on multi-core systems.
The system architect should think through what state it should be saved for
recovery, where and when to deallocate, the algorithm, and what to do after the
system has been restored
Hardware Mitigation #
Soft errors #
Schmitt Triggers #
In high noise applications, the st works as a replacement for the internal
inverter of a circuit. The st has a higher dependency over a source-gate voltage
of its P1 and N1 transistors, resulting in an enhanced robustness over a VTC
deviation
Decoupling Cells #
By connecting capacity elements to the most exposed nodes, one can mitigate
transient effects. The use of decoupling cells increases the total capacity in
the output of a node of the NAND2 gate, resulting in a decrease of critical
charge required to produce a single transient pulse, which by effect improves
signal degradation
Sleep Transistors #
Circuit blocks that are not in use can be shut off by using the power-gating
strategy, widely used in low-power designs for reducing chip’s power
consumption. Sleep transistors act as a supply-voltage regulator. When a sleep
transistor is in active mode, it improves the process variability of a typical
logic gate connection to the ground rail by acting as a voltage regulator. While
in standby, the sleep transistor disconnects the virtual ground from the
physical ground
Transistor Reordering #
Optimizing transistors arrangements allows reducing current leakage or dealing
with bias temperature instability. This technique modifies the transistor
arrangement by still keeping the same functionality that was aimed at. The
transistor reordering swaps the electrical and physical characteristics of the
logic cells, resulting in susceptibility to soft errors. The robustness of
complex gates where can be improved up to 8% by using this approach and can be
favorable to improve single effect stability of circuits without including area
penalty in complex gates
Software Mitigation #
Software approaches can also be used for hardware errors. The primary interest
of using a software redundancy is that it brings no hardware cost and requires
no hardware modification. The software approach may provide good coverage of
possible errors and can be easily tested comparing to hardware approaches. The
cost of software redundancy may be significant, as performance may be lost
depending on the core model and software workload, as instruction duplication
requires more processing
Selective Code Duplication #
In Selective code duplication, only parts of the code are duplicated, and their results are compared, which reduces fault coverage but improves code size and execution time overhead. Multiple techniques use selective code duplication, such as SWIFT, VAR3+, CDB, and SEDSR.
Error detection by duplicated instructions #
EDDI consists of inserting redundant instructions and instructions that also
compares the results produced by the original instruction and the redundant
instructions
Error detection by diverse data and duplicated instructions #
EDDDDI is a full code duplication technique, where all instructions in a block
are duplicated. Comparison instructions are placed after each original and
duplicated instruction in each block to compare their results
Overhead reduction #
In a VAR3 technique, all instructions in a block, except for branch and store
instructions are duplicated. The comparison instructions have to be placed
before load, store, and branch instructions to compare the results
Critical block duplication #
In CBD technique, critical blocks have to be identified in the control flow
graph. Any block which has the highest number of fan-outs in the control flow
graph is considered a critical block. If any mismatch of results is detected, an
error is reported
Soft error detection using software redundancy #
SEDSR in an extended version of DBD, however, comparison instructions have to be
added after the original and duplicated instruction in each identified block for
comparing results
Conclusions #
With the continuous trend of smaller chip sizes, the tendency of transient and permanent faults are much higher. This paper sought to characterize the types of SEE and how they affect a system according to the environment, and what metrics are important when considering a fault-tolerant design. By understanding the difference between error detection and error recovery allows one to seek a solution which may fit their requirements. Multiple fields of mitigation’s have been reviewed, from a circuit-level techniques to software-level approaches. Although software mitigation usually impacts performance, it is a cheaper alternative in comparison to hardware alternatives.
Download #
Download the PDF version of this file here.