Facelift: Hiding and Slowing Down Aging in Multicores
r>
Facelift: Hiding and Slowing Down Aging in Multicores
Facelift: Hiding and Slowing Down Aging in Multicores
Abhishek Tiwari
and Josep Torrellas
Department of Computer Science
University of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
Abstract
Processors progressively age during their service life due to
normal workload activity. Such aging results in gradually slower
circuits. Anticipating this fact, designers add timing guardbands
to processors, so that processors last for a number of years. As a
result, aging has important design and cost implications.
To address this problem, this paper shows how to hide the ef-
fects of aging and how to slow it down. Our framework is called
Facelift. It hides aging through aging-driven application schedul-
ing. It slows down aging by applying voltage changes at key times
it uses a non-linear optimization algorithm to carefully bal-
ance the impact of voltage changes on the aging rate and on the
critical path delays. Moreover, Facelift can gainfully congure
the chip for a short service life. Simulation results indicate that
Facelift leads to more cost-effective multicores. We can take a
multicore designed for a 7-year service life and, by hiding and
slowing down aging, enable it to run, on average, at a 1415%
higher frequency during its whole service life. Alternatively, we
can design the multicore for a 5 to 7-month service life and still
use it for 7 years.
1
Introduction
The challenges of ensuring the reliability of upcoming, deep
sub-micron hardware have spurred interest in the fact that proces-
sors age or wear-out while executing their normal workloads [7].
In particular, the maximum frequency that a processor can deliver
decreases slowly and gradually with time [4].
Two major mechanisms that induce progressive slowdown in
processors are Negative Bias Temperature Instability (NBTI) and
Hot-Carrier Injection (HCI) [4]. Roughly speaking, these effects
are due to stresses induced on transistors by the normal, contin-
uous movement of charges. At the macroscopic level, these ef-
fects manifest as gradually slower transistors and, hence, gradu-
ally slower critical paths.
Anticipating this fact, processor designers add timing guard-
bands to their designs [3]. The goal of guardbands is to absorb
any increase in critical path delay during the processors service
life, and avert any timing error. Anecdotal evidence suggests that
current processors include a guardband to last for 710 years.
Clearly, the aging process has important implications on the de-
sign and cost of processors.
While aging or wearout has been extensively studied at the de-
vice level, there is relatively little work at the architecture or sys-
tem level. Specically, Srinivasan et al. [29, 30] focus on model-
This work was supported by the National Science Foundation under
grant CPA-0702501 and by SRC GRC under grant 2007-HJ-1592. Ab-
hishek Tiwari is now with Goldman Sachs, New York City.
ing the Mean Time To Failure (MTTF) due to aging mechanisms.
The value of such MTTF is typically multiple times the expected
service life of a processor, since one should expect that only a
negligible fraction of processors will fail during their service life.
The authors propose voltage, frequency, and microarchitectural
adaptations to attain the required MTTF more cost-effectively.
Our goal is to understand how critical path delays increase due
to aging during the service life of the processor, and then use in-
expensive techniques to reduce the performance degradation. The
most relevant work that we are aware of is that of Ramakrishnan et
al. [22], Abella et al. [1], and Shin et al. [26]. Their idea is to dy-
namically set the transistors to a logic value that undoes some
of the aging process during periods when this does not dis-
rupt processor execution. While their techniques are effective and
transparent to software layers, if they are applied widely across
processor structures, they are likely to be intrusive to the proces-
sor design and may have performance implications. Ideally, we
would like approaches that do not affect processor internals.
In practice, aging depends exponentially on high-level param-
eters that can be easily manipulated supply voltage (V
dd
), tem-
perature (T), and threshold voltage (V
t
). Small changes to these
parameters at key times in the processors service life can have
major effects without requiring intrusive designs.
Using this general approach, this paper contributes with a
framework of techniques to (i) hide the effects of aging in a mul-
ticore, (ii) slow down aging, and (iii) gainfully congure the chip
for a short service life. We call our framework Facelift. A sec-
ond contribution is to show how the shorter guardband enabled
by Facelift can be used to either (i) design a less rened version
of the processor or (ii) clock the processor at a higher frequency.
Facelift hides the effects of aging in a multicore by steering
high-T jobs to the fast cores and low-T ones to the slow cores.
Keeping the slow cores cooler enables the chip to appear to age
less.
Facelift slows down aging by making small, chip-wide
changes to V
dd
or V
t
at key times using a non-linear optimiza-
tion algorithm to carefully balance the impact of the changes on
the aging rate and on the critical path delays. Finally, Facelift con-
gures a chip for a short service life by shifting performance
from the unused lifetime portion to the used one.
Simulation results indicate that the Facelift techniques lead to
more cost-effective multicore designs. We can take a multicore
designed for a 7-year service life and, by hiding and slowing down
aging, enable it to run, on average, at a 1415% higher frequency
during its whole service life. Alternatively, we can design the
multicore for a 5 to 7-month service life and still use it for 7
years. Finally, the implementation of the Facelift techniques is
very simple.
This paper is organized as follows. Section 2 provides a back-
ground; Section 3 discusses the impact of aging on architecture;
Section 4 presents the techniques to hide and slow down aging,
and congure for a short service life; Sections 5 and 6 evaluate
Facelift; and Section 7 discusses related work.
2
Background
During a processors normal, failure-free use, semiconductor-
level mechanisms gradually cause devices to become slower. At
the macroscopic level, this results in critical paths in the proces-
sor gradually and slowly taking longer an important part of a
process that is popularly known as wearout or aging [4]. To tackle
this effect, processor designers add timing guardbands to their de-
signs [3], so that any increased critical path delays during a pro-
cessors expected service life can be absorbed by the guardband.
Informal observations indicate that current processors include a
guardband to last for 710 years, and possibly less for mobile de-
vices.
According to Bernstein et al. [4], the two key mechanisms that
increase the delay of transistors during their normal, failure-free
operation are Negative Bias Temperature Instability (NBTI) and
Hot-Carrier Injection (HCI). In particular, NBTI is a dominant
effect that has been the subject of much interest (e.g., [10, 15,
19, 21, 36]). An important insight is that both NBTI and HCI
cause a gradual elevation of the threshold voltage (V
t
) of transis-
tors [4] PMOS transistors in NBTI and NMOS transistors in
HCI. A higher V
t
in turn increases the transistor switching delay
(T
s
) through the alpha power law [24], where 1.3:
T
s
V
dd
L
ef f
(V
dd
V
t
)
(1)
To propose architectural mechanisms to hide or slow down ag-
ing, we need to understand what factors directly impact the in-
crease in transistor delay due to NBTI and HCI. We do this in
Sections 2.1 and 2.2. The formulas in these sections correspond
to 32nm technology. Bernstein et al. [4] also indicate that elec-
tromigration in wires is another important mechanism observed
during aging. However, since we are unaware of any models in
the public domain that suggest how wire delays are affected by
electromigration during normal, failure-free operation, we neglect
wire delay changes. Finally, Section 2.3 discusses the related is-
sue of process variation.
2.1
NBTI
NBTI is explained by the Reaction-Diffusion model [20].
When logic input 0 is applied to the gate of a PMOS transistor
(V
gs
=V
dd
), the presence of holes in the channel causes Si-H
bonds to break at the interface between the gate oxide and the
channel. The resulting H diffuses away, leaving positive traps
(Si
+
) in the interface, which increase V
t
[15]. This process is
called the Stress phase. The reaction rate mainly depends on the
temperature (T) and the supply voltage (V
dd
). The increase in V
t
is [36]:
V
t stress
= A
N BT I
t
ox
C
ox
(V
dd
V
t
)
e
(
VddVt
toxE0
Ea
kT
)
t
0.25
stress
(2)
where t
stress
is the time under stress, t
ox
is the oxide thickness
(0.65nm), and C
ox
is the gate capacitance per unit area (4.6
10
20
F/nm
2
). E
0
, E
a
, and k are constants equal to 0.2 V /nm,
0.13 eV , and 8.6174 10
5
eV /K, respectively. A
N BT I
is a
constant that depends on the aging rate.
When logic input 1 is applied to the gate (V
gs
= 0), the tran-
sistor turns off, and H atoms diffuse back, eliminating some of
the traps. This process is called the Recovery phase. The nal in-
crease of V
t
after considering both the stress and recovery phases
is [36]:
V
t
= V
t stress
(1
t
rec
/(t
stress
+ t
rec
))
(3)
where t
rec
is the time under recovery and is a co