On Modeling the Lifetime Reliability of Homogeneous Manycore Systems
d circuits. In this work, we model
the lifetime reliability of homogeneous manycore systems using a
load-sharing nonrepairable k-out-of-n:G system with general failure
distributions for embedded cores. In manycore systems, an embed-
ded core can be in operational, cold standby, or warm standby state
depending on system redundancy schemes and their workloads. We
then use the proposed model to analyze the impact of different redun-
dant schemes and congurations on the lifetime reliability of many-
core systems.
1. INTRODUCTION
While the relentless scaling of CMOS technology has brought with
it enhanced functionality and improved performance in every new
generation, the associated ever-increasing on-chip power and tem-
perature densities make the lifetime reliability of high-performance
integrated circuits (ICs) one of the major concerns for the industry
[3, 19].
The failure mechanisms that contribute to ICs permanent failures
(e.g., time dependent dielectric breakdown (TDDB) and electromi-
gration) have been extensively studied at the circuit level in the past,
and they are shown to be strongly related to the temperature and volt-
age applied to the circuit. Recently these failure mechanisms have
been revisited at the processor microarchitecture level due to their
increasing impact with technology scaling [18, 19].
The above models mainly target unicore processor chips. State-of-
the-art computing systems (e.g., multi-digital signal processor (DSP)
system [11], general-purpose processors), however, have started to
employ multiple cores on a single silicon die to improve performance
through parallel execution instead of frequency increase, which have
the benets of power-efciency and short time-to-market [7]. A 128-
core GPU [15] and a 64-core general-purpose multiprocessor [22]
have already been released to the market. Various research teams
have projected that thousand-core processor chips will become com-
mercially available in the foreseeable future [1, 4]. For such large-
scale manycore systems fabricated with latest technology, how to
model its lifetime reliability is an interesting and relevant problem.
Assuming the failure distribution of an embedded core is known
a priori, we analyze the lifetime reliability of manycore systems in
this work. We make the following observations during the modeling
process:
embedded cores will age in operation. That is, we expect an
increasing failure rate (IFR) when a core gets older.
manycore systems are k-out-of-n:G systems
1
, in which n is the
total number of processor cores fabricated on-chip and k is the
number of cores for the system to function correctly. Generally
speaking, the value of n is larger than the value of k to provide
fault tolerance [24].
manycore systems are load-sharing systems, i.e., each embed-
ded core is designed to carry only part of the load assigned by
the operating system (OS). In fact, a cores failure rate and the
associated lifetime depends signicantly on its workload that
determines the temperature and voltage applied to the circuit.
manycore systems are nonrepairable systems. That is, unlike
traditional board-level multiprocessor systems that can be eas-
ily repaired by replacing defective processor chip, embedded
cores are integrated on silicon die in manycore systems and it
is extremely difcult to repair or replace a faulty core, if not
impossible.
Based on the above observations, we model the lifetime reliability
of manycore systems using a load-sharing nonrepairable k-out-of-
n:G system with general lifetime distributions for embedded cores.
To the best of our knowledge, this is the rst comprehensive reliabil-
ity model for such complex systems.
Manycore systems can be congured in two ways to achieve re-
liability: (i). gracefully degrading systems that use all failure-free
cores to execute tasks. When a core failure is detected, these sys-
tems attempt to recongure to a system with one fewer module; (ii).
standby redundant systems that execute tasks on active cores. Upon
detection of the failure of an active core, these systems attempt to
replace the faulty unit with a spare unit. Depending on the above
congurations and current workload, cores can be in normal func-
tional mode, warm standby, or cold standby state, which have direct
implications on the ageing effect of the manycore system. In this
paper, we use the proposed model to analyze the impact of different
congurations and redundant schemes on manycore systems life-
time reliability. This will facilitate designers to make architecture
decisions to achieve their design objectives.
The remainder of this paper is organized as follows. In Section 2,
we present preliminaries and motivation for this work. The proposed
lifetime reliability model for manycore systems is then discussed in
detail in Section 3. Experimental results for different manycore sys-
tem congurations are presented in Section 4. Finally, Section 5 con-
cludes this paper.
1An n-component system that works (or is good) if at least k of the n components
work is called a k-out-of-n:G system.
2. PRELIMINARIES
In this paper, we consider homogeneous manycore systems that
have n identical embedded cores fabricated on-chip. In order to func-
tion correctly, at least k (k n) cores need to be good. These cores
will share the workload designated by operating system. Appar-
ently, this is a k-out-of-n:G load-sharing system. Before discussing
the technical details of the proposed lifetime reliability model, we
present some preliminaries in this section.
2.1 IC Lifetime Reliability
Integrated circuit errors can be broadly classied into two cate-
gories: soft errors and hard errors. As soft errors caused by radiation
effects do not fundamentally damage the circuit, they are not viewed
as lifetime reliability threats. In this paper we mainly consider those
hard errors that are permanent once they manifest, such as TDDB in
the gate oxides, electromigration (EM) and stress migration (SM) in
the interconnects, and thermal cycling (TC).
The above failure mechanisms have an increasingly adverse effect
with technology scaling, and therefore have re-attracted research in-
terests recently. Srinivasan et al. [19] described a so-called RAMP
model that is able to dynamically track lifetime reliability of a pro-
cessor according to changes in application behavior. Their model,
however, is inherently inaccurate because it assumes a uniform de-
vice density over the chip and an identical vulnerability of devices to
failure mechanisms. To address this problem, Shin et al. [18] intro-
duced a structure-aware model that takes the vulnerability of basic
structures of the microarchitecture (e.g., register les, latches and
logic) to different types of failure mechanisms into account. Coskun
et al. proposed a cycle-accurate lifetime reliability simulation method-
ology as well as a statistical one in [5] and used them to optimize the
processor power management policy. In [20], Srinivasan et al. stud-
ied the vulnerability of FPGAs to TDDB and EM effects.
2.2 Modeling Processor Core Behavior
We assume embedded cores execute tasks independently (an ap-
plication however may consist of a series of tasks [14]) and one core
can perform at most one task at a time. In addition, the tasks as-
signed to a certain core is assumed to be stored in a rst-in-rst-out
(FIFO) buffer with innite capacity when the core is busy. Once the
core becomes available, it starts to process the next task in the FIFO
promptly. As shown in Fig. 1, a core can be in active mode or spare
mode in the manycore system (depending on redundancy congu-
rations). For spare processor cores, their power supply can be re-
duced signicantly or turned off completely, we therefore treat them
as cold standby components with zero failure rate. For active cores,
depending on the current workload, they can be in two states: pro-
cessing or wait, which denote the state that the cores are performing
tasks or waiting for task allocation, respectively. Generally speaking,
cores operate at higher temperature in processing state and hence will
wear out more quickly than in wait state. We therefore regard cores
in wait state as warm standby components in this work, and we use
R
p
(t) and R
w
(t) to denote the reliability functions of cores in pro-
cessing state and wait state, respectively, where they have the same
shape but different scale parameter. For example, R
p
(t) = e
(
t
p
)
and R
w
(t) = e
(
t
w
)
, wherein
p
and
w
are scale parameters.
According to the above discussion, if manycore systems are con-
gured as a graceful degrading system, embedded cores cannot be
in spare mode and hence they are in either processing or wait state.
The number of cores in either state at a particular moment is depen-
dent on the current workload and hence is uncertain. If, however,
manycore systems are congured as a standby redundant system, an
Figure 1: The Embedded Core Behavior.
embedded core can serve as: cold standby, warm standby or process-
ing core. As k cores are active, we know exactly how many cores
are cold standbys but again not sure about the number of cores in
processing or wait state at a specic time.
2.3 Related Work on Modeling
k
-out-of-
n
:G Sys-
tems
While there has been a large amount of research work on modeling
the lifetime reliability of multi-component systems, most of them
focused on parallel systems that are designed to carry full load, as
shown in [10, 23].
In the literatures on load-sharing k-out-of-n:G systems, for the
sake of simplicity, many studies (e.g., [16, 12]) assume an expo-
nential lifetime distribution for every c