Automating Post-Silicon Debugging and Repair

nt>
Automating Post-Silicon Debugging and Repair Automating Post-Silicon Debugging and Repair
Kai-hui Chang, Igor L. Markov, Valeria Bertacco
EECS Department, University of Michigan, Ann Arbor, MI 48109-2121
{
changkh, imarkov, valeria}@umich.edu
ABSTRACT
Modern IC designs have reached unparalleled levels of complexity,
resulting in more and more bugs discovered after design tape-out
However, so far only very few EDA tools for post-silicon debug-
ging have been reported in the literature. In this work we develop a
methodology and new algorithms to automate this debugging pro-
cess. Key innovations in our technique include support for the phys-
ical constraints specic to post-silicon debugging and the ability to
repair functional errors through subtle modications of an existing
layout. In addition, our proposed post-silicon debugging methodol-
ogy (FogClear) can repair some electrical errors while preserving
functional correctness. Thus, by automating this traditionally man-
ual debugging process, our contributions promise to reduce engi-
neers debugging effort. As our empirical results show, we can
automatically repair more than 70% of our benchmark designs.
1. INTRODUCTION
Due to the high complexity of modern designs and the increas-
ing pressure to reduce their time-to-market, errors are more likely
to escape verication and are only found after a chip has been man-
ufactured. Needless to say, such errors must be xed before the In-
tegrated Circuits (ICs) can be shipped to customers, making post-
silicon debugging a crucial step in the design process. To this end,
a recent EE Times article quotes: post-silicon debugging is a dirty
little secret that can cost $15 to $20 million and take six months
to complete [15]. Indeed, post-silicon debugging has become one
of the most time-consuming parts, 35% on average, of the chip de-
sign cycle [2]. Given that the market window for many modern
products is only a few years long, the delay caused by respins can
dramatically impact revenues. Therefore, it is surprising that only
few EDA tools and algorithms address this problem [15].
Post-silicon debugging is becoming more important because sil-
icon ICs offer several advantages not available pre-silicon. One
is that manufacturing defects are becoming increasingly difcult
to simulate, including those caused by antenna, thermal and in-
ductive effects, as well as diffraction patterns. Non-deterministic
effects, such as manufacturing variability, pose even greater chal-
lenges. As a result, comprehensive validation of a chip can only
be performed after tape-out. In addition, silicon dies allow at-
speed testing, which is orders of magnitude faster than logic sim-
ulation and astronomically faster than electrically-accurate simula-
tion. If a sufciently strong post-silicon debugging methodology is
available, more thorough post-silicon verication can be achieved,
enabling the distribution of more reliable designs. Unfortunately,
such a methodology is not yet available today.
Pre-silicon and post-silicon debugging differ in several signi-
cant ways. First, conceptual bugs that require deep understanding
of the chips functionality often appear in pre-silicon stages only,
and such bugs may not be xable by automatic tools. On the other
hand, post-silicon functional bugs are often subtle errors that only
affect the output responses of a few input vectors, and their xes
can usually be implemented with very few gates. However, nd-
ing such xes requires the analysis of detailed layout information,
making it a highly tedious and error-prone task. As we will show
later, our work can automate this process. Second, errors found
post-silicon typically include functional and electrical problems, as
well as those related to manufacturability and yield. However, is-
sues identied pre-silicon are predominantly related to functional
and timing errors.
1
Problems that manage to evade pre-silicon val-
idation are often difcult to simulate, analyze and even duplicate.
Third, the observability of the internal signals in a silicon die is ex-
tremely limited. Most internal signals cannot be directly observed,
even in designs with built-in scan chains [5], which enable access
to sequential elements. Fourth, verifying the correctness of a x is
challenging because it is difcult to physically implement a x in
a chip that has already been manufactured. Although techniques
such as Focused Ion Beam (FIB) exist [19], they typically can only
change metal layers of the chip and cannot create any new transis-
tors (this process is often called metal x).
2
Finally, it is especially
important to minimize the layout area affected by each change in
post-silicon debugging because smaller changes are easier to imple-
ment with good FIB techniques, and there is a smaller risk of un-
expected side effects. Due to these unusual circumstances and con-
straints, most debugging techniques prevalent in early design stages
cannot be applied post-silicon. In particular, conventional physical
synthesis and Engineering Change Order (ECO) techniques affect
too many cells or wire segments to be useful in post-silicon debug-
ging. As illustrated in Figure 1(b), a small modication in the lay-
out that sizes up a gate requires changes in all transistor masks and
refabrication of the chip. To this end, our SafeResynth technique
[10] only selects netlist modications that require minimal physi-
cal changes. This philosophy is adopted in our work to handle the
unusual constraints of post-silicon debugging.
(a)
(b)
(c)
Figure 1: Post-silicon error-repair example. (a) The original
buggy layout with a weak driver (INV). (b) A traditional resyn-
thesis technique nds a simple x that sizes up the driv-
ing gate, but it requires expensive remanufacturing of the sil-
icon die to change the transistors. (c) Our physically-aware
techniques nd a more complex x using symmetry-based
rewiring, and the x can be implemented simply with a metal
x and has smaller physical impact.
Existing techniques for post-silicon debugging strive to provide
more visibility and controllability of the silicon die [2]. Although
1
Post-silicon timing violations are often caused by electrical prob-
lems and are only symptoms of such errors.
2
Despite the impressive success of the FIB technique at recent fab-
rication technology nodes, the use of FIB is projected to become
more problematic at future nodes due to increasingly difcult ac-
cess to lower metal layers, limiting how extensive changes can be
and further complicating post-silicon debugging. such techniques are great aids to engineers, they do not automate
the debugging process itself. To address this problem, we propose
new algorithms and a methodology that facilitate the automation
of post-silicon debugging. These techniques can benet from ex-
isting Design-For-Debugging (DFD) constructs but can also work
well without them. Key innovations in our techniques include the
support for the unusual physical constraints of post-silicon debug-
ging and the ability to repair errors by subtle modications of an
existing layout. As illustrated in Figure 1(c), our techniques are
aware of the physical constraints of the design and can repair er-
rors with minimal layout and routing changes. To achieve these
goals, we develop algorithms to identify as many candidate xes as
practically possible, in terms of netlist and layout transformations.
This is important in post-silicon debugging because often only a
few transformations can satisfy all the physical constraints. On the
other hand, we also utilize these constraints in our algorithms be-
cause they can prune the search space effectively due to their highly
restrictive nature. The main contributions of our work include: (1)
a post-silicon debugging methodology, called FogClear, that auto-
mates
the debugging process; (2) the PARSyn resynthesis algorithm
that searches for netlist transformations which can be implemented
with limited physical resources; (3) the PAFER framework that au-
tomatically
diagnoses and repairs logic errors with minimal per-
turbation to the layout; and (4) the adaptation of symmetry-based
rewiring [8, 12] and SafeResynth [10] for post-silicon debugging
to nd layout transformations that can repair electrical errors. Em-
pirical results show that our techniques are effective in repairing
design errors and can greatly reduce engineers debugging efforts.
In addition to post-silicon debugging, FogClear can also be ap-
plied to reduce the cost of respins. As the data in [4] suggest, masks
for active device layers contribute about 68% of the total mask cost
at the 100nm technology node. With mask costs approaching 10
million dollars per set at the 45 nm node (see Figure 2) [26], being
able to reuse transistor masks greatly reduces the cost of a respin.
This can be achieved using FogClear because the layout transfor-
mations it produces only involve changes in the metal layers and
allow the reuse of the transistor masks. In addition, FogClear can
accelerate the post-silicon debugging process and reduce the loss
in revenue caused by delayed market entry.
Figure 2: Estimated mask set costs at different technology
nodes [25]. The transformations produced by FogClear allow
the reuse of transistor masks and thus reduce respin costs.
The rest of the paper is organized as follows. In Section 2 we
describe the current post-silicon debugging methodology and re-
view some DFD techniques. The debugging process using our
automated FogClear methodology is discussed in Section 3. The
components of FogClear, that is, the functional and electrical error
repair techniques, are explained in detail in Section 4 and Section
5, respectively. Experimental results are shown in Section 6, while
Section 7 concludes this paper.
2. CURRENT POST-SILICON DEBUGGING
METHODOL