ACTUAL-DELAY CIRCUITS ON FPGA: TRADING-OFF LUTS FOR SPEED

Yahoo! is not affiliated with the authors of this page or responsible for its content.
ACTUAL-DELAY CIRCUITS ON FPGA: TRADING-OFF LUTS FOR SPEED
ACTUAL-DELAY CIRCUITS ON FPGA: TRADING-OFF LUTS FOR SPEED
Evangelia Kassapaki, Pavlos M. Mattheakis and Christos P. Sotiriou Institute of Computer Science,
FORTH ,
Crete, Greece.
email: kassapak@ics.forth.gr, pmat@ics.forth.gr, sotiriou@ics.forth.gr
ABSTRACT
FPGA devices exhibit manufacturing variability. Device
ratings and Timing margins are typically used in order to
cope with inter-device and intra-device variability respec-
tively. Actual-delay circuits operate according to the actual,
physical device delays of the FPGA device components and
not according to STA predictions, exhibit data-dependent
delay,latency and output completion detection, and can thus
detect when their outputs are ready to be latched. In this
paper we demonstrate an FPGA ow, based around exist-
ing FPGA tools, capable of implementing actual-delay cir-
cuits, and through worst-case, upper-bound analysis show
that such circuits exhibit reduced delay and higher perfor-
mance than their conventional counterparts. The FPGA ow
incorporates a logic synthesis transformation step for con-
verting a conventional single-rail circuit to monotonic, dual-
rail and a LUT mapper for mapping the latter to LUTs while
preserving monotonicity. In addition, through our imple-
mentation of actual-delay circuits we are able to measure
intra and inter-FPGA timing margins and we present results
on the STA margin and timing deviation over four devices.
1. INTRODUCTION
The two components of timing variability, i.e static variabil-
ity originating from manufacturing variations and dynamic
variability, i.e. originating from operating conditions at the
interior and exterior of a circuit have been identied as the
most important limits to transistor and technology scaling.
In the sub-90nm era, device parameters vary signicantly
both within a device and across different devices.
Static Timing Analysis (STA), the cornerstone of ASIC
EDA and FPGA timing sign-off has been recognized to be
overly pessimistic and imprecise and as a consequence re-
sponsible for signicant performance loss for designs aim-
ing at performance. Statistical static timing analysis tech-
niques (SSTA) [1] have been proposed in an effort to apply The authors would like to thank Xilinx Inc. for providing the FPGA
hardware and the ISE software used in this research. The authors are also afliated with the University of Crete, Greece.
statistical timing distribution models so as to better predict
the actual delay distribution for a number of devices. How-
ever, dynamic behaviour is difcult to model using a mathe-
matical formula and a large number of parameters cannot be
dened.
An alternative approach to variation tolerance is to em-
ploy design methodologies which allow for the actual delay
of the circuit to be measured or employed during its oper-
ation, instead of attempting to improve the accuracy of the
delay prediction. Delay fault testing, which has been pro-
posed in the past has been shown to be costly, as it requires
the designer to start from a very expensive initial two-level
representation, and then use only a subset of logic optimiza-
tions [2]. Moreover, delay test patterns are more numerous
than functional fault test patterns, and hence require more
time on expensive testing machines.
It is possible to implement circuits for which conven-
tional STA sign-off can be used as a sign off method, solely
as an upper bound to their delay, however during actual cir-
cuit operation their delay is both data-dependent and manu-
facturing and operating conditions dependent. Such a metho-
dology has been presented in [3], i.e. monotonic, dual-rail
circuits with completion detection. The difference between
the latter approach and delay-fault testing is that it provides
for data-dependent latency and the ability to adapt to vari-
ations instead of solely providing support for delay testing
critical paths through an external stimulus.
The aims of this work are twofold, rstly to demonstrate
an FPGA ow capable of implementing actual-delay circuits
through a dual-rail, monotonic circuit implementation and
secondly to demonstrate that such circuits exhibit reduced
delay and higher-performance than their conventional coun-
terparts when implemented on any conventional FPGA.
The paper is organized as follows. Section 2 provides
the motivation behind this work and the previous research
in the area. Section 3 describes in detail the FPGA design
ow for actual-delay circuits. Section 4 describes the ex-
perimental framework for measuring the delay of the mono-
tonic, actual-delay circuits and the error margins. Section 5
presents a detailed comparison between conventional single- rail implementations for 25 IWLS benchmark circuits and
their actual-delay, monotonic, dual-rail counterparts in terms
of the actual delay circuits worst-case timing, i.e. the upper-
bound of their delay, and of their LUT occupancy. Section
5 includes additional results on the STA prediction accuracy
and on delay variation across FPGA devices.
2. PRIOR WORK
The key advantage of circuits with completion detection is
their ability to operate using actual, or true, delays, instead
of worst-case delays employed in traditional clocked design.
Circuits with completion detection are capable of sustaining
average-case performance, due to their data-dependent na-
ture. Such circuits are very common in asynchronous sys-
tems as they remove the need for timing assumptions.
The most common method for implementing circuits
with completion detection combines dual-rail signal encod-
ing [4] with a two-phase set/reset mode of operation. Cir-
cuit inputs and outputs strictly alternate between a spacer
word, i.e. the reset phase, which carries no data information
and is used for synchronization and a dual-rail codeword,
i.e. the computation or set phase. The two-phase discipline
ensures monotonic behavior at circuit outputs, avoiding haz-
ards [5], and provides a clean separation between consecu-
tive data words. Several approaches to implementing cir-
cuits using dual-rail encodings have been proposed in the
literature [6, 7, 3].
The approach of [6], i.e. DIMS, is based on imple-
menting a logic function as sums of minterms, with every
minterm realized using a C-element, i.e. a sequential gate
with function f
= (a
1
a
2
...a
n
)+ f (a
1
+a
2
+...+a
n
) for n in-
puts, a
1
to a
n
. DIMS, despite its advantage of being strongly-
indicating [8], i.e. a change at an output implies the comple-
tion of all its transient fan-in nodes, is expensive area-wise.
The reported area increase for DIMS lies somewhere be-
tween 3 and 4 times that of the original single-rail circuit,
for an ASIC circuit, as none of the conventional logic opti-
mizations are possible, due to its structure.
In [7], i.e. NCLX, a dual-rail network is created by us-
ing De Morgans law to create logic duals for every gate and
eliminating inverters by using complementary rails. Com-
pletion detection is carried-out separately from data evalua-
tion. The key advantage is a drastic reduction in area, com-
pared to DIMS, as it enables logic synthesis algorithms to be
used for logic minimization, albeit only positive gates. The
use of a positive spacer throughout all the logic levels of the
circuit, implies that this approach allows only for positive
logic gates, and therefore the latency of the dual-rail circuit
implemented in this approach, will always be greater, com-
pared to single-rail logic implemented using a mixture of
positive and negative gates.
In [3], an approach for dual-rail, monotonic circuits, with
double spacer is presented, which solves the latency prob-
lem of the NCLX approach by allowing negative gates to be
used. In addition, the use of a fast reset mechanism is also
presented, which alleviates the use of two, almost equally
long phases. Of course, area overhead is a problem with
all of these approaches, as it cannot be reduced below a x2
penalty, due to the dual-rail encoding. With this latter ap-
proach, a 30% increase in delay is observed when perform-
ing STA on the dual-rail circuit, and a 100% area increase.
However bearing in mind that for ASIC ows, the inherent
margin between worst-case and typical delays is between
60 to 100% [3], along with the fact that STA does not take
into account the data-dependent nature of dual-rail circuits,
it can be concluded that circuits with completion detection
can operate at 24 to 65% higher frequencies than conven-
tional ones.
In this work we developed an FPGA ow for implement-
ing actual-delay circuits for FPGAs based on the latter ap-
proach.
3. DUAL-RAIL FPGA DESIGN FLOW
The FPGA design ow for implementing actual-delay cir-
cuits shown in Figure 1. It consists of 5 steps from the circuit
specication in the BLIF or Verilog format to the Placed and
Routed (P&R) circuit on the FPGA device and the deriva-
tion of its critical path vectors. The ow and experimen-
tal setup presented focuses on combinational logic circuits,
however it is straightforward to implement actual-delay se-
quential circuits by embedding a sequential element control
scheme, e.g. De-synchronization [9], which synchronizes
sequential elements based on the completion signals of the
combinational logic. The o