Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency

i
g
@cs.wisc.edu
Abstract
Untolerated load instruction latencies often have a signicant
impact on overall program performance. As one means of miti-
gating this effect, we present an aggressive hardware-based mech-
anism that provides effective support for reducing the latency of
load instructions.
Through the judicious use of instruction predecode, base regis-
ter caching, and fast address calculation, it becomes possible to
complete load instructions up to two cycles earlier than traditional
pipeline designs. For a pipeline with one cycle data cache access,
this results in what we term a zero-cycle load. A zero-cycle load
produces a result prior to reaching the execute stage of the pipeline,
allowing subsequent dependent instructions to issue unfettered by
load dependencies. Programs executing on processors with sup-
port for zero-cycle loads experience signicantly fewer pipeline
stalls due to load instructions and increased overall performance.
We present two pipeline designs supporting zero-cycle loads:
one for pipelines with a single stage of instruction decode, and
another for pipelines with multiple decode stages. We evaluate
these designs in a number of contexts: with and without software
support, in-order vs. out-of-order issue, and on architectures with
many and few registers. We nd that our approach is quite ef-
fective at reducing the impact of load latency, even more so on
architectures with in-order issue and few registers.
1
Introduction
High-performance computing requires high sustained instruc-
tion issue rates, a goal which can only be achieved if pipeline
hazards are minimized. Data hazards, an impediment to perfor-
mance caused by instructions stalling for results from executing
instructions, can be mitigated by tolerating or reducing instruction
execution latencies.
For many programs, the dominating source of data hazards can
be attributed to load instructions. These operations occur fre-
quently and have longer latency than most other non-numeric in-
structions because they combine address calculation, data cache
access, and occasional accesses to lower levels of the data memory
hierarchy in a single instruction.
A signicant body of work is dedicated to reducing the impact
of load latency. The techniques can be broadly bisected into two
approaches: latency tolerating techniques and latency reducing
techniques. Tolerating techniques require that independent opera-
tions be moved into unused pipeline delay slots. This reallocation
of processor resources can be performed either at compile time
using instruction scheduling or at run time using techniques such
as out-of-order issue, non-blocking loads, or multi-threading. Re-
ducing techniques decrease or eliminate some component of load
instruction latency; for example, register allocation eliminates the
entire load operation, or (loop) blocking eliminates many cache
miss latencies.
In this paper, we present a microarchitecture design capable of
reducing the latency of load instructions. Through the application
of instruction predecode, base register caching, and fast address
calculation, it becomes possible to complete load instructions up to
two cycles earlier than traditional pipeline designs. For a pipeline
with one cycle data cache access, this results in what we term
a zero-cycle load. A zero-cycle load produces a result prior to
reaching the execute stage of the pipeline, allowing subsequent
dependent instructions to issue unencumbered by load instruction
hazards. Programs executing on processors with support for zero-
cycle loads experience signicantly fewer pipeline stalls due to
load instructions and increased overall performance.
We present two pipeline designs supporting zero-cycle loads:
an aggressive design for pipelines with a single stage of instruction
decode, and a less aggressive design for pipelines with multiple
decode stages. We evaluate these designs in a number of contexts:
with and without software support, in-order vs. out-of-order issue,
and on architectures with many and few registers. We nd that our
approach is quite effective at reducing the impact of load latency,
even more so on architectures with in-order issue and few registers.
The remainder of this paper is organized as follows: Section 2
details the microarchitecture support required to implement zero-
cycle loads. Section 3 presents a detailed example of zero-cycle
loads in action. In Section 4, we present results of simulation-
based performance studies and Section 5 describes related work.
Finally, Section 6 presents a summary and conclusions.
2
Zero-Cycle Loads
A load, while being a single instruction, is composed of several
smaller component operations, many of which must occur in a
specic order prior to completion of the load and delivery of a
value from memory. Figure 1a illustrates the major component
operations of a load and their required order.
1
As shown in Figure 1b, a traditional pipeline fetches loads in
the IF stage of the pipeline. Identifying, aligning, and reading the
register le occurs in the ID stage of the pipeline. In designs with
very fast clocks and wide issue, these operations are often split
1
Many variations exist upon this basic template; for example, some
pipelines require address translation to complete before accessing the data
cache, other designs require that register le access occur after a load has
been aligned into a pipe that services load instructions. This basic template
is, however, representative of many modern pipeline designs.
Appears in: Proceedings of the 28th Annual International Symposium on Microarchitecture b)
c)
a)
MEM
EX
ID
two cycle latency
Fetch
Identify
Align
Read RF
Arbitrate
Address Calc
Access Cache
(Pre)decode
Inst
MEM
EX
ID
zero cycle latency
Fetch
Identify
Align
Read RF
Arbitrate
Access Cache
Address Calc
IF (or early ID)
IF (or early ID)
Fetch
Load
Identify
Load
Align
Load
Read
Base/Index
Registers
Arbitrate
Adder/
Cache
Compute
Effective
Address
Access
Data
Cache
Addr Trans
Addr Trans
Translate
Effective
Address
Figure 1: Zero-Cycle Loads.
across multiple decode stages. Effective address generation occurs
in the EX stage of the pipeline, and data cache access and address
translation in the MEM stage of the pipeline. This organization
creates a two cycle latency for load instructions.
With support for zero-cycle loads (Figure 1c), loads can com-
plete up to two cycles earlier than the traditional pipeline. We
accomplish this optimization with two basic mechanisms. First,
we use instruction predecode and base register caching to reduce
the time required to decode and issue a load instruction. Instruction
predecode reduces the latency to identify and align the loads in a
group of fetched instructions. Base register caching provides the
necessary high-bandwidth access to base and index register values
early in the pipeline. Second, we employ fast address calculation
[APS95] to reduce the latency of data cache access. Fast address
calculation is a stateless set index predictor that allows address
calculation and data cache access to proceed in parallel.
The combination of early issue and faster data cache access
results in a pipeline design capable producing a load result two
cycles earlier than traditional organizations. For pipelines with
a single cycle data cache access, it becomes possible to forward
a load result into the execute stage of the pipeline. If latency is
dened as the number of cycles from the beginning of the execute
stage for an operation to produce a result, successfully speculated
loads will appear to have zero latency hence the moniker zero-
cycle loads.
Not all loads can execute with zero latency. Early issue in-
troduces new register and memory interlocks, and fast address
calculation will occasionally mispredict the effective address, ne-
cessitating a recovery mechanism. In the following subsections,
we more fully explore the implementation of zero-cycle loads, ex-
amining both the organizational and pipeline control impacts of its
use. Two designs are presented in detail: an aggressive design for
pipelines with only a single stage of instruction decode, and a less
aggressive design for pipelines with multiple decode stages.
2.1
Implementation with One Decode Stage
Achieving a zero-cycle load in a ve stage pipeline is a very
challenging task the load instruction must complete in only two
pipeline stages. Assuming data cache access takes one cycle, all
preceeding component operations must complete in only a single
cycle. Figure 2 shows one approach to implementing zero-cycle
loads in a pipeline with a single decode stage.
2.1.1
Organization
Fetch Stage
In the fetch stage of the processor, the instruction
cache and base register and index cache (or BRIC) are accessed in
parallel with the address of the current PC.
The instruction cache returns both instructions and predecode
information. The predecode information is generated at instruc-
tion cache misses and describes the loads contained in the fetched
instructions. The predecode data is supplied directly to the pipes
which execute loads, permitting the tasks of fetching, identifying,
and aligning loads to complete by the end of the fetch stage.
The predecode data for each load consists of three elds:
the addressing mode, base register type, and offset.
The ad-
dressing mode eld species either a register+constant
or register+register addressing (if supported in the ISA).
The base register type is one of the following: SP load, GP load,
or other load. SP load and GP load species a load using the stack
or global pointer [CCH
+
87] as