Continuous Optimization
Continuous Optimization
Continuous Optimization
Brian Fahs
Todd Rafacz
Sanjay J. Patel
Steven S. Lumetta
Center for Reliable and High Performance Computing
University of Illinois at Urbana-Champaign
{bfahs,trafacz,sjp,steve}@crhc.uiuc.edu
Abstract
This paper presents a hardware-based dynamic opti-
mizer that continuously optimizes an applications instruc-
tion stream. In continuous optimization, dataow optimiza-
tions are performed using simple, table-based hardware
placed in the rename stage of the processor pipeline. The
continuous optimizer reduces dataow height by perform-
ing constant propagation, reassociation, redundant load
elimination, store forwarding, and silent store removal. To
enhance the impact of the optimizations, the optimizer in-
tegrates values generated by the execution units back into
the optimization process. Continuous optimization allows
instructions with input values known at optimization time
to be executed in the optimizer, leaving less work for the
out-of-order portion of the pipeline. Continuous optimiza-
tion can detect branch mispredictions earlier and thus re-
duce the misprediction penalty. In this paper, we present a
detailed description of a hardware optimizer and evaluate it
in the context of a contemporary microarchitecture running
current workloads. Our analysis of SPECint, SPECfp, and
mediabench workloads reveals that a hardware optimizer
can directly execute 33% of instructions, resolve 29% of
mispredicted branches, and generate addresses for 76% of
memory operations. These positive effects combine to pro-
vide speed ups in the range 0.99 to 1.27.
1. Introduction
Dynamic optimization offers opportunities beyond static
compiler optimization because of its ability to dynamically
identify hot execution paths and adapt to changes in pro-
gram behavior. Many dynamic optimizers [1, 2, 5, 7, 8, 9,
11, 20] share a common overall structure: they (1) select
regions (functions, traces, hyperblocks, etc.) through some
form of dynamic proling, (2) optimize the selected re-
gions, (3) cache the optimized versions, and (4) exchange
future occurrences of the original regions with the opti-
mized versions. We propose a dynamic optimization sys-
tem, which we call continuous optimization, that does not
require proling of the instruction stream or caching of the
optimized instructions. Instead, dataow optimizations are
applied to each fetched instruction using a table-based hard-
ware optimizer located directly in the processor pipeline.
Retire
Fetch Decode Rename Schedule
Execute
Reg. Read
Optimizer
RAT
Figure 1. High-level view of optimizer inte-
grated into a dynamically-scheduled proces-
sor pipeline.
As depicted in Figure 1, continuous optimization is inte-
grated with the register alias table (RAT) in the rename stage
of a dynamically-scheduled processor. The continuous op-
timizer uses the hardware tables described in Sections 2
and 3 to perform constant propagation, reassociation, redun-
dant load elimination, store forwarding, and silent store re-
moval. Execution results are fed back to, and exploited by,
the optimizer to enhance performance. This feedback com-
bined with optimization allows 33% of retired instructions
to be non-speculatively executed early in the pipeline by
the optimizer. The optimizers objectives are (1) to reduce
dataow height, and (2) to execute simple instructions dur-
ing optimization. These objectives are symbiotic, and com-
bine to provide an average speed up of 1.11 over a standard
pipeline.
Some previously proposed hardware-based dynamic op-
timizers [1, 9] perform classical compiler optimizations of-
ine using an abstract optimizer operating on discrete in-
struction traces. Although continuous optimization can be
adapted for optimizing instruction traces, it is not limited
to discrete regions, i.e., it can impact a greater span of in-
structions. There have been numerous in-pipeline optimiza-
tion techniques [4, 6, 21, 23, 24, 27]. Continuous optimiza-
tion subsumes and extends many of them by aggressively
optimizing dataow through registers and memory to re-
duce dataow height and to increase instruction-level par-
allelism (ILP).
This paper makes several contributions. First, the no-
tion of continuous optimization is presented along with a
detailed description of a hardware implementation of such
an optimizer. Second, continuous optimizations impact and
sensitivity to design choices are characterized. Third, con-
tinuous optimizations contributing performance factors are
analyzed. Finally, a quantitative comparison is made with
other in-pipeline optimization techniques, and continuous
optimization is shown to provide signicantly higher per-
formance.
This paper is organized as follows. Section 2 provides
motivation and high-level details for continuous optimiza-
tion. Hardware requirements are discussed in Section 3.
Section 4 presents our experimental infrastructure. Sec-
tion 5 provides the performance characterization, and Sec-
tion 6 presents sensitivity studies. Section 7 presents related
work along with comparisons to previously proposed in-
pipeline optimization techniques. Section 8 provides a sum-
mary of the ndings.
2. Continuous optimization
.
from
execute
.
.
.
opt.
instrs
instrs
.
.
.
.
.
feedback from RLE/SF
feedback
value
SSR
RLE/SF/
CP/RA
&
RAT
RAT
Optimizer
Figure 2. Architectural view of optimizer.
Figure 2 provides more details of the optimizer. Con-
stant propagation (CP) and reassociation (RA) are imple-
mented with additional logic integrated into the register
alias table (RAT). Redundant load elimination (RLE), store
forwarding (SF), and silent store removal (SSR) are per-
formed with a separate, cache-like structure accessed af-
ter the RAT. Following their decoding, all instructions have
their inputs and outputs renamed while being simultane-
ously transformed to a more parallel form. The crux of the
optimizer is a symbolic representation of each architectural
register value, which is maintained in the optimization ta-
bles. The symbolic representation is leveraged to increase
dataow parallelism and to reduce memory accesses. As
an enhancement, execution results are fed back to the op-
timizer to increase effectiveness. We call this process value
feedback, and call the values fed back known values to dif-
ferentiate them from symbolic information available before
instructions execute. Instruction optimization does not stall
waiting for value feedback because symbolic values sufce
for correctness. Bekerman et al [4] was rst to propose value
feedback for early load address resolution.
We now discuss the operation of the optimizer, explain
and motivate its design, and discuss positive and negative
implications of continuous optimization.
2.1. Symbolic register values
All optimizations operate on symbolic register values
of the form
(reg<<scale)
offset
, where
reg
is a
physical register,
scale
is a two-bit left shift quantity
1
, and
offset
is a 64-bit immediate eld. The symbolic expres-
sion was chosen because it enables a variety of optimiza-
tions, as described in the next section, and approximately
55% of instructions executed match its form
2
.
2.2. Optimizations
The optimizations performed are described below:
Constant propagation (CP)
propagates known val-
ues from producers to consumers. Simple instructions
with known constant inputs can be executed in the opti-
mizer. If, for example,
add r3, 4 -> r4
is being op-
timized, and
r3
is known to be
3
, the add is performed
and
7
is moved into
r4
. This optimization subsumes con-
stant folding.
Reassociation (RA)
transitively attens symbolic ex-
pressions, reducing dataow height and increasing ILP
by transferring dependences to earlier producers. A con-
sumers input is copied from its producers input, and
its
offset
and
scale
are recalculated. This optimiza-
tion subsumes copy propagation.
Redundant load elimination (RLE)
combines load op-
erations accessing identical memory locations; the loads af-
ter the rst are converted to move operations, which are then
optimized away in a manner similar to [11, 15].
Store forwarding (SF)
converts loads referencing re-
cently stored values into move operations, which are then
optimized away as in redundant load elimination.
Silent store removal (SSR)
removes store operations
that write a value identical to the currently stored value.
RLE/SF/SSR follow CP/RA because constant propa-
gation and reassociation simplify
reg
offset
address
specications, enabling effective redundant load elimina-
tion, store forwarding, and silent store removal. Multiple
optimization passes are not performed, but the benets of
doing so are achieved by feeding redundant and store for-
warded load information back to the CP/RA stage; CP/RA
then copy propagates the information through subsequent
instructions.
Our memory optimizations are not speculative. There-
fore, removed load and store instructions are safely elimi-
nated with respect to a single thread of execution. We as-
sume that memory-mapped I/O is identiable. In a real sys-
tem, where another thread of execution may disturb a mem-
ory location unbeknownst to the thread being optimized,
1
Left shifts of 0, 2, and 3 corresponding to addx, s4addx, and s8addx
of the Alpha ISA are the only allowed shift values.
2
Removing
scale
decreases coverage to 54%. Expressions can be 32-
bit or 64-bit; without 32-bit support, coverage decreases to 50%.
such optimizations may not be possible. We evaluate the im-
pact of p