Effective Dynamic Voltage Scaling through CPU-Boundedness Detection
or=blue>the Internet Archive.
Yahoo! is not affiliated with the authors of this page or responsible for its content.
Eective Dynamic Voltage Scaling through CPU-Boundedness Detection
Eective Dynamic Voltage Scaling
through CPU-Boundedness Detection
Chung-Hsing Hsu and Wu-chun Feng
{chunghsu,feng}@lanl.gov
Advanced Computing Laboratory
Los Alamos National Laboratory
Los Alamos, NM 87545
Keywords: Power-aware computing, dynamic voltage
scaling, interval-based voltage scheduling, performance
modeling, power-performance tradeo.
Abstract
Dynamic voltage scaling (DVS) allows a program to ex-
ecute at a non-peak CPU frequency in order to reduce
CPU power, and hence, energy consumption; however,
it is done at the cost of performance degradation. For
a program whose execution time is bounded by periph-
erals performance rather than the CPU speed, apply-
ing DVS to the program will result in negligible per-
formance penalty. Unfortunately, existing DVS-based
power-management algorithms are conservative in the
sense that they overly exaggerate the impact that the
CPU speed has on the execution time, e.g., they assume
that the execution time will double if the CPU speed is
halved. Based on a new single-coecient performance
model, we propose a DVS algorithm that detects the
CPU-boundedness of a program on the y (via a re-
gression method on the past MIPS rate) and then ad-
justs the CPU frequency accordingly. To illustrate its
eectiveness, we compare our algorithm with other DVS
algorithms on real systems via physical measurements.
1
Introduction
Dynamic voltage and frequency scaling (DVS) is a
mechanism whereby software can dynamically adjust
CPU voltage and frequency. This mechanism allows
systems to address the problem of ever-increasing CPU
power dissipation and energy consumption, as they are
both quadratically proportional to the CPU voltage.
However, reducing the CPU voltage may also require
the CPU frequency to be reduced and results in de-
This work was supported by the DOE ASC Program through
Los Alamos National Laboratory contract W-7405-ENG-36.
graded CPU performance with respect to execution
time. In other words, DVS trades o performance for
power and energy reduction.
The performance loss due to running at a lower CPU
frequency raises several issues. First, a user who pays to
upgrade his/her computer system does not want to see
performance degradation. Second, running programs
at a low CPU frequency may end up increasing total
system energy usage [1, 2, 3, 4]. In order to control
(or constrain) the performance loss eectively, a model
that relates performance to the CPU frequency is essen-
tial for any DVS-based power-management algorithm
(shortened as DVS algorithm hereafter).
A typical model used by many DVS algorithms pre-
dicts that the execution time will double if the CPU
speed is cut in half. Unfortunately, this model overly
exaggerates the impact that the CPU speed has on the
execution time. It is only in the worst case that the ex-
ecution time doubles when the CPU speed is halved; in
general, the actual execution time is less than double.
For example, in programs with a high cache miss ra-
tio, performance can be limited by memory bandwidth
rather than CPU speed. Since memory performance
is not aected by a change in CPU speed, increasing
or decreasing the CPU frequency will have little eect
on the performance of these programs. We call this
phenomenon sublinear performance slowdown. Con-
sequently, researchers have been trying to exploit this
program behavior in order to achieve better power and
energy reduction [5, 6, 7, 8]. One common technique de-
composes program workload into regions based on their
CPU-boundedness.
The decomposition can be done
statically using proling information [5, 6] or dynam-
ically through an auxiliary circuit [9, 10, 11] or through
a built-in performance monitoring unit (PMU) [7, 8].
In this paper, we propose a new PMU-assisted, on-line
DVS algorithm called beta that provides ne-grained,
tight control over performance loss as well as takes the
advantage of the sublinear performance slowdown. The
new beta algorithm is based on an extension of the the-
1
oretical work developed by Yao et al. [12] and by Ishi-
hara and Yasuura [13]. Via physical measurements, we
will demonstrate the eectiveness of the beta algorithm
when compared to several existing DVS algorithms for
a number of applications.
The rest of the paper is organized as follows: Section 2
characterizes how current DVS algorithms relate perfor-
mance to CPU frequency. With this characterization as
a backdrop, we present a new DVS algorithm (Section 3)
along with its theoretical foundation (Section 4). Then,
Section 5 describes the experimental set-up, the imple-
mented DVS algorithms, and the experimental results.
Finally, Section 6 concludes and presents some future
directions.
2
Related Work
CPU utilization is often used to relate performance to
the CPU frequency. While it is generally dened as the
fraction of time that the CPU spends non-idle, CPU
utilization may also be interpreted as the normalized
workload (e.g., [14, 15, 16]). This particular interpreta-
tion has a nice property that there is a one-to-one corre-
spondence between the desired normalized CPU speed
and CPU idle time. Thus, if CPU utilization is 0.5
on a 2-GHz machine, then setting the CPU frequency
to 1 GHz is predicted to eliminate all CPU idle time.
Clearly, CPU utilization (or the normalized workload)
follows the assumption that the execution time doubles
when the CPU speed is halved. This type of model is
popular because the metric is easy to derive at run time
and it does not require application-specic information.
However, CPU utilization by itself does not provide
enough information about system timing requirements,
and DVS algorithms based on such information can only
provide loose control over performance loss [17, 18, 19].
Thus, DVS algorithms with application-specic infor-
mation have been proposed in order to provide tighter
control over performance loss. For example, an applica-
tion (or task or thread) can be associated with a dead-
line, in terms of seconds, as well as a CPU work require-
ment, in terms of CPU cycles. In this setting, perfor-
mance is usually formulated as a linear function of the
CPU speed. This type of performance model predicts
that the execution time doubles when the CPU speed
is halved. Other approaches use a target IPC (instruc-
tions per cycle) rate as the system timing requirements
[20, 21]. Their performance model falls into the same
category too.
There have been some attempts to exploit the sublin-
ear performance slowdown to achieve more power and
energy reduction. For example, Marculescu [9] proposed
to set the CPU to a low speed whenever an L2 cache
miss occurs. Li et al. [11] improved the algorithm by
taking into account the transition overhead and scaling
the CPU frequency and voltage according to the level of
parallelism between the CPU and the memory subsys-
tem. Stanley-Marbell et al. [10] designed an auxiliary
hardware unit to detect loop-based, memory-bound ex-
ecution phases.
The PMU-assisted process cruise control developed
by Weissel and Bellosa [5] relies on a pre-computed ta-
ble of optimal CPU speeds to direct the CPU speed
change. The table is indexed by the run-time instruc-
tion counts per cycle and memory requests per cycle.
Although the algorithm requires neither source code nor
compiler support, it is inexible in the sense that the ta-
ble is obtained through extensive experiments of micro-
benchmarks for a given performance loss (e.g., 10% in
[5]). In other words, the algorithm does not allow for
dynamic, application-specic control over performance
loss.
Hsu and Kremer [6] use o-line proling to identify
memory-bound program regions, coupled with compiler
transformations, to facilitate the setting of the CPU
frequency. However, the need for source code and com-
piler support makes this approach more dicult to im-
plement in practice. In general, compiler-directed DVS
algorithms have the benet of only requiring the host
processor to export a DVS interface and does not re-
quire support from the OS scheduler. They also allow
DVS scheduling decisions to be made in a global man-
ner and to be in combination with performance-oriented
optimization. On the other hand, savings are limited as
speed-set instructions are inserted statically, and thus,
apply to all execution of a specic memory reference,
both cache misses as well as hits [22]. Moreover, in-
put data sets may change program behavior that makes
prole-based DVS algorithms less attractive.
Our work is closest to Choi et al.s recent work
[7, 8]. Both use a regression method and PMU sup-
port to perform on-line DVS scheduling through CPU-
boundedness detection. However, the two works dier
in their denition of CPU-boundedness, and thus, the
detection mechanism. Choi et al.s work is based on the
ratio of the on-chip computation time to the o-chip
access time. In contrast, our algorithm denes CPU-
boundedness as the fraction of program workload that
is CPU-bound. Because of the di