vEC: Virtual Energy Counters

krishnan, M.J. Irwin, and A. Sivasubramaniam
Microsystems Design Lab
Pennsylvania State University
University Park, PA 16802
1 814 865 9505
mdl@cse.psu.edu


ABSTRACT
Energy has become a critical issue in processor design, especially
in embedded environments. Thus, there is a need for tools, which
provide an accurate and fast estimation of energy. In this paper,
we present the design and use of a tool, Virtual Energy Counters
(vEC), for estimating the energy consumption of user programs.
vEC is built on top of the Perfmon user library for the
UltraSPARC platform, and provides a user interface, which can
be used within user programs to estimate the energy consumption.
The energy estimates are provided for those consumed in the data,
instruction and extended caches, main memory, address bus, data
bus, address pads, and data pads.
Keywords

Hardware Performance Counters, System Energy Consumption,
Embedded Systems, Optimizations, Signal Processing.
1. INTRODUCTION
Energy has become a critical issue in processor design, especially
in embedded environments. When designing for embedded
systems, designers have to take into account both high
performance and low energy consumption. With this in mind,
power/energy estimation tools, which provide fast and accurate
estimates, are becoming increasingly important and necessary
during code development. Simulators, such as SimplePower [2]
provide accurate energy estimations, yet they generally take a
significant amount of time to provide designers with the energy
estimates. Software-based techniques (e.g., [8]), on the other
hand, are not as robust enough to be relied on.
In this paper, we present the design and use of a tool, Virtual
Energy Counters (vEC), for estimating (measuring) the energy
consumption of user programs. The tool provides a fast estimation
of the energy consumption of the main components of modern
processors such as cache, main memory, and buses. It is built on
top of the Perfmon user library [5] for the UltraSPARC platform,
and may be easily extended for other platforms such as the MIPS
R10K and Intels Pentium processors. Perfmon is a tool that can
be used by the user-level code to access hardware performance
counters in the Pentium and UltraSPARC series microprocessors.
Hardware performance counters are special hardware registers of
modern microprocessors that monitor the occurrence of hardware
events in a microprocessor without affecting the performance of
the program [5, 7]. Some of the typical events that may be
monitored are cycle count, instruction count, cache references and
hits/misses, main memory writebacks and references, and branch
mispredictions. By monitoring these events designers can improve
program performance. For example, using the branch
mispredictions, a designer can optimize her/his program to reduce
the number of mispredicted branches and wasted cycles, thereby
improving the program execution time. Similarly, by monitoring
the data cache misses, a designer can modify the data layout
(dynamically at run-time) to improve program performance. For
the purposes of this work, the events monitored are used to
estimate memory system energy consumption for a running
program.
vEC provides a means to profile user programs and measure the
memory system energy consumption. The power analysis
performed applies the analytical energy formulas from [1], which
are primarily based on the number of cache references, hits,
misses, and capacitance values. The energy estimations provided
by vEC are absolute values (in Joules) and can also be applied to
energy consumption comparisons for different code
implementations such as those presented in [3]. However, since
the values are based on event occurrences and analytical formulas,
they differ from the actual values.
The rest of this paper consists of six sections. Section 2 presents
related work on energy estimation and optimization. Section 3
discusses the UltraSPARC hardware performance counters. We
present the analytical formulas used to compute our energy
estimations as well as the events monitored in Section 4. Section 5
presents the vEC user interface; its usage and the experiments
performed to evaluate our tool. Sections 6 and 7 provide a
discussion and conclusion, respectively, of the work presented in
this paper.
2. RELATED WORK
Numerous compiler transformations (optimizations) have been
proposed to make user programs automatically run their fastest.
Some of these transformations are applied at source-level, where
program access patterns imposed by loop and other control
structures are visible. Loop nest transformations constitute an

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
PASTE 01, June 18-19, 2001, Snowbird, Utah, USA.
Copyright 2001 ACM 1-58113-413-4/01/0006$5.00.

28 important part of source-level transformations and, specifically
target loop nests where most of execution times is spent
(especially in multidimensional signal processing and video
applications). Techniques such as loop permutations, loop tiling,
and loop fusion have been proven to be very useful in optimizing
performance of loop nests, e.g., enhancing cache performance
and/or improving parallelism [4].
Switching activities of system components determine the dynamic
power consumption of CMOS VLSI circuits [9]. The switching
activity depends on the execution patterns of applications. Source-
level compiler transformations determine the execution patterns of
these applications. Such transformations were originally designed
to optimize the performance of code and did not focus on
optimizing the power consumption, an essential design parameter
when developing power-conscious systems such as embedded
systems. There has been little effort to analyze their energy
impact. In embedded systems we need energy estimation tools,
which measure the impact of these compiler transformations on
energy consumption. Previous work, which presents an evaluation
of compilation techniques on energy consumption, can be found
in [8].
Energy measuring tools can be either transition-sensitive or based
on analytical formulas. Since transition-sensitive simulators
estimate the energy consumption based on bit-switching activities,
they take a significant amount of time to generate energy
estimates. For example, SimplePower is a transition-sensitive
energy simulator, which estimates the data path energy
consumption [2].
3. UltraSPARC HARDWARE
PERFORMA-NCE COUNTERS
The UltraSPARC CPU [6] has two 64-bit registers that can be
used to monitor and collect statistics about different hardware
events: the Performance Control Register (PCR) and Performance
Instrumentation Counter (PIC). These registers collect statistics
on the major events that occur on a per-processor basis, at the user
and system levels. The PCR controls which events will be
monitored and the level of monitoring (i.e., system or user). The
PIC accumulates the number of occurrences of at most two events.
The first 32 bits of the PIC are used for one event, and the second
32 bits are used for the other. Each half of the PIC can monitor
one of 16 events (at a time) and only 2 events are common to each
half. Thus, a total of 30 different events can be monitored.
Accessing the performance counters requires privileged
instructions. Perfmon provides a user library component, which
allows user programs to access the performance counters using C
functions calls.
4. COMPUTING ENERGY VALUES
4.1 Energy Formulations
We use the analytical memory energy model from [1] to estimate
the energy. This model is found to 2.4% accurate as compared to
circuit level simulation.
Energy = E
bus
+ E
cell
+ E
pad
+ E
main

E
bus
= E
add_bus
+ E
data_bus

E
cell
=
*(Word_line_size)*(Bit_line_size+ 4.8)*(Nhit+2*Nmiss)
E
pad
=E
add_pad
+E
data_pad

E
main
=Em*8L*Nmiss*(1+dirty_r)
E
add_bus
=0.5e-12*Pr1*V
2
*(Nhit+Nmiss)*Wadd
E
data_bus
=0.5e-12*Pr2*V
2
*(Nhit+Nmiss)*32
E
add_pad
=20e-12*Pr3*V
2
*Nmiss*Wadd
E
data_pad
=20e-12*Pr4*V
2
*(1+dirty_r)*Nmiss*64
Word_line_size=m*(8L+T+St)
Bit_line_size=C/(m*L)
= 1.44e-14 (technology parameter)
Em = 4.95e-9 (per-access off-chip energy cost)
where C = cache size, L= Cache line size,
m = set associatively, T = tag size in bits,
St = number of status bits per block,
Nhit = number of hits,
Nmiss = number of misses,
Wadd = the width of an address bus,
dirty_r = the percentage of blocks written back into memory on
replacement.
The Pr1-4 values are the switch rates for the add_bus, data_bus,
add_bus and data_bus, respectively. Pr1, Pr2, Pr3 and Pr4 are
assumed to be 0.25 for the purposes of this study.
In this formulation, E
bus
represents data and address bus energy
between processor and cache, E
cell
represents cache energy, E
pad

represents data and address pad energy between cache and main
memory, and finally E
main
represents the main memory energy.
4.2 Events of Interests
We determined which events would be monitored in order to
calculate the memory energy consumption (using