Performance Evaluation of the SX-6 Vector Architecture for Scientiï¬c ...
ont color=blue>current page or check for previous versions at the Internet Archive.
Yahoo! is not affiliated with the authors of this page or responsible for its content.
Performance Evaluation of the SX-6 Vector Architecture for Scientic Computations
Performance Evaluation of the SX-6 Vector Architecture for
Scientic Computations
Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, David Skinner
CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
Stephane Ethier
Princeton Plasma Physics Laboratory, Princeton University, Princeton, NJ 08453
Rupak Biswas, Jahed Djomehri
, and Rob Van der Wijngaart
NAS Division, NASA Ames Research Center, Moffett Field, CA 94035
Abstract
The growing gap between sustained and peak performance for scientic applications is a well-known
problem in high performance computing. The recent development of parallel vector systems offers the
potential to reduce this gap for many computational science codes and deliver a substantial increase in
computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector proces-
sor, and compares it against the cache-based IBM Power3 and Power4 superscalar architectures, across a
number of key scientic computing areas. First, we present the performance of a microbenchmark suite
that examines many low-level machine characteristics. Next, we study the behavior of the NAS Parallel
Benchmarks. Finally, we evaluate the performance of several scientic computing codes. Overall re-
sults demonstrate that the SX-6 achieves high performance on a large fraction of our application suite
and often signicantly outperforms the cache-based architectures. However, certain classes of applica-
tions are not easily amenable to vectorization and would require extensive algorithm and implementation
reengineering to utilize the SX-6 effectively.
1
Introduction
The rapidly increasing peak performance and generality of superscalar cache-based microprocessors long
led researchers to believe that vector architectures hold little promise for future large-scale computing sys-
tems [15]. Due to their cost effectiveness, an ever-growing fraction of todays supercomputers employ
commodity superscalar processors, arranged as systems of interconnected SMP nodes. However, the grow-
ing gap between sustained and peak performance for scientic applications on such platforms has become
well known in high performance computing.
The recent development of parallel vector systems offers the potential to reduce this performance gap for
a signicant number of scientic codes, and to increase computational power substantially [14]. This was
highlighted dramatically when the Japanese Earth Simulator [2] results were published [20, 21, 24]. The
Earth Simulator, based on NEC SX-6
1
vector technology, achieves ve times the LINPACK performance
with almost half the number of processors of the IBM SP-based ASCI White, one of the worlds most pow-
erful supercomputers [8], built using superscalar technology. In order to quantify what this new capability
entails for scientic communities that rely on modeling and simulation, it is critical to evaluate these two
microarchitectural approaches in the context of demanding computational algorithms.
Employee of Computer Sciences Corporation.
1
Also referred to as the Cray SX-6 due to Crays agreement to market NECs SX line.
1
In this paper, we compare the performance of the NEC SX-6 vector processor against the cache-based
IBM Power3 and Power4 architectures for several key scientic computing areas. We begin by evaluating
memory bandwidth and MPI communication speeds, using a set of microbenchmarks. Next, we evaluate ve
of the well-known NAS Parallel Benchmarks (NPB) [4, 11], using problem size Class B. Finally, we present
performance results for a number of numerical codes from scientic computing domains, including plasma
fusion, astrophysics, uid dynamics, materials science, magnetic fusion, and molecular dynamics. Since
most modern scientic codes are already tuned for cache-based systems, we examine the effort required to
port these applications to the vector architecture. We focus on serial and intranode parallel performance of
our application suite, while isolating processor and memory behavior. Future work will explore the behavior
of multi-node vector congurations.
2
Architectural Specications
We briey describe the salient features of the three parallel architectures examined. Table 1 presents a
summary of their intranode performance characteristics. Notice that the NEC SX-6 has signicantly higher
peak performance, with a memory subsystem that features a three to six times larger bytes/op ratio than
the IBM Power systems.
Node
CPU/
Clock
Peak
Memory BW
Peak
Memory Latency
Type
Node
(MHz)
(Gops/s)
(GB/s)
Bytes/Flop
(sec)
Power3
16
375
1.5
1.0
0.67
8.6
Power4
32
1300
5.2
6.4
1.2
3.0
SX-6
8
500
8.0
32
4.0
2.1
Table 1: Architectural specications of the Power3, Power4, and SX-6 nodes.
2.1
Power3
The IBM Power3 was rst introduced in 1998 as part of the RS/6000 series. Each 375 MHz processor
contains two oating-point units (FPUs) that can issue a multiply-add (MADD) per cycle for a peak per-
formance of 1.5 GFlops/s. The Power3 has a short pipeline of only three cycles, resulting in relatively low
penalty for mispredicted branches. The out-of-order architecture uses prefetching to reduce pipeline stalls
due to cache misses. The CPU has a 32KB instruction cache and a 128KB 128-way set associative L1 data
cache, as well as an 8MB four-way set associative L2 cache with its own private bus. Each SMP node con-
sists of 16 processors connected to main memory via a crossbar. Multi-node congurations are networked
via the IBM Colony switch using an omega-type topology.
The Power3 experiments reported in this paper were conducted on a single Nighthawk II node of the 208-
node IBM pSeries system (named Seaborg) running AIX 5.1, Parallel Environment 3.2, C 6.0, Fortran 8.1,
and located at Lawrence Berkeley National Laboratory.
2.2
Power4
The pSeries 690 is the latest generation of IBMs RS/6000 series. Each 32-way SMP consists of 16 Power4
chips (organized as four MCMs), where a chip contains two 1.3 GHz processor cores. Each core has two
FPUs capable of a fused MADD per cycle, for a peak performance of 5.2 Gops/s. Two load-store units,
each capable of independent address generation, feed the two double precision MADDers. The superscalar
out-of-order architecture can exploit instruction level parallelism through its eight execution units. Up to
2
eight instructions can be issued each cycle into a pipeline structure capable of simultaneously supporting
more than 200 instructions. Advanced branch prediction hardware minimizes the effects of the relatively
long pipeline (six cycles) necessitated by the high frequency design.
Each processor contains its own private L1 cache (64KB instruction and 32KB data) with prefetch
hardware; however, both cores share a 1.5MB unied L2 cache. Certain data access patterns may therefore
cause L2 cache conicts between the two processing units. The directory for the L3 cache is located on-chip,
but the memory itself resides off-chip. The L3 is designed as a stand-alone 32MB cache, or to be combined
with other L3s on the same MCM to create a larger interleaved cache of up to 128MB. Multi-node Power4
congurations are currently available employing IBMs Colony interconnect, but future large-scale systems
will use the lower latency Federation switch.
The Power4 experiments reported here were performed on a single node of the 27-node IBM pSeries
690 system (named Cheetah) running AIX 5.1, Parallel Environment 3.2, C 6.0, Fortran 7.1, and operated
by Oak Ridge National Laboratory.
2.3
SX-6
The NEC SX-6 vector processor uses a dramatically different architectural approach than conventional
cache-based systems. Vectorization exploits regularities in the computational structure to expedite uniform
operations on independent data sets. Vector arithmetic instructions involve identical operations on the ele-
ments of vector operands located in the vector register. Many scientic codes allow vectorization, since they
are characterized by predictable ne-grain data-parallelism that can be exploited with properly structured
program semantics and sophisticated compilers. The 500 MHz SX-6 processor contains an 8-way replicated
vector pipe capable of issuing a MADD each cycle, for a peak performance of 8 Gops/s per CPU. The
processors contain 72 vector registers, each holding 256 64-bit words.
For non-vectorizable instructions, the SX-6 contains a 500 MHz scalar processor with a 64KB instruc-
tion cache, a 64KB data cache, and 128 general-purpose registers. The 4-way superscalar unit has a peak
of 1 Gops/s and supports branch prediction, data prefetching, and out-of-order execution. Since the vector
unit of the SX-6 is signicantly more powerful than its scalar processor, it is critical to achieve high vector
operation ratios, either via compiler discovery or explicitly through code (re-)organization.
Unlike conventional architectures, the SX-6 vector unit lacks data caches. Instead of relying on data lo-
cality to reduce memory overhead, memory latencies are masked by overlapping pipelined vector operations
with memory fetches. The SX-6 uses