A High-Performance, Portable Implementation of the MPI Message Passing ...


A High-Performance, Portable Implementation of
the MPI Message Passing Interface Standard
William Gropp
Ewing Lusk
Mathematics and Computer Science Division Argonne National Laboratory
Nathan Doss
Anthony Skjellum
Department of Computer Science &
NSF Engineering Research Center for CFS
Mississippi State University Abstract
MPI (Message Passing Interface) is a specication for a standard library for message
passing that was dened by the MPI Forum, a broadly based group of parallel com-
puter vendors, library writers, and applications specialists. Multiple implementations
of MPI have been developed. In this paper, we describe MPICH, unique among existing
implementations in its design goal of combining portability with high performance. We
document its portability and performance and describe the architecture by which these
features are simultaneously achieved. We also discuss the set of tools that accompany
the free distribution of MPICH, which constitute the beginnings of a portable parallel
programming environment. A project of this scope inevitably imparts lessons about
parallel computing, the specication being followed, the current hardware and software
environment for parallel computing, and project management; we describe those we
have learned. Finally, we discuss future developments for MPICH, including those nec-
essary to accommodate extensions to the MPI Standard now being contemplated by
the MPI Forum.
1
Introduction
The message-passing model of parallel computation has emerged as an expressive, e-
cient, and well-understood paradigm for parallel programming. Until recently, the syntax
and precise semantics of each message-passing library implementation were dierent from This work was supported by the Mathematical, Information, and Computational Sciences Division
subprogram of the Oce of Computational and Technology Research, U.S. Department of Energy, under
Contract W-31-109-Eng-38. This work was supported in part by the NSF Engineering Research Center for Computational Field
Simulation, Mississippi State University. Additional support came from the NSF Career Program under
grant ASC-95-01917, and from ARPA under Order D350.
1 the others, although many of the general semantics were similar. The proliferation of
message-passing library designs from both vendors and users was appropriate for a while,
but eventually it was seen that enough consensus on requirements and general semantics
for message-passing had been reached that an attempt at standardization might usefully be
undertaken.
The process of creating a standard to enable portability of message-passing applica-
tions codes began at a workshop on Message Passing Standardization in April 1992, and
the Message Passing Interface (MPI) Forum organized itself at the Supercomputing 92
Conference. During the next eighteen months the MPI Forum met regularly, and Version
1.0 of the MPI Standard was completed in May 1994 [16, 36]. Some clarications and re-
nements were made in the spring of 1995, and Version 1.1 of the MPI Standard is now
available [17]. For a detailed presentation of the Standard itself, see [42]; for a tutorial
approach to MPI, see [29]. In this paper we assume that the reader is relatively familiar
with the MPI specication, but we provide a brief overview in Section 2.2.
The project to provide a portable implementation of MPI began at the same time as
the MPI denition process itself. The idea was to provide early feedback on decisions being
made by the MPI Forum and provide an early implementation to allow users to experiment
with the denitions even as they were being developed. Targets for the implementation were
to include all systems capable of supporting the message-passing model. MPICH is a freely
available, complete implementation of the MPI specication, designed to be both portable
and ecient. The CH in MPICH stands for Chameleon, symbol of adaptability to
ones environment and thus of portability. Chameleons are fast, and from the beginning a
secondary goal was to give up as little eciency as possible for the portability.
MPICH is thus both a research project and a software development project. As a
research project, its goal is to explore methods for narrowing the gap between the program-
mer of a parallel computer and the performance deliverable by its hardware. In MPICH,
we adopt the constraint that the programming interface will be MPI, reject constraints on
the architecture of the target machine, and retain high performance (measured in terms of
bandwidth and latency for message-passing operations) as a goal. As a software project,
MPICHs goal is to promote the adoption of the MPI Standard by providing users with
a free, high-performance implementation on a diversity of platforms, while aiding vendors
in providing their own customized implementations. The extent to which these goals have
been achieved is the main thrust of this paper.
The rest of this paper is organized as follows. Section 2 gives a short overview of MPI
and briey describes the precursor systems that inuenced MPICH and enabled it to come
into existence so quickly. In Section 3 we document the extent of MPICHs portability and
present results of a number of performance measurements. In Section 4 we describe in some
detail the software architecture of MPICH, which comprises the results of our research in
combining portability and performance. In Section 5 we present several specic aspects
of the implementation that merit more detailed analysis. Section 6 describes a family of
supporting programs that surround the core MPI implementation and turn MPICH into a
portable environment for developing parallel applications. In Section 7 we describe how we
as a small, distributed group have combined a number of freely available tools in the Unix
environment to enable us to develop, distribute, and maintain MPICH with a minimum
of resources. In the course of developing MPICH, we have learned a number of lessons
2 from the challenges posed (both accidentally and deliberately) for MPI implementors by
the MPI specication; these lessons are discussed in Section 8. Finally, Section 9 describes
the current status of MPICH (Version 1.0.12 as of February 1996) and outlines our plans
for future development.
2
Background
In this section we give an overview of MPI itself, describe briey the systems on which
the rst versions of MPICH were built, and review the history of the development of the
project.
2.1
Precursor Systems
MPICH came into being quickly because it could build on stable code from existing systems.
These systems pregured in various ways the portability, performance, and some of the other
features of MPICH. Although most of that original code has been extensively reworked,
MPICH still owes some of its design to those earlier systems, which we briey describe
here.
P4 [8] is a third-generation parallel programming library, including both message-passing
and shared-memory components, portable to a great many parallel computing environments,
including heterogeneous networks. Although p4 contributed much of the code for TCP/IP
networks and shared-memory multiprocessors for the early versions of MPICH, most of that
has been rewritten. P4 remains one of the devices on which MPICH can be built (see
Section 4), but in most cases more customized alternatives are available.
Chameleon [31] is a high-performance portability package for message passing on par-
allel supercomputers. It is implemented as a thin layer (mostly C macros) over vendor
message-passing systems (Intels NX, TMCs CMMD, IBMs MPL) for performance and
over publicly available systems (p4 and PVM) for portability. A substantial amount of
Chameleon technology is incorporated into MPICH(as detailed in Section 4).
Zipcode [41] is a portable system for writing scalable libraries. It contributed several
concepts to the design of the MPI Standardin particular contexts, groups, and mailers (the
equivalent of MPI communicators). Zipcode also contains extensive collective operations
with group scope as well as virtual topologies, and this code was heavily borrowed from in
the rst version of MPICH.
2.2
Brief Overview of MPI
MPI is a message-passing application programmer interface, together with protocol and
semantic specications for how its features must behave in any implementation (such as
a message buering and message delivery progress requirement). MPI includes point-to-
point message passing and collective (global) operations, all scoped to a user-specied group
of processes. Furthermore, MPI provides abstractions for processes at two levels. First,
processes are named according to the rank of the group in which the communication is being
3 performed. Second, virtual topologies allow for graph or Cartesian naming of processes
that help relate the application semantics to the message passing semantics in a convenient,
ecient way. Communicators, which house groups and communication context (scoping)
information, provide an important measure of safety that is necessary and useful for building
up library-oriented parallel code.
MPI also provides three additional classes of services: environmental inquiry, basic
timing information for application performance measurement, and a proling interface for
external performance monitoring. MPI makes heterogeneous data conversion a transparent
part of its services by requiring datatype specication for all communication operations.
Both built-in and user-dened datatypes are provided.
MPI accomplishes its functionality with opaque objects, with well-dened constructors
and destructors,