RAS design for the IBM eServer z900
>
Yahoo! is not affiliated with the authors of this page or responsible for its content.
RAS design for the IBM eServer z900
by L. C. Alves
M. L. Fair
P. J. Meaney
C. L. Chen
W. J. Clarke
G. C. Wellwood
N. E. Weber
I. N. Modi
B. K. Tolan
F. Freier
RAS design
for the IBM
eServer z900
The IBM eServer zSeries
TM
Model 900, or z900,
has been designed with major enhancements
for hardware reliability, availability, and
serviceability (RAS) in support of the zSeries
RAS strategy, the eServer self-management
technologies, and the z900 design objective
of continuous reliable operation. The eServer
self-management technologies enable the
server to protect itself, to detect and recover
from errors, to change and congure itself,
and to optimize itself, in the presence of
problems and changes, for maximum
performance with minimum outside intervention.
From the RAS perspective, the longstanding
RAS strategy for the IBM S/390
®
and now the
zSeries has provided an excellent foundation
for self management. This paper describes
the z900 RAS enhancements and how they
strengthen the RAS strategy building blocks
and provide a basis for autonomic computing.
Introduction
The purpose of the zSeries* RAS strategy is to enable
delivery to our customers of servers which are capable of
continuous reliable operation (CRO). The RAS strategy
building blocks and the concept of CRO were described
in Volume 43, Number 5/6 of the IBM Journal of Research
and Development [1], which was devoted to the design of
the S/390* G5/G6 servers. The two elements of CRO,
continuous and reliable, require the server to run the
customers operation without interruption caused by
errors, maintenance, or change in server hardware or
Licensed Internal Code (LIC), while ensuring error-free
execution and data integrity. The seven building blocks of
this strategy, which are intended to support the drive to
CRO, are error prevention, error detection, recovery,
problem determination, service structure, change
management, and measurement and analysis [1]. These
RAS building blocks are intended to be independent of
the operating system (OS), so that no particular OS is
required to recongure the logical partition (LPAR), or to
enable processor unit (PU), memory, or input/output (I/O)
port sparing. No particular OS is required to enable
Capacity Upgrade on Demand (CUoD) or Capacity
Backup (CBU). No particular OS is required to provide
service or remote support. All of these RAS functions are
built into the structure of the hardware. Therefore, when
a new OS (e.g., Linux**) is introduced, zSeries RAS is
already at work. The RAS functions are operational
whether Linux is the only OS running or Linux is sharing
the server with a traditional OS (e.g., z/OS*).
Major enhancements in RAS design, concurrent
upgrade, and concurrent repair for the z900 have been
made in the processor, storage, I/O, power/cooling,
service, support, and LIC subsystems, as well as for the
Parallel Sysplex*. These enhancements strengthen the
RAS building blocks and support the self-protecting,
self-healing, self-conguring, and self-optimizing
capabilities of the z900.
By denition, CRO implies the capability of a server to
prevent or tolerate errors, eliminate outage, ensure error-
Copyright
2002 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this
paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of
this paper must be obtained from the Editor.
0018-8646/02/$5.00 © 2002 IBM
IBM J. RES. & DEV. VOL. 46 NO. 4/5 JULY/SEPTEMBER 2002
L. C. ALVES ET AL.
503
free operation, and enable the maximum possible
operating capacity in the presence of errors, service, or
change. This is also the essence of self management. The
interrelationship between the RAS strategy building blocks
and eServer self management is shown in Figure 1.
Processor subsystem
The z900 is based on the proven G5/G6 dual cluster
system structure, described in [2], consisting of PU,
storage controller control (SCC), storage controller data
(SCD), memory bus adapter (MBA), memory system
controller (MSC), external timer reference (ETR), clock
(CLK), cryptographic coprocessor element (CCE),
oscillator (OSC), and memory cards (Figure 2). The z900
utilizes two multiple-chip module (MCM) designs; one
contains 35 chips (20 PU
2 SCC
8 SCD
4 MBA
1 CLK), and the other contains 23 chips (12 PU
2 SCC
4 SCD
4 MBA
1 CLK). Both MCMs are
cooled by modular cooling units (MCUs) and operate at
a nominal junction temperature of 0 C, improving server
performance and semiconductor reliability.
The z900 processor subsystem continues the robust RAS
design of the G5/G6. The PUs are not shipped with a
xed function assignment, but are assigned during rst
power-on reset (FPOR) following server power-on. The
rst assignments are for the system assist processors
(SAPs), followed by the central processors (CPs), the
Integrated Coupling Facilities (ICFs), and the Integrated
Facilities for Linux (IFLs). Making these assignments
dynamically allows the power-on reset (POR) code to
respond to changes in the conguration due to upgrades
or failures. The PUs are assigned alternately between the
two clusters of the Level 2 (L2) cache (SCC and SCD).
This allows all of the PUs to survive an SCD failure and
half to survive an SCC failure.
The increase in the maximum number of PUs to twenty
allows for more nontraditional processors, such as ICFs
and IFLs, while keeping some PUs available as spares.
The PUs are tightly coupled through a binodal L2 cache.
The PUs can survive intermittent failures on the PUL2
interface, or the z900 can completely fence (logically
remove from the active conguration) a PU. Fencing a
failing PU shifts the execution to a spare PU. If the failing
PU is a master SAP and a spare PU is not available, an
active CP is reassigned as the master SAP. This dynamic
PU sparing process is transparent to the OS. Each PU
protects itself with dual instruction/execution engines. The
two engines execute each instruction in lockstep. The
output of the engines is compared at checkpoints, and
any mismatch causes the PU to retry from the previous
checkpoint. Continued failure results in PU clockstop and
a dynamic PU sparing, as described in [1]. The on-chip
Level 1 (L1) cache can survive an array failure by purging
and deleting the cache line or compartment. The L1 cache
fuse relocation technology allows the defective cache
line to be relocated (L1 cache-line sparing) at the next
FPOR. All PUs, SCCs, and SCDs participate in the logic
built-in self-test (LBIST) and the array built-in self-test
(ABIST).
The z900 provides RAS enhancements in MCM
reliability and in CCE serviceability, and introduces a
redundant design for the system oscillator.
MCM reliability
To enhance error prevention, the z900 introduces a
number of MCM stress test enhancements, increasing the
already phenomenal MCM reliability even further. The
improvements are as follows:
1. A 20 C temperature test to screen failures at low
temperature. This test reduces or eliminates failures in
manufacturing (shipped product quality levelSPQL)
and in the eld.
2. A port test to reduce defective single-cell array failures.
Under this test, V
dd
(drain voltage) is reduced to 0.7 V
for 80 ms. Under these conditions, the defective single-
cell failures are screened out.
3. A 1.2-V V
dd
low-voltage test to screen cell failures
which occur under low-voltage conditions.
4. Respective increases in ac and dc test coverage to
98.77% and 99.86% to improve SPQL fallout in
Figure 1
zSeries RAS strategy building blocks and eServer self management.
Continuous reliable operation
(CRO)
zSeries
RAS strategy
eServer
self management
Self-protecting
Self-healing
Self-configuring
Self-optimizing
Error prevention
Error detection
Error recovery
Problem determination
Service/support
Change management
Measurement
L. C. ALVES ET AL.
IBM J. RES. & DEV. VOL. 46 NO. 4/5 JULY/SEPTEMBER 2002
504
manufacturing and eld reliability. Analysis techniques
and TestBench
1
procedures are also improved.
5. Reduction of the nominal junction temperature of z/900
MCM chips to 0 C from 15 C for G6 to improve server
cycle time. A side benet of this change is a signicant
improvement in chip reliability due to lower junction
temperature.
6. Addition of three hours of on-product clock generator
post-burn-in stress testing for the PU, SCC, and SCD
chips. This helps reduce the number of ac defects that
escape to system manufacturing from the supplier.
Cryptographic coprocessor element
The CCE chips inherit their design from the G5/G6
servers, thus continuing the strong cryptographic RAS
characteristics. The CCE chips are each twin-tailed, with
one active interface and one hot standby interface to the
processor subsystem (Figure 2). CCE0 has interfaces to