Proactive Steps to Minimize Power Outages
Do you still need backups if you have a SAN?
7
Common Sense
7
Let Us Help You Succeed!
8
Preface
You are sitting at your desk, working intently on a critical project at your
computer and the power goes out. There is no worse feeling than that except
of course if you are responsible for your companys enterprise computer systems
and network, and you are not completely sure just how well protected you are
from a power failure. This white paper will help give you some insight on how
to avoid power failure anxiety.
Introduction
At the time of the writing of this white paper many areas of California are
experiencing rolling blackouts. A rolling blackout is an unexpected period of
30 minutes or longer when electricity is disconnected to targeted areas in order
to reduce the total load on the states power grid. So far there has been little or
no advance warning regarding which areas will be affected next and therefore it
is impossible to adequately plan or prepare for these outages on short notice.
2
Avoiding Disaster An ounce of Prevention
The purposes of this white paper are to document proactive steps that can be
taken to minimize the impact of unplanned service outages, and to discuss some
elements of general systems availability. There is a certain element of
overlapping inherent in these subjects, so the reader may find that some sections
of the paper are useful in contexts other than that of a power outage.
While a power outage is not by definition a true disaster, it shares many of the
same issues with a Disaster Recovery Plan. When it is not addressed, there can
be a loss of services and data, and possible equipment damage, resulting in a
longer unplanned services outage. Many companies incur serious loss of
business, profits, and customer satisfaction with even short periods of unplanned
services interruption.
Rolling blackouts were the inspiration for this document, but it should be noted
that issues such as the loss of power are not limited to California. The impact is
widespread and potentially devastating. It is good business practice to prepare
for problems such as this in order to avert situations that could potentially lead
to the execution of your Disaster Recovery Plan. As the saying goes, an ounce
of prevention is worth a pound of cure.
Availability How much is enough?
The growth of the Internet has forced many companies to change the way they
do business. In fact, most companies today are affected in some way by the
Internets rise in popularity as a medium for communications, sales, on-line
transactions, and customer service. As a result, companies are finding that their
old standards for systems availability are inadequate to meet their current needs.
As most of you are aware, the nines scale of availability has long been a
standard for measuring not only the amount of time in which a system must be
available (uptime), but also the amount of time in which it may be unavailable
(downtime), for maintenance or unplanned outages. Many companies pay a
premium for their systems and attendant maintenance in order to be guaranteed
high availability. The more nines in the rating, the higher the level of
availability and conversely the lower the allowable downtime. The actual
numbers may surprise you. Per non-leap year:
90.0% uptime = 36 days and 12 hours of downtime per year.
99.0% uptime = 3 days, 15 hours and 36 minutes of downtime per year.
99.9% uptime = 8 hours and 46 minutes of downtime per year.
99.99% uptime = 52 minutes and 33 seconds of downtime per year.
99.999% uptime = 5 minutes and 15 seconds of downtime per year.
99.9999% uptime = 31.5 seconds of downtime per year.
Proactive Steps to Minimize Power Outages
3
It is important to differentiate between scheduled and unscheduled downtime, as
well as service times. Often times it will be decided that a scheduled window of
downtime on a periodic basis is acceptable and therefore does not factor into the
nines equation. Also, many businesses are not 7 x 24 operations so there may
be non-critical times identified during off-hours. Each business is different and
it is important to determine what matters to your specific business, as opposed to
jumping on the five nines bandwagon.
Obviously few systems will achieve 99.9999% uptime, although five nines, or
99.999% is often achieved in telephone and banking systems. While your
company may not require such a high level of availability, it also probably
cannot survive with 90 or 99% uptime, either. Even 99.9% will be problematic
for companies that rely heavily on their computer systems in the performance of
their day-to-day business functions. It should also be noted that costs
associated with a high nines system can grow almost exponentially as you
extend beyond the 99% range.
An interesting note is that in this case, while availability refers to the uptime of
the system, reliability means something completely different. Reliability refers
to the likelihood of failure. In turn, a reliable system may ensure high
availability, but when failures do occur, availability can also be affected by
serviceability, which is a measure of how easily failures can be corrected. This
can be more problematic in older and/or less popular equipment.
An Ounce of Prevention is worth a Pound of Cure
You can utilize many of the same techniques that companies requiring uptime
levels of 99.9% or higher use. This helps to ensure that your business does not
suffer needless loss of service or hardware failure in the event of power outages.
Lets explore some of the areas where systems and practices can be reinforced to
prevent unnecessary downtime:
The UPS
Preparation and redundancy are key to availability. While redundant servers are
a commonly used preventative measure to ensure that processing is not
completely halted by the failure of one machine and raid arrays help to eliminate
the problems caused by single disk failure, a whole cluster of servers and their
attached raid storage will be useless if their electrical supply is disrupted for any
reason.
4
Avoiding Disaster An ounce of Prevention
Most companies use some form of uninterruptible power supply, commonly
referred to as a UPS, to support their computer systems. These range from small
battery backup units that will power a single desktop computer long enough to
allow work-in-progress to be saved, to large industrial units that support
complex server farms and allow for an orderly shutdown of systems to prevent
data loss and long restart periods. These devices may also provide another
critical function, which is to condition the incoming mains current to prevent
dangerous surges and spikes from reaching sensitive equipment. If a UPS is of a
type that does not do this then it may also require a few milliseconds to begin
powering the equipment connected to it. In that case it is imperative that the
power supplies in the dependent machines maintain an equivalent number of
milliseconds of power in the event that mains current is removed from them. It
would also be highly advisable to provide some form of power stabilization to
protect delicate equipment if the UPS does not provide this.
An often-overlooked point of failure is the UPS itself. If only one is in use and
it fails the situation will be the same as having no UPS, possibly worse given a
false sense of security. Redundant UPSs are critical to systems that require
high availability. Having redundant UPSs are important, but so is having
cross wired power supplies. The best configuration that we have seen is
having two UPSs running at less than 50% of capacity providing power to each
and every piece of equipment. If one UPS fails then the other has both the
capacity and the connections to sustain business operations (albeit for a reduced
period of time). This is described in more detail below.
Even redundant UPSs can fail so it is good practice to test every UPS on a
regular basis. The most common failures are caused by the increasing inability
of the UPSs batteries to hold a charge as they age. Frequent testing will help to
uncover this problem. In addition, many battery systems benefit from the
occasional discharge and recharge cycles achieved during the testing process.
Temperature can also affect battery life and capacity.
Size matters. A common error is to under-size the capacity of the UPS.
Calculating the capacity of the UPS required by your installation is outside the
scope of this paper, but manufacturers of these devices usually have clearly
defined recommendations based on the power requirements of the equipment to
be connected to them. It is far better to have surplus capacity than to have too
little. Under sizing can result in damage to the dependent equipment or failure
of the UPS to function at all, and the cost of replacing inadequate equipment is
usually higher than the cost of purchasing oversized equipment to begin with.
Also, be sure that the UPS you select automatically shuts off when it can no
longer maintain adequate voltage and current levels, otherwise sensitive
equipment may be severely damaged by decreased voltage and/or current as the
UPSs batteries weaken.
Proactive Steps to Minimize Power Outages
5
Depen