Philosophy to Astronomy: downtime

Saturday, September 15, 2007

DATA CENTERS, PART-4

Real-world Downtime for each Tier

This was taken inside a rented cage inside a data center. This is a colocation site.

This is Part-4. Click here to read Part-1. A new tab or window will open.

“Uptime” refers to the end-user's uninterrupted access to his or her data. “Downtime” refers to any disruption to this access.

Both uptime and downtime are measured from the end-user perspective. Downtime, for example, is measured from the start to the end of the disruption. Downtime is always longer than the period of the actual disruption. This is due to the aftereffects that occur downstream of the point of disruption. Let's say a Tier-2 center experienced a power outage for 30 seconds. Database servers in the middle of multiple transactions would suddenly die. The database would probably not get corrupted thanks to the built-in safeguards of the database application. Still, it'll take time for the database administrators to confirm this. Assuming the best, end-user access will be restored.

These empirical statistics came from a control group of 16 data centers studied by The Uptime Institute, the creators of the Tier standard.

Tier-1 centers typically experience two separate 12-hour periods of downtime a year because of preventive maintenance. These sites also experience 1.2 failures a year of its components or paths. Tier-1 centers average 28.8 hours of downtime a year (equivalent to 99.67% uptime).

Tier-2 sites typically experience three scheduled maintenance periods every two years and one unexpected outage each year. Tier-2 centers average 22 hours of downtime a year (equivalent to 99.75% uptime).

Tier-3 centers typically experience four hours of downtime every two and a half years—or 1.6 hours a year (equivalent to 99.98% uptime).

Tier-4 sites typically experience four hours of downtime every five years—or 50 minutes a year (equivalent to 99.99% uptime).

So far, we know these factors will cause unexpected disruptions:

Human activities
Infrastructure and equipment failures
Acts of God

The human factor

I encountered another factor that will definitely cause a center to shut down. Local authorities. Local fire and electrical safety codes may force sites, regardless of tier, to shut down for inspections and tests. Fortunately, these can be planned events.

How long does it typically take to restore access from momentary disruptions? Four hours. Tier level aside, a disruption will require human intervention. That alone takes time. Would you agree that four hours seem quick for Tier-1 and -2 but, at the same time, seem too long for Tier-3 and -4? It's about expectations, isn't it?

The higher tiers, -3 and –4, should be built and, more importantly, operated with the capability to withstand subsequent failures triggered by the first failure event. "Failure" should be interpreted broadly as you will see from these customer examples.

The first involved a Tier-3 center normally staffed by two operators. One of them was on extended leave. One morning, the remaining person called in sick. How did they deal with it? The manager spent the day there. She wasn't trained but fortunately nothing untoward happened.

The second occurred in an Tier-2 room. A cooling pipe beneath the raised floor had sprung a leak. It went undetected for a week until a floor tile was picked up for another reason. A rather wide puddle had formed in the sub-floor. The site had no operators per se. The analysts, programmers, and managers had to deal with it. It was discovered mid-morning and was not was resolved until close to midnight. Nobody was really responsible for the physical infrastructure and, consequently, nobody was trained.

Facility failures often reveal previously unknown architectural, hardware, or software issues. As you read however, more than anything else, disruptions expose human activity-related deficiencies. You have to train and practice and fill the roles properly otherwise the human factor will get you.

Sphere: Related Content

Tuesday, July 3, 2007

DATA CENTERS, PART 2

The general attributes of each Data Center Tier are presented below.

This Part-2. Click here to read Part-1. Click here to read Part-3. A new tab or window will open for each post.

Data Centers are classified into four tiers. Tier-1 refers to a basic facility and Tier-4, to the most reliable and sophisticated type. This post goes into further detail about each tier.

Tier-4

takes 15 to 20 months to plan and implement
is the most expensive type and most costly to operate
is housed in a stand-alone building
is staffed "24 x 7 x forever"
intentionally uses only 90% or less of its total load capacity.
has at least two active distribution paths for connectivity, power, and cooling
All paths are physically separated and always active. The failure of any single active path will not impact uptime.
All components are physically separated. The failure of any single subsystem will not impact uptime. All IT equipment is dual-powered and installed so as to be compatible with the site's topology. Any non-compliant end-user equipment is equipped with point-of-use switches.
Preventive maintenance can be safely done without disrupting operations. Maintenance on any and every system or component can be performed using backup components and distribution paths. The failure of key nexus points will not impact uptime.
A Tier-4 site has a fault-tolerant infrastructure. The site location is not susceptible to any single major disruption. This extends the capability of the lower tier through the addition of measures that will prevent disruption even when crucial components unexpectedly fail. Tier-3 only allows the preventive maintenance of crucial components and has no safety provision for the unexpected failure of crucial components.
All IT equipment is dual-powered and installed so as to be compatible with the site's topology. Any non-compliant end-user equipment is equipped with point-of-use switches. Dual-power technology requires two completely independent systems that feed power via two paths. Research has determined that 98% of all failures occur between the UPS and the computer load.

Tier-3

is housed in a stand-alone building
takes 15 to 20 months to plan and implement
is typically staffed for two shifts or more intentionally
uses only 90% or less of its total load capacity
has at least two paths for connectivity, power, and cooling distribution.
All paths are physically separated. However, only one path is active at any time. The unexpected failure of an active path will impact uptime.
All components are physically separated. The failure of any single subsystem will not impact uptime. All IT equipment is dual-powered and installed so as to be compatible with the site's topology. Any non-compliant end-user equipment is equipped with point-of-use switches.
Preventive maintenance can be safely done without disrupting operations. Maintenance on any and every subsystem or component can be performed using backup components and distribution paths.
This has a concurrently maintainable infrastructure. The site location is not susceptible to unexpected minor disruptions. This extends the capability of the lower tier through the creation of a second distribution path for connectivity, power, and cooling.

Tier-2

may be housed in a wing or floor of an existing building
takes three to six months to plan and implement
is typically staffed for one shift
has only one path for power and cooling; may have a second path for connectivity
has a backup set of only critical power and cooling components, e.g., extra UPS batteries, cooling units, chillers, pumps, and engine generators
The unexpected failure of any component or path will impact uptime.
Operational errors will likely cause a disruption.
The site location is susceptible to all kinds of disruptions. The infrastructure must be shut down to safely perform preventive maintenance.

Tier-1

may be housed in a room or wing of an existing building
typically takes less than three months to plan and implement
is not staffed
has only one path for connectivity, power, and cooling
The unexpected failure of any component or path will impact uptime.
Operational errors will cause a disruption.
The site location is susceptible to all kinds of disruptions. The facility must be shut down to safely perform preventive maintenance.

Despite its basic infrastructure, a Tier-1 center still provides a better IT environment because:

It offers dedicated space.
Its online UPS system does a better job than a standby UPS at filtering power spikes, compensating for sags, and covering momentary outages.
It has nonstop, dedicated cooling equipment.
It has an engine generator to withstand extended power outages.

Sphere: Related Content

Philosophy to Astronomy

Saturday, September 15, 2007

Tuesday, July 3, 2007

Alex Pronove

Visitors

BBC Earth Explorer

News by Reuters

Blog Archive