Real-world Downtime for each Tier
This was taken inside a rented cage inside a data center. This is a colocation site.
This is Part-4. Click here to read Part-1. A new tab or window will open.
“Uptime” refers to the end-user's uninterrupted access to his or her data. “Downtime” refers to any disruption to this access.
Both uptime and downtime are measured from the end-user perspective. Downtime, for example, is measured from the start to the end of the disruption. Downtime is always longer than the period of the actual disruption. This is due to the aftereffects that occur downstream of the point of disruption. Let's say a Tier-2 center experienced a power outage for 30 seconds. Database servers in the middle of multiple transactions would suddenly die. The database would probably not get corrupted thanks to the built-in safeguards of the database application. Still, it'll take time for the database administrators to confirm this. Assuming the best, end-user access will be restored.
These empirical statistics came from a control group of 16 data centers studied by The Uptime Institute, the creators of the Tier standard.
Tier-1 centers typically experience two separate 12-hour periods of downtime a year because of preventive maintenance. These sites also experience 1.2 failures a year of its components or paths. Tier-1 centers average 28.8 hours of downtime a year (equivalent to 99.67% uptime).
Tier-2 sites typically experience three scheduled maintenance periods every two years and one unexpected outage each year. Tier-2 centers average 22 hours of downtime a year (equivalent to 99.75% uptime).
Tier-3 centers typically experience four hours of downtime every two and a half years—or 1.6 hours a year (equivalent to 99.98% uptime).
Tier-4 sites typically experience four hours of downtime every five years—or 50 minutes a year (equivalent to 99.99% uptime).
So far, we know these factors will cause unexpected disruptions:
- Human activities
- Infrastructure and equipment failures
- Acts of God
The human factor
I encountered another factor that will definitely cause a center to shut down. Local authorities. Local fire and electrical safety codes may force sites, regardless of tier, to shut down for inspections and tests. Fortunately, these can be planned events.
How long does it typically take to restore access from momentary disruptions? Four hours. Tier level aside, a disruption will require human intervention. That alone takes time. Would you agree that four hours seem quick for Tier-1 and -2 but, at the same time, seem too long for Tier-3 and -4? It's about expectations, isn't it?
The higher tiers, -3 and –4, should be built and, more importantly, operated with the capability to withstand subsequent failures triggered by the first failure event. "Failure" should be interpreted broadly as you will see from these customer examples.
The first involved a Tier-3 center normally staffed by two operators. One of them was on extended leave. One morning, the remaining person called in sick. How did they deal with it? The manager spent the day there. She wasn't trained but fortunately nothing untoward happened.
The second occurred in an Tier-2 room. A cooling pipe beneath the raised floor had sprung a leak. It went undetected for a week until a floor tile was picked up for another reason. A rather wide puddle had formed in the sub-floor. The site had no operators per se. The analysts, programmers, and managers had to deal with it. It was discovered mid-morning and was not was resolved until close to midnight. Nobody was really responsible for the physical infrastructure and, consequently, nobody was trained.
Facility failures often reveal previously unknown architectural, hardware, or software issues. As you read however, more than anything else, disruptions expose human activity-related deficiencies. You have to train and practice and fill the roles properly otherwise the human factor will get you.
No comments:
Post a Comment