Philosophy to Astronomy: SRDF

Sunday, August 12, 2007

BUSINESS CONTINUITY

One step beyond Disaster Recovery

I recently advised a medium-sized commercial bank in the Philippines about a stalled project to create a business continuity solution.

Financial institutions in the Philippines do not face equivalent data integrity and safety requirements as they do here in the U.S. Still, management knew that they had to improve their IT capabilities. Their primary data center is located in their head office and it’s vulnerability surfaced at every coup attempt.

They learned about me from another client. Click here for that story.

The bank was trying to install an EMC Asynchronous SRDF solution.

I briefly worked for EMC U.S.A. as a systems engineer. I’m familiar with the product line and the subject of disaster recovery & business continuity in general.

Disaster Recovery (DR) aptly describes the process of recovering from a disaster.

DR can be illustrated with the knowledge that all hard drives crash. It’s not a question of “if,” but a question of “when.” When the drives of a “production box” crash, business grinds to a halt unless and until the data can be restored and the server restarted. The process of restoring the data and restarting the server is disaster recovery.

A “production box” is tech-speak for a computer server that’s serving a live network.

Operations can grind to a halt for any number of reasons. Fire, a software crash, human error, network failure, and a power blackout are common culprits.

DR planning begins by defining the acceptable minimum values of two factors. The first is called the Recovery Time Objective (RTO) and the second is the Recovery Point Objective (RPO).

RTO is the amount of time you require to recover your lost or damaged data in order to become operational again. Can your business tolerate being down for several days or several hours? Whether it’s days or hours, this figure is your RTO.

RPO, on the other hand, is the amount of data accumulated over time that you can tolerate losing. Can your business afford to lose a day’s worth of data? If so, then your data must be backed up on a daily basis. A retail operation, like a supermarket, that logs hundreds or thousands of transactions a day may require several backups made during the course of the day.

“Business Continuity” (BC) extends the scope of preparation, plans, and resources past DR. Those two factors, RTO and RPO, figure into this as well.

BC’s goal is to ensure the business will be able to continue operating through crises and disasters. Accomplishing that requires going beyond the processes and equipment for restoring data and replacing equipment. Indeed, BC refers to making plans and preparing resources that, among other things, will prevent the loss of data. It refers to advance preparation in order to cope with the unexpected.

A good BC plan has:

identified the most likely disaster scenarios and their impact on the business;
determined the “mission-critical,” important, and less-important processes, systems, and services of the company;
established its priorities for supporting the mission-critical components;
developed and implemented the most redundant and fault-tolerant system possible within its budget;
several alternate strategies
taught and regularly practice the plan with its people; and
the continuing support of senior management.

“Mission-critical” is tech-speak for the most important processes, systems, and services that a business must have in order to fulfill its mission. What is a mission? For a hospital, it could be the 24/7 availability of patient information.

“Redundant” is tech-speak for a backup that can temporarily take the place of a failed primary system.

“Fault-tolerant” is tech-speak for the characteristic of being able to withstand glitches.

Certain industries and companies require uninterrupted IT services. For them, BC is mandatory. The airline industry and financial institutions are examples. The financial sector, in fact, has to follow stringent guidelines for protecting and maintaining the security of its data. These companies must have minimal downtime. How minimal?

A calendar year has 8,760 hours. To give you an idea of the pressure to perform, consider that a 99.9% uptime is “only” equivalent to 8,751 hours.

Imagine the trouble a bank would face if it's nine non-operational hours occurred on the 15th. Employees would not receive their pay.

It turns out that a 99.99% uptime is required to stay operational 8,759 hours of the year! That’s still one hour short of the goal!

When the availability or integrity of data is compromised for any reason, businesses risk losing revenue and market share, experiencing decreased productivity, damaging their reputation, eroding their customers’ loyalty, and, in certain industries, being penalized for failing to comply with mandated regulations.

I enjoy BC planning because it's an activity that can incorporate numerous improvements for a little or no additional cost. It's a rare opportunity to deliver a lot of added value beyond the client's initial expectations.

There are several ways to go with DR and BC. You can create it in-house or outsource some or all of its aspects.

I'll cover both but the next entry will focus on the offerings of two established players in the field of storage, DR, and BC. These are the two I’m familiar with, EMC and NetApp.

Sphere: Related Content

Tuesday, July 17, 2007

BUSINESS CONTINUITY PLANNING

Planning takes four steps

It took a while but business continuity planning (BCP) has finally become visible on the radar screen of managers and owners of smaller businesses (< $100 million sales). It’s about time too. The state of the world today is far more volatile than it was a mere eight years ago. Nine 11 did change everything.

Every organization should plan for its continued existence in the event of a major disruption. How will it continue to operate if its operation—and existence—is disrupted by any number of natural or man-made disasters?

The practice of Business Continuity Planning (BCP) has evolved into a recognized field. Job titles that carry or imply this area now exist. Practitioners can join any number of reputable associations that promote this field. Several recognized certifications can now be earned as well.

I had the good fortune of working as a Sales Systems Engineer for the world’s largest enterprise storage vendor just before the dot com crash. I’m referring to EMC, the 800-pound gorilla of the enterprise storage space. At that time, the basic rationale behind EMC’s fabulously expensive SRDF (Symmetrix Remote Data Facility) was real-time replication for disaster recovery (DR). Under the proper guidance, it can be a short leap from DR to BCP. And that is where SRDF is now positioned—as the lynchpin of the data side of business continuity planning.

The mission of a Systems Engineer who works in Sales is to support his sales reps by designing the storage and DR solutions for customers and prospects alike. To him fell the task of dealing with the technical aspect of any proposal or project. This frequently involved making technical presentations for prospects and serving as the single point-of-contact for existing customers that were contemplating system upgrades.

Disaster recovery (DR) is a subset of the BC solution. Many fine definitions of the term abound so rather than reinvent the wheel, I will quote some of the better ones. Disaster recovery is:

the process, policies and procedures of restoring operations that are critical to the resumption of business [Wikipedia].
the ability of an organization to respond to a disaster or an interruption in services by implementing a disaster recovery plan to stabilize and restore the organization’s critical functions. [Disaster Recovery Journal].

Wikipedia goes on to say that…

a disaster recovery plan (DRP) should include plans for coping with the unexpected or sudden loss of communications and/or key personnel, although these are not covered in this article, the focus of which is data protection. Disaster recovery planning is part of a larger process known as business continuity planning (BCP).

Disaster Recovery Journal continues as well…

The management approved document that defines the resources, actions, tasks and data required to manage the technology recovery effort. Usually refers to the technology recovery effort. This is a component of the Business Continuity Management Program.

The two share the common thread in their reference to business continuity planning and its inclusion of disaster recovery within its larger scope.

I will continue this in a subsequent post. For now, let me break down the steps that BCP entails. The process follows these four steps in a logical sequence.

Identification

Identify risks and hazards that confront your business. These can be natural hazards, e.g., flooding and earthquake, or man-made risks, e.g., power outage, theft, fire, attack against your computer network. Obviously you have to draw the line at some point since it is impractical to anticipate some risks regardless of their severity. For example, two key project members in an SAP implementation project I participated in literally met an unfortunate and fatal accident. That incident delayed a major portion of the entire project until replacement personnel were hired.

Assessment

It is possible to quantitatively and qualitatively determine the likelihood, magnitude, and duration of the identified risks. Assessing risks this way allows you to prioritize them. When risks are categorized this way, you can budget your resources more rationally.

Plan Development

You now have the information to create the plans and procedures for preparing your organization to respond to and recover from interruptions. This is a high-level step and as the saying goes, the devil is in the details. This is where senior management, which should have initiated this project to begin with, should return and visibly support the BCP team. The team will need the time to extensively discuss the risks and possible solutions with functional heads. Without that support, the team will find it difficult to get the attention of the functional heads, much less their full-hearted cooperation.

Exercise

In this final step you must exercise the plan. This is the only way to learn what works and what does not. Needless to say, this is another step that senior management must support. Exercising the plan is a continuing activity. In fact, this entire process is performed iteratively. Exercising the BC plans will refine those plans and, more importantly, teach the employees how to respond if and when the real event happens.

Sphere: Related Content

Philosophy to Astronomy

Sunday, August 12, 2007

Tuesday, July 17, 2007

Alex Pronove

Visitors

BBC Earth Explorer

News by Reuters

Blog Archive