Sunday, July 29, 2007

DISASTER RECOVERY FROM A COUP D'ETAT

Business continuity in action!

Four years ago this month, a disaster recovery solution we created proved its worth. It saved our client, an international property consulting firm, from tanking after an attempted coup d’etat in 2003.

The Philippines has been wracked by four or five coup attempts in the last ten years. That averages to one every 24 months!

The renegades, 300 heavily armed soldiers and their leaders, barricaded themselves in several buildings in Makati’s business district. The standoff lasted for 19 hours before they surrendered. During the crisis, authorities turned off the power grid that served the contested area. My client’s office was within that grid. After it was over, her staff returned to a ransacked office.

Other companies in her building were not so fortunate. In fact, my client was the only one who was able to restore her data and resume operations as if nothing had happened.

About half of her neighbors—branch offices of large companies as well as individual businesses—did not back up at all. As for the other half, the IT manager kept their backup data offsite by bringing the media home. I learned that many of the ones who backed up discovered that their backups were too old or could not be restored.
The latter didn’t surprise me. In smaller shops, many IT administrators diligently back up their data but neglect to regularly test the media’s integrity by doing test restores. Back in the days when I was a network engineer, I was installing Citrix in a 25-desktop network. They had two Windows NT servers and had always been using Windows NT’s built-in backup utility. I knew about that utility’s notorious reputation so I challenged them to restore the data (prior to my continuing my work). Their office manager, who doubled as the IT administrator, pulled out seven tape cartridges. One by one, she tried to restore the contents of each tape. And one by one, she discovered they were empty. In fact, if I recall correctly, the MS-DOS directory listing revealed one empty folder in each tape. That’s how we sold a lot of ARCserve software back then!
As for my client, months earlier, we added additional storage. I persuaded them to configure it to do double-duty as a disaster recovery system. The hardware was kept in a cabinet closet (literally) down the hall. The system consisted of a NetApp NAS (Network Attached Storage) appliance and Symantec’s Backup Exec (System Recovery version).

I was back in the U.S. when this happened and I was able to talk them through the procedure. They were up and running by the end of the day!


Sphere: Related Content

Saturday, July 21, 2007

BUSINESS CONTINUITY

How to turn it into a competitive advantage.


Business continuity. Disaster recovery. Colocation. Backup service providers. All of these share at least one thing in common and that’s storage.

SAAS. SOA. High availability. Regulatory compliance. Risk management. Virtualization. These topics were steps in a thread that followed this sequence:

We plan to deploy our proprietary applications over the web to serve our user
communities (Software As A Service, or SAAS).

In fact, we should start re-orienting our entire information system to become more agile and flexible (Service-Oriented Architecture or SOA).

If we do that, we’ve got to ensure our user communities have 24x7x365 accessibility (high availability).

At the same time, the law requires us to safeguard our clients’ information from unauthorized access (regulatory compliance).

Now, what are the risks that stand in our way? How can we mitigate most or all of those (risk management)?

A primary way to do that is to create shared pools of computing assets (server and storage virtualization).
I illustrated the actual sequence of thought of a client's CIO. It eventually became the company’s blueprint for the nature and sequence of IT investments. Senior management gave him the mission of ensuring that IT would support the company’s business goals. At that time, the company was a four-year old start-up that specialized in providing backroom services to hospitals.

I became their consultant based on my experience when I worked for the second largest U.S. provider of the same service. We provided the service through WANs. That was in the late-90s. The CIO above planned to provide it through web-based applications. It’s really remarkable how fast things change in eight years.

WANs are Wide-Area Networks. These are networks that connect far-flung branches to the mother ship and, occasionally, directly to each other. WANs are ubiquitous. Walgreens, Wal-mart, and Home Depot are examples of companies that use WANs extensively. You’ll know you’re dealing with a WAN when you can buy from one store and return it at another. Their WANs might not necessarily be working in real-time but here’s one example that does: bank ATMs (or Automatic Teller Machines). “Real-time” means occurring as it happens. Monday night football, for instance, is aired in real-time.
Here’s an overview of these IT objectives. I’ve been on both sides of the fence—as a CIO and as a vendor—so I can present both perspectives.

The IT goal exists to accomplish the Business Objective.


The IT Deliverable will indicate the attainment of the IT Goal.

The IT Goal can be classified according to its nature. It may be an “O,” which stands for Offensive, or a “D,” which stands for Defensive. An O goal indicates the objective confers the organization with a competitive advantage. A D is meant to protect the organization’s assets. It turns out that these are all Ds. Later on, I'll explain how these Ds can become Os.

I only presented the first four issues. They all revolve around storage.



If the IT GOAL is Business Continuity, then the
BUSINESS OBJECTIVE must have been:
to keep the business operating through crises and disasters. While it may not be possible to operate every aspect of the business, all mission-critical aspects must continue to operate. Interruptions should be kept to a minimum. The severity of the impact should be limited. Minimize financial losses!

The IT DELIVERABLE, in that case, would have been:
an infrastructure equipped with the appropriate technology and trained people that are always prepared to provide IT services in the event of crises and disasters. Components of this IT deliverable include the IT goals described below, namely, disaster recovery, colocation, and backup service providers. The organization should be trained and periodically rehearsed.



If the IT GOAL is
Disaster Recovery, then the BUSINESS OBJECTIVE must have been:
to get back up and running as quickly as possible in the event of a major disruption.

The IT DELIVERABLE, in that case, would have been:
a combination of elements that will allow operations to resume in the event of a disaster. Some of these elements are the colocation services and backup service providers discussed below.



If the IT GOAL is
Colocated Services, then the BUSINESS OBJECTIVE must have been:
to provide stand-by redundancy and data protection from a geographically distant location.

The IT DELIVERABLE, in that case, would have been:
a fully-functional IT configuration housed in a facility located some distance from the primary data center. "Colocate" was a term coined to describe the service of renting space in a "hardened" facility for the purpose of housing backup equipment.

If live production data is continuously sent to the backup, it's referred to as a "hot" configuration. If not, it's a "cold" one. Regardless, the purpose of collocated service is to take over in the event of an emergency. A hot site should be able to transition within 24 hours of the start of the incident. A cold one will take a lot longer. The transition time will be measured in days.



If the IT GOAL is
a Remote Backup Service, then the BUSINESS OBJECTIVE must have been:
to safely store data that stays accessible (through the Internet) in the event of a major disruption.

The IT DELIVERABLE, in that case, would have been:
an ongoing subscription to the services of a third-party company. Your data is stored on their equipment and administered by their people. You should pay for the highest level of support with this option. It will take days to restore your data and resume operations.

Think about it. First, you might need
to replace your equipment. Second, you would need to prepare them (format the drives, install the software, etc.). Third, you might need to restore the previous IT environment (user accounts, etc.). Fourth, you need to retrieve your data (on CDs probably). Finally, you can resume operations. This description is overly simplified. I've gone through this for different clients and it's always caused them a huge headache and given us a small windfall.

In one incident involving a metro trucking company, a lightning storm fried most of their desktops and everything in their server room. It took us five calendar days (we worked through a weekend) to receive
the correct hardware and software. We bought replacement servers and desktops and laptops. We had to re-order again because of incorrect or incomplete items. We also had to buy an upgraded version of their proprietary software. We dealt with five different vendors, three of which required immediate payment in full. This, in turn, put additional pressure on the owners. They were already scrambling. They were dealing with insurance, their bank, the landlord, and their staff. They already had to hire extra help since they were using paperwork to run their business.

It took a day and a half to prepare everything. Both steps, after nearly seven days, required two of us. It took another two days to configure their IT environment. We learned that their proprietary software required agents installed in all of its clients. In other words, their trucking industry software required
each and every desktop that used it to be specifically prepared. We were on the phone constantly with them. Our hassles didn't stop there. They had been using an older version of this software so when we installed the current version, our client's analyst had to adapt the stored data to the newer version. So finally, nine days after the incident occurred, we began restoring the data.

I skimmed over many details. For instance, I didn't mention the hotel room we rented. Or the Internet connection that had to be set up (this happened in the late-90s). We moved back into the office on the 14th calendar day and they felt comfortable enough to let us go after the 16th day.

This experience taught us many lessons. One of them is the importance of selecting the appropriate contingency solution.



Everything begins with a formal process of contingency planning. Senior management should support this. It’s highly advisable for the Chief Operating Officer (COO) and the Chief Information Officer (CIO) to become active members of the planning team. In broad strokes, the team should do a comprehensive assessment of the risks the company faces, their likely impact, and the company’s various plans of action.

Contingency planning is important for another reason. The discussion can take a proactive turn if the technical side is able to enlighten the business side of the potential of carefully configuring the business continuity solution.

I’ve seen it happen several times. The light bulb goes on and suddenly the business side gets “it.”

IT being the proposition that a carefully configured solution will enable new processes. These new processes, in turn, can become competitive advantages. In so doing, the Ds suddenly turn into Os that deliver competitive advantages.

I'll explain that in a forthcoming article.


Sphere: Related Content