Data Center Vulnerabilities: Why are Modern Data Centers Failing?

 Michael Osterman, goes on to say, "Organizations are
 not meeting their targets for messaging system
The year 2008 began with a dire prediction fromavailability," and adds that the average e-mail system
Subodh Bapat, a vice president in the eco-computingexperiences about 70 minutes of downtime during a
team at Sun Microsystems, when he declared, "You?lltypical month, which translates to 99.84% uptime. To
see a massive failure in a year." He went on to say,this he poses the question, "Is this good enough?" ¹¹
"We are going to see a data center failure of that 
scale," referring to the worm that took down 5% ofCeryx Inc., a Hosted Microsoft Exchange provider with
worldwide UNIX boxes in 1988.¹data center facilities in Canada and the United States,
This time he isn't citing security lapses as the rootdoesn?t think so. They were the first in the industry to
cause but rather failure caused by the massiveoffer a real 100% SLA based on their multi-data center
computing power required to run today's applications.architecture and software design. Customers? data is
Though certainly an extreme position, the past yearreplicated in real-time and resides in both data centers
has seen a rash of data center failures that brings into– more than 500 miles apart – so that even in
question how reliable single data centers are for thethe event of catastrophic failure, the primary system
delivery of mission-critical applications.would fail over with almost no impact to the end-user.
Vulnerabilities ranging from the most common, like"We operate on the premise that even the best data
natural disasters and infrastructure failure (data centercenter can and will experience failure due to
power outage, burst pipes, construction workcircumstances beyond anyone's control," says Dr.
damaging fibre lines, ) to hardware failure, storage orDavid Penny, CIO at Ceryx. "We focus our R&D
database failure and common software problems,on keeping the application highly available and rely on
have been causing regular disruptions to businessesour replication technology to mitigate the vulnerabilities
and come with a high price tag.that exist on the data center level. And then we make
Recent events in the news support the fact that eventhe operating and capital investments necessary to
with good planning, resourcing and design, some of theexecute daily."
most sophisticated facilities can still experienceFor the past 4 years Penny and his team have
catastrophic failure.worked with Enterprise Messaging systems, like Lotus
Last summer, the state-of-the-art 365 Data Center, inNotes and Microsoft Exchange, developing technology
San Francisco – built with more than $125 million -to deliver high availability. Since 2004 they have been
was offline for hours due to a power grid outage byproviding a geo-replicated Microsoft® Exchange
Pacific Gas & Electric that put a significant portion2003 service to medium and large-sized companies
of San Francisco in the dark. Subsequently, the backupwho see the cost and performance benefits of the
generators at the facility also experienced failure andCeryx solution.
had to be manually started.²Most recently Dr. Penny and his team have been
"When researching data centers, new facilities oftenworking with Geographic Clustering in Server 2008
boast N+2 levels of redundancy," says Roger Smith,and native Microsoft Exchange 2007 CCR (Cluster
V.P. of Operations at Ceryx Inc. "However, as theseContinuous Replication) technology. What this allows
same facilities fill up and age, that often becomes N+1,for is clustering over a wide area network. Traditional
or in some areas no redundancy at all."clusters, which rely on the same RAID system in order
According to Sun Microsystems Executives, the typicalto continue to function properly, are susceptible to
life span of a data center is only about 10 to 12 yearslogical corruption and certain physical corruptions that
and many data centers - built at the beginning of thecan propagate across an entire RAID array causing
dot-com era – now need to be rebuilt.complete failure. Geo-Clustering eliminates the reliance
"As the person who is accountable for uptime I haveof redundant servers on the same set of disks
to balance which applications are considered critical bythereby eliminating a very common single point of
upper-management and clearly communicate the costfailure.
and investment required to provide high-availability,""Even with WAN replication we need to ensure that
says Roger Smith. "When you present the facts, itthe corruption itself isn?t replicated," says Dr. Penny.
becomes clear to everyone that an in-house dataFor this they are utilizing log-shipping with delayed
center couldn't possibly provide the levels ofapplication rather than block-level replication, thereby
redundancy required and even on a co-location levelavoiding the replication of corruptions caused by
we would need redundancy."application defects. By monitoring performance on the
In many cases, no contingency plan could avoid theprimary system closely they can stop bad changes
issues that plague individual data centers. On July 14thfrom being committed to the secondary system.
of this year, the Peer 1 data center in downtownBeyond the physical vulnerabilities of a single data
Vancouver – one of the largest facilities in Canadacenter, Ceryx is protected against a number of other
– was offline for almost an entire day. Anvulnerabilities anyone using a single data center is
underground fire caused massive power outagesexposed to. "When negotiating our contract, our
throughout downtown Vancouver. While backupprovider knows how easy it is for us to move
generators at Peer 1 started without issue, thefacilities," says Roger Smith. "The data is already
water-based cooling system failed as firefighters –replicated and we don?t need to physically migrate
in their attempt to douse the fire - depleted the waterservers. Migration to a new facility can occur without
pressure required to keep the cooling systemsany impact to our customers. We can't be held
operational. This caused the backup generators tohostage to a bad contract or radical increases in
overheat and any failover to UPS was limited to apricing or continued poor performance."
short battery life.³Ceryx also has a lot of flexibility where routing is
In a similar event this summer, The Planet, a prominentconcerned and should a backbone be down or
hosting provider in Houston, experienced a majorcongested, Ceryx with front-end servers operating at
explosion in their data center, taking more than 9000both facilities, has the flexibility to route traffic through a
customer servers offline for several days. Backupseparate facility and bypass potential network
generators worked perfectly, but again the firecongestion that can plague operators running out of a
department would not allow the facility to resumesingle data center.
power until it was deemed safe. In some casesWhile there are a number of solutions in the market
servers were physically migrated to a new facility.that provide continuity through an interim e-mail system
In the aftermath of this disaster the Planet wasin the event of downtime, the Ceryx system is
applauded for their response to the crisis; allocatingdifferent in that it doesn't require the user to even
every resource they could to address the problem andchange settings when the e-mail system fails over to
proactively communicating status reports and issuingthe secondary facility. Moreover, things like e-mail
SLA credits. history, sent items and calendar entries all remain intact.
Google, whose Enterprise App customersIn this respect the Ceryx solution is not a continuity
experienced multiple outages on August 6th, 11th andsolution but rather a high-availability solution that
15th of this year, took a more reactive stance,provides layers of redundancy, from the software
promising to build a communication dashboard andlevel up to the facility level.
issuing a blanket credit for all customers, regardless of 
whether they were impacted by the outage.Hosted archiving solutions – a good plan for any
The real question remains, what is the cost of datacompany facing regulatory and legal compliance - also
center failure and the resulting downtime forprovides a layer of assurance and access to e-mail
organizations? Is it covered by SLA credits? Mostrecords, should the primary facility suffer complete
SLA credits reflect the cost of the services renderedfailure. However, these solutions will not provide
and almost never provide for business losses.business continuity or availability.
At the Continuity Insights Management Conference inMoreover, if the primary e-mail provider experiences
2006, Agility Recovery Solutions stated that 78% offailure due to data corruption, the data being archived
businesses who suffer a catastrophe without amay be corrupt as well. Large data stores, even at the
contingency plan are out of business within 2 years.mailbox level, lead to corruption and the current trend
And 90% of companies unable to resume businessof Hosted Exchange vendors selling e-mail accounts
operations within 5 days of a disaster are out ofwith massive storage allowances is introducing a
business within 1 year. higher probability of data corruption and subsequent
Clearly some applications are considered more criticalfailure. A good archiving strategy can be used to keep
and have more visibility than others. Large companiesmailbox sizes manageable and subsequently reduce
feel the impact immediately when their ERP, CRM isthe likelihood of corruption.
still plagued by a prime-time outage more than twoSo while extremely valuable in today's world of
years ago caused by a failure with an Oraclemission-critical e-mail, archiving to an external hosted
Database Cluster ?), Business Intelligence or E-mailfacility should not be mistaken for a multi-data center
systems become unavailable.strategy. Instead archiving is a good backup plan and
However, with the proliferation of mobile devices andwill not provide the protection businesses today need
'everywhere access', e-mail clearly stands out as theagainst the inevitable vulnerabilities that exist with a
premier mission-critical application of today. Systemssingle-data center strategy.
like Lotus Notes® and Microsoft® ExchangeThese vulnerabilities are typically covered in the fine
maintain a living record of a company's existence,print of a facility?s SLAs, under the term "Force
storing every activity, process and thought anMajeure"; a phrase often translated as an "Act of
organization and its employees have. It's no surpriseGod? or the literal French translation, "Superior Force"
public companies are now required to maintain aand is included as a clause to excuse interruptions in
record of e-mail activity for compliance purposes.services caused by extraordinary circumstances
While the vast majority of businesses rely on e-mailbeyond the control of the provider. Circumstances that
everyday to send contracts, proposals, quotes and the- as demonstrated over the past year - are becoming
majority of correspondence, most e-mail systemsmore and more common.
have not yet reached the point of reliability that phoneMichael Osterman concludes, in his presentation on the
service provides (99.999% or 5.2 minutes of downtimeImportance of E-mail Continuity, that the only solution to
per year)the inevitable problems that plague mission-critical
According to Osterman Research, most Northservice delivery is with a geo-replicated, multi-data
American businesses experience more than onecenter solution, like the one being offered by Ceryx.
e-mail outage every month -- and many indicate that
they could lose more than $100,000 as the result of aFootnotes:
single major e-mail outage.¹¹ CNET News:
Osterman also found that the average business 
experiences nearly seven hours of e-mail downtime² Data Center Knowledge:
every year and that outages can bring many workers 
to a virtual standstill, who on average are 25% less³ Data Center Knowledge:
productive during e-mail downtime. 
"Forget the fact my billing rate gets impacted if I can't4 Data Center Knowledge:
access my email system," says a partner at a major 
North American law firm who prefers to remain5 Center Networks:
anonymous. "My company image gets tarnished 
immeasurably when I am working on a multi-million6 CIO WebBlog:
dollar, highly-confidential deal and I have to send out a7 London Chamber of Commerce Study, 2006
set of documents using my Hotmail account because8 The Importance of Messaging in the Enterprise: A
my email system is down. Somebody gets fired forsurvey of email application continuity,
that."Applicationcontinuity.