Datacenter failures of note
“A man may fall many times, but he won't be a failure until he says that someone pushed him.”
- Elmer G. Letterman
OnePartner's ATAC datacenter was the first commercial datacenter in North America to receive a Tier III certification by The Uptime Institute. There are no higher certified commercial datacenters in North America today. OnePartner has maintained uninterrupted service since 2008.
Happy Father’s Day from United Airlines!
United Airlines experienced a 5 hour disruption of service due to “a network connectivity issue”. The service disruption included flight departures, airport processing and reservations as well as the company’s website. 36 flights were canceled and 100 flights were delayed.
"A hundred yards of kiosks, and every one of them closed," he said, adding there were no flights listed on monitors. "Workers were trying to answer questions. They have no ability to do anything manually. They can't check baggage. You can't get baggage. You are really stuck."
- Bloomberg Businessweek
This weekend's outage was caused by "a connectivity issue and United's back-up system didn't implement properly," said United spokeswoman Mary Clark. Hobart said it was unrelated to the ongoing merger of United with Continental Airlines.
- USA Today
State of Colorado's Datacenter
A router issue within Colorado’s State datacenter caused a 2 ½ hour disconnection from the Internet on June 15, 2011. The outage occurred at 9:15 AM according to Dara Hessee, Chief of Staff for the Governor’s Office of Information Technology.
The outage caused access failures for www.colorado.gov, the unemployment office, Secretary of State’s Office, the U.S. District Court and the Department of Revenue, according to the Denver ABC affiliate, 7News.
According to the report, State Troopers could not access crime databases from patrol cars.
“7NEWS has been reporting frustrations with repeated state computer outages for years, from the unemployment office to the DMV. The problems have been with both server and hardware failures.
Hessee said this issue was different than previous issues.
"It's not related to antiquated computer system. This was a separate issue related to networking equipment," she said.
She said the state's IT office is doing an analysis to figure out what caused the crash to keep it from happening again.”
7News web site
Another report indicates that some systems remained offline longer…
“A statewide computer system failure is still keeping the El Paso County Public Health Department from issuing birth and death certificates. Spokesperson Susan Wheelan says the Vital Records Office hopes to be back online sometime Thursday morning.”
5:36 PM June 15, 2011, reported by NewsFirst5.com
Northrop Grumman has agreed to pay $4.7M for VITA (Virginia Information Technologies Agency) outage which occurred on August 25, 2010. The outage disrupted 26 of Virginia’s 89 executive agencies, including the DMV.
The initial failure resulted from a memory card failure in the EMC SAN. The failure was compounded when a backup SAN also failed.
ComputerWorld, initial storyhttp://www.tinyurl.com/onepartner-18
InformationWeek, follow-up storyhttp://www.tinyurl.com/onepartner-19
University of Pennsylvania
An accidental switch of two glycol pumps from manual to automatic during equipment replacement caused hardware overheating. The overheating hardware shut down.
The outage disrupted financial, research and student services.
Data Center Knowledgehttp://tinyurl.com/onepartner-15
At 4:45 PM (Pacific), severe storms knocked out power at PG&E. One of NaviSite’s hosted clients indicated that backup power systems initially functioned (from battery power) but failed with the switch from battery to generators. This client’s Twitter feed indicates that the resulting outage required 2-3 hours to recover.
“Our root cause analysis is continuing. All indication is that the electrical transfer switch was damaged as the result of a power surge thus causing the failure of the generators to pick up the load. We will be on manual override until the situation has been fully addressed and resolved.”
- From NaviSite’s blog
Data Center Knowledgehttp://tinyurl.com/onepartner-16
Rackspace suffered an outage at 9:19 AM in their LON03 data center (London). A module on one of the Uninterruptible Power Supplies (UPS) failed. Rackspace restored power to most clients of this data center by 11:30 AM but had particular issues with several “Whiteboxes” that required longer to restore.
“In numerous instances they had to replace power supplies in servers, replace firewalls, reconfigure switches and to log on to servers to get them to boot properly. We can only apologize for this incident. We take this type of incident extremely seriously and will work to fix the root cause.”
- Rackspace reported on Data Center Knowledge
Data Center Knowledgehttp://tinyurl.com/onepartner-17
Rackspace experienced yet another outage in their Dallas/Fort Worth data center. This outage follows previous outages this summer which prompted Rackspace CEO Lanham Napier to issue an apology to customers and provide an online video outlining the cause of the failure.
The problems began at 12:30 AM (CST) while “testing phase rotation on a Power Distribution Unit (PDU),” according to Rackspace’s blog. “A short occurred and caused us to lose the PDUs behind this cluster.” Rackspace reports that the PDU outage lasted about five minutes, however many customer’s sites were unavailable for a longer period.
“Most sites returned to service by 2 a.m., while several cloud servers continued to experience problems until after 5 a.m. according to a timeline on the Cloud Servers status blog” – Data Center Knowledge
Data Center Knowledgehttp://tinyurl.com/onepartner-14
6 hours (9:30 AM to 3:30 PM)
A generator failed during planned maintenance at IBM’s commercial data center in Newton (outside Auckland) New Zealand on Sunday, October 11 at 9:30 AM. Local media reports that a failed oil pressure sensor on a backup generator was the likely cause.
The outage severely impacted Air New Zealand who outsourced their mainframe and mid-range systems to IBM. The shutdown impacted airport check-in systems, online bookings and call center systems. Overall, the outage impacted over 10,000 passengers and threw airports into disarray, according to a local media report.
Air New Zealand representatives publicly railed at IBM.
“My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and theneeds of our customers." - TVNZ
Data Center Knowledgehttp://tinyurl.com/onepartner-5
"We experienced another power interruption on July 7, 2009. Again, we moved customers to generator power. During this outage we also suffered a loss of network connectivity due to the power disruption. The part of the power infrastructure that failed (a “bus duct”) prevented proper operation of our UPS for that section, so some customers lost power to their servers for about 20 minutes before we could get them onto generator power. We have since replaced the failed bus duct, and that section of the data center is back to normal and running on utility power."
- Rackspace CEO Lanham Napier (via company blog).
Rackspace describes corrective measures to address the continuing issues with the facility, but does not volunteer to have the facility certified by the Uptime Institute.
Data Center Knowledgehttp://tinyurl.com/onepartner-1
Rackspace blog postinghttp://tinyurl.com/onepartner-2
Rackspace reported a power failure in their Grapevine, Texas data center at 4:30 pm (EST). Power was restored approximately 45 minutes later. Sequentially repowering client's servers however lasted until "well into the evening", according to Rich Miller of Data Center Knowledge.
Although Rackspace manages a number of facilities, this one is the company's largest. The facility also experienced an outage in 2007 due to power issues.
“Although this outage only affected a portion of our customers in one of our nine global data centers, we consider any outage to be unacceptable,” the company said in its update. “We sincerely apologize to our customers and those who were affected by this downtime. … Now that we have the near-term situation stabilized in Dallas, we have some work to do to improve our reliability. We will follow up with more information as we work through our root-cause analysis.”
This type of failure illustrates the importance of "concurrent maintainability" espoused by The Uptime Institute. In a Tier III facility, such as OnePartner's ATAC data center, each power system can be taken offline for routine maintenance without impacting the production status of the data center. This ensures equipment can be properly maintained and capable of optimal performance.
Data Center Knowledgehttp://tinyurl.com/onepartner-3
Methodist Hospital in Indianapolis turned away patients arriving in ambulances June 2 according to an article in the Indianapolis Star. A power surge reportedly knocked out the hospital’s computer system early in the morning, making patient’s medical records in the Electronic Medical Record (EMR) system unavailable. The hospital staff continued to see patients without records until “a backlog of paperwork” led them to stop accepting ambulance patients.
If the hospital had business continuity architecture in place, physicians would be able to continue with a minor, or possibly no disruption of service.
There was a power outage in The Planet’s D5 facility in Dallas, Texas. The outage occurred when utility power failed and automatic cut-over to UPS systems did not occur, according to Rich Miller at Data Center Knowledge. “While the transfer switch did not allow us to connect to generators, we are back on utility power,” reports the company via Twitter.
From the information reported, the company was simply fortunate that the utility power was restored within one hour, since restoration of utility power ultimately brought the facility back online.
"The Houston-based company suffered a lengthy outage in June 2008". - Rich Miller, Data Center Knowledge
Data Center Knowledgehttp://tinyurl.com/onepartner-10
According to posts made by The Planet representatives, electrical gear shorted in their H1 data center at 4:55 PM on May 31, 2008 causing an explosion and fire. The explosion knocked down three walls of the electrical room. The fire department prevented the company from starting backup generators. The Planet's communications to clients during the outage was widely praised. Full operation was restored at 11:25 pm on June 3.
A great deal of information is available on this outage, including the forum updates at The Planet, which provide a running narrative as events actually unfolded.
The Planet’s forum coveragehttp://tinyurl.com/onepartner-9
According to Rich Miller writing for Data Center Knowledge, Rackspace experienced two failures in as many days. The first outage was caused by "an unspecified mechanical failure", causing about three hours downtime and the second was caused when the power was cut then restored then cut again, causing the chillers to overheat. Rackspace pledges a 100% uptime SLA, under which they repaid 5% of client's monthly fees for every 30 minutes of downtime.
Data Center Knowledgehttp://tinyurl.com/onepartner-4
A CNET article (see reference below) summarizes the outage. Transformer breakers at San Francisco Pacific Gas & Electric suddenly opened, causing a power surge and subsequent power failure. Three of the data center's 10 generators failed to start which cut power to approximately 40% of the data center's customers.
In a fully redundant architecture with no single points of failure, such as the OnePartner ATAC, ½ of the available resources can fail without disruption of service. To achieve a Tier III rating, 365 Main would need to maintain twice the number of generators required to power the facility at capacity. The Tier rating would also require a review by The Uptime Institute, which might uncover other vulnerabilities. 365 Main hosted Gamespot and Craigslist among many other companies and boasted "24-hour-a-day, 7days-a-week, 365-days-a-year power".
We've heard of other large data centers that fell prey to generators that didn't start up during a power outage. This also seems to be a fairly common failure mode. That's one reason we have redundant generators powering the ATAC. If one fails, the other starts automatically. - OnePartner
Official statement by 365 Main’s President, August 1, 2007http://tinyurl.com/onepartner-11
July 24, 2007 (http://tinyurl.com/onepartner-12
August 1, 2007 (http://tinyurl.com/onepartner-13