United Airlines experienced a 5 hour disruption of service due to "a network connectivity issue". The service disruption included flight departures, airport processing and reservations as well as the company's website. 36 flights were canceled and 100 flights were delayed.
A router issue within Colorado's State datacenter caused a 2 1/2 hour disconnection from the Internet on June 15, 2011. The outage occurred at 9:15 AM according to Dara Hessee, Chief of Staff for the Governor's Office of Information Technology.
The outage caused access failures for www.colorado.gov, the unemployment office, Secretary of State's Office, the U.S. District Court and the Department of Revenue, according to the Denver ABC affiliate, 7News.
According to the report, State Troopers could not access crime databases from patrol cars.
Hessee said this issue was different than previous issues.
"It's not related to antiquated computer system. This was a separate issue related to networking equipment," she said.
She said the state's IT office is doing an analysis to figure out what caused the crash to keep it from happening again."
Another report indicates that some systems remained offline longer...
Northrop Grumman has agreed to pay $4.7M for VITA (Virginia Information Technologies Agency) outage which occurred on August 25, 2010. The outage disrupted 26 of Virginia's 89 executive agencies, including the DMV.
The initial failure resulted from a memory card failure in the EMC SAN. The failure was compounded when a backup SAN also failed.
An accidental switch of two glycol pumps from manual to automatic during equipment replacement caused hardware overheating. The overheating hardware shut down.
The outage disrupted financial, research and student services.
At 4:45 PM (Pacific), severe storms knocked out power at PG&E. One of NaviSite's hosted clients indicated that backup power systems initially functioned (from battery power) but failed with the switch from battery to generators. This client's Twitter feed indicates that the resulting outage required 2-3 hours to recover.
Rackspace suffered an outage at 9:19 AM in their LON03 data center (London). A module on one of the Uninterruptible Power Supplies (UPS) failed. Rackspace restored power to most clients of this data center by 11:30 AM but had particular issues with several "Whiteboxes" that required longer to restore.
Rackspace experienced yet another outage in their Dallas/Fort Worth data center. This outage follows previous outages this summer which prompted Rackspace CEO Lanham Napier to issue an apology to customers and provide an online video outlining the cause of the failure.
The problems began at 12:30 AM (CST) while "testing phase rotation on a Power Distribution Unit (PDU)," according to Rackspace's blog.
Most sites returned to service by 2 a.m., while several cloud servers continued to experience problems until after 5 a.m. according to a timeline on the Cloud Servers status blog"
A generator failed during planned maintenance at IBM's commercial data center in Newton (outside Auckland) New Zealand on Sunday, October 11 at 9:30 AM. Local media reports that a failed oil pressure sensor on a backup generator was the likely cause.
The outage severely impacted Air New Zealand who outsourced their mainframe and mid-range systems to IBM. The shutdown impacted airport check-in systems, online bookings and call center systems. Overall, the outage impacted over 10,000 passengers and threw airports into disarray, according to a local media report.
Air New Zealand representatives publicly railed at IBM.
Rackspace experienced an outage in their Grapevine, TX data center.
Rackspace describes corrective measures to address the continuing issues with the facility, but does not volunteer to have the facility certified by the Uptime Institute.
Rackspace reported a power failure in their Grapevine, Texas data center at 4:30 pm (EST). Power was restored approximately 45 minutes later. Sequentially repowering client's servers however lasted until "well into the evening", according to Rich Miller of Data Center Knowledge.
Although Rackspace manages a number of facilities, this one is the company's largest. The facility also experienced an outage in 2007 due to power issues.
Methodist Hospital in Indianapolis turned away patients arriving in ambulances June 2 according to an article in the Indianapolis Star. A power surge reportedly knocked out the hospital's computer system early in the morning, making patient's medical records in the Electronic Medical Record (EMR) system unavailable. The hospital staff continued to see patients without records until "a backlog of paperwork" led them to stop accepting ambulance patients.
If the hospital had business continuity architecture in place, physicians would be able to continue with a minor, or possibly no disruption of service.
There was a power outage in The Planet's D5 facility in Dallas, Texas. The outage occurred when utility power failed and automatic cut-over to UPS systems did not occur, according to Rich Miller at Data Center Knowledge. "While the transfer switch did not allow us to connect to generators, we are back on utility power," reports the company via Twitter.
From the information reported, the company was simply fortunate that the utility power was restored within one hour, since restoration of utility power ultimately brought the facility back online.
According to posts made by The Planet representatives, electrical gear shorted in their H1 data center at 4:55 PM on May 31, 2008 causing an explosion and fire. The explosion knocked down three walls of the electrical room. The fire department prevented the company from starting backup generators. The Planet's communications to clients during the outage was widely praised. Full operation was restored at 11:25 pm on June 3.
According to Rich Miller writing for Data Center Knowledge, Rackspace experienced two failures in as many days. The first outage was caused by "an unspecified mechanical failure", causing about three hours downtime and the second was caused when the power was cut then restored then cut again, causing the chillers to overheat. Rackspace pledges a 100% uptime SLA, under which they repaid 5% of client's monthly fees for every 30 minutes of downtime.
A CNET article (see reference below) summarizes the outage. Transformer breakers at San Francisco Pacific Gas & Electric suddenly opened, causing a power surge and subsequent power failure. Three of the data center's 10 generators failed to start which cut power to approximately 40% of the data center's customers.
In a fully redundant architecture with no single points of failure, such as the OnePartner ATAC, ½ of the available resources can fail without disruption of service. To achieve a Tier III rating, 365 Main would need to maintain twice the number of generators required to power the facility at capacity. The Tier rating would also require a review by The Uptime Institute, which might uncover other vulnerabilities. 365 Main hosted Gamespot and Craigslist among many other companies and boasted "24-hour-a-day, 7days-a-week, 365-days-a-year power".