Thursday, September 2, 2010

VA cloud outage

--Virginia Gov't Agencies Suffer Massive Outage
(August 27 & 30, 2010)
A storage area network (SAN) memory card failure at the Virginia
Information Technologies Agency (VITA) left at least two dozen agencies
without the ability to conduct business. Among the affected agencies
are the Department of Motor Vehicles, which was unable to issue driver's
licenses, and the Department of Social Services, which was unable to
distribute benefits. The data center where the failure occurred is run
by Northrop Grumman.

[Editor's Note (Northcutt): The state of Virginia was an early adopter
of blades and virtualization. The advantages and economics are obvious.
These outages may prove to be a cautionary tale. With virtualization,
you end up with a lot of eggs concentrated in a fairly small basket so
that if your continuity of operations plans fail, you go down pretty

(Schultz): This is a perfect example of what can go wrong when cloud
services fail. People in general neither recognize the real risk nor
plan for loss of availability in cloud services.]

Wow, they were not running dual HBAs into the SAN? Can't be.

Outage report from VA is here:

I am not sure the SANS editor comments are warranted. This may be related to an architectural error in the deployment of the EMC DMX 3 and its backup.

The DMX is an SMP-based HA system with a petabyte of capacity. The comment about too many eggs in one basket is accurate with respect to the State of Virginia's use of a monster SAN, but not so much as per use of virtualization.

The real failure here is whether or not they tested their COOP capability ... ever. Then we have to ask when was the last time they ran a DR test because their time to recover seems a little long as well.

My failure analysis: over reliance on a vendor's claim that their hardware never fails.

No comments: