When the Australian Tax Office (ATO) experienced a significant HPE SAN outage in December last year, users were unable to access key services for several days. A second outage in February 2017 caused similar problems, causing plenty of embarrassment for the Australian government.
Finally, after months of investigations by the ATO vendor HPE, an interim report has identified the cause of both failures.
HPE SAN misconfiguration
At the heart of the problem was a serious configuration issue; the SAN had been configured to prioritise performance over resilience. When a small number of drives failed, it had a catastrophic effect on the rest of the systems.
Typically, recovering from these kinds of disk failures is relatively straight-forward. However, several key recovery tools were stored on the same SAN that failed, rendering them inaccessible.
A completely avoidable outage
The full results of the ATO investigation are yet to be published, and government confidentiality means that we are unlikely to learn much more. But there are lessons to be learned from what little we do know.
At the heart of the problem appears to have been ATO and HPE’s close familiarity with the SAN. Dealing with issues day-in, day-out meant that basic issues were overlooked. The SAN had been configured with the wrong profile for instance, and key recovery tools had to be recovered before work could begin on rebuilding the SAN.
Much time, money and embarrassment could have been prevented through the use of a third-party storage audit for instance. Having “a fresh pair of eyes” check configurations and ask some difficult questions could have uncovered the problems before the first outage.
For the time-poor CIO, arranging regular third-party audits could actually become a massive time (and job) saver.
To learn more about storage auditing and maintenance services for your SAN, please get in touch.