Crashing the celebration dinner

Steve worked as an account manager at IBM in 2006. We landed a services contract for a multinational corporation supporting their business-to-business transactions in ordering goods for 4,000 companies in the US and 5,000 in Europe.

We implemented 65 powerful server computers and a sophisticated front-end network to support this platform. We were under a lot of scheduling pressure to build, test, and transition workload in less than two months. Several of us worked 60+ hour weeks and pulled out all of the stops in order to meet the deadline. The platform ran great for the first month, and the customer was pleased.

To celebrate, the customer held a posh dinner with executives from multiple companies in Paris, France. Immediately after they were seated at the restaurant, they received reports of the platform being down. This was a major impact to the 9,000 companies that rely on the platform. Instead of enjoying the dinner, the executives had to start making unpleasant phone calls.

At the same time, I received a call from our tech support staff that it appeared the disk storage for our platform had a major issue. I was 5 miles from the data center and raced there in my Mazda Miata. I found the enterprise storage system was powered off. Not only did that affect our platform with 65 servers but other workloads running on another 300 servers.

The storage system was powered up and a team of 20 people worked crash recovery on filesystems and logical drives so that applications could be restarted. I recall having a tremendous headache and stomach ache around 3 am the next morning. Such excitement is normally rare in an IT career but the potential is there.

The outage time was 7 hours, most of it during prime business hours in the US. When something like this happens, it is rapidly followed by root cause analysis. The first question is, “Who broke it?” Noone confessed to powering off the storage system or working on it, so it was mysterious.

The outage was caused by a firmware bug
The design of the enterprise storage unit featured full redundancy in fiber channel attachment to the storage area network, internal cache computers, redundant disk arrays, power supplies, and internal battery backup for the internal cache.

After several weeks of working with the storage unit’s vendor, we discovered the root cause.
The unit is operated internally by firmware that monitors the subsystems and makes adjustments while the system is running. There was a logic error. In a portion of code that evaluated the internal battery voltage output, when the “A” side battery voltage dropped, it should have checked the “B” side and continued. Instead, it panicked and initiated the shutdown of the whole system.
A firmware fix was applied and this problem did not re-occur.

I gained more respect for the decency and kindness of my customer, as the staff maintained courtesy and professionalism in the weeks and months that followed. Our customer’s confidence was shaken but following this incident, we had some good solid months of uptime.

Leave a Reply Cancel reply