Pulling the plug: Why We Chose to Black out an Entire Campus—and How It Went
In a data center, nothing is more important to business continuity than redundancy.
At Vantage, testing redundancy is non-negotiable. In fact, we’re never not doing it. We have a computerized maintenance management system (CMMS), and every one of the components that make up our data centers is in that system. Each component has a schedule for when, and how, it needs to be maintained—and that includes testing.
“Certain things have a monthly schedule; other things have a quarterly or annual,” says Chris Yetman, our Chief Operating Officer. “And at some point, you’re turning the entire component off.”
Whenever we turn off a component, we note the alarms generated—and sometimes not generated. If we don’t get the expected result, we create additional follow-through tickets to resolve the alarm problem.
Introducing the campus-wide blackout.
Of course, we don’t stop with the component level. In the past, we’ve run blackouts on entire buildings to make sure that we’re successfully redundant. And this year, we decided to run a blackout on our entire Santa Clara campus.
We didn’t do it lightly. “We had very stringent methods of procedure,” Chris says. “And we had steps to take in the event of a failure, so that there were no interruptions in service.”
For the test, which was completed on April 26, 2017, we brought in additional staff, and we created stations with small teams of people wherever we had an electrical switch or a key mechanical component. There was communication across the board among all of our teams, using radios and other methods, to make sure that we were staying on track.
When we pulled the plug, each station knew exactly how its equipment was supposed to react. As long as things went according to script, nobody would have to touch a thing. And if something didn’t work, we had plans for every station in place, down to the second. As soon as a number of seconds passed without the required result, each team had a process for manually forcing the switch or backing out to ensure that there was no disruption to our customers’ critical loads.
How did it feel to black out an entire campus? “It may have been a little scary,” Chris says. “But we knew it was right thing to do.”
Reporting back from the blackout.
At 11:00 a.m. on April 26, 2017, we dropped the mains coming off our substation and watched nearly 30 generators light up at once. We watched the power transfer over successfully. And we watched the campus continue to perform, with no interruption in service.
The blackout confirmed some of our hopes and expectations around improvements we’d made. For example, over the previous six months, we’d spent time analyzing the code that tells the parallel switch gear when to switch over. As a result, we’d re-engineered our timing to improve the speed, so we’d spend less time on battery and recover mechanical power faster—and the blackout confirmed the efficiency of that arrangement. In fact, we cut the amount of time it takes to transfer the buildings to the generators nearly in half. Of course, we’re never dropping critical load, but the lights came back on faster and the mechanical equipment recovered faster, so there was less of a temperature deviation. And the only way to prove that our buildings are responding better under these conditions than previously was to test.
We had a few minor blips, too. One of our breakers failed. The power was simply shunted off in another direction, so it wasn’t a problem—and it’s an easy go-back-and-fix. And if we hadn’t caught it during the intentional campus-wide blackout, it might not have been there when we actually needed it. We also had an issue with a single generator, but we have more than enough generator redundancy, so that wasn’t a problem, either. Again, we were able to fix the generator problem and avoid issues in the future.
Some of our customers were very interested in the blackout. Our largest customer came over for a visit just before we started. “They were a bit nervous,” Chris says, “but they completely understood why we needed to do it.”
After the cut, the customer stopped by again and reported that everything was good on their end—in fact, the temperature had hardly moved. Later, when we sent out a notice that we were going to move back to grid power, this customer wanted to stay and watch. “We invited them into the room,” Chris laughs, “on the condition that they touch nothing and remain quiet.”
Most of our customers didn’t even notice that anything had happened. In fact, on the morning of April 27, the day after the blackout, we got an email from a customer asking us when we’d be doing it. “They were disbelieving,” Chris says. “And it made me chuckle. But that’s the point. Nothing happened, and nothing should happen.”
As COO, Chris has over 18 years of operations, engineering and IT experience in the Internet infrastructure industry. Chris is responsible for leading operations, security, network and IT for Vantage. He most recently served as SVP, Process and Technology at Integra. Previously, Chris was VP of AWS Infrastructure Operations at Amazon, where he had worldwide responsibility for operations and network for Amazon’s data centers. Chris also served as SVP of Operations at Level 3 Communications, SVP of Operations at Elevation Data Centers and VP of Operations Architecture at Genuity.
Chris graduated from Northeastern University with a Bachelor of Science in Computer Engineering.
The Innovative Green Features of Vantage’s VA1 Campus
In late October, Data Center Frontier kicked off a new content series called, “Greener Data,” which is intended to explore the progress that data center…