A power room in a data center facility in Northern Virginia. (Photo: Rich Miller)
Uptime is always a job for data center operators, but it remains a challenge for the largest tech platforms, as shown in Apple’s lengthy outage on Monday. Reliability was already in the spotlight after major outages in 2021 for Amazon Web Services and Facebook. In today’s edition of the DCF Data Center Executive Roundtable, our panel of industry thought leaders examines uptime in the cloud computing era.
Our panelists include Sean Farney of Kohler Power Systems, Michael Goh from Iron Mountain Data Centers, DartPoints’ Brad Alexander, Amber Caramella of Netrality Data Centers and Infrastructure Masons, and Peter Panfil of Vertiv. The conversation is moderated by Rich Miller, the founder and editor of Data Center Frontier. Here’s today’s discussion:
Data Center Frontier: Last year’s major service outages at Facebook and Amazon Web Services sharpened the focus on data center reliability. As companies embrace the benefits of cloud and hybrid IT architectures, what are the key strategies for ensuring uptime?
Sean Farney, Kohler Power Systems: I recall a fascinating dinner conversation with Ray Ozzie years ago while at Microsoft. In it, he shared that the original design tenet for Azure was to move redundancy and resilience way up the stack from the physical layer to the application layer so that software would obviate outages. So if utility power failed in one facility, the data center manager would simply roll bits over to a different facility, removing the cost and complexity of redundant power systems, for example. It’s a noble idea and is working to a great extent with many CDN services in production today.
However, with the increasing complexity of interdependent applications and the rapidity of data set growth, there are stateful information services that must be homed in a single location. For this reason, the best way to ensure uptime in the cloud – which is just another data center – is to design, build and properly maintain multiple levels of redundancy across all key points of failure in a system. N + 1, 2N, Concurrent Maintainability, etc. are table stakes for operators beholden to Service Level Agreements.
Equipment like the venerable but ever-reliable diesel generator will continue to be in high demand for many years to come because we can trust them to provide backup power, flawlessly. And amid a data center building boom, Kohler is seeing just this – unparalleled demand for proven and reliable power products.
Michael Goh, Iron Mountain Data Centers: From a colocation service provider perspective, the fundamental requirement is data center uptime. While the data center industry is facing widespread growth, it’s also adapting to a more complex playing field with evolving efficiency and sustainability requirements next to the challenging supply chain.
Hiring and maintaining qualified staff, monitoring and increasing the level of automation in the data center for less chance of human error are key strategies for ensuring uptime. Having comprehensive operations procedures and disaster recovery plans in place is also key.
From an end user perspective, it’s important not to put all your eggs in one basket. We see customers adopting hybrid cloud strategy where they mix workloads in colocation and the cloud providers as well as embracing multi-cloud platforms. This inevitably drives up the complexity of the customer’s infrastructure and has indirectly contributed to the rising demand for managed cloud services segment.
Amber Caramella, Infrastructure Masons and Netrality: The key to uptime begins with a fault-tolerant design that mitigates single points of failure, as the foundation of infrastructure. Companies that work closely with their vendors to collaborate and play a role in the planning, design, and build phase have greater resiliency in their network.
A preventive maintenance strategy – including regularly scheduled maintenance check-ins executed by your data center operations team – ensures power and cooling systems are running at optimal levels and evaluates when systems need to be replaced or upgraded. Monitoring and reporting will notify operators if a system is down. Real-time reporting is essential to address issues before they escalate and cause system outages.
Brad Alexander, DartPoints: Horizontal scalability not only reduces the risk of lost data, but it also ensures that there is no single point of failure – this is a huge safety net. This same principle is also what makes a multi-cloud, multi-provider solution so attractive to companies that are focused on reliability. A multi-cloud platform adds an extra layer of protection. If one provider experiences an infrastructure breakdown or is a victim to a cyber-attack, companies with more than one provider can quickly switch to the other provider or back everything up to a private cloud to secure important data.
Geographic and safety awareness are also important contributing factors to uptime and data security. Some locations are at lower risk for natural disasters such as hurricanes and earthquakes minimize geographic risks and make it an attractive location for tenant colocation. Data center locations should be carefully evaluated based on climate as well as environmental conditions and the probability of a natural disaster.
Network visibility and control helps avoid issues before they occur, which is why application intelligence is a key component of reliability for service providers. It gives them the power to collect reliable, actionable application data for more effective monitoring and security. Intelligent applications also understand the proper flow of data and can detect traffic that might indicate a threat. This protects confidential information from application security attacks.
When it comes to uptime and business continuity, the number one and number two threats are human error and security and procedure flaws. The end user is typically the most overlooked threat to a business, and is a common entry point for ransomware, malware, phishing participants, and the source of data leaks from social engineering attacks. Companies should not assume that network hardware and security software will protect against end user mistakes. My best advice is to train, reinforce, test, review, and train some more.
Peter Panfil, Vertiv: To meet growing demand for services, cloud operators have to balance speed-of-deployment, cost, reliability and sustainability. In some cases, infrastructure redundancy has been sacrificed to achieve lower build costs, which can backfire if downtime causes the market to lose confidence in the reliability of cloud services.
Two trends have emerged that enable operators to achieve their speed and cost goals without compromising reliability. One is value engineering of high-utilization critical power architectures that maintain redundancy while eliminating stranded capacity and maximizing efficiency. The other is the availability of modular prefabricated data centers, which can be deployed in less time than is possible using traditional construction methods while delivering extremely low PUEs and high availability.
NEXT: Our roundtable panel discusses the state of the data center supply chain.
Keep pace with the fact-moving world of data centers and cloud computing by following us on Twitter and Facebookconnecting with DCF on LinkedInand signing up for our weekly newspaper using the form below: