What Was Missing During the AWS S3 Outage? (Besides a Big Chunk of the Internet)

Much of the technology world went into shock last week when Amazon Web Services (AWS) left many clients feeling not so AWSome thanks to a massive, multi-hour internet outage. The event took down services, sites and applications throughout the web. In the days that followed, a number of cloud and hosting vendors sensed blood in the water, each vying to be the alternative to the “behemoth” in the business. This belittles the true lessons from the AWS outage, the foremost of which is that what matters most is how a service provider reacts to an outage, and communicates to its users.

Service providers cover the range from “great” to let’s say – “not so great,” when it comes to cloud infrastructure and fault tolerance. Considering that we’re talking about technology and the fact that not every provider is the same, it is not a matter of “if,” but “when” an outage will happen. Many people don’t even realize when they’re affected by an outage, but every situation is different. Technology aside, one of the key differentiators is communicating with customers—and a large element of transparency was missing from AWS last week.

Poor Response

Arriving at a root cause can be a time consuming and often inaccurate task in the heat of an outage. However, AWS Simple Storage Service (S3) proved once again to struggle with transparency in explaining what was happening during the downtime. This is quite customary for AWS in all of their prior outages. As a giant in the industry, AWS has a responsibility to be more open and transparent during an outage, even if that means releasing information with a “preliminary” or “subject to change” tag. Other companies host on the AWS system, meaning they have to explain to their own clients why their AWS-based applications and sites are down. Customers deserve to know what, why and where things are happening, so they can plan better, come up with an ETA of restored functionality and maintain business services. Millions of sites were affected, and just some slightly better informational communication could have saved countless relationships, transactions and opportunities.

Nearly a week after the event, AWS put out a statement:

“While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses…We will do everything we can to learn from this event and use it to improve our availability even further.”

Human Error

Ultimately, the root cause turned out to be a matter of human error. It seems that a billing update was a bit too aggressive and we wound up with what we saw play out. Errors happen, especially when there’s a human element in the chain. Additionally, the report indicated that it had been years since AWS had performed routine tests and system restarts. Needless to say, there are ways of minimizing issues like this to reduce risks. AWS will surely address the matter in a way that corrects this particular sort of event from happening again. The point is that they should have been more open in the way they communicated to their end users.

Too Much AWS – a Single Point of Failure

In the current-day IT paradigm, where cloud-based architecture decisions are standard, one big question the industry is wrestling with is whether there is an overreliance on AWS. Look, the cloud isn’t broken, and the point should be made that cloud servers are far more reliable, far more secure and more cost effective than on-premises architectures. But when so much of the web and associated applications exist on AWS S3, during an outage, other AWS services will inevitably come down, too.

Most importantly, it’s clear that the AWS S3 service has been exposed as a single point of failure. That’s a big no-no. Of course, AWS will probably make some moves to allay this fault in time, but the lessons remain. The web, at large, is far too dependent on a single provider (or even a select group of providers which includes Azure and Google). The lesson here for the industry and customers relying on one of the major public cloud services is that diversification is your friend. Hybrid cloud, as we have been saying for a long time, is the answer.

 

Hybrid, Communications and Leadership

Competitors can’t be blamed for jumping on Amazon in all of this, but they all appeared to be doing so with a limited vision. It’s not about another cloud that is better. It’s not about the best price. This is an area where best practices, a proven approach, customer service and industry leadership are differentiators.

Leading with hybrid cloud products, we have a different viewpoint. The hybrid cloud provides layers of architecture and a foundation that provides efficiency, resiliency, tuning and custom components that can alleviate point of failure exposures such as what happened with S3. The hybrid cloud would even leverage S3, and other clouds for that matter. That is something that scaled-up and scaled-out (aka hyperscale) public cloud providers just can’t provide. In the aftermath of this, a solid provider, with solid hybrid technology, combined with thorough transparency and great customer support is the answer that many companies will be looking for, and we commit to our customers, that we will continue being the leader in that space.

Check out our white paper for a more in-depth look at the hybrid cloud >>>