Industry Experts Blog

One Audience. One Topic. Multiple Experts.

Cloud Disaster Recovery and Business Continuity

by Graham Thompson | August 4, 2017 | Cloud Security

Planning Disaster Recovery for a Black Swan Event in the cloud

Is it even possible to fully protect your firm in the event of a mass outage?

 

For the most part, a Black Swan event is a term used in the financial industry to describe a random unforeseen act that carries with it an extreme impact.  In a cloud world?  This would be the same as all of AWS going down.  Not just an availability zone in US-East or even the entire region, not just S3, but every AWS service, server instance and every stored file suddenly being unavailable for a prolonged time.  That would be a true Black Swan event for the cloud industry.

I’m writing this article because this was asked of me by a CTO for a financial company with trillions under management (yes, 4 comma club for you Silicon Valley fans).  At the time, the company was just starting off in consuming cloud services and chose AWS to start their cloud path.  They were also limiting deployment to low risk services in both the Americas and Australia.  As a result, the discussion was put to rest and deemed to have such a low likelihood that slowing down initial adoption was not an ideal path to take.  Still, that question has been sitting in the back of my head for a while now.

Risk of a Black Swan Event

What could cause such a global event to occur in the first place?  A physical attack comes to mind, probably on the network cables, or possibly on the HVAC control system in a datacenter would do the trick as well; overheat the servers and there goes your datacenter.    But a physical attack would need military-like precision to execute successfully, especially when you’re talking dozens of geographically disperse data centers.  No, this would require a logical attack of some form, probably against the infrastructure, not individuals servers themselves.  Who could pull off such an attack?  State sponsored agents I would say.

But cloud services as a target by rogue nations?  Yes, Amazon does have over 1 million clients, ranging from mom and pop websites to aforementioned trillion dollar firms running some services in the cloud.  That said, are cloud services really a prime target considering our power grids, water control systems and other critical infrastructure all use SCADA networks that are vulnerable?  Even as a cloud guy, I have a hard time believing AWS, Azure or Google cloud is a primary target to despots around the world looking to cripple “our way of life”.  Should the major cloud players be deemed Critical Infrastructure to be protected from nefarious enemies around the world?  This is a question that I’m not going to try to address.  I will say though, in my mind Government installations (Nuclear power, water level controls, etc) need to be addressed first.

What about an insider?  I’m not going to try to pretend I know all the internal controls implemented by Amazon.  I do know this though – AWS has to meet so many various standards and certifications like ISO, SOC2, etc that the notion of a pissed off admin taking down a global infrastructure by him/herself is very unlikely.  Even in the past where “human error” lead to outages, AWS demonstrated that regions are self-contained.  I don’t think there’s ever been an outage that has cascaded beyond a single region.  Sure there was a fairly recent incident where an S3 outage in US-East caused headaches for systems everywhere, but that was still limited to files being stored only in US-East.

What can your firm do to protect from a Black Swan event?  As discussed, the entire AWS infrastructure going down at once is extremely unlikely, still the discussion is about an unlikely event occurring, and the impact could be very severe, so what’s a company to do?  There’s a couple of approaches to mitigate some of the risk running workloads in the cloud.

Mitigation Strategies

Go West, Young Man?

Most AWS outages have occurred in the US-East region.  US-East is also the oldest region.  Has Amazon hit the threshold of manageability in US-East?  Doubtful, but still, maybe a move west to Ohio may be something worth consideration.  The same question could be asked of the other providers you work with.  What’s the oldest and most populated data center they have and steer clear of it.

Multi-Region Replication

Good practice would state that multi-region replication needs to be in place for high availability and disaster recovery purposes.  If this isn’t in place, you need to address it right away if high availability is of any concern.  All those global regions and data centers your provider has are useless to you if you don’t use them.  To use them, you’re going to need to address everything the systems use, not just the applications themselves or the server instances they run on.

Be warned though.  If you are replicating out of one country to another as part of your disaster recovery plans, you need legal advice on the jurisdictional rules that you’re opening yourself up to.  Data protection and privacy laws are much different in Europe than they are in the United States for example and things are only getting “worse” with new GDPR rules coming in effect in 2018.

Zero Downtime with Multi-Cloud Replication

Now the good news is this will address a complete outage of a cloud provider with zero downtime.  Bad news?  I don’t think it’s ever been successfully pulled off by anyone.  It’s doable in theory, two clouds are spun up, load balancers are established and data replicated across the internet from one system in ProviderA sent to a duplicate system running in ProviderB.  Load balancer sees ProviderA systems are non-responsive and traffic is then sent to ProviderB.  Simple right? But as with everything else, the devil is in the details.  Costs are going to skyrocket as you now have two distinct cloud environments to spin up and more importantly manage.  Infrastructure as code, DevOps, CI/CD toolchains, Containers all come immediately to mind as items you’ll likely need to invest in to have a chance of this working.  Then there’s the security and offering differences from one environment to the other.  Moving the data is honestly the easiest part of this scenario.  Even then, there’s increased latency (ProviderA data securely sent back to DC, then securely sent to ProviderB) that needs to be addressed and a fail-over mechanism that is required.  I believe this pattern is in its infancy*.  Honestly, having some downtime in the event of a complete provider outage might be a risk you just have to accept.

*If you know otherwise, I would honestly love to talk with you and gladly tell the world how wrong I am. (Fine print:  I write this in August 2017.  No coming back to me in 2020 and saying there’s tons of solutions out there).

Cloud-Cloud Disaster Recovery

To me, this is the only real option to protect from a Black Swan event like we’ve discussed the past 1000 words.  It all comes down to cost and time of course.  Fairly standard requirements in that you need to know your acceptable Recovery Time and Recovery Point Objectives (RTO and RPO).  Depending on these, you’ll determine if the different cloud provider will act as a hot, warm or cold site.  You’ll need the ability to redirect using DNS as well so traffic is sent to the different environment.  Infrastructure as code will likely be a requirement, unless you have an RTO measured in months.  Data sets are replicated to a core set of services in the remote provider environment on a regular basis.  Once required, the infrastructure as code can be invoked to spin up a “real” data center to handle incoming load.  If done right, your downtime can be measured in hours.  Be prepared though – this will require significant investment in architecture and training!  You really have to determine if your workloads cannot handle what would be at any forseeable semi-realistic Black Swan event that would be measured in days.

Conclusion

In conclusion, when bracing yourself for a Black Swan event in the cloud, you have to pull everything back to the basics.  What are the datasets in the cloud worth?  What’s my RTO/RPO for said datasets and balance that off with the cost associated to minimize downtime to this acceptable level.  All in all, I would say that having a multi-cloud DR strategy is the most viable of the scenarios and that is in the event your systems are so important that they cannot possibly be offline for days no matter the cost.  If you don’t have multi-region replication in place, I would be putting that on the roadmap ASAP.  A single region going down is a much more realistic, and some would say probable event.  Much more so than a complete vendor outage (well, for the Big IaaS providers that is).

One last note, the above only applies to Infrastructure as a Service (IaaS).  Platform as a Service (PaaS) could be seen as arguable easier to implement multi-vendor solutions with.  Software as a Service (SaaS)?  You’re frankly out of luck if your provider goes down.  You better backup data on a regular basis in case the vendor goes bankrupt.

More Information

Hopefully you found this discussion of interest.  This topic and many more cloud security issues are covered in both our CCSK and CCSP training.  We would love to see you on an upcoming session.  Check out the courses and schedule your date by checking out our CCSK page and our CCSP pages.

 

Sign In

 
Share This