Cloud disaster recovery rule 1: Downtime is an option


When dealing with cloud disaster recovery for workloads running in IaaS, there is the allure of using a multi-cloud strategy to address availability. On the surface, it seems logical right? After all, it’s just data crossing from one server to another and that data can be secured. Before you start working on this approach, we need to discuss the virtual infrastructure itself and why accepting downtime is most likely preferable to attempting multi-cloud failover.
In the Cloud Security Alliance guidance (v4), they introduce the concept of a logical model in a cloud environment. The layers (page 19 of the guidance) break down as follows:
Infostructure: This consists of data and information
Applistructure: Applications deployed in cloud and application services
Metastructure: The protocols and mechanisms that provide the interface between the infrastructure and higher layers. Includes the management plane (your interface to manage your environment)
Infrastructure: The foundation that everything is built on
Here’s the thing: managing one cloud environment alone is challenging enough. Now, you want to manage two of them? Not only manage two environments but keep them in sync so that the virtual infrastructure (metastructure) in the backup IaaS environment is identical from a security perspective to the production environment. Take a moment to see how this architecture drives costs up. Need a virtual firewall in production? Well, now you need two (well, 4 really because there shouldn’t be a single point of failure in either location). Oh, the backup environment doesn’t have a particular vendor offering available? Ok, no problem, we’ll use this other one. Great, now you need to train your staff on two different virtual appliances, maintain two contracts, two license agreements, two bills, etc. Do both providers support the same approach to allowing clients to create/modify virtual infrastructures? Now you have two sets of infrastructure as code approaches and these need to be properly maintained.
Now consider more advanced architectures where we start leveraging provider services such as serverless computing (e.g. AWS Lambda, Google/Microsoft Functions), event-driven security, container services and such. Serverless would be a complete nightmare to replicate across providers thanks to supported languages, invocation, etc. Such an environment would likely push you away from adopting provider services which can be extremely useful and cost-effective from both security and functionality perspectives.
So where does that leave us? Leveraging the provider’s regions may be a pretty good place to start. By selecting another region to act as a DR point, we can have geographical separation for our production and DR sites. We could use availability zones, but this approach may not give us the geographical separation we require (note: this may be the only alternative if we are bound to operations in a single jurisdiction and our provider only supports one region in our country such as the case of AWS in Canada). Additionally, there have been the (very) rare occasion when all availability zones in a region have been impacted; I honestly cannot recall any IaaS provider having a multi-region outage.
Whenever thinking about an integrated multi-cloud environment, always start off with focusing on the metastructure layer. When taking this as a primary decision point, we can see how a cloud disaster recovery strategy that leverages multiple cloud service providers is extremely difficult, time-consuming and expensive. All told, downtime is likely an option, especially considering the probability of all regions a cloud service provider offers going down.
If you are looking at cloud disaster recovery and cloud security training in general for your company, reach out to us at +1 855-732-3348 or check out our authorized Cloud Security Alliance CCSK and (ISC)2 CCSP training options here