Lessons From Azure Very Bad Day


Microsoft Azure Security and How Companies Can Protect Themselves.
This past month has not been kind to Azure clients, and Microsoft has some explaining to do. First, I’ll break down these recently disclosed vulnerabilities and list some reference links for deeper learning, then I’ll get into what grinds my gears about these, then finally, I’ll discuss what you can do as a customer to limit future events like this impacting your organization, regardless of which cloud provider you use.
I’ll start with the first one in chronological order: The CosmosDB vulnerability (code named ChaosDB) was announced on August 12th, 2021. This Microsoft Azure Cosmos DB NoSQL database product vulnerability left more than 3,300 Azure customers open to complete unrestricted access by attackers. This was due to the addition of a feature called Jupyter Notebook, which is an open-source web application that is directly integrated with your Azure portal and Cosmos DB accounts. You can read more about the story here: https://www.zdnet.com/article/azure-cosmos-db-alert-critical-vulnerability-puts-users-at-risk/And you can read even more from the fine folks at Wiz.io who discovered the vulnerability here: https://www.wiz.io/blog/chaosdb-how-we-hacked-thousands-of-azure-customers-databases
So here’s my take on this. This vulnerability is completely the responsibility of Microsoft. We know that as cloud customers, you shift the security responsibility more to the provider the higher in the SaaS-PaaS-IaaS stack you go. By using PaaS, you are offloading many of the security aspects of the facilities, servers, and platform itself.
Who’s impacted? Everyone running the Jupyter Notebook that is part of the CosmosDB service. It’s actually kind of funny how this Jupyter Notebook was released initially in 2019, but then, on Feb 10th, 2021, Microsoft enabled it for everyone as default for any new Cosmos DB installation (YOU GET A JUPYTER NOTEBOOK! AND YOU GET A JUPYTER NOTEBOOK! EVERYONE GETS A JUPYTER NOTEBOOK!) and then has this gift (trojan?) kinda blow up in their faces. Customers didn’t ask for it, had no control over it being enabled by default, yet here we are with a massive potential exposure that you now have to clean up.
What’s interesting here is that customers need to take action and rotate their encryption keys themselves. There’s an article on how this is done here.
Microsoft has reached out to all impacted customers. Who in your company is tasked with getting these kinds of messages from a provider? Now might be a great time to ensure your listed contacts are up-to-date.
Now, the BIG one…
With that vulnerability having been addressed by your engineers, and perhaps feeling that a bullet was dodged, we fast-forward to September 14th, 2021 (yes, less than a month later). On this day, Wiz again announced another Azure vulnerability. This one, coined OMIGOD by Wiz is due to a vulnerable Open Management Infrastructure (OMI) service. Microsoft uses OMI extensively behind the scenes as a common component for many of its management services for VMs, including the following agents that need to be enabled for management tools to work.
Azure Log Analytics
Microsoft describes their log analytics agent as follows: “The Azure Log Analytics agent collects telemetry from Windows and Linux virtual machines in any cloud, on-premises machines, and those monitored by System Center Operations Manager and sends it collected data to your Log Analytics workspace in Azure Monitor.”
Azure Diagnostics
Microsoft describes their diagnostics agent as follows: “Azure Diagnostics extension is an agent in Azure Monitor that collects monitoring data from the guest operating system of Azure compute resources including virtual machines”.
Although there are particular use cases for each agent, both Azure Log Analytics and Diagnostics agents do pretty much the same thing, at least at a conceptual level. They both collect system information that can be used by several management solutions. A couple of the major monitoring systems in Azure are:
Azure Security Center
Microsoft describes their Security Center service as follows: “Azure Security Center collects events from Azure or log analytics agents and correlates them in a security analytics engine, to provide you with tailored recommendations (hardening tasks)”. I would assume a good number of Azure customers would use Azure Security Center.
Azure Sentinel
Microsoft describes Sentinel as follows: “Microsoft Azure Sentinel is a scalable, cloud-native, security information event management (SIEM) and security orchestration automated response (SOAR) solution. Azure Sentinel delivers intelligent security analytics and threat intelligence across the enterprise, providing a single solution for alert detection, threat visibility, proactive hunting, and threat response.”
Sentinel can only collect log data from servers when the log analytics agent is installed. As a result, OMIGOD applies to Sentinel. This is the same for other management solutions. I’m going to let you look at the other management solutions that are impacted. These are shown at the MSRC site. The point is all of these management services require the agents that use the Open Management Interface.
OMI Applicability
The flaw in OMI is limited to Linux systems in Azure that need to collect data and send it to aforementioned centrally managed solutions. The agent is automatically installed when you enable the services for centralized management. So what’s the big deal, right? It’s Microsoft Azure, so the virtual machines must be Windows based, you say? Not so fast. A full 50% of virtual machines in Azure are Linux-based today.
Bottom line is this: If you have Linux-based systems that have centralized management implemented, you have to deal with the vulnerability that is actively being exploited in the real world.
Microsoft’s response is a bit of a mishmash of automatic and manual updates. You can find their update rollout plans here. Microsoft has also identified Azure Marketplace vendor solutions that are also impacted. Honestly, the whole thing is a mess.
An interesting aspect of Microsoft’s response regarding automatic updates. These were enabled by default. Now, I believe they have removed the ability for customers to turn these off. So, in the future, you won’t be able to determine what or when Microsoft patches on your systems (installed agent function wise).
But it’s ok…you can use Azure Sentinel (which imposes the vulnerability on your systems by the way – ironic) to look for the vulnerability Microsoft imposed on you! That comes at the low, low cost of hundreds, likely thousands of dollars a month. And then there’s the knowledge and capability to manage this SIEM.
Cloud service providers need to take security as job number one if they are to attract and retain customers. These two security issues within a single month may very well impact Azure in a significant way. It will be interesting to see the impact these vulnerabilities (and exploits) have on Azure’s operating revenue. I have heard of some companies questioning moving workloads to Azure of late due to these vulnerabilities. Perhaps it’s because many leaders have been through the dark days of Microsoft security in the past. They seem to have been doing a great job addressing security since Bill Gates’ Trustworthy Computing Initiative up until now. Here’s hoping we don’t move back to Microsoft security like it was back in 1999. Given the hybrid and integrated nature of Microsoft Azure compared to AWS, I won’t be surprised to see more of these pop up in the future.
Teachable Moment – It will happen again, and it won’t be limited to Microsoft.
Although Microsoft and their Azure customers are taking it on the chin right now, this vulnerability exposes something that I have been concerned about for quite some time now that impacts many IaaS customers. There’s a logical model that is taught in the Cloud Security Alliance CCSK Training we offer at Intrinsec. The logical model breaks cloud services into four tiers. These tiers are:
Infostructure: Data and information
Applistructure: Operating system and applications
Metastructure: The glue that ties the virtual world to the underlying physical infrastructure
Infrastructure: The physical environment managed by the cloud service provider
Let me quickly explain this Metastructure layer. This is the area that a cloud provider exposes to customers. It’s where you configure everything in the “virtual world” that is your cloud environment. This is the demarcation point between the Cloud Service Provider (where they expose controls) and the Cloud customer (where you configure these controls).
The Applistructure *should* be exclusively under your control. This is where the blurring of the responsibility shift is beginning to occur. CSPs offer services to manage your server in a cloud environment. In order to do this, agents need to be installed in the Applistructure. Exposing agents to be installed by the customer is not exclusive to Azure, by the way. AWS does it with services such as Inspector (for vulnerability assessments) and the AWS Systems Manager Agent (SSM agent). Here’s the thing though – Amazon AWS actually pre-installs the SSM agent on many “quick start” Amazon Machine Images (AMI). On a related note, two things:
1) Agents often open network connectivity (such as opens ports on your operating system). This adds to the “attack footprint” of the server.
2) You can’t protect that which you don’t know exists. In this case, the agent is silently installed in Azure and pre-installed on many AWS servers. I really do wonder how many cloud engineers know their actions are responsible for installing an agent on the machine.
This blurring of the lines between the applistructure and the metastructure is concerning to me for multiple reasons. How are roles delineated between the cloud engineers and systems engineers in your environment? The people who have management of the cloud infrastructure also have the ability to run arbitrary code on systems when these agents are installed. Compromise of the management plane (e.g., aws.amazon.com, portal.azure.com) can lead to not just termination of server instances, but can potentially also allow the “bad actor” to run anything they want on your instances with God-level privileges, or take any data they want, thanks to those God-level privileges.
I’m certain few people are about this, and I’m also certain that many have never even contemplated the management console as being an attack vector to the systems themselves. Take the AWS Systems Manager Run command for example. Here’s the AWS description of what the run command can do:
“Administrators use Run Command to perform the following types of tasks on their managed instances: install or bootstrap applications, build a deployment pipeline, capture log files when an instance is removed from an Auto Scaling group, and join instances to a Windows domain.”
Add to this the following statements in the Systems Manager user guide:
“AWS Systems Manager Agent (SSM Agent) runs on Amazon Elastic Compute Cloud (Amazon EC2) instances using root permissions (Linux) or SYSTEM permissions (Windows Server). Because these are the highest level of system access permissions, any trusted entity that has been granted permission to send commands to SSM Agent has root or SYSTEM permissions.”
And…
“This level of access is required for a principal to send authorized Systems Manager commands to SSM Agent, but also makes it possible for a principal to run malicious code by exploiting any potential vulnerabilities in SSM Agent.”
“This level of access is required for a principal to send authorized Systems Manager commands to SSM Agent, but also makes it possible for a principal to run malicious code by exploiting any potential vulnerabilities in SSM Agent.”
But wait! There’s more! As per the Systems Manager User Guide, the SSM agent is preinstalled on the following AWS Amazon Machine Images:
Amazon Linux
Amazon Linux 2
Amazon Linux 2 ECS-Optimized Base AMIs
Ubuntu Server 16.04, 18.04, and 20.04 (!)
macOS 10.14.x (Mojave)
macOS 10.15.x (Catalina)
macOS 11.x (BigSur)
Windows Server 2008-2012 R2 AMIs published in November 2016 or later
Windows Server 2016 and 2019
Remind me again how this isn’t going to be a problem in the future? Perhaps not a vulnerability, but absolutely, I can confidently predict there are very few organizations that realize their server instances actually have an SSM agent pre-installed.
How to you restrict access to this ability? One of two ways:
1) Use AWS IAM to restrict access to run the commands SendCommand and StartSession in Systems Management Server. Here’s a link to the Systems Management Server API guide (https://docs.aws.amazon.com/systems-manager/latest/APIReference/API_Operations.html). As a side-note, AWS also says this on page 121 of their 1578 page(!) SSM User Guide (available here)
2) Don’t install the SSM agent just because you think you may need the functionality in the future. Always remember you don’t need to protect that which doesn’t exist. Implement for the present, not the what may be in the future.
Final take-away is that this blurring of the lines between the metastructure and the applistructure can lead to terrible results from a security perspective if left unchecked. Multiple groups (cloud admins vs systems admins) will be able to manage the same thing and that often leads to either collisions because of confusion, or a lack of action due to assumptions made of who is responsible for different items.
If you are going to have utilities that span the metastructure and the applistructure layers, make sure you understand what is being installed on your servers, how to properly secure it, and practice least privileges for management.