South Central US - Preliminary RCA
The following material is intended to be an overview and act as a preliminary root cause analysis (RCA) for the South Central US incident beginning 4 September 2018.
The Azure engineering teams are continuing with a detailed investigation over the coming weeks to identify resiliency gaps and areas of improvement for Azure services. A detailed analysis will be made available in the weeks ahead.
Summary of impact: In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region. Multiple Azure datacenters in the region saw voltage sags and swells across the utility feeds. At 08:42 UTC, lightning caused electrical activity on the utility supply, which caused significant voltage swells. These swells triggered a portion of one Azure datacenter to transfer from utility power to generator power. Additionally, these power swells shutdown the datacenter’s mechanical cooling systems despite having surge suppressors in place. Initially, the datacenter was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the datacenter temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated. This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the datacenter that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units.
While storms were still active in the area, onsite teams took a series of actions to prevent further damage – including transferring the rest of the datacenter to generators thereby stabilizing the power supply. To initiate the recovery of infrastructure, the first step was to recover the Azure Software Load Balancers (SLBs) for storage scale units. SLB services are critical in the Azure networking stack, managing the routing of both customer and platform service traffic. The second step was to recover the storage servers and the data on these servers. This involved replacing failed infrastructure components, migrating customer data from the damaged servers to healthy servers, and validating that none of the recovered data was corrupted. This process took time due to the number of servers damaged, and the need to work carefully to maintain customer data integrity above all else. The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication.
Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter. Unfortunately, this particular set of issues also caused a cascading impact to services outside of the region, as described below.
 Impact to resources in South Central US
Customer impact began at approximately 09:29 UTC when storage servers in the datacenter began shutting down as a result of unsafe temperatures. This impacted multiple Azure services that depended on these storage servers, specifically: Storage, Virtual Machines, Application Insights, Cognitive Services & Custom Vision API, Backup, App Service (and App Services for Linux and Web App for Containers), Azure Database for MySQL, SQL Database, Azure Automation, Site Recovery, Redis Cache, Cosmos DB, Stream Analytics, Media Services, Azure Resource Manager, Azure VPN gateways, PostgreSQL, Application Insights, Azure Machine Learning Studio, Azure Search, Data Factory, HDInsight, IoT Hub, Analysis Services, Key Vault, Log Analytics, Azure Monitor, Azure Scheduler, Logic Apps, Databricks, ExpressRoute, Container Registry, Application Gateway, Service Bus, Event Hub, Azure Portal IaaS Experiences- Bot Service, Azure Batch, Service Fabric and Visual Studio Team Services (VSTS). Although the vast majority of these services were mitigated by 11:00 UTC on 5 September, full mitigation was not until approximately 08:40 UTC on 7 September.
 Impact to Azure Service Manager (ASM)
Insufficient resiliency for Azure Service Manager (ASM) led to the widest impact for customers outside of South Central US. ASM performs management operations for all ‘classic’ resource types. This is often used by customers who have not yet adopted Azure Resource Manager (ARM) APIs which have been made available in the past several years. ARM provides reliable, global resiliency by storing data in every Azure region. ASM stores metadata about these resources in multiple locations, but the primary site is South Central US. Although ASM is a global service, it does not support automatic failover. As a result of the impact in South Central US, ASM requests experienced higher call latencies, timeouts or failures when performing service management operations. To alleviate failures, engineers routed traffic to a secondary site. This helped to provide temporary relief, but the impact was fully mitigated once the associated South Central US storage servers were brought back online at 01:10 UTC on 5 September. Update 12 September 2018: We have received feedback seeking clarification related to impact to ARM during the incident, we wanted to clarify this impact to customers. The ARM service instances in regions outside of South Central US were impacted due to the dependencies called out in  below, which includes subscription and RBAC role assignment operations. In addition, because ARM is used as a routing and orchestration layer, customers may have experienced latency, failures or timeouts when using PowerShell, CLI or the portal to manage resources through ARM that had underlying dependencies on ASM or other impacted services mentioned in this RCA.
 Impact to Azure Active Directory (AAD)
Customer impact began at 4 September at 11:00 UTC. One of the AAD sites for North America is based in the South Central US region, specifically in the affected datacenter. The design for AAD includes globally distributed sites for high availability so, when infrastructure began to shutdown in the impacted datacenter, authentication traffic began automatically routing to other sites. Shortly thereafter, there was a significantly increased rate in authentication requests. Our automatic throttling mechanisms engaged, so some customers continued to experience high latencies and timeouts. Adjustments to traffic routing, limited IP address blocking, and increased capacity in alternate sites improved overall system responsiveness – restoring service levels. Customer impact was fully mitigated at 14:40 UTC on 4 September
 Impact to Visual Studio Team Services (VSTS)
VSTS organizations hosted in South Central US were down. Some VSTS services in this region provide capabilities used by services in other regions, which led to broader impact – slowdowns, errors in the VSTS Dashboard functionality, and inability to access user profiles stored in South Central US to name a few. Customers with organizations hosted in the US were unable to use Release Management and Package Management services. Build and release pipelines using the Hosted macOS queue failed. To avoid data loss, VSTS services did not failover and waited for the recovery of the Storage services. After VSTS services had recovered, additional issues for customers occurred in Git, Release Management, and some Package Management feeds due to Azure Storage accounts that had an extended recovery. This VSTS impact was fully mitigated by 00:05 UTC on 6 September. More information can be found at https://aka.ms/VSTS-CPD1PLG.
 Impact to Azure Application Insights
Application Insights resources across multiple regions experienced impact. This was caused by a dependency on Azure Active Directory and platform services that provide data routing. This impacted the ability to query data, to update/manage some types of resources such as Availability Tests, and significantly delayed ingestion. Engineers scaled out the services to more quickly process the backlog of data and recover the service. During recovery, customers experienced gaps in data, as seen in the Azure portal; Log Search alerts firing, based on latent ingested data; latency in reporting billing data to Azure commerce; and delays in results of Availability Tests. Although 90% of customers were mitigated by 16:16 UTC on 6 September, full mitigation was not until 10:12 UTC on 7 September.
 Impact to the Azure status page
The Azure status page (status.azure.com) is a web app that uses multiple Azure services in multiple Azure regions. During the incident, the status page received a significant increase in traffic, which should have caused a scale out of resources as needed. The combination of the increased traffic and incorrect auto-scale configuration settings prevented the web app from scaling properly to handle the increased load. This resulted in intermittent 500 errors for a subset of page visitors, starting at approximately 12:30 UTC. Once these configuration issues were identified, our engineers adjusted the auto-scale settings, so the web app could expand organically to support the traffic. The impact to the status page was fully mitigated by 23:05 UTC on 4 September.
 Impact to Azure subscription management
From 09:30 UTC on 4 September, Azure subscription management experienced five separate issues related to the South Central US datacenter impact. First, provisioning of new subscriptions experienced long delays as requests were queued for processing. Second, the properties of existing subscriptions could not be updated. Third, subscription metadata could not be accessed, and as a result billing portal requests failed. Fourth, customers may have received errors in the Azure management and/or billing portals when attempting to create support tickets or contacting support. Fifth, a subset of customers were incorrectly shown a banner asking that they contact support to discuss a billing system issue. Impact to subscription management was fully mitigated by 08:30 UTC on 5 September.
Next steps: We sincerely apologize for the impact to affected customers. Investigations have started to assess the best ways to improve architectural resiliency and mitigate the issues that contributed to this incident and its widespread impact. While this is our preliminary analysis and more details will be made available in the weeks ahead, there are several workstreams that are immediately underway. These include (but are not limited to) the following themes which we understand to be the biggest contributing factors to this incident:
- A detailed forensic analysis of the impacted datacenter hardware and systems, in addition to a thorough review of the datacenter recovery procedures.
- A review with every internal service to identify dependencies on the Azure Service Manager (ASM) API. We are exploring migration options to move these services from ASM to the newer ARM architecture.
- An evaluation of the future hardware design of storage scale units to increase resiliency to environmental factors. In addition, for scenarios in which impact is unavoidable, we are determining software changes to automate and accelerate recovery.
- Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.
Provide feedback: Please help us improve the Azure customer communications experience by taking our survey https://aka.ms/CPD-1PLG