Disruption in Azure environment due to CrowdStrike update

Customer Sitution

On Friday 19/07/2024 at 06:00am CEST, CrowdStrike released a sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the protection mechanisms of the Falcon platform. This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems. By 8:00am CEST, the first impact report was received from one of our customers, whose Azure environment we manage. Immediate action was required to mitigate the issues caused by this update.

Partner Solution

The initial plan of action was to create a fix for the issue. There was already a solution posted online that indicated a file needed to be deleted from the Windows system32 directory, but this required booting into Safe Mode. The problem was that these were Azure VMs, and Safe Mode wasn’t an option. The “Data Center Services Lead” suggested attaching the affected VMs’ disk to another VM and renaming the files. Implementation began on this fix. However, detaching the OS from the VM directly wasn’t feasible; instead, it required a “Swap OS disk” process.

Given the familiarity with Azure and the customer’s environment, creating a workflow to achieve the necessary steps was straightforward. The first step was to create a snapshot of the OS disk, which took less than a minute, highlighting the importance of good backups. The next step was to create a managed disk from the snapshot to perform the “Swap OS disk” process, which took less than five minutes. Using a sensible naming convention was crucial for keeping track of the resources. Managed disks incur costs even when not attached to a VM, so the naming convention used was VMname_Snapshot_date and VMname_recovered_date.

This process involved creating the snapshot, converting it to a managed disk, attaching it to another server (in this case, a jump server), deleting the necessary files, detaching the disk, and finally attaching it back to the affected VM and powering it on.

Once the solution was tailored and implemented on one machine, other team members were informed of the issue and solution through various communication channels. This enabled effective planning and coordination to remediate the issue across the Azure environment. The “Data Center Services Lead” also provided lists of servers, prioritizing the most critical ones to be brought up first.

Key Drivers & Business Objectives

Time: 8:00am CEST Customer: Global Manufacturer Reported Issue: Disruption in Azure environment due to CrowdStrike update Initial Response: Immediate investigation initiated upon the report from the customer’s Data Center Services Lead.

Problem Analysis

  • Regular exchanges in Microsoft Teams with the customer’s internal team.
  • Initial confusion due to the absence of tickets indicating service issues.
  • Direct discussion with the customer’s Data Center Services Lead.
  • Confirmation that the issue was indeed causing significant disruptions.

 

Value Provided & Business Outcomes

All affected VMs in the customer’s Azure environment were fully operational within 24 hours. Effective teamwork and swift implementation of the tailored solution minimized downtime and operational impact

Win Insights

The swift and coordinated response to the CrowdStrike update issue demonstrates the effectiveness of our incident management processes. By leveraging strong communication, detailed knowledge of the customer’s environment, and innovative problem-solving, we successfully mitigated a potentially severe disruption within a remarkably short timeframe. This case underscores the importance of readiness, teamwork, and adaptability in IT service management.

Lessons Learned

Quick access to reliable backups is crucial for effective disaster recovery. Maintaining open channels for quick information dissemination and strategy adjustments. Ability to quickly tailor standard solutions to fit specific environments and constraints. Clear and consistent naming conventions are vital for managing and tracking resources during a crisis.