CrowdStrike has published a post incident review (PIR) of the buggy update it published that took down 8.5 million Windows machines last week. The detailed post blames a bug in test software for not properly validating the content update that was pushed out to millions of machines on Friday. CrowdStrike is promising to more thoroughly test its content updates, improve its error handling, and implement a staggered deployment to avoid a repeat of this disaster.
CrowdStrike’s Falcon software is used by businesses around the world to help manage against malware and security breaches on millions of Windows machines. On Friday, CrowdStrike issued a content configuration update for its software that was supposed to “gather telemetry on possible novel threat techniques.” These updates are delivered regularly, but this particular configuration update caused Windows to crash.
CrowdStrike typically issues configuration updates in two different ways. There’s what’s called Sensor Content that directly updates CrowdStrike’s own Falcon sensor that runs at the kernel level in Windows, and separately there is Rapid Response Content that updates how that sensor behaves to detect malware. A tiny 40KB Rapid Response Content file caused Friday’s issue.
Updates to the actual sensor don’t come from the cloud, and typically include AI and machine learning models that will allow CrowdStrike to improve its detection capabilities over the long term. Some of these capabilities include something called Template Types, which is code that enables new detection and is configured by the type of separate Rapid Response Content that was delivered on Friday.
On the cloud side CrowdStrike manages its own system that performs validation checks on content before it’s released to prevent an incident like Friday from happening. CrowdStrike released two Rapid Response Content updates last week, or what it also calls Template Instances. “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data,” says CrowdStrike.
While CrowdStrike preforms both automated and manual testing on Sensor Content and Template Types, it doesn’t appear to do as much thorough testing on the Rapid Response Content that was delivered on Friday. A March deployment of new Template Types provided “trust in the checks performed in the Content Validator,” so CrowdStrike appears to have assumed the Rapid Response Content rollout wouldn’t cause issues.
This assumption led to the sensor loading the problematic Rapid Response Content into its Content Interpreter and triggering an out-of-bounds memory exception. “This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD),” explains CrowdStrike.
To prevent this from happening again, CrowdStrike is promising to improve its Rapid Response Content testing by using local developer testing, content update and rollback testing, alongside stress testing, fuzzing, and fault injection. CrowdStrike will also perform stability testing and content interface testing on Rapid Response Content.
CrowdStrike is also updating its cloud-based Content Validator to better check over Rapid Response Content releases. “A new check is in process to guard against this type of problematic content from being deployed in the future,” says CrowdStrike.
On the driver side, CrowdStrike will “enhance existing error handling in the Content Interpreter,” which is part of the Falcon sensor. CrowdStrike will also implement a staggered deployment of Rapid Response Content, ensuring that updates are gradually deployed to larger portions of its install base instead of an immediate push to all systems. Both the driver improvements and staggered deployments have been recommended by security experts in recent days.