News

Microsoft outage highlights need for operational resilience, say experts

Global IT outage has impacted telecoms firms, banks and airlines.

Cybersecurity experts have said that today’s Microsoft outage is a timely reminder for companies to put operational resiliency at the heart of their strategies.

The mass global IT outage has caused chaos, impacting a host of businesses including telecoms companies, major banks, media outlets and airlines. Many flights have had to be grounded, resulting in large queues and delays at airports, while shops and communications have also been hit.

The outage stemmed from a defect in a content update carried out by cybersecurity firm Crowdstrike, which has affected Microsoft Windows devices. Crowdstrike’s CEO, George Kurtz, said that the problem has been “identified, isolated and a fix has been deployed”.

Evolve’s CEO Alan Stephenson-Brown said that the outage served as a “timely reminder” that companies needed to put operational resilience at the forefront of their strategies.

"Demonstrating that even large corporations aren't immune to IT troubles, this outage highlights the importance of having distributed data centres and rerouting connectivity that ensures businesses can continue functioning when cloud infrastructure is disrupted," he said. "By prioritising both contingency planning and preventative measures, IT systems can be protected. I urge business leaders to seriously appraise the systems they have in place to identify potential vulnerabilities before they find themselves the subject of the next IT outages headline."

Al Lakhani, CEO of IDEE, said that the incident underscored the importance of businesses thoroughly researching and vetting their cybersecurity solutions before implementing them.

"Microsoft clearly fell short in this regard, and we are witnessing a cascade of operational failures around the world as a result,” he said. "CrowdStrike’s platform approach, which relies on a single agent focused on detection, might seem good at first glance, but as we can see, it can create significant issues. 

“For instance, agents require installation and maintenance of software on multiple different OSes, adding layers of complexity and potential points of failure. Moreover, agents can become a single point of failure, as a bad update can compromise the entire network, as seen with the SolarWinds attack.”
 
Lakhani added that investing in cybersecurity isn’t just about acquiring the latest or most popular tools, but ensuring they are reliable and resilient. That is why businesses must prioritise agentless solutions such as MFA 2.0, which reduce the risk of widespread failures and ensure more resilient defences, he said.

Neatsun Ziv, CEO at OX Security, added, "The lesson which can be taken from an event such as this is the importance of choosing a vendor who can protect your server as a distinct and valuable portion of the network, separate from endpoints. Endpoint devices may need resetting in this kind of scenario, but if the server also needs resetting it becomes a much more complex fix. Taking the example of an ATM connected to an effected server, this may require a manual reset by an engineer, which for the large financial organisations currently affected could mean hours or days of downtime for key services. 

"Moving forward, a system of agentless updates as opposed to automatically updating agents on the endpoint servers could help alleviate issues like this; the associated convenience of automating these updates creates more potential for outages and security incidents, and this kind of event could happen to any vendor that uses agent technology."

Ecliptic Dynamics co-founder, Tom Kidwell, said that the outage highlighted the vulnerability in using a single supplier on a large scale.

“The outage impacting Windows devices and servers at airports, hospitals and stores around the world appears to have been caused by a driver update by CrowdStrike, bricking older windows devices and servers, which will be worst hit," he said. "Unfortunately for CrowdStrike, if that is the case it could be nauseating to fix.

"Due to the nature of the update, an individual from every organisation will need to boot into safemode, remove the issue file/driver, and then either roll back or update to a new version, something CrowdStrike will need to release very quickly.

“Incidents like this highlight the vulnerability in using a single supplier on such a vast scale, and why it’s critical that organisations have a backup plan. Best practice for vendors is to pressure test any updates before rollout, however this can be difficult when you serve 60 per to 90 per cent of the world.”

Omer Grossman, CIO at cybersecurity firm CyberArk, said the damage caused by the outage will be “dramatic”.

“The glitch is due to a software update of CrowdStrike’s EDR product,” he said. “This is a product that runs with high privileges that protects endpoints. A malfunction in this can, as we are seeing in the current incident, cause the operating system to crash.”

Posted under: