Global IT outages inevitable without effective backup strategies

Experts urge on more contingencies plan for networks and organisations to avert Friday like global outage


REUTERS July 20, 2024
Scoot passengers wait to be checked-in manually at Changi Airport Terminal 1 in Singapore after a global IT system outage. Photo: Reuters

Elements of Friday’s global IT outage, which grounded planes and hit services from banking to healthcare, have occurred before and until more contingencies are built into networks, and organisations put better back-up plans in place, it will happen again.

Friday’s outage was caused by an update that US cybersecurity firm CrowdStrike pushed to its clients early on Friday morning which conflicted with Microsoft’s Windows operating system, rendering devices around the world inoperable.

CrowdStrike has one of the largest shares of the highly competitive cybersecurity market that provides such tools, leading some industry analysts to question whether control over such operationally critical software should remain in the hands of just a handful of companies.

But the outage has also raised concerns among experts that many organisations are not well-prepared to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down.

At the same time there are also more solvable digital disasters looming on the horizon, with perhaps the biggest global IT challenge since the Millennium Bug, the “2038 Problem”, just under 14 years away - and, this time, the world is infinitely more dependent on computers.

Read: Uncovering CrowdStrike: the cybersecurity firm behind global tech outage

“It’s easy to jump at the idea that this is disastrous and therefore suggest there must be a more diverse market and, in an ideal world, that’s what we’d have,” said Ciaran Martin, former head of Britain’s National Cyber Security Centre (NCSC), part of the country's GCHQ intelligence agency.

“We're actually good at managing the safety aspects of tech when it comes to cars, trains, planes, and machines. What we're bad at is then providing services,” he added.

“Look at what happened to the London health system a few weeks ago - they were hacked, and that led to loads of cancelled operations, which is physically dangerous,” he said, referring to a recent ransomware incident which affected Britain’s National Health Service (NHS).

Organisations need to look around their IT systems, Martin said, and ensure there are enough failsafes and redundancies in those systems to stay operational in the event of an outage.

Friday’s outage happened amid a perfect storm, with both Microsoft and CrowdStrike owning huge shares of a market which relies on both of their products.

“I'm sure the regulators globally are looking at this. There is limited competition globally for operating systems, for example, and also for the large scale cybersecurity products like the ones CrowdStrike provides,” said Nigel Phair, a cybersecurity professor at Australia’s Monash University.

Read: Airlines, banks, and healthcare hit by global tech outage

Friday's outage hit airlines particularly hard, as many scrambled to check in and board passengers who relied upon digital tickets to fly. Some travellers posted photos on social media of hand-written boarding cards provided by airline staff. Others were only able to fly if they had printed out their ticket.

“I think it's very important for organisations of all shapes and sizes to really look at their risk management and look at an all-hazards approach,” Phair said.

Epochalypse now

Friday’s outage will not be the last time the world is reminded of its dependency on computers and IT products for basic services to function. In about 14 years' time, the world will be faced with a time-based computer issue similar to the Millennium Bug called the “2038 Problem”.

The Millennium Bug, or “Y2K” happened because early computers saved expensive memory space by only counting the last two digits of the year, meaning many systems were unable to distinguish between the year 1900 and 2000, leading to critical errors.

The cost to mitigate the problem in the years before 2000 ran up a global bill of hundreds of billions of dollars.

The 2038 problem, or "Epochalypse", which begins at 0314 GMT on Jan. 19, 2038, is, in essence, the same problem.

Many computers count the passage of time by measuring the number of seconds since midnight on Jan. 1, 1970, also known as the “Epoch”.

Those seconds are stored as a finite sequence of zeroes and ones, or “bits” but for many computers, the number of bits that can be stored reaches its maximum value in 2038.

“We currently have a situation where there's huge global disruption, because we cannot cope administratively,” said Ciaran Martin, the former NCSC head.

COMMENTS

Replying to X

Comments are moderated and generally will be posted if they are on-topic and not abusive.

For more information, please see our Comments FAQ