We tend to take electronic communications for granted. Like power and water it’s always there, but although it’s a part of the critical national infrastructure (CNI), this doesn’t mean it can’t or doesn’t fail from time to time says David Sutton.

There are times when we are unable to hear a dial tone; see a signal from the cell site; the email server returns an error code or the web browser tells us that it’s unable to connect. These problems that beset elements of the so-called critical information infrastructure (CII) are not uncommon and can bring our organisations to a standstill.

Why do these things happen, and what can we do?

There are several causes of failure of the CII. Let’s take a look at some real-world examples. Loss of power can lead to the total or partial failure of electronic communications. The energy industry is another part of the CNI, but despite its overall general reliability, serious damage can occur.

In November 2006, E-On, an energy provider in Germany, responded to a request to remove a high voltage power line across the river Ems to allow a Norwegian ship to pass. The request was not unusual, and the work had been pre-planned to ensure that there would be no impact on downstream power systems.

On the evening in question, the ship arrived three hours early and requested an earlier disconnection of the power line. The conditions were checked and verified as being close to the required margins, and the disconnection went ahead.

The cascade that followed was blamed on a calculation based on a false assumption about the ratings of two interconnected networks. Power failed across large parts of central and southern Europe, and an estimated 15 million households and businesses were disconnected.

Severe weather-related disruptions such as snow and ice storms, extended periods of low or high temperature, heavy rain, hurricanes and flooding all conspire against the communications networks. More severe geological hazards, including earthquakes, tsunamis, volcanic eruptions, landslides and subsidence frequently cause dramatic interruptions to services.

One specific example of this type of disruption was Hurricane Katrina in 2005, in which the impact of the resulting flooding was the almost total failure of the local communications infrastructure. This was, in part, linked to the complete failure of power supplies in the area.

The technical impacts included the failure of 180 fixed network telephone exchanges, three million fixed telephone lines and 38 emergency call centres out of action, one critical 911 switch failure, 2,000 mobile base stations out of service and public safety radio systems seriously impacted. Full recovery took months.

Deliberate attacks on the CII itself or on its controlling environment include terrorism, civil unrest, arson, vandalism, sabotage, theft (especially the theft of critical components and copper cable), cyber and malware attacks including distributed denial of service (DDoS) attacks.

There are numerous examples of these types of threat, one of which is actually a very common case of mistaken identity. In June 2012, thieves attempted to steal a 1,000-metre underwater cable owned by BT, running beneath Loch Carron in the western Highlands of Scotland.

They had been under the impression that this was a copper cable and therefore had considerable scrap value. It transpired that it was a fibre-optic cable, and the thieves left empty-handed, having caused serious damage in their attempt to remove it. The result was the loss of fixed line services including access to the 112/999 emergency numbers, some mobile phone networks and broadband services for around 9,000 residents in remote Highland villages.

Accidental threats - although not deliberate, we also experience threats and hazards that have originated from man, including road, rail, air and shipping accidents; industrial accidents; consequential chemical releases and accidents that damage trackside cables and under-sea cables. In the early hours of 29 March 2004, a fire broke out in a tunnel approximately 30 metres beneath the city of Manchester. The tunnel, owned and managed by BT, dates from the Cold War period and carries a large number of telecommunication cables.

In all, around 130,000 fixed telephone lines were affected, and the burnt-out cables carried traffic not only from BT itself, but also from other carriers who leased capacity from BT. In particular, access to the emergency service number 112/999 was disrupted, as were circuits carrying private voice services for the emergency services themselves, with Greater Manchester’s Ambulance Service losing most of its radio capability.

The fire also impacted bank cash machines and the ability of traders in the area to conduct credit and debit card transactions was severely disrupted. Many banks and businesses closed until service was restored, and it was estimated that the cost to the business community was in excess of £22M over a five-day period.

Infrastructure failures are quite common, including network hardware and software failures. Network disruptions, especially those in which failures can propagate or cascade into other interconnected networks are less common, but are far more destructive.

In January 1990, a switch in AT&T’s long-distance network in New York sensed overload, and having blocked all further incoming call attempts carried out a routine reset, which should take no more than a few seconds. In doing this, it sent a message to each of its directly connected switches advising them that this was happening.

Once the reset had been completed, it sent a second all clear message. An error in the software that handled these messages meant that the first message had not been fully processed by the receiving switches before the second message was received, and these switches obligingly carried out the same routine, causing all of their directly connected switches to do the same. In total, all 114 of the long-distance switches were affected, resulting in more than 50 million calls being blocked.

So what’s the answer?

As any business continuity consultant will explain, since all the threats are outside the control of any one individual or organisation, there is nothing you can do to prevent them from happening, but you can take action beforehand and build resilience in to your own networks to mitigate the risks. Loss of power can be protected against by providing uninterruptible power supplies (UPS), backed up by standby generation, but not forgetting to safeguard the supply of diesel fuel for the generators.

The remaining threats and hazards can be addressed by ensuring that there is no single point of failure in the network. This generally requires the provision of redundant circuits into and between locations, dual suppliers for essential communications services and even dual locations for data centres and call centres with software that either shares the load between the two or allows failover from one to the other.

Many of these mitigations are quite costly to implement and will require a sound business case in order to fund them, but the important thing is to balance this against the potential cost of failure. Not only the losses incurred by the organisation’s inability to deliver its products or services but also those of recovery, the total of which may be even greater than the costs of proactive work.

As the saying goes, failing to plan is planning to fail.