Tag: UPS

Air Traffic Control Power Fault – When routine test fails

29 May 15
pblumo
, , , ,
No Comments

“More than 200 flights in and out of Belgium were cancelled or diverted on Wednesday 27th May after a power surge disabled the operations of Belgocontrol, the domestic air traffic controller” (Source : Reuters)

In this post we will try to go behind the scene of what may have happened at Belgocontrol earlier this week. We will discuss Air Traffic Control, data centres, generators, UPS, Disaster Recovery, RTO and so on.

Reading time : ~10 mins

Update :

2nd June 2015 – The generator was not grounded (Source)

8th June 2015 – The generator was not grounded since its installation in 2005 (Source)

18th November 2015 – I’ve received an important update from an anonymous source.

While I’m usually trying to source all information I’m basing those analysis on, I found this one credible and interesting enough to be quoted here (with source’s agreement).

“- First of all, there was never a power loss in either of the four technical rooms (separated in different buildings btw).
– The power loss was only at the controller working position.
– The problem was that the power spike killed the static switches in the working position. Those switches can switch between the main and backup power supply.
– There was never a problem in the control tower and regional airports because the technical rooms were fully operational the whole time. “

So, in the conclusion of the initial analysis earlier this year, I’ve raised some questions on why the whole data center went down due to this power spike.
The information above is now clarifying this point, as the data center itself never went down. Only the air traffic controllers workstations were impacted.

This is also a good reminder that the end-user workstations can be as critical as any other IT systems.

Thanks to the anonymous contributor for the clarification

What’s Belgocontrol ?

Belgocontrol is “an autonomous public company in charge of the safety of air traffic in the civil airspace for which the Belgian State is responsible”.
The controlled civil airspace consists of the local airspace (CTR – Control Zone) and the terminal areas (TMA – Terminal Area) of the airports of Antwerp, Brussels Airport, Charleroi, Liège and Ostend. Apart from these areas, the airspace is organised as a network of airways and specific controlled areas (CTA) (Source : Belgocontrol.be)

The scope of Belgocontrol ATC (Air Traffic Control) is limited to a part of the local Belgian airspace. Basically, Belgocontrol will manage planes landing and taking off from Antwerp, Brussels, Charleroi, Liege and Ostend airports, as well as the local traffic flying below 8,000 metres FL245 (Flight Level 245 = 24,500 feet = ~8,000 metres altitude).
Planes transiting above Belgium are managed by Eurocontrol, an European ATC (which is targeted to manage the Single European Sky , which you may have heard of during some countries ATC strikes…).

Contrary to popular belief, an airspace is an extremely controlled area, with different organisations in charge of  different geographical zones and flight levels.
You can see below an example of the various “layers” you will have to cross to land in Brussels/Charleroi zone (Source : Mobilit.belgium.be)

West ACC = Eurocontrol West Area Control Center
EBBR APP = Brussels Airport Approach
EBCI TMA  = Brussels Charleroi Terminal Manoeuvring Area
EBCI CTR  = Brussels Control Traffic Region

Then you will be handed to the famous “control tower”, which will manage only the final parts of the flight (people managing your flight en-route are not based in this tower).

To (over) simplify, for planes who needs to take-off and land in Belgium, Belgocontrol is in charge.

Belgocontrol is located at Brussels airport, near the runways.

Belgocontrol and CANAC2

Belgocontrol and CANAC2

 

CANAC 2

CANAC 2 stands for “Computer Assisted National Air Traffic Control Centre”. It’s been inaugurated in 2010, and uses a Thales ATM (Air Traffic Management) system called Eurocat, now renamed TopSky (Sources : Thales , Thales).

Eurocat / TopSkys runs a modified version of Red Hat Enterprise Linux called Thalix.

CANAC 2 control room looks like this.

Canac 2 room

You can see a bit more of the CANAC 2 room in this short video.

CANAC 2 Datacenter

From Belgocontrol : “In CANAC 2, there are several levels of redundancy of the systems crucial to air traffic safety: the Nominal, the Fallback and the Ultimate mode. The first is the normal operating mode. Fallback and Ultimate, respectively, are the first and second backup levels. They use independent systems which feed the work positions with radar and flight plan data and maintain vocal communications.
In addition to these backup modes, these systems are physically duplicated and installed in two separate computer rooms, far apart from each other and powered by different electrical networks.

Those computers rooms seems hosted within CANAC premises. I was unfortunately not able to find a picture of the datacenter(s) hosting the CANAC 2 Eurocat system.
However, some details are available on CANAC 1 (the former Belgium Air Traffic Control Center), from Schneider .

CANAC DC

Schneider has commissioned the electrical part of the former 2,000 sqm datacenter, splitted in 3  rooms , in addition to a 400 sqm backup room.
I don’t know if those rooms are still currently in use, or if Belgocontrol built new computer rooms for CANAC 2. If someone knows more, feel free to leave a comment below.

Update 8th June 2015 : from a recent update the faulty generator grounding was wrongly installed since 2005 – so it is likely that the datacenter above is the one who went down.

UPS and Generators

A UPS is an Uninterruptible Power Supply. It’s an electrical appliance used to provide power in case of power failure. There is a lot of different types and capacity of UPS.
To (over)simplify again, a UPS is a dedicated hardware connected to a set of batteries, able to provide power to computers.

UPS and batteries

UPS and batteries

The power needed to run a datacenter is very significant. In case of outage, UPS batteries will only last few minutes, sometimes up to 30mins. In case of prolonged power outage, a UPS will not be able to do more than power your servers for few minutes.
In modern datacenters, UPS just providing a temporary backup power, time for the generator to kick-in.
But UPS also have other important functions : they can correct power fluctuation problems, like voltage spike, frequency problems and “line noise” (we will get back to this in the Generators section). A stable (frequency, voltage, etc) power is called a “clean power”.

To do all this (backup power + fluctuation correction), a UPS needs to be installed “in-line” : between the equipment to power and the power source. When the normal power source is running and stable, the UPS is controlling the power and delivering it to the equipment. As soon as a problem is detected, the UPS triggers some actions : it can be to correct a wrong voltage, frequency, or switch to batteries if the main power source fails.

A generator is (usually) a diesel or gas-powered engine, producing power, installed in or nearby the datacenter to protect.

Generator

Generator

Update 08th June 2015 : Media were invited to visit the datacenter, a video footage is showing 2 generators and the electrical panels.

BelgoControl generators

BelgoControl generators

 

A generator, due to its physical characteristics, will provide a “dirty power”, with fluctuations in frequency, electrical noise and so on.

Dirty Power

Dirty Power

While it’s not an issue when you are using a small generator to power a light bulb or two during a camping trip, it will be a catastrophic event for an IT equipment.
Fortunately, in a datacenter, the UPS – connected in front of the IT equipment –  will correct this and protect the IT load by providing a “clean power” to the equipment.

 

Clean Power

Clean Power

 What can go wrong ?

Like any equipment, UPS and generators needs to be regularly maintained, monitored and tested.

It is common (and a best practice) to regularly start the generator, to ensure it works properly. You may sometimes have seen a strange plume of black smoke in some busy business districts (like Paris – La Defense area), a good chance it was a genset (generator set) test.

Different genset tests can be performed. You can decide to fail-over your regular power source to the generator, or not.
Obviously, failing over to the generator is more risky, but it’s also a more complete test. I don’t know which type of test was scheduled at Belgocontrol.

To ensure complete power redundancy in a datacenter you can have two (or more) generators, powering two fully independent power lines to two separate sets of UPS, each one powering one side of the IT rack and the (dual-power) servers.
Of course, you’ve to ensure that your servers are dual-powered and each of the power supply is connected to a different power line (trust me, mistakes happens).

Anyway, duplicating everything is costly, and usually organisations which are not datacenter companies (like Equinix and co) will only have one generator, but two power lines to the racks.

It’s not clear which kind of setup Belgocontrol owns. But, the media have reported the following statement:

“Dominique Dehaene of the Belgocontrol agency said that a sudden power surge had taken out the main air traffic control system and also blew the switches to the emergency generators. “We were twice unlucky,” he said” (Source : AP)

Update 08th June 2015 : there is two generators at BelgoControl as they were recently shown on TV.

Personally, I think the term power surge should probably be replaced by power spike. A power spike will be a very short high-voltage burst (few ms) , while a power surge will be a (usually lower) voltage increase on a slightly longer period (secs).

A spike is common when starting an equipment (electrical or not), it’s called the Transient state

Let’s draw some hypothesis.

Note : there is two generators at BelgoControl as they were recently shown on TV – but this doesn’t change the overall principle

Hypothesis 1

For this one, we will assume :

– There is only one generator at Belgocontrol CANAC 2 Production Datacenter
– The generator test was not a end-to-end test (i.e. the power to the racks was NOT failed-over to the generator).
– We have separate power lines and UPS to the racks

schema 1

The generator start-up may have produced a power spike,violent enough to blow-out the control panel #1.
UPS #1 will have done two things : protected the IT Racks from the surge (if the control panel haven’t done it before, blowing up like a fuse !) , and provided electrical power to the Circuit A in the racks.
Even with UPS#1 blowing up or faulty, the circuit B (still up as per our assumptions) + UPS#2 should have been enough to maintain power to the IT equipment (providing the racks and servers were dual-powered).

Hypothesis 2

For this one, we will assume:

– There is only one generator at Belgocontrol CANAC 2 Production Datacenter
– The generator test was not a end-to-end test (i.e. the power to the racks was NOT failed-over to the generator).
– For some reason, we don’t have separate end-to-end power lines to the racks.

schema 2

In that case, providing the Control Panel #1 didn’t stopped the spike, the UPS #1 and #2 may have been damaged …

Or maybe was there only one UPS for both lines and this UPS died during the spike ? Or maybe the racks were not really dual-powered …?

Update #1 : 2nd June 2015, the generator was not properly grounded (Source).

This lack of proper grounding prevented the power spike / surge to be properly discarded, blowing up one or more static switches (referred as control panel in diagram above).

Update #2 : 08th June 2015, the generator was not properly grounded – since its installation in 2005 (Source)

An electrical default occurred in one of the generator cooling sub-systems (Source), during the monthly test. Due to the lack of grounding on the generator, the power had no other choice but to go through the electrical panels, destroying some of them.

It is still unclear why the whole datacenter went down as – from the latest reports – only one generator / power line was impacted …

Update #3 : 18th November 2015 – I’ve received an important update from an anonymous source. While I’m usually trying to source all information I’m basing those analysis on, I found this one credible and interesting enough to be quoted here (with source’s agreement).

“- First of all, there was never a power loss in either of the four technical rooms (separated in different buildings btw).
– The power loss was only at the controller working position.
– The problem was that the power spike killed the static switches in the working position. Those switches can switch between the main and backup power supply.
– There was never a problem in the control tower and regional airports because the technical rooms were fully operational the whole time. “

So, during the conclusion of the initial analysis I raised some questions on why the whole data center went down due to this power spike.
The information above is now clarifying this point, as the data center itself never went down. Only the air traffic controllers workstations were impacted.

This is also a good reminder that the end-user workstations can be as critical as any other IT systems.

Thanks to the anonymous contributor for the clarification

Anyway, disaster happens, whatever level of redundancy you may have.  Which leads us to another important question :

 What about the Disaster Recovery ?

Above, we have seen that a 400 sqm separate room existed as a Disaster Recovery for the former CANAC (1) systems.
Is this still in use ? Is there a new one ? Is it located in the same building, powered by the same generator or UPS ?

Update #3  : Based on the source quoted above, the IT rooms themselves were not impacted, and are redundant between buildings

Even with a separate IT room, even in a separate building and with a dedicated power line : is there a user disaster recovery area, where Air Traffic Controllers can seat and connect to the Disaster Recovery systems ?

From the media statements, most of the systems were restarted after 5 hours.
A 4-hours RTO (Recovery Time Objective) is commonly used in corporations (including banks) for a major disaster (and a general power failure is a major disaster).
But general public (or corporate users) is becoming more and more impatient… More and more systems are active-active, but active-active comes with other challenges, and doesn’t means it will be foolproof either.

Some vendors claims to provide full redundancy for their systems, but what happens to the interconnection of redundant systems ? I’ve seen buildings with two separate network trunks, build at opposite corners of the construction, but landing in the same MDF (Main Distribution Frame) in the basement.

As a conclusion :
– Critical systems can fail, and they will do.
– Don’t assume you’ve a end-to-end redundancy.
– “Expect the best, plan for the worst, and prepare to be surprised.”

I don’t know the agreed RTO for an ATC like Belgocontrol, but being a regulated organisation there is a chance a detailed report will be made public (unlike TV5 Monde incident).

I will update the post if more details are published.

Update 02/06/2015 : generator not grounded

Update 08/06/2015 : generator not grounded since 2005 + electrical failure in one generator cooling subsystem

Update 18/11/2015 : only Air Traffic Controllers workstations were impacted by the surge (static switches were damaged). The data center(s) itself  were not impacted.

Pierre-Olivier Blu-Mocaer
FixSing Consulting

po@fixsing.com
https://twitter.com/pblumo