Would You Fly a Plane with One Engine? Or Run Your Airline with One Data Center(re)?

For those of you who may of been in the US or outside of Europe this past weekend, you may not have heard about the major British Airways IT outage, that took down their entire operations for most of Saturday and into Sunday. Rumors, which were later confirmed, were that a switch from primary to backup power at their primary data centre (they’re a UK company, so I’ll spell it in the Queen’s English), lead to a complete operations failure. I have a bit of inside information, since my darling wife was stuck inside of Terminal 5 at Heathrow.

Image result for jet with missing engine

There’s a requirement for planes that travel across oceans call ETOPs, which stands for Extended Range Operation with Two-Engine Airplanes, however in parlance is know as Engines Turn or Passengers Swim. This protocol and requirements are a set of rules that ensure if a plane has a problem over a body of water, it can make it back to shore for a safe landing. As someone who flies across oceans a decent amount, I am very happy the regulatory bodies have these rules in place.

However, there are no such rules for data centers that run airline operations. In fact, in January, Delta Airlines had a major failure which took down most of its operations for a couple of days. Most IT experts have surmised that Delta was running a single data center for it’s operations. Based on the evidence from Saturday’s incident with BA, I have to assume that they are, as well. One key bit of evidence, was that BA employees were unable to access email. They are an Office 365 customer, so theoretically, even if on-premises systems were down e-mail should work. However, if they were using Active Directory Federation Services, so that all of their passwords were stored on-prem, then the data center being down, would mean they couldn’t authenticate, and therefore would not have email.

This was my biggest clue that BA was running with a single data center—was that email didn’t work. While some systems, particularly some of the mainframe systems that may handle flight operations, have a tendency to not do well with failover across sites, Active Directory is one of the best distributed systems there is, and is extremely resilient to failures. In fact, given BA’s global business, I’m really surprised they didn’t have ADFS servers in locations around the world.

Enter the Cloud

Denny and I sat talking yesterday and running some numbers on what we thought a second data center would cost a company like BA. Our rough estimate (and this is very rough) was around $30-40 million USD. While that is a ton of money, it is estimated that weekend’s mess may cost BA up to  £150 million (~$192MM USD). However, companies no longer have to build multiple data centers in order to have redundancy, as Microsoft (and Amazon, and Google) have data centers throughout the world. The cloud gives you the flexibility to protect critical systems, and at a much cheaper cost. I’ve designed DR strategies for small firms that cost under $100/month, and I’ve had real-time failover that supported 99.99% uptime. With the resources of a firm like BA, this should be a no-brainer given the risk profile.

What About Outsourcing?

Much has been made of the fact that BA has outsourced much of its IT functions to TCS and various other providers. Some have even tried to place blame on the providers for this outage. Frankly, I don’t have enough detail to blame anyone, and it seems more like the data center operator’s issue. However, I do think it speaks to the lack of attention and resources paid to technology at a company that clearly depends on it heavily. Computers and data are more important to business now than ever, and if your firm doesn’t value that, you are going to have problems down the road.

Conclusions

In the cloud era, I’m convinced no business, no matter how big or small should run with a single data center. It is way too cheap and easy to ship your backups to multiple sites, and be online in a matter of hours with a cloud provider. Given the importance and consolidation of airlines to our world economy, it probably wouldn’t be a terrible idea if their regulators created regulations requiring failover and failover testing. Don’t let this happen to your stock price.

//platform.twitter.com/widgets.js

Monitoring Availability Groups—New Tools from Solarwinds

As I mentioned in my post a couple of weeks ago, monitoring the plan cache on a readable secondary replica can be a challenge. My customer was seeing dramatically different performance, depending on whether a node was primary or secondary. As amazing as the Query Store in SQL Server 2016 is, it does not allow you to view statistics from the readable secondary. So that leaves you writing xQuery to mine the plan cache DMVs for the query information you are trying to identify.

My friends at Solarwinds (Lawyers: see disclaimer at bottom of post) introduced version 11.0 of Database Performance Analyzer (DPA, a product you may remember as Ignite) which has full support for Availability Group monitoring. As you can see in the screenshot below, DPA gives a nice overview of the status of your AG, and also lets you dig into the performance on each node.

image

There are a host of other features in their new releases, which you can check out some of their new hybrid features in their flagship product Orion. Amongst these features, a couple jumped out at me—there is now support for Amazon RDS and Azure SQL Database in DPA, and there is some really cool correlation data that will let your compare performance across your infrastructure. So, when you the DBA is arguing with the SAN, network, and VM teams about where the root cause of the performance problem, this tool can quickly isolate the root cause of the issue. With less fighting. These are great products, give them a look.

Disclaimer: I was not paid for this post, but I do paid work for SolarWinds on a regular basis.

Why Are You Still Running Your Own Email Server?

One of the things I tell customers when doing any sort of architectural consulting, is to identify their most important business systems. Invariably something that gets left off of that list is email. Your email is your most critical system. ERP may run your profit centers, but email keeps it moving.

With that in mind, and given all the security risks that exist in the world (see: Russian hacking scandal, other email leaks of the week) it doesn’t make a lot of sense for most organizations to run their own Exchange environments when Microsoft is really good at it.

I had a discussion with an attorney at a company in a heavily regulated industry recently. The attorney mentioned that after investigating, she determined that the company didn’t have journaling turned on for their Exchange servers. (For you DBAs, journaling is effectively full recovery mode for Exchange—it’s more complicated that, but that is a nice analogy). Given that we are Office 365 customers, I wanted to check the difficulty of enabling this in our environment. I found out, full e-discovery capabilities that integrate with e-discovery systems are as easy as one click of a mouse (and a credit card to make sure you are on the right service level).

Another great security feature that was really painful to integrate with email login is multi-factor authentication. Once again, this requires a mouse click or two, and your credit card. You can even quickly do things like whitelisting your office’s IP address so that your users don’t have to use MFA when in the office.

These features are great, but it doesn’t even cover all the threat protection that Microsoft has built into Office 365 and Azure. You can read about that here, but Microsoft can even protect you from threats like spearphising. (Hi Vlad!) . Just like encryption. Don’t be a news story—just be secure.

%d bloggers like this: