Would You Fly a Plane with One Engine? Or Run Your Airline with One Data Center(re)?

For those of you who may of been in the US or outside of Europe this past weekend, you may not have heard about the major British Airways IT outage, that took down their entire operations for most of Saturday and into Sunday. Rumors, which were later confirmed, were that a switch from primary to backup power at their primary data centre (they’re a UK company, so I’ll spell it in the Queen’s English), lead to a complete operations failure. I have a bit of inside information, since my darling wife was stuck inside of Terminal 5 at Heathrow.

Image result for jet with missing engine

There’s a requirement for planes that travel across oceans call ETOPs, which stands for Extended Range Operation with Two-Engine Airplanes, however in parlance is know as Engines Turn or Passengers Swim. This protocol and requirements are a set of rules that ensure if a plane has a problem over a body of water, it can make it back to shore for a safe landing. As someone who flies across oceans a decent amount, I am very happy the regulatory bodies have these rules in place.

However, there are no such rules for data centers that run airline operations. In fact, in January, Delta Airlines had a major failure which took down most of its operations for a couple of days. Most IT experts have surmised that Delta was running a single data center for it’s operations. Based on the evidence from Saturday’s incident with BA, I have to assume that they are, as well. One key bit of evidence, was that BA employees were unable to access email. They are an Office 365 customer, so theoretically, even if on-premises systems were down e-mail should work. However, if they were using Active Directory Federation Services, so that all of their passwords were stored on-prem, then the data center being down, would mean they couldn’t authenticate, and therefore would not have email.

This was my biggest clue that BA was running with a single data center—was that email didn’t work. While some systems, particularly some of the mainframe systems that may handle flight operations, have a tendency to not do well with failover across sites, Active Directory is one of the best distributed systems there is, and is extremely resilient to failures. In fact, given BA’s global business, I’m really surprised they didn’t have ADFS servers in locations around the world.

Enter the Cloud

Denny and I sat talking yesterday and running some numbers on what we thought a second data center would cost a company like BA. Our rough estimate (and this is very rough) was around $30-40 million USD. While that is a ton of money, it is estimated that weekend’s mess may cost BA up to  £150 million (~$192MM USD). However, companies no longer have to build multiple data centers in order to have redundancy, as Microsoft (and Amazon, and Google) have data centers throughout the world. The cloud gives you the flexibility to protect critical systems, and at a much cheaper cost. I’ve designed DR strategies for small firms that cost under $100/month, and I’ve had real-time failover that supported 99.99% uptime. With the resources of a firm like BA, this should be a no-brainer given the risk profile.

What About Outsourcing?

Much has been made of the fact that BA has outsourced much of its IT functions to TCS and various other providers. Some have even tried to place blame on the providers for this outage. Frankly, I don’t have enough detail to blame anyone, and it seems more like the data center operator’s issue. However, I do think it speaks to the lack of attention and resources paid to technology at a company that clearly depends on it heavily. Computers and data are more important to business now than ever, and if your firm doesn’t value that, you are going to have problems down the road.

Conclusions

In the cloud era, I’m convinced no business, no matter how big or small should run with a single data center. It is way too cheap and easy to ship your backups to multiple sites, and be online in a matter of hours with a cloud provider. Given the importance and consolidation of airlines to our world economy, it probably wouldn’t be a terrible idea if their regulators created regulations requiring failover and failover testing. Don’t let this happen to your stock price.

//platform.twitter.com/widgets.js

Is My Static IP Address in Windows Azure Really Set?

I’ve been working with Windows Azure VMs since they became available last year, and I’ve built out some pretty complex scenarios with them (hybrid clusters using AlwaysOn Availability Groups, for one). One the early limitations was that all of the VMs were all dynamic IP (DHCP) addresses—there were some workarounds to this, but with database servers and domain controllers this wasn’t the best option. Starting early in 2014, a new PowerShell command called “Set-AzureStaticVNetIP” appeared on GitHub, and slowly made its way into the public domain. There’s a great article on how to configure this using Windows Azure PowerShell (which you will need to install after creating your VMs) at Windows IT Pro.

Note: This can only be done on Azure VMs in a virtual network

I’m in the middle of creating some VMs for some work that I am doing, and I went through the above process to assign a static IP to my domain controller in Azure. Pro tip—don’t set your IP address from within an RDP session to that machine. You will get kicked out (hangs head in shame). Also, make note that your machine may reboot—it’s not mentioned in the article, and I’m not 100% sure if it related to my being in the RDP session, but be forewarned.

As I was promoting it to a domain controller, I noticed that Windows still thought it had a dynamic IP address. Which I thought was odd.

Figure 1 Server Manager Showing Dynamic IP

From there I checked the IPv4 properties (note—this server has been promoted to a domain controller, why it is using localhost (127.0.0.1) and the other domain controller for DNS)

Figure 2 IPv4 Properties of AzureVM with Static IP

Of course, the proof was in the pudding—I had rebooted this VM several times and it was still keeping the same IP address (10.0.1.4). In the traditional dynamic IP address model, each DHCP call would increment by 1—so by my third reboot I would expect to see 10.0.1.7. So I went to Powershell to check using the command “Get-AzureStaticVNetIP”:

 

Figure 3 Static IP Address Assigned to VM.

So even though in most places it doesn’t look like your VM has a static IP address, it has been reserved on the Azure side. I think this is likely being done at the hypervisor level somehow, and hasn’t been exposed to Windows yet, but that’s just speculation on my part.

 

 

The SQL Virtualization Tax?

I’ve been working in virtual server environments for a long time, and a big proponent of virtualization. It’s a great way to reduce hardware costs and power consumption, and frankly for smaller shops it’s also their easy foray into high availability. The main reason for the high availability are technologies like VMWare’s vMotion and Microsoft’s Hyper-V Live Migration—if a physical server in a virtualization farm fails the underlying virtual servers get moved to other hardware, without any downtime. This is awesome, and one of the best features of a virtual environment. What I don’t like is when software vendors feel they are getting the raw end of the deal with virtualization, so they develop asinine licensing policies around.

Oracle is my favorite whipping boy in this discussion—Oracle is most typically licensed by the CPU core. In my opinion, a CPU core should be the number of cores that the operating system can address. Oracle agrees with me, but only in the case of hard partitions (mostly old, expensive UNIX hardware that they happen to sell). Basically, if I have a cluster of 64 physical nodes, and I have one virtual machine, with one virtual CPU, Oracle expects me to license EVERY CORE in that cluster. The ways around this are to physically lock down your virtual machine to a given hardware pool and then license all of those cores (a smaller number of course). The other option is to dedicate a bunch of hardware to Oracle, and virtualize it—while this works, it definitely takes away a lot of the flexibility of virtualization, and is a non-starter for many larger IT organizations.

Microsoft, on the other hand has been generally pretty fair in their virtualization licensing policies. An Enterprise license for Windows Server bought you four VM licenses, and SQL Server (before 2008 R2) had some very favorable VM licensing. However, starting with the SQL Server 2012 things started to get a bit murkier—for Enterprise Edition, we have to buy a minimum of 4 core licenses, even if you are only running one 1 or 2 virtual CPUs. However, we don’t have to license every core in the VM farm. One thing that caught my eye with the SQL Server 2012 licensing, is that if you license all of the physical cores in a VM farm, you can run unlimited number of VMs running SQL Server, but only if you purchase Software Assurance. Software Assurance costs 29% of license costs, and is a recurring annual cost. In the past Software Assurance was generally only related to the right to upgrade the version of your software (e.g. if you had SA, you could upgrade from SQL 2008 R2 to SQL 2012). This rule bothered me, but it didn’t really affect me, so I ignored it.

I was talking to Tim Radney (b|t) yesterday, and he mentioned that in order to do vMotion/LiveMigration (key features of virtualization) Software Assurance was required. I hadn’t heard this before, but sure enough in this document from Microsoft, it is mentioned:

So, in a nutshell if you want to run SQL Server in virtual environment, and take advantage of the features that you paid for, you have to pay Microsoft an additional 29% per license of SQL Server. I think this stinks—please share your thoughts in the comments

Vendors, Again—8 Things To Do When Delivering a Technical Sales Presentation

In the last two days, I’ve sat through some of the most horrific sales presentations I’ve ever done—this was worse than the time share in Florida. If you happen to be a vendor and reading (especially if you are database vendor—don’t worry it wasn’t you), I hope this helps you craft better sales messages. In one of these presentations, the vendor has a really compelling product that I still have interest in, but was really put off by bad sales form.

I’ll be honest, I’ve never been in sales—I’ve thought about it a couple times, and still would consider it if the right opportunity came along, but I present, a lot. Most of these things apply to technical presentations as well as sales presentations. So here goes.

The top 8 things to do when delivering a sales presentation:

  1. Arrive Early—ask the meeting host to book your room a half hour early and let you in. This way you can get your connectivity going, and everything started before the meeting actually starts, wasting the attendee’s valuable time, and more importantly cutting into your time to deliver your sales message. Also starting on time allows you to respect your attendees’ schedules on the back end of the presentation.
  2. Bring Your Own Connectivity—if you need to connect to the internet (and if you have remote attendees, you do) bring your own connectivity. Mobile hotspots are widely available, and if you are in sales you are out of the office most of the time anyway, consider it a good investment.
  3. Understand Your Presentation Technology—please understand how to start a WebEx and share your presentation. If you have a Mac have any adapters you need to connect to video. If you want to use PowerPoint presentation mode (great feature by the way) make sure the audience doesn’t see the presenter view, and sees your slides. Not being able to do this is completely inexcusable.
  4. Understand Who Your Audience Is—if you are presenting to very Senior Infrastructure architects in a large firm, you probably don’t need to explain why solid state drives are faster than spinning disks. Craft your message to your intended audience, especially if it has the potential to be a big account. Also, if you know you are going to have remote attendees don’t plan on whiteboarding anything unless you have access to some electronic means to do so. You are alienating half of your audience.
  5. Don’t Tell Me Who Your Customers Are—I really don’t care that 10 Wall St banks use your software/hardware/widget. I think vendors all get that same slide from somewhere. Here’s a dirty little secret—large companies have so many divisions/partners/filing cabinets that we probably do own 90% of all available software products. It could be in one branch office that some manager paid for, but yeah technically we own it.
  6. I Don’t Care Who You Worked For—While I know it may have been a big decision to leave MegaCoolTechCorp for SmallCrappyStorageVendor, Inc., I don’t really care that you worked for MegaCoolTechCorp. If you mention it once, I can deal with it, but if you keep dropping the name it starts to get annoying and distracting.
  7. Get on Message Quickly—don’t waste a bunch of time telling me about marketing, especially when you go back to point #4—knowing your audience. If you are presenting to a bunch of engineers, they want to know about the guys of your product, not what your company’s earnings were. Like I mentioned above, one of the vendors I’ve seen recently has a really cool product, which I’m still interested in, but they didn’t start telling me about the product differentiation until 48 minutes into a 60 minute presentation.
  8. Complex Technical Concepts Need Pictures—this is a big thing with me. I do a lot of high availability and disaster recovery presentations—I take real pride in crafting nice PowerPoint graphics that take a complex concept like clustering and simplify it so I can show how it works to anyone. Today’s vendor was explaining their technology, and I was pretty familiar with the technology stack, yet I got really lost because there were no diagrams to follow. Good pictures make complex technical concepts easy to understand.

I hope some vendors read this and learn something. A lot of vendors have pretty compelling products, but fail to deliver the sales message which is costing them money. I don’t mind listening to a sales presentation, even for a vendor I may not buy something from, but I do really hate sitting through a lousy presentation that distracts me from the product.

Cluster Aware Updating and AlwaysOn Availability Groups

One of the features I was most looking forward to in Windows Server 2012, was Cluster Aware Updating. My company has a lot of Windows servers, and therefore a lot of clusters. So when a big vulnerability happens and they all need to be rebooted, we use System Center Configuration Manager to handle the reboots automatically. Unfortunately, clusters must maintain quorum to stay running, so rebooting them has generally been a manual process.

However with Windows 2012, we have a new featured called Cluster Aware Updating that is smart enough to handle this for us. It allows us to define a cluster for patching, so we can tell our automated tools to update and reboot the cluster, or we can even just update and reboot manually. This seems like a big win—it was hard to test in earlier releases of Windows 2012, as updates weren’t available. So my question was how it would work with SQL Server. My first test (I’ll follow up with testing a SQL Server Failover Cluster Instance) was with my demo AlwaysOn Availability Groups environment.

The environment was as follows:

  • One Domain Controller (controlling the updates as well)
  • Two SQL Server 2012 SP1 nodes
  • No Shared Storage
  • File Share and Node Majority Quorum Model (File Share was on DC)
  • Updates downloaded from Windows Update Internet service

I ran into some early issues when I ran out of C: drive space on one of my SQL VMs, it was less than intuitive that the lack of storage was the issue, but I was able to figure it out and work through it. So I started onto attempt #2. The process for how cluster aware updating works as follows:

  • Scans both nodes looking for required updates
  • Chooses node to begin updates on (in my case it was the node that wasn’t the primary for my AG—not sure if that’s intentional)
  • Puts node into maintenance mode, pausing the node in the cluster
  • Applies Updates
  • Reboots
  • Verifies that no additional updates are required
  • Takes node out of maintenance mode.

All was well when my node SQLCluster2 went through this process. When SQLCluster1 went into maintenance mode this happened:

When I logged into SQL Server on SQLCluster2 to check the Availability Group dashboard, I found this.

The Availability Group was in resolving status. Mind you the cluster still had quorum, and was running. I couldn’t connect to the databases that are members of the AG, and I could connect to the listener, but the again databases were inaccessible. The only option to bring these DBs online is to perform a manual forced failover to the other node, which may involve data loss. After the updating is completed the services do resolve themselves.

I was hoping Cluster Aware Updating would work a little more seamlessly than that. As far as I can tell, to avoid an outage, I will need to either have manual intervention, or build in some intelligent scripting to fail my AGs over ahead of time. Hopefully this will get resolved in forthcoming SPs and/or CUs.

**Update–Kendal Van Dyke (b|t) messaged me and proposed that changing the failover and failback settings for the cluster (the number of failures that are allowed in a given time period) could resolve the issue.  Unfortunately, I saw the same behavior that I saw above.

SAN Basics for DBAs–Central Pennsylvania Users Group

I will presenting tonight at the Central Pennsylvania Users group on SAN Basics for DBAs (and other data pros!). I’ve given this presentation many times, but it’s been updated to reflect new information about automated storage tiering and what it means for the database.

The slides are available here, and I will update this post with any additional resources.

Here is a link to SQLIO. And a best practices document.

Lastly here is an EMC white paper on the automated storage tiering, I mentioned lasted night.

South Jersey User’s Group — Virtualization for DBAs

Tonight (7/23) I will be talking at the South Jersey SQL Server’s user group on Virtualization and what it means to your databases. We will walk through the terminology associated with VMs, and some best practices for how to configure and troubleshoot your databases on VMs.

The meeting is at the Haddon Heights library–we are in the basement when you walk in. You can register for it here.

My slides for tonight’s presentation will be available on Slideshare.
The link to the SSRS reports for vCenter are on Tom Fox’s blog here.
Denny Cherry’s blog on tuning the vCenter database is here.

Server Virtualization–the bottom end of the spectrum

Brent Ozar posted an excellent blog yesterday, on the upper limits of server virtualization. In his post, he discussed the limitations of the upper limits of using VMWare for database servers. I mentioned to Brent on twitter that, I had just been talking about the other side of this–what’s too small to virtualize.

I was in a meeting yesterday, discussing a recent acquisition for our company, and one of our remote manufacturing sites, and the costs involved of converting them to a virtual infrastructure. Each of these sites currently have around 10 physical servers, and no shared storage platform (currently). The leading management argument was that it’s cheaper to replace the servers of a regular cycle, than to make the investment into a virtual infrastructure.

The hardware costs for the project are as low as about $50-60k–using HP 360s and MSA iSCSI storage. That’s 3 servers and 3-4Tb of storage. The real killer is the VMWare licensing–we’re looking at close to $40k a host, which brings the total cost of the project to well over $100k. We’re in an odd spot, because we’re in a large company, but supporting smaller sites, that need some enterprise management features. A smaller shop could get away with VMWare Essentials plus, which is a much more affordable $3k a server (all prices are non-discounted).

However, that brings the total cost of the project to about $70k–which would replace all of the standalone servers on site at least once.

This obviously doesn’t account for reduced management and backup costs, nor does it account for the higher availability of the VMWare environment. High Availability can still be a hard sell in small shops–believe it or not. But that’s where the value in virtualizing their hardware is–outstanding uptime and ease of management. With a little bit higher cost.

I’m a big fan of virtualization, but some times it can be a hard sell to the pointy haired boss.

%d bloggers like this: