Cluster Aware Updating and AlwaysOn Availability Groups
November 16, 2012 2 Comments
One of the features I was most looking forward to in Windows Server 2012, was Cluster Aware Updating. My company has a lot of Windows servers, and therefore a lot of clusters. So when a big vulnerability happens and they all need to be rebooted, we use System Center Configuration Manager to handle the reboots automatically. Unfortunately, clusters must maintain quorum to stay running, so rebooting them has generally been a manual process.
However with Windows 2012, we have a new featured called Cluster Aware Updating that is smart enough to handle this for us. It allows us to define a cluster for patching, so we can tell our automated tools to update and reboot the cluster, or we can even just update and reboot manually. This seems like a big win—it was hard to test in earlier releases of Windows 2012, as updates weren’t available. So my question was how it would work with SQL Server. My first test (I’ll follow up with testing a SQL Server Failover Cluster Instance) was with my demo AlwaysOn Availability Groups environment.
The environment was as follows:
- One Domain Controller (controlling the updates as well)
- Two SQL Server 2012 SP1 nodes
- No Shared Storage
- File Share and Node Majority Quorum Model (File Share was on DC)
- Updates downloaded from Windows Update Internet service
I ran into some early issues when I ran out of C: drive space on one of my SQL VMs, it was less than intuitive that the lack of storage was the issue, but I was able to figure it out and work through it. So I started onto attempt #2. The process for how cluster aware updating works as follows:
- Scans both nodes looking for required updates
- Chooses node to begin updates on (in my case it was the node that wasn’t the primary for my AG—not sure if that’s intentional)
- Puts node into maintenance mode, pausing the node in the cluster
- Applies Updates
- Verifies that no additional updates are required
- Takes node out of maintenance mode.
All was well when my node SQLCluster2 went through this process. When SQLCluster1 went into maintenance mode this happened:
When I logged into SQL Server on SQLCluster2 to check the Availability Group dashboard, I found this.
The Availability Group was in resolving status. Mind you the cluster still had quorum, and was running. I couldn’t connect to the databases that are members of the AG, and I could connect to the listener, but the again databases were inaccessible. The only option to bring these DBs online is to perform a manual forced failover to the other node, which may involve data loss. After the updating is completed the services do resolve themselves.
I was hoping Cluster Aware Updating would work a little more seamlessly than that. As far as I can tell, to avoid an outage, I will need to either have manual intervention, or build in some intelligent scripting to fail my AGs over ahead of time. Hopefully this will get resolved in forthcoming SPs and/or CUs.
**Update–Kendal Van Dyke (b|t) messaged me and proposed that changing the failover and failback settings for the cluster (the number of failures that are allowed in a given time period) could resolve the issue. Unfortunately, I saw the same behavior that I saw above.