Recovering From Cluster Quorum Loss with SQL Server AlwaysOn Availability Groups
November 8, 2013 1 Comment
Quorum in a cluster is a very important concept—generally speaking, a majority of the cluster nodes, or more likely a majority of the cluster nodes must be up in order to start your cluster. This is done by design, in order to prevent what is known as “split-brain” syndrome, which is where the each half of the broken cluster tries to start and run processes. That is a bad scenario and in Microsoft Windows Failover clusters, is prevented by the cluster service stopping when cluster is no longer reached. One last note—this blog post is for versions of Windows below 2012 R2—there’s a cool new feature there that makes this a lot easier.
The origin of this post is that we had the following cluster scenario, involving SQL Server 2012’s AlwaysOn Availability Group feature. Here was our architecture.
Our concern was that if for any reason we lost the east coast data center, we wouldn’t be able to reach cluster quorum. So we built a lab environment to test out the scenario—just like the above cluster, and then I took SQL Server 1 and Domain Controller 1 down. At first it looked like all was well.
Then suddenly it wasn’t.
And then the Availability Group went down.
So how do we repair this? Theoretically, I could connect to the cluster using the Failover Cluster Manager GUI, and issue a force start from there, but that didn’t work so well.
So let’s hit PowerShell.
What I’m doing there is start-clusternode, this is forcing the start of the cluster service on my Node AG2. The –fixquorum flag forces quorum onto this node, and it also means the copy of the cluster configuration will be treated as the authoritative copy and will be replicated to all of the other nodes. However, quorum should be used as an absolute last resort—don’t do this unless you are in a cluster down scenario.
So now our cluster is back up and running, and since this Availability Group was going across the United States, it was configured in asynchronous availability mode, which requires manual failover mode. It seems counterintuitive, but we need to fail the Availability Group over to the living node in order to restart the AG.
Since we were in asynchronous, Microsoft requires us to check off that we are willing to lose data (this would have been any in-flight transactions, which had not yet reached the transaction log on our secondary replica).
Then, we fail over, and all is happy.
Everything (except the powered down node AG01) is back up and we can connect to the databases in our availability group now.
I did go back to the cluster, and added a quorum file share on Domain Controller 2. This is not necessary, but I did it in the case that we were down for a long time, and potentially adding a new node at our second data center, it might be something you want to do.
Good luck with your clusters!