Don’t Let Your Infrastructure Team Design Your Data Protection Strategy

In the last two days, I’ve been part of two discussions, one of which was about the need to run CHECKDB on modern storage (yes, the answer is always yes, and twice on Sundays), and then another about problems with third party backup utilities. Both of these discussions were born out of (at the end of the day) infrastructure teams wanting to treat database servers like web, application, and file servers, and have one tool to manage them all. Sadly, the world isn’t that simple, and database servers (I’m writing this generically, because I’ve seen this issue crop up with both Oracle and SQL Server). Here’s how it happens, invariably your infrastructure team has moved to a new storage product, or bought an add on for a virtualization platform, that will meet all of their needs. In one place, with one tool.

image

So what’s the problem with this? As a DBA you lose visibility into the solution. Your backups are no longer .bak and .trn files, instead, they are off in some mystical repository. You have to learn a new tool, to do the most critical part of your job (recovering data), and maybe you don’t have the control over your backups that you might otherwise have. Want to stripe backups across multiple files for speed? Can’t do that. Want to do more granular recovery? There’s no option in the tool for that. Or my favorite one—want to do page level recovery? Good luck getting that from a infrastructure backup tool. I’m not saying all 3rd party backup tools are bad—if you buy one from a database specific vendor like RedGate, Idera, or Quest, you can get value-added features in conjunction with your native ones.

BTW, just an FYI, if you are using a 3rd party backup tool, and something goes terribly sideways, don’t bother calling Microsoft CSS, as they will direct you to the vendor of that software, since they don’t have the wherewithal to support solutions they didn’t write.

Most of the bad tools I’m referring to, operate at the storage layer by taking VSS snapshots of the database after quickly freezing the I/O. In the best cases, this is non-consequential, Microsoft let’s you do it in Azure (in fact it’s a way to get instant file initialization on a transaction log file, I’ll write about that next week). However, in some cases these tools can have faults, or take too long to complete a snapshot, and that can do things like cause an availability group to failover, or in the worst case, corrupt the underlying database, while taking a backup, which is pretty ironic.

While snapshot backups can be a good thing for very large databases, in most cases with a good I/O subsystem, and backup tuning (using multiple files, increasing transfer size) you can backup very large databases in a normal window. I manage a system that backs up 20 TB every day with Ola Hallengren’s scripts, and not even storage magic. Not all of these storage based solutions are bad, but as your move to larger vendors who are further and further removed from what SQL Server or Oracle are, you will likely run into problems. So ask a lot of questions, and ask for plenty of testing time.

So if you don’t like the answers you get, or the results of your testing, what do you do? The place to make the arguments are to the business team for the applications you support. Don’t do this without merit to your argument—you don’t want to unnecessarily burn a bridge with the infrastructure folks, but at the end of the day your backups, and more importantly your recovery IS YOUR JOB AS A DBA, and you need a way to get the best outcome for your business. So make the argument to your business unit, that “Insert 3rd Party Snapshot Magic” here isn’t a good data protection solution and have them raise to the infrastructure management.

Always On Availability Groups transport has detected a missing log block…

If you are running SQL Server 2016 (especially before CU3) you have received this error:

DATE/TIME:    8/21/2017 11:24:53 AM

DESCRIPTION:    Always On Availability Groups transport has detected a missing log block for availability database "Database". LSN of last applied log block is (99939:95847:0). Log scan will be restarted to fix the issue. This is an informational message only. No user action is required.

COMMENT:    (None)

JOB RUN:    (None)

Image result for missing block

I’ve talked with the product team about it, and its just something that happens, and is more of an informational message. Based on some discussions I’ve have on #sqlhelp on Twitter, it may be related to not enough network bandwidth between nodes. But if you see these sporadically, relax, it’s nothing major.

Monitoring Availability Groups—New Tools from Solarwinds

As I mentioned in my post a couple of weeks ago, monitoring the plan cache on a readable secondary replica can be a challenge. My customer was seeing dramatically different performance, depending on whether a node was primary or secondary. As amazing as the Query Store in SQL Server 2016 is, it does not allow you to view statistics from the readable secondary. So that leaves you writing xQuery to mine the plan cache DMVs for the query information you are trying to identify.

My friends at Solarwinds (Lawyers: see disclaimer at bottom of post) introduced version 11.0 of Database Performance Analyzer (DPA, a product you may remember as Ignite) which has full support for Availability Group monitoring. As you can see in the screenshot below, DPA gives a nice overview of the status of your AG, and also lets you dig into the performance on each node.

image

There are a host of other features in their new releases, which you can check out some of their new hybrid features in their flagship product Orion. Amongst these features, a couple jumped out at me—there is now support for Amazon RDS and Azure SQL Database in DPA, and there is some really cool correlation data that will let your compare performance across your infrastructure. So, when you the DBA is arguing with the SAN, network, and VM teams about where the root cause of the performance problem, this tool can quickly isolate the root cause of the issue. With less fighting. These are great products, give them a look.

Disclaimer: I was not paid for this post, but I do paid work for SolarWinds on a regular basis.

%d bloggers like this: