PREEMPTIVE_OS_FILEOPS Waits and Filestream Restores

We had a case over the weekend where our automated restore process at client got hung up on this wait type, for a single database. What was the unique characteristic about this relatively medium (2-300 GB) database? It had a lot of filestream data–it seemed like the file count wasn’t that high, but my guess is the filestream data was the majority of the data in that database. When the job hung up, the restore had been waiting on PREEMPTIVE_OS_FILEOPS for over a day and still had a null value for percentage complete.

One interesting thing that happened was after I attempted to kill the restore process, it remained in place. My restore task was running in SQLCMD, so I went a step further, and killed the SQLCMD process on the server. The SPID in the database stayed alive, and since it was a non-production environment, and a weekend, I restarted the SQL service.

Per SQLSkills this wait type means “This wait type is a generic wait for when a thread is calling one of several Windows functions related to the file system”, and more commonly you see it on the end of a backup, when SQL Server is growing a log file, which does not benefit from instant file initialization (IFI). In our case the server did not have IFI enabled, and I suspect this was one of the contributors to the problem. After we enabled IFI, the restore complete in just under three hours.

EXEC sp_configure filestream_access_level, 2
RECONFIGURE
GO

CREATE DATABASE Filestream
ON
PRIMARY ( NAME = Arch1,
FILENAME = 'c:\data\archdat1.mdf'),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = FSData,
FILENAME = 'c:\filestream')

LOG ON ( NAME = Archlog1,
FILENAME = 'c:\data\archlog1.ldf')
GO

create table fs_table
(id INT IDENTITY (1,1),
UI UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
FS_Data varbinary(max) filestream NULL)

Those are the database objects–I then used PowerShell to create some files, and generate an insert script.

$i=1
while ($i -le 10001)
{new-item -ItemType file -Path "C:\fstemp\" -Name Fs$i.txt -Value "Text file $i"; $i++}

$i=1
while ($i -lt 10001)
{add-content C:\temp\fsinsert.sql "`nINSERT INTO [dbo].[FS_Table] (UI, FS_Data) VALUES (NEWID(),(SELECT * FROM OPENROWSET(BULK N'C:\fstemp\FS$i.txt', SINGLE_BLOB) AS Image$i))"; $i++}

Those are just a couple of loops that create 10000 files and then insert them into the database. The files are very small (13-14 bytes), but it would be a very representative test. I kicked off a restore and ran procmon to see what the SQL Server was doing. SQL Server first queried the filestream directory for each file.

SQL Server is doing four operations for each file:

  • Create File
  • Query Standard Information
  • Write File
  • Close File

Since my files are very small, this happens very quickly, but it has to happen 5000x, which gives me enough time to prove this behavior. You should note that I was able to control for IFI being enabled–there was a performance improvement that I think is related to the number of operations. Instead of doing the four operations per file SQL Server appears to only creating and closing each file. Performance on my test instance was inconsistent, but I was working in a constrained VM (my CPU fan has been running all morning).

I suspect this restore process would be impacted by either a large number of files or a large volume of data. This can be confusing, as even though the restore process is running, this isn’t reflected in percent complete for the restore.

I hope you learned something in this post, I know I did. also, don’t #$%^ing store files in your database, unless you like hitting yourself with a hammer.

Why You Shouldn’t Use Amazon RDS for your SQL Server Databases

Disclaimer: I’ve a Microsoft MVP and shareholder, but neither of these things affected my opinions in this post.

Cloud vendors have built a rich array of platform as a service (PaaS) solutions on their platforms. They market these heavily, because they have higher degrees of stickyness compared to IaaS platforms (and in many cases they likely have higher profit margins), but they also have key benefits to the users of the platform. Because a PaaS solution is fully managed by the cloud provider, they tend have a set of common features:

  • Easy to deploy–you are never running setup.exe
  • Built-in high availability–no need to configure a cluster
  • Easy disaster recovery/geo-replication–usually in most services it’s just a few clicks away
  • Automated backups
  • Automated and possibly zero downtime patching

While some services include other really useful features (for example the query data collected by the Azure SQL Database and Managed Instance platforms), I wanted to focus on the common value adds to PaaS systems across providers. I made the last two of these bold, because I feel like they are are the most important, especially in scenarios where the vendor doesn’t own the source to the applications. Like Amazon RDS for SQL Server.

Amazon RDS Backups

I’m writing this post this week, mainly because of what I just learned this week about the backups for SQL Server on RDS. I was in some client calls this week when I learned the default backup approach is volume snapshots (which I knew), but what I didn’t know what that you couldn’t restore an individual database with these default backups.

I feel like their docs are deliberately vague about this–it’s not clearly obvious that this is the case, but a few Stack Overflow threads and discussions with fellow MVPs confirmed that I was told on the call. Amazon does support you taking your own backups on RDS, and you can then restore individual databases to individual points in time, but where’s that fun (and more importantly the value proposition) in that. To me, this really eliminates one of the biggest benefits of using a PaaS service. AWS refers to normal backup/restore as “Native Backup/Restore” as you are reading docs.

Patching

Azure SQL Database and Managed Instance, both seamlessly patch your databases without your knowledge. For the most part those services utilize hot patching, which means in many cases there isn’t any downtime to install the Azure equivalent of a CU. (Owning the source code has it’s benefits) Amazon RDS can automatically install selected CUs, but you should be aware that is an option when you deploy your instance.


"ValidUpgradeTarget": [
    {
        "Engine": "sqlserver-se",
        "EngineVersion": "14.00.3192.2.v1",
        "Description": "SQL Server 2017 14.00.3192.2.v1",
        "AutoUpgrade": true,
        "IsMajorVersionUpgrade": false
    }

The other thing you should note, is that you may not always have the most current CU available to you. Currently, AWS supports CU12, which is three CUs behind current. However, our customer was only on CU8–so patching doesn’t seem to be as automatic or easy as it is on the Azure side.

Licensing

This one really isn’t AWS’ fault (it’s Microsoft’s fault), but there are a couple of issues with licensing RDS. The first is that you don’t have the option of running Developer Edition for non-production workloads. Which, especially if you are running Enterprise Edition represents an expensive choice–you either need to license Dev/Test for Enterprise, or run Standard Edition in dev to save some money, but not have adequate features in place (though the code platform is mostly the same, performance characteristics can be dramatically different). Additionally, you cannot bring your own SQL Server licenses to RDS, you have to lease them though AWS. Neither of these problems are the fault of AWS, but they still suck.

AWS is an excellent cloud platform, even thought I actively hate their console. For most part a lot of components are very similar or even better than Azure. However, when I comes to a service where Microsoft owns the source code and AWS doesn’t, you can see the clear superiority of Azure. So what is an AWS shop that runs SQL Server to do? IMO, the backup/restore thing is a deal breaker–I would just recommend running in an EC2 VM.

Fixing SQL Server Database Corruption (when you get lucky)

First things first–if you are reading this, and not regularly running consistency checks on your SQL Server databases, you should drop everything you are doing and go do that. What do I mean by regularly? In my opinion, based on years of experience, you should run DBCC CHECKDB at least as frequently as you take a full backup. Unless you have a clean consistency check, you don’t know if that last backup you took is valid. SQL Server will happily back up a corrupted database.

Screenshot of a failed checkdb, followed by a successful backup command

I cheated a little bit here and used an undocumented command called DBCC WRITEPAGE to corrupt a single 8kb page within an non-clustered index on a table I created. You should basically never use this command, unless you are trying to corrupt something for a demo like this, but as you can see, after we’ve corrupted the page, SQL Server fails CHECKDB, and then happily takes a backup of our now corrupted database.

What Causes Database Corruption?

Other than doing something terrible, like editing a page, database corruption is mostly caused by storage failures. An example of this could be your local SAN, where the SAN’s operating system acknowledges that a write operation is complete back to the host operating system, but the write doesn’t actually complete. SQL Server recieved the write acknowledgment and thinks that the data was correctly written to the page, however for whatever reason it didn’t happen. I had this happen a couple of times in a past job when the SAN in our “data center” (it wasn’t) crashed hard when the building lost power (yeah we didn’t have a generator, hence the quotes. Be careful who you buy your medical devices from). What was actually happening is that the SAN was acknowledging a write when the data hit memory on the SAN, which is a performance enhancement, that assumes you have a proper power infrastructure that will prevent the SAN from “crashing hard”. You know what happens what you assume, right?

Anyway, this is far less common than it used to be for a number of reasons, one of which is the use of cloud based storage, which is very robust in terms of data protection. Also, modern enterprise class SANs are more efficient, and less likely to have failures like this. However, it’s still very possible–I had a minor corruption event in an Azure VM a couple of years ago, and we had a customer who filled up their very non-enterprise class SAN, with terrible results (it was all the corruption). So the moral of the story, is wherever you are running SQL Server you need to run checkdb. (Except Azure SQL DB, and possibly Managed Instance).

Fixing Corruption

There is a lot of stuff and tools that people will try to sell you on the internet to fix your database corruption. Almost all of them are crap–if you have corruption in a table or clustered index, or worse in one of the system pages that determines allocation, you are screwed and need to restore your last good backup. (see why backup retention matters here?)

However, in some cases you can get lucky. If your corruption is limited to a nonclustered index, you don’t need to restore the database, and you can just rebuild your index.

However, in my case that just threw the dreaded SQL Server 824 error. I suspect this had something to do with how I corrupted the page, but that investigation is not complete. I was able to disable the index, and then rebuild and we had a sucessful CHECKDB.

Getting Started with Log Replay Service for Azure SQL Managed Instance

Recently, I’ve started on a project where we are migrating a customer to Azure SQL Managed Instance, which now supports a few different migration paths. You can simply backup and restore from a database backup, but you can’t apply a log or differential backup to that database. You can also use the Database Migration Service, but that requires a lot of infrastructure and Azure configuration. The log replay service, or LRS, is the functional equivalent of log shipping to your new managed instance database. While log shipping is a well known methodology for both database migrations or disaster recovery. However, the implementation is a little different–let’s talk about how it works.

flock of birds flying
Photo by Shakhawat Shaon on Pexels.com

First, You Must backup to URL

In order to use the log replay service, you need to take a backup of your database to Azure Blob Storage. Alternatively, you can push on-premises backups using AzCopy or uploading files into the portal or Storage Explorer. But let’s focus on backup to URL. The first thing you need to do is create a credential in SQL Server. While there are other methods of creating a credential, I have had absolutely the best luck, using this PowerShell code which produces a CREATE CREDENTIAL statement to execute on your SQL instance.

CREATE CREDENTIAL [https://storageaccount.blob.core.windows.net/sqlbackups] WITH IDENTITY='Shared Access Signature', SECRET='sv=2020-08-04&si=sql1&sr=c&sig=Dw0uE13l2347%2FCNJgQbm5sHhM9ZtDts9F8SMi5Re1ASUw%3D'

The SAS token you create is created on the container level–you will also need to create a policy (that code from MS will create everything for you including the storage account). Next execute the backup.

EXECUTE dbo.DatabaseBackup@Databases = 'DB1',@URL = 'https://storageaccount.blob.core.windows.net/sqlbackups',@BackupType = 'LOG',@Compress = 'Y',@Checksum = 'Y',@Verify = 'Y',@DirectoryStructure = '{DatabaseName}',@NumberOfFiles = 8,@MinBackupSizeForMultipleFiles = 10240,@logtotable=y

I’m using Ola Hallengren’s code to do my backups, but you can also do this manually.

BACKUP LOG [db1] TO URL = N'https://storageaccount.blob.core.windows.net/sqlbackups/db1/687593983fe8_db1_LOG_20211216_162916.trn' WITH CHECKSUM, COMPRESSION

I highly recommend using Ola’s code to perform these backups and all of your backups. A couple of things you should note–you have to use the checksum option, and with Ola’s backups I’m using the Directory Structure of {DatabaseName}, The reason for this is that LRS does not support nested directories, like Ola’s code would default to, nor does it support back ups stored in the root of the container. (You should note the backup example on the LRS doc page does just this).

Docs also say you need a full, differential, and log backup to do a complete migration, however in my testing I’ve been successful with just a log backup in addition to a full. To complete your restore process, you will need to have the name of your last log backup. You can use this code to generate the powershell code to initiate the migration process.

DECLARE @DBName SYSNAME;

DECLARE @RGName VARCHAR(256) = 'ResourceGroupName'

DECLARE @MIName VARCHAR(256) = 'ManagedInstanceName'
DECLARE database_cursor CURSOR FOR

SELECT name FROM MASTER.sys.sysdatabases where name not in ('master','msdb','model','tempdb')

create table #migration (command nvarchar(max))

OPEN database_cursor FETCH NEXT FROM database_cursor INTO @DBName
WHILE @@FETCH_STATUS = 0 BEGIN

SELECT @lastlogbackup = REVERSE(SUBSTRING(REVERSE(physical_device_name ),1,CHARINDEX('/',REVERSE(physical_device_name )) - 1))FROM msdb.dbo.backupset bs INNER JOIN msdb.dbo.backupmediafamily bmf ON [bs].[media_set_id] = [bmf].[media_set_id]WHERE (bs.database_name = @DBName OR @DBName IS NULL) AND bs.type = 'L'ORDER BY bs.backup_start_date DESC;

insert into #migration select ‘Start-AzSqlInstanceDatabaseLogReplay -ResourceGroupName “'+@RgName+'" -InstanceName "'+@MiName'" ` -Name "'+@dbName+'" `-LastBackupName "'+@lastlogbackup+'" -Collation "SQL_Latin1_General_CP1_CI_AS" -StorageContainerUri "https://storageaccount.blob.core.windows.net/sqlbackups/'+@dbname+'/" -AutoCompleteRestore -StorageContainerSasToken "sp=rl&st=2021-12-14T21:25:07Z&se=2021-12-25T05:25:07Z&spr=https&sv=2020-08-04&sr=c&sig=h8%2FimcTh%2BnlCQ3WodgTShWERT7yA38HfBQFOLwvEoiqw%3D"'
FETCH NEXT FROM database_cursor INTO @DBName

END

CLOSE database_cursor DEALLOCATE database_cursor

select * from #migration

For a small database, this only takes a few minutes. If your PowerShell execution goes out to lunch, and your database is stuck in restoring, there is probably something wrong with the way your formatted your backup string. This also manifests itself in the form of http 204 error in the log of your managed instance.

Message

Sending the notification action: NotifyRestoreProgress

Date                       12/9/2021 8:09:42 AM

Log                         SQL Server (Current – 12/9/2021 4:27:00 AM)

Source                  spid101

Message

Http code after sending the notification for action NotifyRestoreProgress: 204

What’s annoying is that the PowerShell doesn’t attempt to do any validation of requirements. Hoepfully, this will change in the future because this was a challenging problem to troubleshoot.

Message

A Post-mortem for PASS, and What That NDA* Issue Was

Editor’s Note: I wrote this last December, and went back and forth on publishing. As I see some of the poor leadership patterns from PASS, creeping into new offshoots of the SQL Community, I decided to publish it today.

I am not writing this to dance on the grave of PASS or throw pitchforks at anyone. Just like the rest of the community, I am disappointed at the end of PASS, as the community worked really hard to make the organization great. However, a lot of mistakes were made over the course of many years, and I think it’s important to talk about those mistakes, and how we as a community can avoid them in future endeavours. The biggest failure was not protecting community assets like SQL Saturday and the Virtual Chapters that were built by the sweat of the community. But first I’d like to talk about the “NDA” violation the came up while I was running for the board.

The reason why I’m including this story into my post-mortem is that the board NDA was a major organizational problem and was abused by senior leadership at C&C, in order to in my opinion hide a lot of bad practices that took place within the organization. Nearly every interaction with PASS required an NDA—for a community organization this notion is ridiculous. There are small set of activities that actually require an NDA—typically harassment cases and personnel matters like salaries. Other than that, all activities and budgets should be fully transparent. One of the biggest issues that happened in PASS’s failure was that members at large on the board of directors did not have a clear picture of PASS’s finances until it was much too late. This stonewalling was not for a lack of effort—even to members of the board PASS did not operate transparently. This behavior continued until the very end of the organization.

About that NDA Violation

I’d also like the clarify the difference between whistleblowing activity and NDA violations. Whistleblowers report things when they see their leadership acting with maleficence, and whistleblower activity nearly always violates NDAs. Whistleblowers have exposed government, and corporate negligence and saved lives and costs.

Many people reached out to me after the posts I wrote last year. There was rightful and genuine concern for PASS—despite frustrations there was a lot of emotion for PASS, and more important our community has been largely structured around a centralized organization. As people reached to me, I learned fairly early about the plan to ask Microsoft for a very large bailout (on the order of millions of dollars), in an effort to save PASS. Part of this plan involved selling Microsoft a large number of PASS Pro memberships.

Given that PASS had no plan for a revenue stream in 2021, this funding project seemed absurd, and at PASS’ current burn rate they would require another bail out next year. I had numerous sources for this information, and while I won’t reveal the sources, they included community members and C&C employees. I did share this information with selected individuals at Microsoft as well, but I never revealed it publicly. I also shared information in confidence to a few selected individuals One of those people decided that was a violation and reported it to PASS exec. I have a couple of things to say about this—first of which is that in the application that prospective board members sign, there is no reference to an NDA, and I was not under contract with PASS. I’m not a lawyer, but I took a few semesters of business law, and if there is not an NDA in-place, there is no expectation of privacy. In the end, all of that didn’t matter, as PASS is now dead.

A Board Operating in the Dark

In my discussions with former board members, one topic that constantly came up was the complete lack of transparency C&C showed, particularly when budgets and spending were discussed. Multiple Board and even exec members reported to me that they gave explicit directives to C&C only to have those directives be ignored.

Much of the blame for the end of PASS will be laid onto the Covid-19 pandemic taking away the primary revenue source, in-person Summit. While this is the direct cause for PASS’ failure, the truth is that PASS has been on a downward slide for several years. In preparing to write this post, I spoke to board members across several generations of the board and deeply involved community members to gain insights into what happened. There are a number of things that have always concerned me about PASS—the lack of transparency the organization has always, always been the core problem.

C&C’s Conflicts of Interest

Why do I think there was such a culture of secrecy in the organization? I think it stems from the management organization, and its leadership. C&C always acted in the best interests of C&C, and not the best interests of PASS or the Microsoft Data community down to the final days of the organization. C&C was paid by PASS as a percentage of PASS’ spending—while this is somewhat common in event management circles, in organizational management circles it can lead to perverse outcomes, and likely what was ultimately responsible for PASS’ bankruptcy.

C&C had so many conflicts of interest I don’t even know where to begin talking about them. The CEO’s sister was responsible for the finances of both PASS and C&C, and C&C held a non-voting seat on the executive board. While many members of the board would say this was helpful as PASS and C&C operated in partnership, it is probably the most explicit and visible conflict of interest. The fact that the organization providing all of your services knows what your bank balance and income are problematic at best and can lead to poor outcomes. It also means your partner has no real interest in saving the organization money. C&C had no other clients other than PASS, which is likely a violation of IRS and Revenue Canada employee/contractor regulations.

I’ll share some specific examples, but what this practice led to was a singular focus on PASS Summit, because that’s where the spending (and revenue) is for the organization, and therefore C&C. This came at the expense of community events like user groups and SQL Saturday events. This also led to C&C influencing heavily decisions like spending a ton of money on a virtual Summit that was unlikely to be successful, or spending a lot of money on PASS Pro, which effectively had no hope of ever being profitable. While PASS’ noble mission was “Connect, Share, and Learn” in practice it became “whatever it takes to drive C&C’s revenue growth”.

Lost Opportunities

One of the challenges PASS faced well before Covid was lack of growth (and even a downturn) in the number of people attending PASS Summit. Throughout my time with PASS the board acknowledged this single source of revenue as a risk and made a few attempts at growing the community or diversifying revenue. One of these was the Business Analytics conference, which was initially driven by Microsoft, but could have evolved into something much bigger. As opposed to a technology only conference, business analytics can be marketed to wide swath of data professionals and not just technologists. The marketing for this conference was handled poorly, as C&C tried to market just as though as it was PASS Summit.

This led to PASS turning down the opportunity to run Power BI user groups, which in turn led to the successful Microsoft Business Analytics (MBAS) conference, which could have both increased revenues, and more importantly grown our data community.

Another story shared with me, was that Microsoft wanted to make PASS Summit a free event, in order to dramatically grow the size of the conference. While this is a bit of one-upmanship between software vendors as to who has the largest conference, this had to potential to exponentially grow the PASS community. Microsoft was willing to fund the revenue difference, but this was shot down by C&C, presumably because the initiative would have cost them money. This goes against PASS’s stated mission of Connect, Share, Learn, and is just one of myriad of bad acts perpetrated by C&C through the years that ultimately doomed PASS.

What Happened in 2020

There are countless examples of C&C’s actions that were detrimental to the broader community to try and drive Summit attendance and revenue, but I’d really like to focus on the actions that took place this year, which lead to the death of PASS. When the pandemic hit in February and March, it became very obvious to everyone that having large conferences was an unlikely proposition until a vaccine was in place, and even then, probably a year after that.

Assuming people are mostly vaccinated by the middle of 2021, that means the best hope for an organization who’s revenue was dependent on conference revenue would have been to move to an austerity budget which limited services to completely essential services, until a time, probably in 2022 when a large scale conference was a good bet. This was bad news for PASS, because its major revenue source was compromised until at least 2022. (Editor’s note: I wrote this last year, and it has mostly held true)

Even in lieu of an austerity budget, what PASS could have done in April 2020, was to move to protect the assets that the community built like SQL Saturday and the Virtual Chapters, and now were at risk due to multiple organizational failures. The executive board and C&C executed none of these options and kept the board members at large as well as the general membership in the dark as to how perilous the financial situation was.

There were additional poor decisions made in 2020. The Exec and C&C decided to go along with PASS Pro, a paid membership offering, that was built on a learning management system. While a noble concept, it was competing against mature services like Pluralsight and LinkedIn Learning. There was some Microsoft funding to this project (my understanding is around $300k USD), and several community members were paid (around 3-5K USD) to record videos for the service. While a lot of money was spent to build and launch the service, there was never hope of it being a significant revenue source for PASS.

Despite what C&C has always said about PASS having 300,000 members, based on better estimates, PASS likely had < 5000 “active” members and a maximum of 20-30,000 “passive” members who unwittingly signed up by attending a SQL Saturday or PASS Summit. As of November, PASS Pro had only sold 300 memberships (note some of these may have been corporate memberships which included up to six subscriptions). Even if we assume all 300 of those were corporate memberships (which I know not to be the case), that provides PASS $180,000 of revenue. While that is not nothing, for an organization with PASS’s expenses, it is a completely insignificant amount of revenue, and since PASS/C&C had no real plans for a content pipeline, it was unlikely to grow over time, instead being more likely to fade away so like many other PASS initiatives.

Which brings us to Virtual Summit. I don’t have a final accounting to know what this event cost, but after PASS’s bankruptcy we know it was responsible for the death of the organization. PASS was very slow to cancel in-person Summit, with the virtual event was not announced until June. My understanding is that part of this was because it was a challenge to get out of the contract with the Houston Convention Center, however it put PASS in place where lots of other (sponsor funded) conferences had gone virtual and more importantly to this discussion, FREE.

I’m of the opinion that C&C leadership took this opportunity to apply the death blow to PASS. Instead of moving into an austerity budget, and taking assistance for revenue loss from the Canadian government, PASS decided to go “all in” on the virtual conference using a very expensive virtual conference platform (that was ultimately derided by both speakers and attendees), that promised a “unique, interactive” conference, that ended up just being a really fancy wrapper for a Zoom meeting. This was defended in the December board meeting minutes, that in order to protect PASS’s cashflow, executing a virtual conference was a necessity.  

The pricing of the virtual conference was lower than in-person Summit, though my understanding was that C&C was pushing for significantly more expensive pricing than the final number. C&C sold the board on the notion that since people didn’t have to travel to an in-person event, they would have a potentially larger audience than at an in-person event. That assumption was flawed from the start, as it didn’t account for the pandemic recession, virtual meeting fatigue, and the fact that most of the other conferences PASS usually competes against, were FREE.

PASS weakly attempted to highlight how the conference was going to be different and more interactive than other conferences, but they never had a free trial event to build hype, and C&C was ineffective at marketing the virtual conference. During the entire summer and into early fall there were few emails or tweets about the conference, other than short video describing the platform.

I lay the responsibility for death of PASS on C&C largely, but secondarily on PASS’ executive leadership. In good times, some of the decisions made vis a vis C&C were poor, but they didn’t matter given the revenue streams. A microcosm of this is that PASS sent speaker shirts to 2020 Virtual Summit speakers. Anyone who’s ever run an event knows that the first two things you cut when your budget is not looking good, are speaker shirts and speaker’s dinner.

The decisions made by PASS in 2020, from launching PASS Pro, to betting the organization on a virtual conference, and finally the decision to ask Microsoft for a major bailout, with no significant plan forward, show a complete lack of leadership from anyone on exec, instead deferring all decisions to the deeply conflicted C&C organization, which had shown repeatedly they only cared about their own interests and not that of the broader data platform community. If you are waiting on a refund from PASS Summit, you will have to stand in line behind C&C, because they are getting paid first. Like always, Judy got her money before the community did. And that just fucking sucks.

Like much of the community, I am lamenting the loss of PASS, but I feel the to call out executive leadership, not just for their leadership failings (which are many), but for their passive aggressiveness and general rudeness to community members who cared about the organization. These actions not only took place against community members, the abusive behavior extended to other members at large of the board of directors.  The exec did everything they could to keep the rest of the board in the dark on the financial situation all the way to the bitter end of the organization. Board members were chastised for asking hard questions and kept in the dark before votes. There were never open and honest communications between the board members at large and the exec and C&C. While in normal times this meant C&C got resources that should have gone to the community, in this pandemic year, it meant the complete death of the organization without protecting any of its assets. This is not a recipe for a strong organization. That secretive behavior did nothing to help, and actively alienated many active and senior members of the community, who otherwise could have been strong allies to the PASS cause.

I am thankful for the existence of PASS, and the community won’t go away. I don’t want to necessarily focus on what’s next—yet. I think the best thing community members can do is keep running their user groups, and hopefully in the second half of 2021, we can start having data events on Saturday, whatever they are called. I think the best approach to a larger organization is to have a federation of larger events, so they can coordinate on things like scheduling, and have a unified code of conduct and diversity and inclusion policy. Having a centralized organization comes with a lot of expense, so starting small and building is the right approach. Andy Mallon has a good vision for this in his post here.

Passing the AZ-500 Exam–How I Prepared

A few months ago, I was asked to do a webinar on security for Microsoft Teams. In preparing for that webcast, I noticed a number of security features that required either Azure Active Directory P2, or Office 365 M5. In order to get either of those two features for free, DCAC needed to gain the silver security competency from Microsoft. In order to do this one of us, had to pass wither MS (the Office/Microsoft 365 version) or the AZ-500 security exam. Since, I was the one who wanted the features, I volunteered to take the exam.

The first time I took the exam, I didn’t do much preparation. I was pretty busy, and I recently taught some Azure Architecture training and I was hoping that would be enough to get me through the exam. That almost worked–I came pretty close to passing, and likely didn’t pass because I was frustrated with some questions and kind of blew through them. Pro-tip: don’t schedule an exam for Monday at 4 PM.

The second time I took the exam, I prepared a little more. Here were the key topics I focused on:

  • Azure Active Directory
  • Azure Sentinel
  • Azure Role Based Access Control (RBAC)
  • Azure Networking
  • Azure Key Vault

You can see the full exam outline here. I spend a lot of time working on most of these technologies in my daily job, so this was more refreshing my memory on the specifics of how each of them work. I also paid for the practice exam, which was helpful but likely not necessary, but it did get me more into the exam mindset.

The second time I took the exam, I didn’t find the questions as friendly to my knowledge, and generally thought the exam was harder. However, this time I passed the exam, fairly easily. Yay, a week later we had P2 AAD so I can now do cooler demos. My general recommendation to folks who want to take cert exams, is to just dive in and take them as it’s really the best way to study, and know the areas where you are weak.

A Lesson in DR, Azure Site Recovery, and Troubleshooting

I need to blog more. Stupid being busy. Anyway, last week, we were doing a small scale test for a customer, and it didn’t work the way we were expecting, and for one of the dumbest reasons I’ve ever seen. If you aren’t familiar with Azure Site Recovery it provides disk level replication for VMs, and allows you to bring on-premises VMs online in Azure, or in another Azure region, if you VMs are in Azure already. It’s not an ideal solution for busy SQL Server VMs with extremely low recovery point objectives, however, if you need a simple DR solution for a group of VMs, and can sustain around 30 minutes of data loss, it is cheap and easy. The other benefit that ASR provides, similar to VMware’s Site Recovery Manager, is the ability to do a test recovery in a bubble environment.

Our test environment was as shown in the image below:

In our case, due to configuration difference between subscriptions, we had to have our test domain controller in a different virtual network, and peered to our network. The DC was also in a different Azure region, which wasn’t a big deal because you can still peer across regions. I have the additional box around the test environment because it is not connected to the rest of the network.

You should note that for a real disaster recovery scenario, you are likely better off having a live domain controller in each region where you want to operate. However, when you are testing you cannot use a live domain controller, as objects and passwords could get out of sync. For this test we added a DC to our ASR config as noted.

The other caveat to doing testing is that you need a way to login to your test environment. Because your test network(s) are not connected to your other networks, unless you create a new VPN gateway, you likely have no connection to the test bubble. I don’t recommend creating a new VPN, which leaves you a couple of options:

  1. Create a jump host with a public IP address and create an allow list for specific IPs to connect to it
  2. Use Azure Bastion which provides an endpoint with allows you to RDP/SSH in a browser into your VMs in a secure fashion.

We decided to go with option 2–which led to weird ramifications.

If you are testing in this scenario you want to bring your domain controller online before failing over your SQL Server and app server(s), It’s not completely necessary, but it will remove some headaches and reboots down the road. After you failover you domain control, you need seize the FSMO roles, and make the domain controller a global catalog, if it is not. However, you need to login with domain credentials to do all of that first.

There is no guidance about how to login using Bastion in docs. However, after a bunch of stupid troubleshooting yesterday. I discovered that if you attempt to login using the standard domain credentials (e.g. contoso\joey), you will see the following error message.

Instead you have to login with the user principal name (UPN), or in my case joey@contoso.com. We troubleshot this in a test environment yesterday and were confused when DNS worked and it seemed like most other AD functionality was in-place. After a couple of more attempts at failover (one of the benefits of of ASR is that it’s really easy to test failover multiple times).

Sadly, this behavior wasn’t documented anywhere that I could find, except in a forum post (which was on docs.microsoft.com, so I guess that counts), but it would be really nice if the portal just showed the correct format for how to login with a domain account. Beyond that, ASR is pretty cool, and allows for easy DR testing.

Senior SQL Server DBA Job Interview Questions–2021

In my role as a consultant, it’s rare that I go on an interview anymore, though I occasionally get interviewed by a client, or interview potential DBAs for clients as part of my role. There are a number of these lists on the internet, but many of them are old and focus on trivia questions (there won’t be any questions on what’s clustered versus nonclustered index, but if you are interviewing for a senior role, you should know that anyway. I also like to focus on open ended questions, to gauge the depth of knowledge of the person I’m interviewing.

man and woman near table
Photo by fauxels on Pexels.com

Here are some sample questions, and explanations:

What does TempDB do in SQL Server? This is probably my all time favorite SQL Server question, because almost anyone can answer it correctly, but at the same time I can gauge someone’s knowledge of SQL Server from a 100 level to “I worked on the product team and wrote part of the tempdb code” depending on the answer. The basic right answer is “stores temp tables” and the senior dba right answer involves spills, isolation levels, and version stores.

What would you change on a default (e.g. next, next, next, finish) installation of SQL Server? The answer to this question has changed in recent years as the product team has implemented most of the best practices around TempDB, MaxDOP, and Max Memory into setup (bonus points if they know which version those things were implemented in (TempDB (2016), MaxDOP (2019), and Max Memory (2019)). Other correct answers include enabling IFI, optimize for ad hoc, changing file locations, changing default growth rates. (The really good senior dba answer, is that I used DBA tools or a T-SQL script to automate my setup based on the machine size).

I didn’t want to make this a full list of questions and answers, but I will include a number of questions. You can go find the answers, or ask for help in the comments.

“Microsoft has a number of Platform as a Service database options in Azure. Can you tell me what those options are and a good example of a use case for each of those options?”

“Tell me about the various options for encrypting data in SQL Server, and how would you use each of them?”

“What are some system waits you have observed, and how what tuning steps did you take based on those wait statistics?”

“Tell me some ways you might improve backup and restore performance?”

“What types of workloads could benefit from a columnstore index and what are the tradeoffs?”

“How would you configure storage for a SQL Server VM running in Azure”

“Can you walk though the high availability and disaster recovery options, and the benefits and tradeoffs of each of those?”

“What is your methodology for troubleshooting a query that suddenly decreased in performance?”

How would you diagnose a storage performance problem versus a poorly performing query?”

What are the possible causes of database corruption?”

You need to deploy a change to 3000 SQL Servers, tell me about your methodology?”

How do you manage performance metadata across a large number of servers to provide baseline performance information?”

This is just a start–feel free to add questions in the comments or ask for help.

Validating IP Addresses in PowerShell

I had a client ask me to write a PowerShell script (for an Azure Automation runbook) to automate the creation of firewall rules for an Azure SQL Database. As part of this process, I was planning on having to validate the IP addresses (a valid IP address is of the format x.x.x.x where x is an integer between 0 and 255) by having to write some code to split the IP address string into individual octets and figuring out if each value was a valid number. However, in doing research (searching Stack Overflow), I discovered PowerShell has a built-in IP address data type, which made my life way easier.

Since PowerShell has IPAddress as a data type, that means we can attempt to cast a string value as a data type.

If I type an invalid IP address I get the following result:

You’ll note I don’t get an error, I simply don’t get a result. For me to do something useful with this I do have to write a little more code to handle this IP address in my script.

try {
$startip='0.0.0.0'

$endip='1.1.1.1'
$s=$StartIP -as [ipaddress] -as [Bool] $e=$EndIP -as [ipaddress] -as [Bool]
if ($s -eq $false -or -$e -eq $false){write-error "Invalid IP Address"}
}

After casting my parameters as an IPAddress, I can then cast them as a boolean. A boolean value is a yes/no value that in PowerShell will return a $True/$False value. If the IP addresss submitted is a valid one, the result will be true. In my IF loop, I’m saying if either my start or end IP addresses are invalid to write an error stating such.

How Do You Shrink Your Azure SQL Database?

In the early days of Azure SQL Database (ne SQL Azure) we had size limits that were minuscule–I vaguely remember 5 and 10 GB, but it could have been smaller. Back then space was at a premium. In 2021, we have hyperscale that allows us to have an Azure SQL Database of up to 100 TB (or more) in size. However data size in the cloud means costs, so space management can matter, right?

gray and black galaxy wallpaper
Photo by Pixabay on Pexels.com

Let’s Talk About Your Log File

While many DBAs will never ever want to shrink a data file, sometimes your transaction log file can grow for external reasons. Typically these reasons are as follows:

  • Your log backups failed for a while and the log file grew
  • You couldn’t write to an availability group secondary for a while and your log file grew
  • You couldn’t write to a replication target for a while and your log file grew

You will note that I didn’t mention that “your log file grew because of a large index rebuild”. That’s because that is probably roughly (this is a really rough rule of thumb) how big your transaction log needs to be. But, anyway, we’re talking about Azure SQL Database, so you don’t need to worry about your transaction log file. Microsoft takes care of that for you: ‘Unlike data files, Azure SQL Database automatically shrinks log files since that operation does not impact database performance.’

What About My Data Files?

In an on-premises world, shrinking data files should come down to two scenarios:

  • You were creating a database file and you accidentally added an extra zero to the size field.
  • You weren’t using compression, and you compressed all of your larger tables and indexes and got a bunch of space back

The first one is easy–since SQL Server never wrote data to those pages, SHRINKDB will quickly reallocate them. The second one is much harder and depends on your data types and how busy your server is. In that case SHRINK may never complete–especially if you used LOB data types. Besides that SHRINK will fragment your data massively. So you should basically never run it.

Disk space is mostly cheap, so unless you are extremely storage constrained you should just live with empty space. In Azure SQL this is even more complex, because the ghost cleanup process, which cleans up unused dirty pages can take a long time to run. So it can be days or weeks before you can reclaim your space, if ever.

So Really, How Do I Shrink My Azure SQL Database?

You don’t. Space, even in the cloud is really cheap. If you really want to make the database smaller, you can extract the data into a new database, and then drop the original database. Congratulations, you just took downtime and took several hours of your life to save $30/month. That doc from Microsoft recommends (with a warning) enabling autoshrink on your databases. Don’t do that. Ever. It’s really, really bad for your performance. Relax and enjoy the fact that your don’t have to bug the SAN admin for space in the cloud.