Tech Field Day 26: ZPE Systems

I recently attended Tech Field Day 26, in Santa Clara, while we spent most of time with the Open Compute summit, and discussing CXL, we got to meet with hardware and software networking vendor, ZPE Systems. ZPE has a number of solutions in the edge and data center spaces that allow you to do secure network management and just as importantly automate large scale data center deployments.

In my opinion there are two main concerns with building out network automation—security and device accessibility. ZPE aims to mitigate both of these risks–with security, by building a secure ring around your infrastructure to limit direct exposure. This ring security model can apply to the OS, but in highly secured networks, access to a secured network boundary layer, can provide similar functionality, ZPE recommends a three ring model, with their hardware solution being the outer boundary. They provide support for multiple network providers as well as support for the latest continuous deployment and integration deployment patterns.

The real secret sauce to ZPE’s solution is that it integrates this automation with a central repository in their cloud service to support:

  1. Device (servers, network gear, storage) upgrades, setup, and patching
  2. Out-of-band access
  3. Access control
  4. Logging
  5. Monitoring

These configurations are all pushed from the centralized cloud data store. Conceptually, this is like a jump host, but way smarter, with a lot of connectors and support for automation process. ZPE showcased their Network Automation Blueprint seen on the slide above, which launches from their out of band devices.

Understanding CXL for Servers–Tech Field Day #tfd26 at OCP Summit

Last week, I was fortunate enough to be able to attend Tech Field Day 26 in San Jose. While we met with several companies, this was a bit of a special edition of Tech Field Day, and we attended the CXL Forum at the Open Compute Project conference. In case you don’t know the Open Compute Project is project from a consortium of large scale compute providers like AWS, Microsoft, Meta, and Google, amongst others. They aim to optimize hyperscale data centers in terms of power and cooling, deployments, and automation. So how does CXL fit into that equation?

CXL stands for Compute Express Link, which is a standard developed by Intel, but also includes a large number of both cloud providers and hardware manufacturers. The CXL standard defines three separate protocols (definition source Wikipedia) :

  • CXL.io – based on PCIe 5.0 with a few enhancements, it provides configuration, link initialization and management, device discovery and enumeration, interrupts, DMA, and register I/O access using non-coherent loads/stores
  • CXL.cache – allows peripheral devices to coherently access and cache host CPU memory with a low latency request/response interface
  • CXL.mem – allows host CPU to coherently access cached device memory with load/store commands for both volatile (RAM) and persistent non-volatile (flash memory) storage

The main area focus for cloud vendors like Microsoft and Amazon is CXL.mem, which would allow them to add additional memory to cloud VM hosts. Why is this such a big deal? Memory represents the largest expense to cloud providers, and the requirements for memory keeps increasing.

Beyond that—supporting a mix of workloads means memory can become “stranded”. If you are a database administrator, you can think of this like index fragementation—which leads to wasted space. Ideally, cloud vendors would like to completely disaggregate memory and CPU, which is one of the goals of CXL (memory being tied to a rack and not a specific host), but will likely not occur for 3-5 years.

However, CXL is real, and on-board CXL memory sockets are coming soon. The best explanation of CXL’s use cases I saw last week were from Ryan Baxter, the Senior Director of Micron (Micron has some interesting solutions in the space). You can a version of that talk here. Effectively, you can have additional memory on a server on a CXL bus (which uses PCI-E for its transport mechanism)—this memory will be slightly slower than main memory, but still much faster than any other persistent storage.

Another interesting talk was from Meta, who described their performance testing with CXL. Since memory is remote, there is a performance cost, which was around 15% with no optimizations to their software. However, Meta wrote an application to perform memory (on Linux) management which reduced the overhead to < 2%.

You might imagine a database engine, would be aware of this remote memory configuration, and might age pages it did not think were going to be reused outside of main memory and into remote memory.

I learned a lot last week—hardware is still a very robust business, even though most of the focus is still on the needs of the cloud providers. CXL promises some foundational changes to the way servers get built, and I think it will be exciting. Stay tuned for more posts from Tech Field Day 26.

Using the Dedicated Administrator Connection with SQL Server in a Docker Container

I use a Mac (Intel, you can only run SQL Server on Edge on Apple chips) as my primary workstation. While I have a Windows VM locally, and several Azure VMs runnining Windows, I can do most of my demo, testing, and development SQL Server work locally using Azure Data Studio, sqlcmd, and SQL Server on Docker. Docker allows me to quickly run any version or edition of SQL Server from 2017-2022 natively on my Mac, and has been nearly 100% compatible with anything I’ve needed Windows for core database functionality. And then this #sqlhelp query came up this morning.

The one reference I found to “ForkID” on the internet was in this DBATools issue, given that and the fact that the tweet also referenced backup and restore, my first thought was to query sys.columns in MSDB. So, I did and there were a couple of tables:

Because as shown in the image above, the table in question is a system_table, in order to query it directly, you need to use the dedicated administrator connection (DAC) in SQL Server. The DAC is a piece of SQL Server that dedicates a CPU scheduler, and some memory for a single admin session. This isn’t designed for ordinary use–you should only use it when your server is hosed, and you are trying to kill a process, or when you need to query a system table to answer a twitter post. The DAC is on by default, with a caveat–it can only be accessed locally on the server by default. This would be connected to a server console or RDP session on Windows, or in the case of a container, by shelling into the container itself. However, Microsoft gives you the ability to turn it on for remote access (and you should, DCAC recommends this as a best practice), by using the following T-SQL.

exec sp_configure 'remote admin connections', 1 
GO
RECONFIGURE
GO

This change does not require a restart. However, when I tried this on my Mac, I got the following error:

Basically–that’s a network error. In my container definition, I had only defined port 1433 as being open, and the DAC uses port 1434. If I were using Kubernetes for this container, I could open another port on a running container, however in Docker, I can only do this by killing and redeploying the container.

docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=P@ssw0rd!' -e'MSSQL_PID=Developer' -p 1433:1433 -p 1434:1434 -v /Users/joey/mssql:/mssql -d  mcr.microsoft.com/mssql/server:2022-latest  

I simply expose port 1434 (by the second -p switch in the deployment script) and now I can connect using the DAC. Sadly, there was nothing interesting in sysbrickfiles.

How to Remove a Data Disk from an Azure VM (How not to blow your leg off)

I was working with a client recently, were we had to reconfigure storage within a VM (which is always a messy proposition). In doing so, we were adding and removing disks from the VM. this all happened mostly during a downtime window, so it wasn’t a big deal to down a VM, which is how you can remove a disk from a VM via the portal. However, upon further research, I learned that through the portal you can remove a disk from a running VM.

For the purposes of this demo, I’ve built a a SQL Server VM with two data disks and a single disk for transaction log files. The SQL VMs use Storage Spaces in Windows, which is a recommended best practice–but even if you are not using Storage Spaces, most of this will apply.

How To Identify Your Disk

This is the really important part of this post–how to identify what your disk is in the portal and with your VM. When you define a data disk in the portal, either you or the system will define a LUN number for the individual disk. You can see it on the portal in the below screenshot.

This number is mostly meaningless, except that within Windows, it lets you identify the disk. If you open up Server Manager and navigate to Storage Pools > Physical Disk, you can see where this LUN number shows up.

That number maps back to the values you see in the Azure portal, and unless you size each of your disks differently (which you shouldn’t do for performance reasons). If you aren’t using Storage Spaces, you can also see the LUN number in Disk Management in Windows as shown below.

You can also get this information using PowerShell using the Windows cmdlet Get-PhyiscalDisk.

It is very important to ensure that you have identified the correct disk before you remove it.

Removing the Disk

Azure will let you delete a disk from a running VM. Even if that disk is mounted in the VM and has data on it. Yes, I just did this on my test VM.

If you click one of those highlighted Xs and then click Save, your disk will be removed from your VM. There’s also a series of PowerShell commands you can use to do this. It is also important to note, that at this point your disk is still an Azure resource. Even though you have removed it from the VM, the disk still exists and has all the data it had the moment you detached it from the VM.

If you chose the correct disk to remove from your VM, and you have confirmed that your VM is healthy, you can navigate into the resource group for your VM where you will see your disks.

The important thing to note is that the state of the disk is “Unattached”, which means it’s not connected to a VM. So it can be deleted from Azure–I don’t recommend doing so until you have validated your VMs are running as expected.

You may ask how you can prevent disks from being removed from running VMs. I’ll write a post about this next week, but while you are waiting read up on resource locks in Azure.

Azure SQL Managed Instance versus Amazon RDS for SQL Server—Which Should You Choose? (Or why Managed Instance is faster)

Microsoft, in conjunction with Principle Technologies recently produced a benchmark, comparing the performance of Azure SQL Managed Instance, and Amazon RDS SQL Server. I normally really dislike these benchmarks—it can be really hard to build proper comparisons and the services frequently don’t have perfect equivalent service tiers, making them really hard to ultimately compare their performance. In fact, when I was reading this benchmark, I saw something when comparing the two services that made my eyes light up. And then I realized it was a limitation of RDS.

I immediately saw that Azure MI had 320,000 IOPs while AWS only had 64,000. Obviously Azure is going to crush any database benchmark with that difference. And then I did a bit more research. I the visited the AWS docs.

You’ll note that while Oracle RDS does get up to 256,000 IOPs (I guess those customers have more money), SQL Server has a max number of IOPs of 64,000. Needless to say, in this benchmark comparing the price/performance ratio, Managed Instance crushes RDS before you even add in the hybrid licensing benefits that Microsoft supports for Azure SQL services.

But Wait There’s More

While Managed Instance is by no means a perfect service, there are a number of reasons why I strongly recommend against running your database on RDS. Here are the main ones:

  • You can’t migrate a TDE encrypted database using backup and restore—you have to extract a BACPAC and import into a database in the service
  • The native backup solution doesn’t support restoring a database to point in time.
  • You can’t deploy cross-region, meaning, there is no near real-time option for disaster recovery
  • There is no instant file initialization which can make some restore and file growth operations extra painful

These are the major concerns, with an additional licensing concern of not being able to use Developer edition for your workloads, means your overall costs to run an environment are going to be a lot higher of you use RDS.

When Not to Choose PaaS?

While RDS has a lot of costs associated with it, and performance is limited, there is a price/performance/data volume curve that I feel I applies to both Azure and AWS platforms. If you need high end storage performance (which means using the Business Critical service tier), on Managed Instance, and your data volume, is more than a terabyte, you have to scale your Managed Instance to 24 cores. If your volume is more than 2 terabytes, you need to scale to 32 cores, and if you need more than 8-16 TB, you will need to scale to 80 cores, which will cost close to $30,000/month. I perfectly understand why this is the cost model—the storage is stored locally on the VM itself rather than remote—so Microsoft can’t put other VMs on that piece of physical hardware.

What Should You Do for SQL Server on AWS?

If you need to run SQL Server an AWS, what should you do? The answer is to use EC2 VMs. Sure you lose the minor benefits of having your servers patched and limited benefits of the backup feature, but you have more granular control over your IO performance and overall configuration.

Tl;dr Azure SQL Managed Instance delivers a lot more throughput than Amazon RDS for SQL Server, so your workloads will run a lot faster on Azure.

Taking Your Azure Active Directory Security to the Next Level

What if I told you just using multi-factor authentication (MFA) wasn’t enough anymore? The Lapsus$ hacking group, who were at least partially made up of a group of teenagers in the UK, took a very targeted hacking approach. They used password stuffing to try to breach the password credentials of power users within organizations they were targeting.

Once they identified the passwords (through a variety of tactics, but mostly password reuse—use a god damned password manager) , they sent 100s of MFA requests to multiple users. While a single user may have the discipline to ignore a series of MFA approvals that they didn’t prompt themselves, the odds are that if you send them to several people many times. Once that happens, and the attackers have an admin token, they can then move laterally, secure command and control, and do all manner of other bad things around credentials.

If this sounds scary, and it does to me, who is by far not an expert in all things security, but knows a little bit, you may ask, what are some alternative solutions? The answer to that question is Fido2, a different protocol for MFA and auth. Remember all of that stuff Microsoft talks about with passwordless login? That’s all based around Fido2. I configured this for DCAC’s Azure Active Directory yesterday, and I wanted to walk you through the steps.

Step 0 is to acquire a FIDO2 key for you and/or your team—I have a Yubico 5c, but there are others you can consider.

After that, the first step was to go to the Azure Portal and navigate to Azure Active Directory authentication methods.

Click on FIDO2 security key–even though it shows as enabled here, it is not enabled by default. When you click on the text you will see the next screen.

In this case, I enabled for All Users–this doesn’t mean they have to authenticate using this method, just that they have the option to. You also have some advanced options that go above my pay grade and are not happening in DCAC’s AAD.

Following this–I configured my Macbook to use my Yubi key as a authentication method. I followed the guidance on their site here. After doing that configuration, I was ready to make the change to my account. Navigate to myaccount.microsoft.com and select Update Info, under security info.

Once there, you can add a method, which is called “security key” here. I think this can be done globally in your org, but for this basic trial, I just enabled it for myself.

So that’s all of the prework you need to do. Now–let’s show logging into the Azure portal. You have to change the options in the portal as shown below:

Once you have selected Sign in with security key, you will prompted to choose the key. I was also promoted to touch the key (not captured in this screenshot) and then to enter the PIN you created when configuring your key.

Once you have entered your PIN, you are now authenticated to the Azure portal, without using a password. What I would love to see, but haven’t been able to configure is the ability to set a conditional access policy for AAD where untrusted location logins were required to use a harder level of authentication like a security.

Passing AZ-104–Azure Administrator

This Monday, I took and passed the Azure Administrator exam (Az-104) exam. It was a little bit unusual for me to take this exam, as I’m already an Azure Solutions Architect, but as part of the new Microsoft partner requirements, I had to take this exam, even though it’s a subset of what’s on the architect exams. Full disclosure: I didn’t study at all for this exam–I’m not saying that to brag, but if you are very experienced with Azure, particularly IaaS and Azure Active Directory, you can probably pass this exam cold.

clear light bulb
Photo by Pixabay on Pexels.com

Obviously due to NDA, I can’t disclose any questions on the exam, but I can review some of the high level topics you need to know. Some of the topics covered on this exam included:

  • Azure Virtual Machines
  • Azure Storage
  • Azure Networking (understand the various load balancer services)
  • Azure Active Directory user security
  • Azure Monitoring
  • Azure Policy

I didn’t feel like there was significant depth or advanced questions on any of these topics. Networking comes up a lot in any of these exams (and in my day to day to work with Azure, it’s exceedingly important). This is a good exam for you to take if you are just learning Azure, and want to validate your skills. If you are more advanced, I would focus on the architecture exams, unless you have to take this for Microsoft partner reasons.

PREEMPTIVE_OS_FILEOPS Waits and Filestream Restores

We had a case over the weekend where our automated restore process at client got hung up on this wait type, for a single database. What was the unique characteristic about this relatively medium (2-300 GB) database? It had a lot of filestream data–it seemed like the file count wasn’t that high, but my guess is the filestream data was the majority of the data in that database. When the job hung up, the restore had been waiting on PREEMPTIVE_OS_FILEOPS for over a day and still had a null value for percentage complete.

One interesting thing that happened was after I attempted to kill the restore process, it remained in place. My restore task was running in SQLCMD, so I went a step further, and killed the SQLCMD process on the server. The SPID in the database stayed alive, and since it was a non-production environment, and a weekend, I restarted the SQL service.

Per SQLSkills this wait type means “This wait type is a generic wait for when a thread is calling one of several Windows functions related to the file system”, and more commonly you see it on the end of a backup, when SQL Server is growing a log file, which does not benefit from instant file initialization (IFI). In our case the server did not have IFI enabled, and I suspect this was one of the contributors to the problem. After we enabled IFI, the restore complete in just under three hours.

EXEC sp_configure filestream_access_level, 2
RECONFIGURE
GO

CREATE DATABASE Filestream
ON
PRIMARY ( NAME = Arch1,
FILENAME = 'c:\data\archdat1.mdf'),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = FSData,
FILENAME = 'c:\filestream')

LOG ON ( NAME = Archlog1,
FILENAME = 'c:\data\archlog1.ldf')
GO

create table fs_table
(id INT IDENTITY (1,1),
UI UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
FS_Data varbinary(max) filestream NULL)

Those are the database objects–I then used PowerShell to create some files, and generate an insert script.

$i=1
while ($i -le 10001)
{new-item -ItemType file -Path "C:\fstemp\" -Name Fs$i.txt -Value "Text file $i"; $i++}

$i=1
while ($i -lt 10001)
{add-content C:\temp\fsinsert.sql "`nINSERT INTO [dbo].[FS_Table] (UI, FS_Data) VALUES (NEWID(),(SELECT * FROM OPENROWSET(BULK N'C:\fstemp\FS$i.txt', SINGLE_BLOB) AS Image$i))"; $i++}

Those are just a couple of loops that create 10000 files and then insert them into the database. The files are very small (13-14 bytes), but it would be a very representative test. I kicked off a restore and ran procmon to see what the SQL Server was doing. SQL Server first queried the filestream directory for each file.

SQL Server is doing four operations for each file:

  • Create File
  • Query Standard Information
  • Write File
  • Close File

Since my files are very small, this happens very quickly, but it has to happen 5000x, which gives me enough time to prove this behavior. You should note that I was able to control for IFI being enabled–there was a performance improvement that I think is related to the number of operations. Instead of doing the four operations per file SQL Server appears to only creating and closing each file. Performance on my test instance was inconsistent, but I was working in a constrained VM (my CPU fan has been running all morning).

I suspect this restore process would be impacted by either a large number of files or a large volume of data. This can be confusing, as even though the restore process is running, this isn’t reflected in percent complete for the restore.

I hope you learned something in this post, I know I did. also, don’t #$%^ing store files in your database, unless you like hitting yourself with a hammer.

Why You Shouldn’t Use Amazon RDS for your SQL Server Databases

Disclaimer: I’ve a Microsoft MVP and shareholder, but neither of these things affected my opinions in this post.

Cloud vendors have built a rich array of platform as a service (PaaS) solutions on their platforms. They market these heavily, because they have higher degrees of stickyness compared to IaaS platforms (and in many cases they likely have higher profit margins), but they also have key benefits to the users of the platform. Because a PaaS solution is fully managed by the cloud provider, they tend have a set of common features:

  • Easy to deploy–you are never running setup.exe
  • Built-in high availability–no need to configure a cluster
  • Easy disaster recovery/geo-replication–usually in most services it’s just a few clicks away
  • Automated backups
  • Automated and possibly zero downtime patching

While some services include other really useful features (for example the query data collected by the Azure SQL Database and Managed Instance platforms), I wanted to focus on the common value adds to PaaS systems across providers. I made the last two of these bold, because I feel like they are are the most important, especially in scenarios where the vendor doesn’t own the source to the applications. Like Amazon RDS for SQL Server.

Amazon RDS Backups

I’m writing this post this week, mainly because of what I just learned this week about the backups for SQL Server on RDS. I was in some client calls this week when I learned the default backup approach is volume snapshots (which I knew), but what I didn’t know what that you couldn’t restore an individual database with these default backups.

I feel like their docs are deliberately vague about this–it’s not clearly obvious that this is the case, but a few Stack Overflow threads and discussions with fellow MVPs confirmed that I was told on the call. Amazon does support you taking your own backups on RDS, and you can then restore individual databases to individual points in time, but where’s that fun (and more importantly the value proposition) in that. To me, this really eliminates one of the biggest benefits of using a PaaS service. AWS refers to normal backup/restore as “Native Backup/Restore” as you are reading docs.

Patching

Azure SQL Database and Managed Instance, both seamlessly patch your databases without your knowledge. For the most part those services utilize hot patching, which means in many cases there isn’t any downtime to install the Azure equivalent of a CU. (Owning the source code has it’s benefits) Amazon RDS can automatically install selected CUs, but you should be aware that is an option when you deploy your instance.


"ValidUpgradeTarget": [
    {
        "Engine": "sqlserver-se",
        "EngineVersion": "14.00.3192.2.v1",
        "Description": "SQL Server 2017 14.00.3192.2.v1",
        "AutoUpgrade": true,
        "IsMajorVersionUpgrade": false
    }

The other thing you should note, is that you may not always have the most current CU available to you. Currently, AWS supports CU12, which is three CUs behind current. However, our customer was only on CU8–so patching doesn’t seem to be as automatic or easy as it is on the Azure side.

Licensing

This one really isn’t AWS’ fault (it’s Microsoft’s fault), but there are a couple of issues with licensing RDS. The first is that you don’t have the option of running Developer Edition for non-production workloads. Which, especially if you are running Enterprise Edition represents an expensive choice–you either need to license Dev/Test for Enterprise, or run Standard Edition in dev to save some money, but not have adequate features in place (though the code platform is mostly the same, performance characteristics can be dramatically different). Additionally, you cannot bring your own SQL Server licenses to RDS, you have to lease them though AWS. Neither of these problems are the fault of AWS, but they still suck.

AWS is an excellent cloud platform, even thought I actively hate their console. For most part a lot of components are very similar or even better than Azure. However, when I comes to a service where Microsoft owns the source code and AWS doesn’t, you can see the clear superiority of Azure. So what is an AWS shop that runs SQL Server to do? IMO, the backup/restore thing is a deal breaker–I would just recommend running in an EC2 VM.

Fixing SQL Server Database Corruption (when you get lucky)

First things first–if you are reading this, and not regularly running consistency checks on your SQL Server databases, you should drop everything you are doing and go do that. What do I mean by regularly? In my opinion, based on years of experience, you should run DBCC CHECKDB at least as frequently as you take a full backup. Unless you have a clean consistency check, you don’t know if that last backup you took is valid. SQL Server will happily back up a corrupted database.

Screenshot of a failed checkdb, followed by a successful backup command

I cheated a little bit here and used an undocumented command called DBCC WRITEPAGE to corrupt a single 8kb page within an non-clustered index on a table I created. You should basically never use this command, unless you are trying to corrupt something for a demo like this, but as you can see, after we’ve corrupted the page, SQL Server fails CHECKDB, and then happily takes a backup of our now corrupted database.

What Causes Database Corruption?

Other than doing something terrible, like editing a page, database corruption is mostly caused by storage failures. An example of this could be your local SAN, where the SAN’s operating system acknowledges that a write operation is complete back to the host operating system, but the write doesn’t actually complete. SQL Server recieved the write acknowledgment and thinks that the data was correctly written to the page, however for whatever reason it didn’t happen. I had this happen a couple of times in a past job when the SAN in our “data center” (it wasn’t) crashed hard when the building lost power (yeah we didn’t have a generator, hence the quotes. Be careful who you buy your medical devices from). What was actually happening is that the SAN was acknowledging a write when the data hit memory on the SAN, which is a performance enhancement, that assumes you have a proper power infrastructure that will prevent the SAN from “crashing hard”. You know what happens what you assume, right?

Anyway, this is far less common than it used to be for a number of reasons, one of which is the use of cloud based storage, which is very robust in terms of data protection. Also, modern enterprise class SANs are more efficient, and less likely to have failures like this. However, it’s still very possible–I had a minor corruption event in an Azure VM a couple of years ago, and we had a customer who filled up their very non-enterprise class SAN, with terrible results (it was all the corruption). So the moral of the story, is wherever you are running SQL Server you need to run checkdb. (Except Azure SQL DB, and possibly Managed Instance).

Fixing Corruption

There is a lot of stuff and tools that people will try to sell you on the internet to fix your database corruption. Almost all of them are crap–if you have corruption in a table or clustered index, or worse in one of the system pages that determines allocation, you are screwed and need to restore your last good backup. (see why backup retention matters here?)

However, in some cases you can get lucky. If your corruption is limited to a nonclustered index, you don’t need to restore the database, and you can just rebuild your index.

However, in my case that just threw the dreaded SQL Server 824 error. I suspect this had something to do with how I corrupted the page, but that investigation is not complete. I was able to disable the index, and then rebuild and we had a sucessful CHECKDB.