I’ve posted about ransomware a few times before. At DCAC, we’ve worked with a few customers who’ve fortunately been able to help them recover from their attacks. But ransomware attacks are trending upward, as seen below–there’s a ransomware attacker that has pulled in $10 million dollars this year (this year being so far in 2020. 7 days in). This had been a number of announcements about ransomware attacks in recent days– a small company that went bankrupt and money exchange Travelex who was recently attacked.
Travelex's reason for not disclosing that this is a ransomware attack is because 'the police advised against it as it may harm investigation'. Yet the Met were happy to confirm it. Questions about how transparent the company is being will be asked.
The ransomware that struck New Orleans government prevented cops from accessing and sharing data, and according to this report, enabled a serial flasher to continue to assault women for weeks before his eventual arrest https://t.co/Ulb6HStY1cpic.twitter.com/qV4ritvzcG
Ransomware attacks are quite different from traditional hacking. Traditionally hackers are in pursuit of high value data, whether it be from a government agency or a large enterprise business, or someone with a vast array of customer data like Equifax. Ransomware is targeting easy targets, that the attackers think will pay, or hope they will pay. What that means is that if you work for a smaller firm, you are far more likely to be attacked with ransomware than you would be in a traditional hacking attack. Theoretically, large enterprises have more network segmentation, and better security controls, that make them less vulnerable to these attacks. As the the last year has shown, this is definitely not always the case.
It’s All About the Network
The way ransomware typically works, is to find a vulnerability via user accounts, in order to run exploit software on your network. If the only thing that can be reached on your network is user PCs, that’s the only data that’s going to be encrypted. Where organizations get exploited by this, is that their user network is either directly connected to their business network, or there is just one flat network structure. This means if a user is attacked (typical attack vectors are emailing infected office documents or PDFs, but others do include more advanced attack vectors like the aforementioned Pulse VPN vulnerability) the ransomware can go after file servers and domain controllers and start encrypting all of your files.
One of the most important approaches to this, is that in order to build your network structures appropriately, you need to assume that your user network is going to be breached. It’s what Microsoft does with Azure, and it’s what you should do on your network as well. This means you need to do a few things:
Separate accounts, and maybe even a separate domain for your all of your servers
Disconnected backups, you can encrypt a safe full of tapes
Frequent restore testing
Basically, if you can connect to a file or database server from your corporate laptop without multi-factor auth, changing credentials, and/or network, you’re at the mercy of your dumbest user to not launch an infected Word file
All of these controls make our jobs harder, and it sucks, but it’s what we have to do. The one benefit of modern technology is that the cloud makes all these things approachable to even the smallest of businesses. I can build a segmented network in about 20 min in Azure, whereas it would take me a few hours to find the the right equipment at the right price point from a networking vendor if you were trying to do it on-premises.
I’m Just the DBA, What Can I Do?
I get it, and I’ve been there–you’re the DBA at a company where IT isn’t exactly the priority. There’s a few approaches you can take–the first is the most self-serving for me, is to engage a consulting firm like DCAC. While you know what you are talking about, your management sees you as a worker bee, and may not listen to your complaints about having an open WiFi network that your regulatory agency can logon to and see the public file servers (yes, this actually happened to me). However, when they are paying a consultant $ALot/hr they tend to be more receptive of taking advice, even if it’s the same thing the DBA has been talking about for years.
Another approach is to hire a penetration testing firm–many large organizations are required to do this by regulation, and they do a good job of identifying vulnerabilities in a firm.
Finally, and the easiest thing to execute is to PATCH YOUR SHIT. This is where you can assist the most–you can patch Windows and SQL Server (or Linux if that’s your bag), and you can work with the sysadmin teams in your organization to ensure that all OS, application, and mouse software is patched regularly. It’s not perfect, and it doesn’t fix all of your problem, but it’s a start. Also, test your god damned restores, it’s not enough to have backups if you don’t know that they work.
We have a customer who is moving platforms, and as part of this, I’ve been tasked with testing a lot of various storage configurations. This means lots of utilization of DiskSpd, which is a disk performance benchmarking tool from Microsoft. We could argue about the value of synthetic disk benchmarks, but they are good for testing a variety of disk configurations with a standardized tool for comparing results. It also has the benefit of add the runtime configuration into the results file. So as long as you have your results file, you can know what parameters you ran the test with. (You do have to document your disk configuration–we are using the name of our output file for this).
Anyway, I have a bunch of these files, and I needed to get the data into Excel. Since I was too lazy to figure out how to parse a text file in C#, my first thought was to use some combination of sed, awk, and grep in a bash shell. I reached out to my friend Anthony Nocentino (b|t) about his thoughts on the best way to do this, and he immediately said PowerShell.
When I asked about how to do things I wanted to do with specific bash commands, he mentioned the fact that I could use bash statements that supported standard input and output in PowerShell. The linked blog shows how to do this in Windows, however I wrote all of this code in PowerShell natively on my MacBook Pro.
As you can see, I’m passing output in my $Results variable to a grep to give me the first match of the word “total” and then using sed to do a couple of find and replace commands to make parsing the file a little bit easier. After I’ve done all that, I split the array into a comma delimited set of results, and output it to a CSV file. This allows you to grab the results, with headers and open then in your favorite spreadsheet software. For posterity, the code is available at in our GitHub repo here.
I was at Microsoft Ignite last week, and I spent the better of part of 3 days working in the “SQL Server” booth. I put SQL Server in quotes, because in 2019 that term is stretched. Our booth covered SQL Server 2019, Arc (which you should read about because it’s really cool, but is the subject of another post), SQL on Edge, SQL Server Big Data Clusters, Azure SQL Database, and Azure SQL Database Managed Instance, and a few other things that I’m too tired to type. And that doesn’t even include Microsoft Synapse (which is the artist formerly known as Azure SQL Data Warehouse)
My big takeaway from working the booth, was how most of the customers I spoke to, didn’t understanding the differences between the various platform options. In fact, I think I took the following question over 100 times: “We’re currently on-premises, but moving to the cloud, can you explain the difference between Azure SQL Database, Managed Instance, and SQL Server in a VM?” The fact that we had to answer that so much, isn’t a failing on the part of the customers, as it is Microsoft and us MVPs not communicating the message effectively. So, in order to do my part, I’m going to write a post (I also wrote a Redmond column last month, but this post will go into some more detail). I’m not going to talk about pricing in detail in this post, as it’s effectively the same as the two services.
First Things First—Do you Know What PaaS Means?
Azure SQL Database and Managed Instance are both known as Platform as a Service (PaaS) offerings. This means you the DBA will manage a lot less than you typically would on either an on-premises SQL Server, or even a VM running SQL Server in Azure on the Infrastructure as a Service (IaaS) offering. If you run one of the PaaS offerings, Microsoft is responsible for the following tasks:
Backup of your database and transaction logs
High Availability within your Azure region, with a financial guarantee (the number of 9s depends on config, but it’s pretty solid)
Installation, configuration, and patching of SQL Server software
Installation, configuration, and patching of Windows server
Consistency checks of your database
Performance monitoring of your database (This is configured for you, you still have to react to the numbers)
What tasks does that leave you to do?
Configuring disaster recovery across Azure regions—while this is just a few clicks, you get to convince your boss to pay for it, and it’s not turned on by default
Performance tuning—getting utilization down by query optimization can save your company money. It’s a quantifiable part of your job now
Network architecture—you need to be able to make sure your virtual networks in Azure can all talk to your database services, and in a secure fashion
Helping your business derive value from the data you are storing in the database
Ok, Tell Me About Cloud Offerings
I’m going to start out by talking about running SQL Server in an Azure VM. Why would you choose to do this, when the aforementioned PaaS services do all of this magic for you? The main reason I can think of is vendor support. Your business runs some high quality ISV software that requires you to remain on SQL Server 2012 CU1 (just kidding, you know it’s actually SQL 2005 RTM, and Windows 2003), and you can only run this mission critical conference room projection software on that version of SQL Server.
The other reasons why you might choose to run SQL in a VM besides vendor support (which is common) is that the size of your database doesn’t work well with the current limits in Azure PaaS (which is 8 TB in Managed Instance, and effectively unlimited in Azure SQL Database Hyperscale), or that you are willing to trade off some availability for reduced cost of running in a standalone VM.
What do I mean by costs? While PaaS offerings include high availability, that’s something Microsoft has to include in your pricing (otherwise, their shareholders, including me, would murder them), so compared to a VM that is standalone, the PaaS offering will commonly be more expensive. To get a better cost estimate, you should really multiply your IaaS costs by 3. (Also, VMs have pretty good availability—it can be enough, especially for departmental apps).
Azure SQL Database/Azure SQL Database Hyperscale
This is truly one of the OG Azure offerings, dating back to 2009/10. It is really designed to be a standalone database offering. Think of an app, that needs a single database, and wants minimal effort around spinning that up, and you have the right answer for Azure SQL Database. The cost for getting started is also really cheap at $5/month for a basic DB (it can be way, way more than that, the P15 database is $15k USD/month, and there’s a new M-series offering that will be even more). Effectively, an Azure SQL Database is a single user database running in a SQL Server instance somewhere in the cloud. There’s a feature called elastic pools that confuses this a little bit.
Elastic pools are a group of Azure SQL Databases that share the same resources. This is effectively a SQL Server instance somewhere in the cloud that hosts multiple user databases. The elastic pool feature is mostly designed for applications where each customer gets a copy of the same database, and workload is mixed at different times of the day, so resources are utilized most effectively.
Azure SQL Database has some tools called elastic query, elastic transactions, and elastic jobs, that allow you to perform actions across multiple databases (elastic is a bit of a misnomer here, there’s no direct connection to elastic pools), but in general cross-database queries using classical three part naming database.schema.object do not work in Azure SQL Database, so if you need to joins or complex queries across databases, this probably isn’t your cloud product.
Way back when Azure SQL Database was size limited to 5 GB, then ultimately 4 TB, until the hyperscale architecture was developed. Hyperscale is a completely model for physical database infrastructure design (read more about it here, it’s really cool) and should be able to scale to near petabyte scale. If you’ve got more than that much data in a single database, we really need to talk. The one challenge with SQL DB and database size, is that you can’t restore to it, you have to use a database export and load data using a BACPAC, so it can be very slow for very large databases. My guidance here is that if you have a very large on-premises database that you’d like to move, Managed Instance may be a better choice, as the migration path is easier. If you are building a new application in the cloud, Azure SQL Database is frequently the best choices.
There are a few other limitations of on-premises features that Azure SQL Database doesn’t support. Those are as follows:
Common Language Runtime (CLR)
Change Data Capture
Linked Servers* (sort of—you can connect to external data sources)
Replication Source (you can publish to, but not from an Azure SQL DB)
SQL Server Agent
Of those features, the biggest one is SQL Server Agent. There are a number of different ways to schedule jobs to run in Azure, the most common being Azure Automation, which allows for PowerShell or Python. You can also have an Agent job on a SQL VM running full-blown SQL Server, though that seems less practical.
I mentioned new application development as a target for Azure SQL Database. Since, SQL DB allows for your database to be a replication target, it can also be a good use case for reporting, particularly for field users who are afar from the home office. You can leverage either geo-replication, or just have multiple replication targets, and allow field users to report against them.
From a network perspective, your SQL Database is on the internet, with a firewall in front of it (either at the server or the database level). You can also elect to only allow traffic from a Virtual Network within Azure, however your database is still on shared networking infrastructure (albeit secured).
Azure SQL Database Managed Instance
This service came much later, being announced at Build in 2017, and going GA last year. Managed Instance was specifically designed to make it easier to move from on-premises SQL Server to a PaaS services. One of the initial differences was that Managed Instances uses a virtual core-based pricing model (Azure SQL Database still offers a legacy DTU based model), so you pay a price per core, depending on your tier.
The two tiers of Managed Instances (and vCore based Azure SQL Database) are general purpose and business critical. The two key differences between these tiers are performance and high availability models—general purpose (GP) uses networked SSDs and will generally have IO latency in the 5-10 ms range. If you are using GP, you need to read this blog post on how to optimally configure your storage, as other it can be really painful. GP also uses the underlying storage model for its high availability. Business critical (BC) uses the local SSD that’s in each server host for storage, which is lower latency than a network call, and effectively uses an AlwaysOn Availability Group with four replicas for its availability model. You can also query the secondary replica on BC, by using the application intent=read only connection syntax in the SQL client.
In terms of migration, you can restore a backup (from Azure Blob Storage using the RESTORE FROM URL syntax) into Managed Instance. For a more real-time migration strategy, you can use the Data Migration Services tool which will allow for an almost zero-downtime migration. There’s been talk of allowing log shipping, or at least “log shipping” (via an API), but last I checked (and I’ll check again) you had to use DMS.
Features in MI are much closer to on-premises SQL Server. The big gap closers between MI and SQL DB are that the SQL Agent is there, CDC is there, and cross-database queries can be performed. The following list is the features that are included in Managed Instance and not in SQL DB:
Linked Servers (Only to MI and other SQL Servers)
Service Broker (within instance)
The gaps to on-premises are as follows:
SQL Server Analysis/Integration* (there’s a way to do this with Azure Data Factory, but it costs money)/Reporting Services
There are a few other differences, but I went for the ones that I see as most commonly affecting customers. One thing to note for both MI and Azure SQL DB is that they rely on Azure Active Directory and not on-premises Active Directory to provide authentication. I hadn’t seen this as a big gap with my customers, however it came up a few times at the booth last week, and Microsoft is researching options in this space.
The other limitation Managed Instance currently faces, is that there is a maximum of 8 TB database total size currently. The Managed Instance team is in the process of implementing hyperscale, and this limit will go away when that happens, which will likely be sometime next year.
So, I’ve almost written 2000 words on this. So therefore it needs a tl;dr
Azure SQL Database—Use It If
You are building a new cloud based application
You want to build a reporting option for field users
You are building a customer facing data platform and you want simplified networking and auth
Azure SQL Database Managed Instance—Use it if
You are migrating 3rd party ISV apps to a PaaS model
You want a PaaS model, but don’t want to change any of your code
Your app depends on SQL Agent, or cross-database queries
SQL Server in a VM—Use It If
you are stuck on an old version of SQL Server
You have to use on-prem Active Directory for authentication
99.9% availability is good enough for your application
You are one of those people who run SSAS/SSIS/SSRS/R Services/RDBMS/DQS/MDS all on the same server to maximize every penny of your licensing
A good policy in your technology career, or life in general, is to, to quote a former president, “don’t do stupid shit”. If we look historically at major IT outages or data breaches, there’s always some universal tenants–some piece of critical infrastructure was left completely unsecured, or someone put the private keys on Github. In fact, just last week I wrote my column for Redmond Mag, on some common patterns, mostly specific to SQL Server, but applicable to many other types of applications. I wanted to call it Data Breach Bingo, but I didn’t enough space in my column for 25 vulnerabilities.
While we’re here, it’s important that when thinking about security, you think about the basics first. The GRU operatives working for the Russian government likely aren’t targeting your data–it’s some bot network looking for a blank SA password running on a SQL Server that’s got port 1433 open to the internet. If you are on twitter, I highly recommend following Swift on Security a rather serious parody account (of Taylor Swift) that focuses on good security practices. That twitter account also hosts an excellent website called DecentSecurity.com that talks about the basics of how you should be securing your personal and business environments. While you might not be protected from the GRU, you can avoid most of the other “hackxors” in the world that are really just dumb bots.
The reason why I wrote this post, was that my colleague Meagan Longoria (b|t), shared with me a post-mortem from a ransomware attack on the Department of Transportation in the state of Colorado. I would really like to applaud the state for putting out this document, even though it doesn’t paint the organization in the best light. I’m going to use this space to talk about some of the things highlighted in the report.
“A virtual server was created on February 18, 2018. The virtual server was directly connected into the Colorado Department of Transportation (CDOT) network, as if it was a local on premise system. The virtual server instance also had an internet address and did not have OIT’s standardized security controls in place.”
Ok, this sounds pretty much like they deployed a VM to either Azure or AWS, and they had a network connection back into their on-premises network. This in and of itself is not problematic, but not applying security policy in the provider is a problem. Both Azure and Amazon have standard security templates and policies that can be configured and applied at deployment.
“The account utilized to establish the connection into the CDOT network was a domain administrator account – this is the highest level privileged account, and means that 1) the account cannot be disabled for too many failed login accounts, and 2) it provides the highest level of access to the agency domain controllers (gatekeepers for all access to everything in the department).”
There’s a lot going on here. I read this as that they joined the server to the Active Directory domain, and used a domain admin account to perform the join. Neither of these are problematic. The next sentence is wrong–domain admin accounts can absolutely be disabled, however this indicates to me, that the department had the “never disable this login” checkbox check. Bad move. Also, in an ideal world, you would never have a domain admin log into member servers, so its password was never in kernel memory, but that’s not how this (or many) hacks work.
“Later, OIT was informed by the vendor that when an external IP address is requested, the vendor automatically opens the Remote Desktop protocol to the internet. The Remote Desktop protocol is how this attack was initiated.”
This is my biggest complaint about both Amazon and Microsoft Azure, especially in the timeframe of this attack, both services defaulted to having a public IP address and didn’t actively discourage public ports from being opened. While it’s a totally bad practice that should be enforced by policy, up until a few months ago on the Azure side, it was very easy to do.
“An attacker discovered this system available on the internet, broke into the Administrator account using approximately 40,000 password guesses until the account was compromised. From there, the attacker was able to access CDOT’s environment as the domain administrator, installing and activating the ransomware attack.”
This is a statement that will sound somewhat impressive to lay people. 40,000 password guesses sounds like a whole lot, however for an automated script that would take seconds. This means they either had approximately 3 character passwords (I’m guessing they didn’t), or the domain administrator was using a common dictionary based password. You can use add-ons to Active Directory to limit the types of passwords users can use, and while this may be painful for normal users, it is absolutely critical for high privilege users like DBAs or System Admins. They should also have separate user accounts for logging in as domain admin, and multi-factor authentication configured.
This is also the part where I would usually say you should have network segmentation configured to block the malware from spreading, but if someone pwns your domain controllers all the segmentation in the world won’t help that much. An attacker could even go as far to install their malware via group policy. Good luck cleaning that one up.
The state of Colorado had a really robust backup solution, and some good network practices (smart firewalls) that were able to limit the spread of the malware. So they were able to recover from this attack relatively quickly. It’s important to worry about big things, but doing all of the little things right and most importantly not doing stupid things will go a long way to securing your environment.
Yesterday, GigaOm published a benchmark of Azure SQL Database as compared to Amazon’s RDS service. It’s an interesting test case that tries to compare the performance of these platform as a service database offerings. One of the many challenges of this kind of a study is that the product offerings are not exactly analogous. However, GigaOm was able to build a test case where they used similar sized offerings as shown in the image below.
GigaOm derived from the TPC-E workload, which is a blended OLTP workload which involves both read-only and update transactions. In the raw performance numbers, SQL DB was a little bit better in terms of overall throughput, but where Azure really shines in comparison to AWS is pricing.
In terms of raw pricing, there is a significant difference between Azure SQL DB and AWS pricing for these two instances. Azure is $25,000/month less, however it’s not quite an apples to apples comparison, as Azure SQL DB is single database service as opposed to RDS which acts more like Managed Instance, or on-premises SQL Server. If you use use the elastic pool functionality in Azure SQL DB which is the same cost as single database SQL DB, you get a better comparison.
Where the real benefits come in, are that Microsoft allows you to bring on-premises licenses to Azure SQL Database, and the fact the Microsoft offers a much lower cost for 3 year reservations. (Which you can know pay for on a month to month basis). This is a big advantage that Microsoft has over Amazon is that they control the licensing for SQL Server. You can argue that this isn’t fair, but it’s Microsoft’s ball game, so they get to make the rules.
There will always be an inherent feature advantage to being on Azure SQL DB over AWS RDS, as Microsoft owns the code and can roll it out faster. While AWS is a big customer, and will be quick to roll out new updates and versions, they are still subject to Microsoft’s external release cycle whereas Azure is not. In my opinion, if you are building a new PaaS solution based on SQL Server, your best approach is to run on Azure SQL DB.
As I mentioned, I was in Silicon Valley a couple of weeks ago for an analyst event, and got to meet with a variety of the companies. The final company we met with on Friday, was Solo.io, and I have to say they knocked it out of the park. Their technology was super interesting and their founder Idit Levine, and their CTO, Christian Posta were excellent presenters who were clearly enthusiastic about their product.
So what does Solo.io do? In the modern microservices oriented world, we have distributed systems which are nearly all API driven. Solo.io has a number of products in this space, but their core product Gloo is a modern API gateway that securely bridges modern applications like Lamba or Azure functions to both legacy monolithic applications as well as modern databases running in Kubernetes pods.
They also have another open source project called SuperGloo, which is an abstraction layer for service mesh architecture. A service mesh provides modern applications with monitoring, scaling, and high availability through APIs rather than discrete appliances. Istio from Google is best known tool in this space, and SuperGloo can work with it, and other service meshes in the same architecture.
The other really interesting tool that Solo.io highlighted was called Squash, which is a debugger for distributed systems. If you’ve ever tried to troubleshoot a distributed system, even figuring out where to start can be challenging. By acting as a bridge between Kubernetes (drink) and the IDE, you can choose which pods or containers you are debugging and set breakpoints, or change variables during runtime.
I was recently in Silicon Valley for Cloud Field Day 6, and one of the companies we met with with HashiCorp. HashiCorp is known mostly for two key products in cloud automation–Terraform and Vault which enable cloud automation, and secrets management respectively. Both of these are open source projects, which have support and premium feature offerings for companies and are free to get started with for individuals. Both of these products are considered best of class, and are widely used by many organizations.
We had the honor of hearing from the founder and CTO of HashiCorp, Mitchell Hashimoto, who spoke to us about Consul, a service based networking tool for dynamic infrastructure (this means things like containers, Kubernetes, and serverless cloud services). Mitchell explained that companies are trying to apply on-premises networking paradigms to cloud infrastructure doesn’t really work.
Consul steps in, how can make this simpler.
Service Registry & Health Monitoring
Network Middleware Automation
Zero trust network with service mesh
The goal of the product is easier adoption, crawl, walk, run, earlier adoption. It lets you ID what’s deployed in every single platform–registry does that. Consult provides a unified view for both DNS, and API and provides active health monitoring. It also builds catalog of your entire network. Consul launched in 2014, 50,000+ agents, most widely deployed service discovery tool on AWS. Servers form a cluster and do leader election. All membership is via gossip. Consult requires one server cluster per data center. Separate gossip pool, the open source edition requires fully connected network, while the enterprise edition allows for hub and spoke topologies.
Consul also provides a number of other services like traffic splitting, which allows you do rolling deployment of application code, while sending a small percentage of traffic to the newly released version of your app, in order to check for errors.
Consul is unique tool–networking in containers and serverless is very challenging, and this product brings it together with old school technology like mainframes and physical servers. Also, given HashiCorp’s record with their other products, I expect this one to be really successful.
Last week, I had the opportunity to attend Cloud Field Day #6 in Silicon Valley. Along with 11 other brilliant delegates, we got the chance to meet with a variety of companies in the cloud computing space. I’m going to be writing about each of them in the coming weeks, but I’m going to start out by talking about a company we met with on our first day Morpheus Data.
A quote from the presentation that I thought was a really good description of their product was “we help big companies act like startups”. Morpheus’ product is combination of a service catalog, provisioning system, along with configuration management and CI/CD deployment. It’s not a full on CI/CD tool, but it integrates with all of the most common ones.
Morpheus allows you to also manage tasks like monitoring, logging, and backup. The amazing thing to me was the number of connectors that the product supported. The cloud integration here was interesting–you could show back pricing of a cloud stack, migrate workloads between clouds dynamically, and enable cloud deployment of both infrastructure and code. In addition to these cloud offerings, it’s possible to deploy to physical servers as the product includes a PXE boot engine.
I think the thing that impressed the attendees most about Morpheus was the number of integrations the product offered. Some of the other tools in this space are vRealize Automation from VMWare and ServiceNow. I feel as though the integrations and ease of moving to an infrastructure as code model is Morpheus’ strong point. As much as we talk about infrastructure as code, it’s really hard for larger organizations to move in unison in this direction. So having a product that can work with existing tools, while offering benefits on it’s own, can be a real benefit to a lot of organizations.
I wrote a couple of weeks ago, about what not to do with backups in Azure. Because I’ve seen a few improperly configured VMs lately, I wanted to talk about the way the storage works in Azure, and the way we traditionally did things on-premises.
If you still buy your storage from a three letter company, and your sales rep drives an expensive German car, and has better taste in shoes than Imelda Marcos, you might still configure your storage this way. You might create a separate disk volume for TempDB, transaction log files, and data files. Ideally, you are backing up to a separate storage appliance, and not to the same storage array where your data files live.
This architecture design dates back to when a storage LUN was literally a built of a few disks, and we wanted to ensure that there were enough I/O operations per second to service the needs of the SQL Server, because we only had the available IO of a few disks.
As virtualization became popular storage architectures changes and the a SAN lun was carved out into many small extents (typically 512k-1MB depending on vendor) across the entire array. What this meant was that with modern storage there was no need to separate logs and data files, however some DBAs did, however in an on-premises world there was no penalty for this.
Note: There is a scenario where you would want multiple disk devices in Windows. Under very high I/O workloads, IOs can queue at the Windows disk device level. This is an uncommon performance scenario in my experience
Enter the Cloud
Instead of physical disks, in the cloud your “disk devices” are virtual hard drive files, which are stored across 3 different physical disks on the infrastructure. All storage performance is controlled by quality of service settings on the Azure infrastructure. Each disk you add increases both your IOPs and storage capacity. Also, each virtual machine has a fixed limit on the number of IOPs available to it (while this is very possible on-premises, it’s far less common).
We then translate this to the operating system level, and in this specific case, Windows Server. In order to get maximal volume and performance out of our disks, we use Storage Spaces in Windows to create pools of storage. The exciting part here is that you get to use RAID 0, since Azure’s (or Amazon’s) infrastructure is providing your RAID. This means if we have 20 1 TB disks, with 5000 IOPs each, we can have a 20 TB pool, that theoretically supports 100k IOPs. (Most VMs in Azure don’t support that level of IO performance, but a couple do).
It’s also important to know that you need to specify the number of columns parameter when building your storage spaces pools in Windows. If you have more than four disks your need to use PowerShell for that–I’ll write more about that next week. But here’s some info from the product teams.
This post has good info on columns, but it’s from 2014 and the rest of the storage information is very dated (premium storage didn’t exist). I’m only including because it’s the best explanations of columns that I’ve seen.
What this means, is that in order to maximize your database server’s IO performance, you should create one large pool, with all the disks. Throw your system DBs and your data and log files all on that volume. And please don’t write your backups to that disk. (BACKUP TO URL was invented for this purpose).
You can also throw TempDB on the local D: drive, which is ephemeral (it goes away when your machine reboots, but so does TempDB), and can over slightly lower latency.
Note, if you’re reading this and you are using Ultra Disk, I haven’t tested any of this stuff with Ultra Disk because I haven’t been able to test it. I think you may not need to stripe disks to achieve good performance.
This headline may seem a little aggressive. You may think, Joey, I’ve heard you and other MVPs say that backups are the most important thing ever and I’ll get fired if I don’t have backups. That is very accurate, but when I talk about backups, I am talking about backing up your databases. If you are running an Azure VM, or have a good connection to Azure, Microsoft offers BACKUP to URL functionality that let you easily isolate your database backups from the VM.
If you are backing up your operating system and system state on your cloud provider, you are wasting storage space, money, and CPU cycles. You may be even negatively impacting your the I/O performance of your VMs. We had a customer who was doing a very traditional backup model–they were backing up their databases to the local file system, and then backing up the SQL Server VM using the Azure backup service. They swore up and down that backups where impacting their database performance, and I didn’t believe them until they told me how they were doing backups.
Cloud Storage Architecture
Unlike in your on-premises environment, where you might have up to a 32 Gbps fibre channel connection to your storage array and then a separate 10 Gbps connection to the file share where you write your SQL Server backups, in the cloud you have a single connection to both storage and the rest of the network. That single connection is metered and correlates to the size (and $$$) of your VM. So bandwidth is somewhat sacred, since backups and normal storage traffic go over the same limited tunnel. This doesn’t mean you can’t have good storage performance, it just means you have to think about things. In the case of the customer I mentioned, they were saturating their network pipe, by writing backups to the file system, and then having the Azure backup service backup their VM, they were saturating their pipe and making regular SQL Server I/Os wait.
Why Backing Up Your VMs is Suboptimal
The Azure Backup service is not natively database aware. There is this service https://docs.microsoft.com/en-us/azure/backup/backup-azure-sql-database which offers native backups in SQL Server for you. While this feature is well engineered, I’m not a huge fan. The reason is the estimated time to create a new VM in Azure is about 5-10 minutes. If you are using backup to URL, you can have a new SQL Server VM in 10 minutes. You can immediately join the machine to the domain (or better yet, do it in an automated fashion), and then begin restoring your databases. Restoring from the backup service is really slow, whereas you will see decent restore speeds from RESTORE TO URL.
The one bit of complexity in this process is for high availability solutions like Always On Availability Groups, that have somewhat complex configurations. I’m going to say two dirty words here: PowerShell and Source Control. Yes, you should have your configurations scripted and in source control. You’d murder your developers if their web servers required manual configuration for new deployments, so you should do the same for your database configurations.
If you have third party executables installed on your SQL Server for application software, then well, you doing everything wrong.
Sleep well. Eventually, there will be full support for automated storage tiering in Azure (it still has some limitiations that preclude it’s use for SQL Server backups), so for right now you need the additional manual step.
While I wrote this post around Azure, but I think the same logic should apply to your on-premises VMs and even physical machines. You should be able to get Windows and SQL Server installed within 20-30 minutes if you have a very manual process, and if you have an automated process you should be able to have a machine in less than 10. When I worked at the large cable company, we didn’t backup any of our SQL Server, just the databases.