As I mentioned, I was in Silicon Valley a couple of weeks ago for an analyst event, and got to meet with a variety of the companies. The final company we met with on Friday, was Solo.io, and I have to say they knocked it out of the park. Their technology was super interesting and their founder Idit Levine, and their CTO, Christian Posta were excellent presenters who were clearly enthusiastic about their product.
So what does Solo.io do? In the modern microservices oriented world, we have distributed systems which are nearly all API driven. Solo.io has a number of products in this space, but their core product Gloo is a modern API gateway that securely bridges modern applications like Lamba or Azure functions to both legacy monolithic applications as well as modern databases running in Kubernetes pods.
They also have another open source project called SuperGloo, which is an abstraction layer for service mesh architecture. A service mesh provides modern applications with monitoring, scaling, and high availability through APIs rather than discrete appliances. Istio from Google is best known tool in this space, and SuperGloo can work with it, and other service meshes in the same architecture.
The other really interesting tool that Solo.io highlighted was called Squash, which is a debugger for distributed systems. If you’ve ever tried to troubleshoot a distributed system, even figuring out where to start can be challenging. By acting as a bridge between Kubernetes (drink) and the IDE, you can choose which pods or containers you are debugging and set breakpoints, or change variables during runtime.
I was recently in Silicon Valley for Cloud Field Day 6, and one of the companies we met with with HashiCorp. HashiCorp is known mostly for two key products in cloud automation–Terraform and Vault which enable cloud automation, and secrets management respectively. Both of these are open source projects, which have support and premium feature offerings for companies and are free to get started with for individuals. Both of these products are considered best of class, and are widely used by many organizations.
We had the honor of hearing from the founder and CTO of HashiCorp, Mitchell Hashimoto, who spoke to us about Consul, a service based networking tool for dynamic infrastructure (this means things like containers, Kubernetes, and serverless cloud services). Mitchell explained that companies are trying to apply on-premises networking paradigms to cloud infrastructure doesn’t really work.
Consul steps in, how can make this simpler.
Service Registry & Health Monitoring
Network Middleware Automation
Zero trust network with service mesh
The goal of the product is easier adoption, crawl, walk, run, earlier adoption. It lets you ID what’s deployed in every single platform–registry does that. Consult provides a unified view for both DNS, and API and provides active health monitoring. It also builds catalog of your entire network. Consul launched in 2014, 50,000+ agents, most widely deployed service discovery tool on AWS. Servers form a cluster and do leader election. All membership is via gossip. Consult requires one server cluster per data center. Separate gossip pool, the open source edition requires fully connected network, while the enterprise edition allows for hub and spoke topologies.
Consul also provides a number of other services like traffic splitting, which allows you do rolling deployment of application code, while sending a small percentage of traffic to the newly released version of your app, in order to check for errors.
Consul is unique tool–networking in containers and serverless is very challenging, and this product brings it together with old school technology like mainframes and physical servers. Also, given HashiCorp’s record with their other products, I expect this one to be really successful.
Last week, I had the opportunity to attend Cloud Field Day #6 in Silicon Valley. Along with 11 other brilliant delegates, we got the chance to meet with a variety of companies in the cloud computing space. I’m going to be writing about each of them in the coming weeks, but I’m going to start out by talking about a company we met with on our first day Morpheus Data.
A quote from the presentation that I thought was a really good description of their product was “we help big companies act like startups”. Morpheus’ product is combination of a service catalog, provisioning system, along with configuration management and CI/CD deployment. It’s not a full on CI/CD tool, but it integrates with all of the most common ones.
Morpheus allows you to also manage tasks like monitoring, logging, and backup. The amazing thing to me was the number of connectors that the product supported. The cloud integration here was interesting–you could show back pricing of a cloud stack, migrate workloads between clouds dynamically, and enable cloud deployment of both infrastructure and code. In addition to these cloud offerings, it’s possible to deploy to physical servers as the product includes a PXE boot engine.
I think the thing that impressed the attendees most about Morpheus was the number of integrations the product offered. Some of the other tools in this space are vRealize Automation from VMWare and ServiceNow. I feel as though the integrations and ease of moving to an infrastructure as code model is Morpheus’ strong point. As much as we talk about infrastructure as code, it’s really hard for larger organizations to move in unison in this direction. So having a product that can work with existing tools, while offering benefits on it’s own, can be a real benefit to a lot of organizations.
I wrote a couple of weeks ago, about what not to do with backups in Azure. Because I’ve seen a few improperly configured VMs lately, I wanted to talk about the way the storage works in Azure, and the way we traditionally did things on-premises.
If you still buy your storage from a three letter company, and your sales rep drives an expensive German car, and has better taste in shoes than Imelda Marcos, you might still configure your storage this way. You might create a separate disk volume for TempDB, transaction log files, and data files. Ideally, you are backing up to a separate storage appliance, and not to the same storage array where your data files live.
This architecture design dates back to when a storage LUN was literally a built of a few disks, and we wanted to ensure that there were enough I/O operations per second to service the needs of the SQL Server, because we only had the available IO of a few disks.
As virtualization became popular storage architectures changes and the a SAN lun was carved out into many small extents (typically 512k-1MB depending on vendor) across the entire array. What this meant was that with modern storage there was no need to separate logs and data files, however some DBAs did, however in an on-premises world there was no penalty for this.
Note: There is a scenario where you would want multiple disk devices in Windows. Under very high I/O workloads, IOs can queue at the Windows disk device level. This is an uncommon performance scenario in my experience
Enter the Cloud
Instead of physical disks, in the cloud your “disk devices” are virtual hard drive files, which are stored across 3 different physical disks on the infrastructure. All storage performance is controlled by quality of service settings on the Azure infrastructure. Each disk you add increases both your IOPs and storage capacity. Also, each virtual machine has a fixed limit on the number of IOPs available to it (while this is very possible on-premises, it’s far less common).
We then translate this to the operating system level, and in this specific case, Windows Server. In order to get maximal volume and performance out of our disks, we use Storage Spaces in Windows to create pools of storage. The exciting part here is that you get to use RAID 0, since Azure’s (or Amazon’s) infrastructure is providing your RAID. This means if we have 20 1 TB disks, with 5000 IOPs each, we can have a 20 TB pool, that theoretically supports 100k IOPs. (Most VMs in Azure don’t support that level of IO performance, but a couple do).
It’s also important to know that you need to specify the number of columns parameter when building your storage spaces pools in Windows. If you have more than four disks your need to use PowerShell for that–I’ll write more about that next week. But here’s some info from the product teams.
This post has good info on columns, but it’s from 2014 and the rest of the storage information is very dated (premium storage didn’t exist). I’m only including because it’s the best explanations of columns that I’ve seen.
What this means, is that in order to maximize your database server’s IO performance, you should create one large pool, with all the disks. Throw your system DBs and your data and log files all on that volume. And please don’t write your backups to that disk. (BACKUP TO URL was invented for this purpose).
You can also throw TempDB on the local D: drive, which is ephemeral (it goes away when your machine reboots, but so does TempDB), and can over slightly lower latency.
Note, if you’re reading this and you are using Ultra Disk, I haven’t tested any of this stuff with Ultra Disk because I haven’t been able to test it. I think you may not need to stripe disks to achieve good performance.
This headline may seem a little aggressive. You may think, Joey, I’ve heard you and other MVPs say that backups are the most important thing ever and I’ll get fired if I don’t have backups. That is very accurate, but when I talk about backups, I am talking about backing up your databases. If you are running an Azure VM, or have a good connection to Azure, Microsoft offers BACKUP to URL functionality that let you easily isolate your database backups from the VM.
If you are backing up your operating system and system state on your cloud provider, you are wasting storage space, money, and CPU cycles. You may be even negatively impacting your the I/O performance of your VMs. We had a customer who was doing a very traditional backup model–they were backing up their databases to the local file system, and then backing up the SQL Server VM using the Azure backup service. They swore up and down that backups where impacting their database performance, and I didn’t believe them until they told me how they were doing backups.
Cloud Storage Architecture
Unlike in your on-premises environment, where you might have up to a 32 Gbps fibre channel connection to your storage array and then a separate 10 Gbps connection to the file share where you write your SQL Server backups, in the cloud you have a single connection to both storage and the rest of the network. That single connection is metered and correlates to the size (and $$$) of your VM. So bandwidth is somewhat sacred, since backups and normal storage traffic go over the same limited tunnel. This doesn’t mean you can’t have good storage performance, it just means you have to think about things. In the case of the customer I mentioned, they were saturating their network pipe, by writing backups to the file system, and then having the Azure backup service backup their VM, they were saturating their pipe and making regular SQL Server I/Os wait.
Why Backing Up Your VMs is Suboptimal
The Azure Backup service is not natively database aware. There is this service https://docs.microsoft.com/en-us/azure/backup/backup-azure-sql-database which offers native backups in SQL Server for you. While this feature is well engineered, I’m not a huge fan. The reason is the estimated time to create a new VM in Azure is about 5-10 minutes. If you are using backup to URL, you can have a new SQL Server VM in 10 minutes. You can immediately join the machine to the domain (or better yet, do it in an automated fashion), and then begin restoring your databases. Restoring from the backup service is really slow, whereas you will see decent restore speeds from RESTORE TO URL.
The one bit of complexity in this process is for high availability solutions like Always On Availability Groups, that have somewhat complex configurations. I’m going to say two dirty words here: PowerShell and Source Control. Yes, you should have your configurations scripted and in source control. You’d murder your developers if their web servers required manual configuration for new deployments, so you should do the same for your database configurations.
If you have third party executables installed on your SQL Server for application software, then well, you doing everything wrong.
Sleep well. Eventually, there will be full support for automated storage tiering in Azure (it still has some limitiations that preclude it’s use for SQL Server backups), so for right now you need the additional manual step.
While I wrote this post around Azure, but I think the same logic should apply to your on-premises VMs and even physical machines. You should be able to get Windows and SQL Server installed within 20-30 minutes if you have a very manual process, and if you have an automated process you should be able to have a machine in less than 10. When I worked at the large cable company, we didn’t backup any of our SQL Server, just the databases.
I started working on this bit of code a few months ago, and it’s served me really well. Just about every command you run against a SQL Database requires you to supply the server name and the resource group name at parameters. And in order to get the list of server names you have to do it for each for resource group.
This code is pretty simple and looks for an Azure SQL Server in each resource group, and then looks for the databases that aren’t master on each server. In this example I’m setting the storage account for Azure Threat Detection, but you could do anything you wanted in that last loop.
When Azure SQL Database Managed Instance was introduced to the public at //build a couple of years ago, it was billed as a solution to ease the migration from either on-premises or even infrastructure as a service VMs. You would get all of the benefits of a managed service like built-in high availability and patching, automated backups, and you could do all of the things you couldn’t do in Azure SQL Database, like run CLR, use cross-database queries and have SQL Agent jobs without having to learn Azure Automation and PowerShell. The final big bonus was that you restore your backups from on-premises into the managed instance environment. No more dealing with DACPACs and crying, and drinking, and crying, and drinking, and crying.
I had very early access to managed instance servers, and it seemed obvious to me that an easy migration approach would be to use log shipping. You could write your backups from your source server to URL, restore them with NORECOVERY to your managed instance, repeat the process with log backups, and voila you were in a managed instance. Quick and easy, and more importantly, if you were a DBA, nearly the exact same process you would have executed in your on-premises environment (except with backups to blob storage).
There was a long period of time, were we Data Platform MVPs were unable to deploy managed instances into our Microsoft subscriptions. Which is fine, when capacity is short, it should go to paying customers, not us idiots. However, this meant I was away from the product for a while. During this time Microsoft introduced the Data Migration Service, a comprehensive set of tools to move your data to and from a variety of platforms in an online and offline manor.
While DMS is pretty interesting tooling, I had mostly ignored it until recently. Functionally, the tool works pretty well. The problem is it requires a lot of privileges–you have to have someone who can create a service principal and you need to have the following ports open between your source machine and your managed instance:
While the scope of those firewall rules is limited, in a larger enterprise, explaining why you need port 445 open to anything is going to be challenging. So in addition an AAD admin, the DBA is going to need a network admin to enable this. That service principal you created is also going to need the contributor permission on the entire subscription. Yes, that means it can create anything in the entire subscription. This is probably my biggest complaint. Microsoft does acknowledge this in docs, and says they are working to reduce the permissions that are required.
I’m currently engaged in a VM to Managed Instance migration, and when the client’s DBA was complaining about the complexity of the DMS, I suggested we just use log shipping like I had done when I first played with the Managed Instance service. I was trying to figure out how to automate the process, but then I figured I should just verify I could do a restore with NORECOVERY.
Msg 3032, Level 16, State 2, Line 11
One or more of the options (stats, norecovery, stats=) are not supported for this statement. Review the documentation for supported options.
Sad Trombone. That means the only way to migrate a database in near real-time is to use the DMS. And it’s going to take half of your IT staff in order to do it. In order to reduce the friction to migrations I’ve yelled at a couple of PMs about this, but I thought I would create a User Voice option.
There was a thread on one of the Microsoft MVP distribution lists the other week, about recovering from a deleted resource that reminded me of a post I had been meaning to write. In many organizations, the public cloud is the wild west of the IT organization. In the worst cases, this means organization admins are using their gmail accounts to access the subscriptions, but even in well run organizations, the ease of deploying cloud resources leads to the dreaded server sprawl. In this post, I’m going to talk about three features of the Azure Resource Manager architecture that you should be using to better manage your subscription: tags, policies, and locks.
When I worked in corporate IT, there was no discussion I hated more than the dreaded “server naming convention” discussion. It would typically be held in a room filled with middle managers (who would nearly always be men) who felt the need to exercise their dominance by defining at least two characters of the up to 15 we were allowed by NetBIOS. This also lead to metadata packing as I called it–where we would end up with server names like SWCSAPSQL01P, which would indicate the company, the data center location, the application, the function, an integer, and the environment. Plus server names like that roll right of the tongue. In reality, this is kind of a terrible way to define metadata about server resources, and in a world where we are using things like containers which are disposable, this paradigm does not work. Fortunately Azure (and AWS and Kubernetes) allow for tagging of resources. Tags are simply key-value pairs that describe our objects. For example, if I had a VM running SAP’s SQL Server, I might have the following tags:
Tags are free form, and you have up to 15 of them per resource, so you describe things very well. Tags also roll up on your Azure bill–hence my use of the cost center tag in my example. You can also use PowerShell and Azure CLI to filter operations by tags, so they are essential to filtering your management and maintenance tasks.
If you are thinking “tags are a really good idea, but the other people on my team are lazy and will never remember to use them” do I have the solution for you. Similar to Windows Server Active Directory, Azure allows you to implement policies to manage your subscription. Before we delve too deep into Azure Policy specifics, let’s talk about the hierarchy of resources within your Azure subscription (for the purposes of this post I’m talking about a single subscription).
At the highest level we have the subscription, which is made of one or more resource groups, which themselves are composed of zero or more resources. This hierarchy is important to understand for many reasons, not the least of which is that it is how role based access control (RBAC) works in Azure. Security (and policy) are scoped at a level, and then have inheritance down. If you have a role granted at the level of the subscription, you are going to have access to all of the resource groups and resources in that subscription (unless someone has issued an explicit deny, but that’s a different post).
Policy works the same way–a policy has a scope of either a specific resource group or at the subscription. You can define a policy to do any number of things in Azure–if you want to define a policy to enforce tagging and scope it at the subscription level, no one will be able to deploy a resource without the tags you have specified in your policy. You can also specify the type, sizes, and regions where your users can deploy resources. Once policy is in place, users will receive an error if they try to deploy resources that do not meet the definition. Because of this, you should also socialize your policies with anyone who will be deploying resources into your environment, so that they know the rules, and don’t come to your desk with a bat.
The final feature that you should be using are locks. Locks are just what they sound like, and they can perform one of two functions: prevent any changes to the resource, resource group, or subscription (read-only locks) or not allowing any resource at the scope of the lock to be deleted (delete locks). I don’t really like using read-only locks, as they prevent changes like adding disk space, or changing the performance level of an Azure SQL Database. However, I think delete locks should go on every production resource in your subscription. Locks can be removed, if you are the owner of a resource or resource group, but the if you attempt to delete the resource with a lock in place, Azure will throw an error message indicating that the lock is in place.
The cloud is big and complex, and it’s easy to let resources grow out of control, which can lead to configuration drift, security problems, and most importantly excessive spend. These are just a few of the many built-in
tools you can use to make cloud management easier.
I try to avoid writing blog posts that I like to call “hot takes”—quick crappy opinions on the news of the day, but this is a topic I feel particularly strongly about. I’m not sure how many of you were using Microsoft’s Azure Hadoop offering HDInsight, when it debuted in the 2012-13 timeframe, but it had a unique characteristic. Unlike virtually every other Hadoop offering at the time (and this was a hot era for Hadoop) HDInsight ran on Windows Server. That meant all the assorted utilities in and around Hadoop were always trailing what was current on Linux. It also meant that when you had issues, and you searched for help in various online forums, you were always challenged because you got weird error messages, and your cluster used oddball file pathing because Windows. Since 2013, Microsoft has gotten a new CEO, the stock price has shot way up, and the company has embraced open source software. SQL Server is on Linux now, which leads me to my next points.
Ever since SQL Server Helsinki debuted in Docker, I’ve seen the benefit of using containers with databases. When SQL Server started supporting Kubernetes, I really saw the benefit and quickly embraced this, by writing and presenting on the topic, and evangelizing all the benefits of the Kubernetes platform (of which there are many). Before I start my rant, remember in a container platform, there is only one copy of the base operating system on a given host. Your container contains the libraries and binaries it needs but shares a kernel with the host operating system. This means the base operating system of the host must be the same as that of the container.
Much like Hadoop, Kubernetes was built from the ground up as a Linux based platform. When Google built the Borg cluster management system that eventually became Kubernetes, it was reportedly built on a custom build of OpenBSD. This means there a lot of assumptions about the way things work in Linux that are built into Kubernetes. While, I know I’ve heard a reasonable amount of community demand for Windows containers (clearly enough that Microsoft has made an effort to build support both into Windows Server and Azure Kubernetes Services), I can’t help but feel this is not a good long term plan.
When dealing with open source software, it’s good to be on a platform that is widely utilized. When you are searching for help on forums, or looking for the latest patch, the platform that is the most widely utilized. Another example I like to use for this, is Oracle on Windows, which I supported in a past life. Since Oracle was most commonly run on Linux/UNIX, patches for Windows were always days and weeks behind. While, I appreciate the effort of the Windows team to build container and Kubernetes support into the platform, Microsoft is going to be the only support/patching path for Kubernetes on Windows, which hampers one of the key benefits of OSS, rapid fixes.
There’s another elephant that’s in the room—in this scenario, Linux is free as in beer, and you will have to license (pay for) your Windows nodes. If you are running a supported version of Linux like Red Hat (now presented by IBM), you will pay about the same cost as Windows Server licensing, but in most cases organizations running Kubernetes are doing it on a free version of the operating system.
I don’t mean to slight anything Microsoft is doing (note: I’m a shareholder and current a contractor at MS), but I feel as though if you are implementing Kubernetes on Windows, you are likely doing containers wrong. With .NET Core and SQL Server being available on Linux there are few reasons to tie your development to the Windows platform. System administration reasons like domain authentication and group policy support make some sense to me, however I can’t help but think this feels like Hadoop on Windows. Also, the lack of community support can’t be overemphasized–this is a big deal, especially on a not fully mature platform like Kubernetes. By the way, Microsoft stopped offering HDInsight on Windows sometime in 2014-15, just saying.
I was going to New York last weekend from my home of Philadelphia. We were running late for the train, and for the first time ever, I had a booked an award ticket on Amtrak. For reasons unbeknownst to me, you can not make changes to an award ticket on their app (I didn’t try the website). Additionally, when you call the standard Amtrak line, the customer service reps can’t change an award ticket, unless you have defined a PIN. This PIN is defined by telling an awards customer service rep what you want your PIN to be. (Because that’s really secure). While this is all god awful business process, that is exacerbated by crappy IT, it’s really down to bad business processes.
The bad application design came into play, when the Awards rep, tried to change my ticket, and she asked “do you have it open in our app? I can’t make changes to it if you have it open.” My jaw kind of dropped when this happened, but I went ahead and closed the app. Then the representative was able to make this change. We had to repeat the process when the representative had booked us into the wrong class of service. (The rep was excellent and even called me back directly).
But let’s a talk about the way most mobile apps work. There are a series of front-end screens that are local to your device, and most of the interaction is handled through a series of Rest API calls. The data should be cached locally to your device after the API call, so it can still be read in the event of your device being offline. If you are a database professional, you are used to the concept of ACID (Atomicity, Consistency, Isolation, Durability), which is how we ensure things like bank balances in our databases can remain trusted. In general, a tenant of most databases applications is that readers should never block writers–if a process needs to write to a record someone reading the record should not effect that operation. There are exceptions to this rule, but these rules are generally enforced by the RDBMS in conjunction with the isolation level of the database.
Another tenant of good database development is that you should do what you are doing, and get the hell out of the database. Whether you are reading or writing, you should get the data you need and then return either the data or the status of the transaction to your app. Otherwise, you are going to keep your transaction open, and impact other transactions, in a generally unpredictable set of timings. That’s the whole point of using the aforementioned Rest API calls–they do a thing, return your data, or that you updated some data, and then get the hell out.
What exactly is Amtrak’s app doing? Obviously I don’t have any backend knowledge, but based on that comment from the CS rep, opening your reservation in the mobile app, opens a database transaction. That doesn’t close. I can’t fathom why anyone would ever design an app this way, and I can’t think of a tool that would easily let you design an app like this. At some point, someone made a really bad design choice.