What Not To Do With Hadoop

I gave a presentation on Hadoop to the Toronto SQL Server User’s Group this week—it was a really good time, and I got a ton of questions covering a wide range of areas. One thing to note—distrubution of data in the cluster is managed by the name node, which also does the job task management. I did get one question—that I think was more theoretical than anything, but I wanted to put it here for posterity.

“Can we mix Windows and Linux nodes in Hadoop cluster?”

One of the things I love about Hadoop is ease of deployment—the clustering is all network (no shared storage) and clustering is inherit in the software, there is nothing to configure. So you won’t see tips from me on how to add disk to your Hadoop cluster—you just tell the cluster where your new node is and what it’s IP address is and your set.

Back To the Question

I haven’t worked with Hortonworks’ Windows distribution of Hadoop much—so I don’t even know if it’s theoretically possible to mix nodes. But just don’t do it—a) I would never mix even versions (e.g. Windows 2003/2008—and yes I know you can’t) in any cluster computing scenario. So to mix different O/S platforms strikes me as insanity. Point b) no one else is doing this—support will be painful. One of the beauties of open source software is the community support—forums, email list, and blogs. If you are using a system configuration the virtually no one else is using, it will be much harder to get support.

The Answer

No. A million times no.



Backing Up Your SQL Server Database to Windows Azure Blob Storage

So recently, I was speaking at my first SQL Saturday of 2014 in Nashville (great job by Tammy Clark (b|t) and her team to put on a great event!) and due to a weird confluence of circumstances, I couldn’t get to my laptop (it involves my car being in the shop, an emergency performance tuning engagement snow, and not being about to get into my office.)

So, I thought, no fear, I have my trusty Surface Pro and a MSDN account—so I fired up a couple of virtual machines running SQL Server 2012 and 2014. The databases I use for my Data Compression and Columnstore presentations are Adventure Works, but I do a lot of prep work to build some big tables to make the demos be more dramatic. Unfortunately, when I started down this performance was pretty slow—something I’m likely attributing to thin provisioning on the Windows Azure Infrastructure side.

Since both databases (2012 and 2014) are the same—I thought let me just do a backup and restore, especially since I didn’t need to actually do work—I just had to fire up a restore. Additionally, I wanted to learn how to do something new.

First Things First

So I’m going to go on limb and assume that you have a Windows Azure account. From there you’ll need to create a storage container.


You want to make note of a couple of things in your storage account. I added a new storage account called ‘sqlbackupsjd’, and then I created a container a container called ‘mycontainer.’ You will want to make note of the URL associated with this container. We will need it when we get back to SQL Server.

Also, you’ll want to click ‘Manage Access Keys’ and grab the primary access key.


Moving on to SQL Server

You’ll need to be on SQL Server 2012 SP1 CU2 or forward to take advantage of this feature. The first thing we will need to do is create a credential to be able to pass our security information to Windows Azure.

So in my case, ‘mystorageaccount’ is ‘sqlbackupsjd’ and the secret is my storage access key. From here—backing up the database is really straightforward.


Restoring is just the opposite of the process. On the target server we create the credential, and then do the restore.



Cloud backup is one of my favorite use cases for Azure services. Much like DR, many organizations have a requirement to have offsite backups. This process can be an easy way to automate this process, without worrying about handing off your tapes to someone else. (Or storing them in the DBA’s wine cellar). It’s a really easy process with SQL 2012.


Don’t $%^&ing Delete Your SQL Server Cluster Resource

I track the metrics on my blog to see which posts are popular, in an effort to gauge my writing and what people are interested in. One post in particular, that I wish wasn’t popular but is, is “Recovering From a Deleted Cluster Resource.”

This post stemmed from an incident at my previous employer, where an offshore DBA had deleted the cluster resource from failover cluster manager “because he used to work at Microsoft, and they did this all the time” (bad idea jeans). In most cases that I’ve seen, people will try to do this when there is some security problem with a cluster. This won’t fix your security problem, however this KB article might.

What will happen however, is that your clustered instance will go down. HARD. Your clustered instance is dependent on that resource–so when the cluster loses contact with the resource, it will stop the SQL Service, and won’t be able to start back up, because of the missing dependency. You actually run the risk of having data corruption if this happens. Fun, fun.

Note—this isn’t the worst thing you can do to a cluster. That would be deleted the active directory cluster computer object—if your AD recycle bin isn’t turned on, there is no recovery except to rebuild the cluster. Also, formatting cluster disks is also a really terrible idea.

A Little Bit More on Columnstore (SQL 2014) Internals

I wrote a post on Monday with some details about columnstore index insert and update process. I mistakenly posted initially (I’ve since updated) that the tuple mover process kicked in when 1000 rows were loaded (thanks to Jimmy May from Microsoft b|t)—the correct answer is 220 or 1,048,576 rows. Remus confirms this here, but I wanted to test out the process for myself, and I found some interesting things.

Before I go too deep into my lab work, this is pretty important to the columnstore index performance story for a couple of reasons. First, when a columnstore index has open delta stores, it needs to parse those in order to answer the query. This is an expensive operation, as it prevents the query from being completely execute in batch mode and in general the insert process is slower, as Microsoft has written an optimized insert API for this process. Additionally, we would like our columnstore row groups to be as large as possible to optimize the row elimination process in answering the query. Let’s examine some of the impact.

Batch Loading

I’m not sure if I’ve found this documented anywhere, but the threshold to trigger the bulk insert API appears to be 102,400 rows—if we do an insert of 102,399 rows we see an open delta store. See below:

Note the message “Column 0 was omitted from VertiPaq during column index build.”—that’s indicative of the data getting processed directly into the columnstore. Also note the time—279 ms to load 102,400 rows. If we check the sys.column_store_row_groups for the table.

As we see—that load was added to the columnstore and compressed. Note—102,400 rows in our row group is not optimal for performance. We’d really like to have it be much larger for better segment elimination. In fact, if we do a rebuild on the index, we will see our rows get distributed more evenly.


Now let’s see what happens if we only load 102,399 rows.



So first of all, we see something huge—the execution time is 994 ms compared to 279 ms. That means it most likely got loaded to a b-tree delta store. Let’s confirm.

Sure enough—we see an open delta store. So I run my standard warehouse query against the table:


If I rebuild the index (so the delta store gets compressed) my logical reads drop, and my execution time drops slightly. Note—this is a small size dataset in terms of columnstore—in a much larger data set the effects would be much more dramatic.



One Last Thing

While I was playing with the tuple mover process yesterday I noticed something unique—a row group with the state of “invisible”

This description is documented in this connect item, and will be in books online in the near future. According to Remus:

“The Tuple Mover reads one closed delta store at a time and starts building the corresponding compressed segments. During this time scans continue to see and read the delta store. When the Tuple Mover has finished compressing the delta store, the newly created segments are made visible and the delta store is made invisible in a single, atomic operation. New scans will see and scan the compressed format. The Tuple Mover then waits for all scans still operating in the delta store to drain after which the delta store is removed.” 

I just happened the run the row groups query during the time the tuple mover was firing, and got to see this.


Ideally, your data warehouse loads are in perfect 1,024,800 row increments, so your row groups are as large as possible, and you avoid having open delta stores. Since the world is not ideal, you will probably want to monitor for open delta stores as they can greatly impact performance. Additionally, if your row groups get too “fragmented” (I use that word hesitantly—it’s not really fragmentation, it’s just a large number of row groups with less than optimal row counts) you may want to consider rebuilding the columnstore periodically—or even better the partition that it’s in.

SQL Server 2014 Columnstore Index Update Internals

I have this bad habit of wanting to know how things actually work that often leads me down rabbit holes. I’ve taken apart way too many mechanical things in my lifetime, and mostly have gotten them back together (though if you’re a cyclist—I never recommend taking apart a Shimano STI lever). Anyway, in SQL Server 2014, we now have the ability to update a columnstore index.

Given the highly compressed nature, the CPU cost of making a direct update to the columnstore index (which is stored as lobs made up of segments) would be extremely high. So in order to facilitate updates and inserts, SQL Server uses the concept of a delta store. This delta store is a b-tree (row oriented) structure where inserts and updates are stored until they reach a threshold of approximately 1000 rows 1048576 rows (thanks Jimmy May for catching–not sure where I got a 1000 rows from). Then when memory is available—they are moved into the columnstore. Let’s take a look at the process in detail.

Here we update a table with a columnstore index:

For the moment this data gets put into a delta store—which we can see from the following query:

The delta_store_hobt_id helps us here. Next I’m going to search sys.partitions for this hobt_id. Nothing

I did some more chasing, but I’ll spare you those details. I ended up at sys.syscsrowgroups—don’t look for it. It’s in Resource DB—I told your this was a rabbit hole. That didn’t really help—I then went sys.dm_db_database_page_allocations. There we can see the IN_ROW_DATA, that is where our delta store pages reside.

From there—we can run DBCC page to look at the contents of the page:

PageID 608240 is the delta store page—as you see, we can still see real data. If we look at the columnstore compressed page (522712), we see a bunch of gibberish on the page, indicative of compression (and the fact that it’s a lob now)

My VM was being fussy, so I forced the tuple mover (the process which moves data from the delta store into the columnstore) by doing an index rebuild.

Since there are no page allocations that aren’t of the type LOB_DATA all of our data has been compressed and added to the columnstore.

So that’s the path our data takes. Many thanks to Remus Rusanu (b|t) for both helping write this new feature and documenting it.

My Favorite New Feature in SQL Server 2014

Happy New Year readers, I was slack in blogging over the holidays, but I’m back and ready to get going with 2014. Last night, our user group featured a speaker gave and overview of all of the new features in SQL Server 2014 (which I think will arrive in the first half of this year) and there are some really things:

  • In-Memory OLTP (Formerly known as Hekaton)
  • New Cardinality Estimator
  • Security Enhancements (new roles for better separation of duties)
  • Updateable and clustered columnstore indexes (I’ve blogged and presented on these)
  • Enhancements to AlwaysOn Availability Groups
  • Better integration with Windows Azure services

All of these are really nice features—in particular I really like the option of being able to use Windows Azure for offsite backup storage and even a disaster recovery (DR) site in a number of different SQL Server DR scenarios. However, one thing stands out, and it is even in the standard edition of SQL Server 2014—Backup Encryption.

Due to their nature database backups tend to travel a lot of places, and pass through a lot of hands. In some shops, you may be backing up to tape, and then shipping that tape off to a 3rd party (like Iron Mountain). Even though, young Timmy who unloads the tape library seems trustworthy—do you know what he’s doing with those tapes? Or if he made a copy of them—and sold a copy of our customer database to the Russian mafia? You might want to check to see how Timmy paid for his new M5.

SQL Server somewhat fixed this Transparent Data Encryption (TDE) being included in SQL 2008—if we turn on TDE, by default our backups are encrypted, along with the performance hit that entails, and the fact that it requires enterprise edition. TDE is a nice feature—I’ve used it where I had high security requirements, but you may still have sensitive data, that either can’t support the performance hit of encryption or isn’t of the level of sensitivity where it needs to be encrypted. Or maybe your company won’t spring for Enterprise Edition.

In SQL 2014, things get a lot simple in terms of encrypting backups. First let’s take a look at a backup file (unencrypted). So we create a database, add a table, and take a backup.


If Timmy’s Russian mafia friends had a copy of this backup they have a couple of options—they can restore it to another instance of SQL Server, or if they are looking for a specific of piece of data, they could fire up the old text or hex editor (Hey look—it’s John Smith’s SSN).


So let’s see how we do this encryption thing. First we need to create a cert (NOTE: PROTECT THIS CERT WITH YOUR LIFE AND JOB—YOU CAN’T RESTORE WITHOUT IT)


Once our certificate is created we can either backup using the WITH ENCRYPTION option, or through backup options in the GUI.


Let’s try that hex editor again:


Since the file is encrypted, everything is a mash of gibberish. More importantly, no one can take our backup file and restore it to another database server. This is a very good thing for us. However, when Timmy starts sending the Russian mob backups they can’t do anything with, he may not be long for this world. So start looking for a new backup technician when you turn this feature on.


%d bloggers like this: