PASS Summit 2013—I’m Speaking

Last week I received a really exciting email—I’ll be speaking at this year’s PASS Summit in Charlotte, NC. I had the good fortune of being selected to present on two topics. Without further ado, here’s what I’ll be presenting:

Accelerate Database Performance through Data Compression

I feel like Data Compression is one of the most underutilized features of SQL Server Enterprise Edition. In this session, we will build an understanding of its costs and benefits, and the best use cases for it in your environment. Additionally, we will discuss the new ColumnStore index feature in SQL Server 2012. There will be a lot of demos, and hopefully I won’t have to go to my DR laptop like I did in Atlanta.

Into the Blue—Extending Always On Availability Groups

I know a lot of us (my friends in the SQL Community) bash the cloud, and I completely agree that “cloud”, “big data” and “agile” are the three buzzwords that tend to make me want to punch a wall. However, DR solutions provide an interesting use of off-premise resources, a lot of smaller companies may have high RPO/RTO needs, but may not have the in-house capacity to support them (multiple geo-separated data centers). In this session we will show how to use the Windows Azure Virtual Machines to extend your AlwaysOn Availability Group into Azure.

I hope to see many of you in Charlotte.

Disaster Recovery Starts with Backup and Recovery

I present and blog a lot about high availability and disaster recovery solutions, in doing so I get to talk to a lot of folks about different strategies. Depending on your business needs and regulatory requirements these can vary greatly in costs and complexity. However, no matter your DR solution, it is imperative that you have a sound backup strategy and that you test those backups on a regular basis.

I recently took part in a architectural review of several important applications. The reason for the review is that customer teams were asking for a real-time DR solution for systems that were producing multiple terabytes of transaction volume daily. This is possible, but only at great costs, so in order to craft a better solution we started asking for details around current strategy, and to get a better understanding of the business requirements around their data. When doing so, it came out, that multiple application teams had never tested a restore of their backups. Excuse my fonts here…

If you aren’t regularly testing your restores, save the drive space and don’t

back anything up

Ok, rant over—sort of. At a conference I had the pleasure of attending this spring, an Infrastructure Executive from Goldman Sachs was presenting about how they had zero downtime during Hurricane Sandy. Granted it is Goldman Sachs, and their IT budget is some-huge-number billion dollars, but several things she said really stand out. In addition to much risk analysis work that GS had done, they regularly tested out failovers and restores, in all systems (computer, power, cooling and generation). That’s by far the most important thing you can do from a DR perspective. Even, if it is not physically doing a test (you really should), but getting all of the teams (Database, Sys Admin, Network, Application) into one room, and working out all of the moving parts, and what needs to happen in the event of a disaster.

Lastly, I know it’s hard to get resources to do test restores, enterprise storage isn’t cheap. However, there are many options you can leverage if you don’t have space in your primary location for testing:

  • Amazon Web Services—Take advantage of the cloud, DR in my opinion is one of the best use cases for the cloud. You can fire up a server, use it for testing, and then blow it away. You have to get your data there and that can be painful for large data sets, but it’s not a bad solution.
  • Real Hardware—I know enterprise storage is pricey, but a lab server with some relative slow, but dense storage isn’t. You can build a really effective restore testing environment for less than $10,000, and well under it, if you are willing to buy older unsupported HW on eBay.
  • Offsite Recovery Services—There are a number of firms who provide offsite recovery services (for example Sungard). However this option tends to be extremely expensive, as they guarantee hardware for you in the event of a disaster, and as part of that guarantee you are granted testing time.

I can’t express my feelings on this strongly enough—particularly if you are working with large data sets, with lots of opportunities for failure, it is absolutely critical to test those systems. Otherwise, don’t plan on being able to restore when you need.

Compression Presentation Resources

I’ve been fortunate enough to give a presentation on Data Compression in SQL Server several times over past and coming months. I put this post together to showcase the slide deck, answer some questions, and list some resources around the feature.

Data Compression is an Enterprise Edition only feature, which was introduced in SQL Server 2008, starting with 2008 R2 Backup Compression was made available in all editions of SQL Server.

The SQLCAT blog about compression (and specifically to answer the question of there are benefits to using backup compression with a row and/or page compressed database–the answer is yes) is here.

Hugo Kornelius answers some questions around LOBs and why SQL Server doesn’t compress them at SQL Blog here.

My blog post on Columnstore Archival Compression in SQL 2014.

The full SQLCat white paper on Data Compression is here.

To answer the question about backup compression affecting the transaction log chain–basically, you can’t mix compression modes in a given media set. So if you are not using backup compression and you turn it on, you need to start backing up to a new media set (a new backup file). For example, the below code is correct:

backup database [AdventureWorksDW2012] to disk='c:\temp\1.bak'
go
backup log [AdventureWorksDW2012] to disk='c:\temp\2.bak'
with compression

backup log [AdventureWorksDW2012] to disk=’c:\temp\3.bak’

BACKUP LOG [AdventureWorksDW2012] TO DISK = N’C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Backup\AdventureWorksDW2012_LogBackup_2013-05-09_14-22-10.bak’ WITH NOFORMAT, NOINIT, NAME = N’AdventureWorksDW2012_LogBackup_2013-05-09_14-22-10′, NOSKIP, NOREWIND, NOUNLOAD, NORECOVERY , STATS = 5
RESTORE DATABASE [AdventureWorksDW2012] FROM DISK = N’c:\temp\1.bak’ WITH FILE = 2, NORECOVERY, NOUNLOAD, STATS = 5
RESTORE LOG [AdventureWorksDW2012] FROM DISK = N’c:\temp\2.bak’ WITH FILE = 2, NORECOVERY, NOUNLOAD, STATS = 5
RESTORE LOG [AdventureWorksDW2012] FROM DISK = N’c:\temp\3.bak’ WITH FILE = 1, NORECOVERY, NOUNLOAD, STATS = 5
RESTORE LOG [AdventureWorksDW2012] FROM DISK = N’C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Backup\AdventureWorksDW2012_LogBackup_2013-05-09_14-21-30.bak’ WITH FILE = 1, NOUNLOAD, STATS = 5

GO

Also–the following rules apply around using compression in conjunction with partitioned tables:

When you use data compression with partitioned tables and indexes, be aware of the following considerations:

  • When partitions are split by using the ALTER PARTITION statement, both partitions inherit the data compression attribute of the original partition.
  • When two partitions are merged, the resultant partition inherits the data compression attribute of the destination partition.
  • To switch a partition, the data compression property of the partition must match the compression property of the table.
  • When a clustered index is dropped, the corresponding heap partitions retain their data compression setting unless the partitioning scheme is modified. If the partitioning scheme is changed, all partitions are rebuilt to an uncompressed state.

credit Books online

Four Underutilized Features of SQL Server Enterprise Edition (and One of Standard Edition)

I started thinking about this post a few months ago—we were running into database performance issues on one of our major customer facing applications, and I noticed we weren’t using partitioning or compression on a very large database. So I decided to put together a presentation on compression, and that led me into thinking what features aren’t utilized as often as they should be in SQL Server Enterprise Edition. SQL Server Enterprise Edition licenses cost a lot—at a minimum your company is paying about $28,000 (MSRP) for SQL Server Enterprise Edition—so what features might you be leaving on the table.

  1. Compression—Compression (both Row and Page level compression) was introduced in SQL Server 2008, and is great feature for a lot of workloads. In the past, our workloads were CPU constrained, and using compression on a database, just wasn’t a good idea, however in the era of underutilized powerful CPUs, compression makes a lot of sense for many workloads. Microsoft’s official recommendation is that you should row compress all of your data–I don’t necessarily agree with that sentiment, but using compression carefully, can greatly increase the performance of your read workloads. There’s a great Microsoft white paper on compression, and when and where to use it here. Additionally, compression is one of the only features that the DBA can use to increase the performance of a third party application.* (* Make sure your vendor is cool with this)

    SQL

  2. Partitioning—the origin of this blog post was a 112 GB table, that was neither partitioned nor compressed. Index rebuild jobs were running into the batch window, and the DBA asked me for assistance in job scheduling. When I dug into the jobs and saw that table, needless to say I was alarmed. Partitioning isn’t easy to use—you have to think about your data and how it should be broken down. There are some great resources around partitioning from Microsoft here and Kendra Little (b|T) here. Partitioning has the potential to make your databases easier to maintain, your queries run faster, and increasing your salary (well after the other two things happen). It’s also another way to improve the performance of a third party application, and lastly if your application doesn’t have built in archiving functions, it’s a way to archive data, if you have long data retention requirements.

     

  3. Resource Governor—this feature isn’t for everyone—it allows you to control the amount of CPU and Memory available to given resource pool. You can assign users to resource pool, and limit the CPU and memory available to that pool. Pinal Dave does a good job explaining how to configure it here. The Microsoft demos I’ve seen tend to show how you might give the CEO priority to run his reports, I don’t know about you, but everywhere I’ve worked the CEO has minions to run his reports, so he or she isn’t that concerned about it. The real-world situation where I’ve used this feature is in a multi-tenant SQL Server—I had a couple of databases that were overwhelming the rest of the server. Fortunately both of these databases accessed SQL Server as a process user—I was able to add that process use to a resource pool. Note—there is no way to directly limit the I/O utilization directly, but if you limit CPU enough, the user won’t be able to generate that many IOs. I’m looking at you VSphere database.

     

  4. Management Data Warehouse—Ok, I lied, this feature isn’t in Enterprise Edition. It’s also not for everyone—if you have more than 50 servers, it probably won’t scale that well. This was introduced in SQL Server 2008, to allow administrators to collect performance data from a group of servers, and return it to a central server. It is extendable and customizable, and Microsoft includes a large number of canned reports. It is limited to servers that are running SQL Server 2008 and above, but you should really be getting off of SQL 2005 soon anyway. The other caveat is that if the time comes that you want to stop using MDW, it leaves some remnants behind. Fortunately, Aaron Bertrand (b|t) has blogged on how to configure and remove it successfully here.

     

  5. Peer to Peer Transactional Replication—Do you like hitting yourself in the thumb with a hammer? This might be the SQL Server feature for you! Seriously, if you want to do this, enlist professional help before going down the path, it’s not simple, and can certainly be a headache.

Vendors, Don’t Be That Creepy Guy, OK?

We’ve all been out at the bar late, too late even, and seen that drunk guy (or girl) who’s desperate to go home with another guy/girl/mammal. That’s understandable it’s the end of the night, you’re drunk and the most attractive options are already gone. Sales can be very similar—it’s the end of the quarter, you need to meet your sales quota, and all of the attractive customers are all signed up or have already refused your pitch. That’s how sales works, and while it’s annoying, I get it. Here’s where I get upset—after I’ve refused the advances and acknowledged that I’m not really interested in your junk, er product, it’s not okay to keep asking questions. It’s not really any of your business what other solutions I own, or what I’m developing internally, and when you keep asking those questions after I’ve blown your off, it just starts to feel weird. Don’t make me get a restraining order. (BTW, this isn’t about any of our wonderful SQL Server community sponsors—NoSQL related)

PASS Business Analytics Conference Keynote Day 2

PASS has brought in the “Indiana Jones of Economics” for today’s keynote, Dr. Steven Levitt. He’s the author of books like Freakonomics and Super Freakonomics. Steven starts with a discussion about a long time IRS employee John Szilagyi—discusses first names and putting dependent SSNs on the tax form. At first the idea didn’t go over—but on April 16, 1987, 7 million children vanished off the face of earth.  Well, not really—they didn’t exist in the first place, as there was no IRS verification.  This made the IRS $2B in additional revenue. The story is here.

Incentives and Behavior

Dr. Levitt discusses that story and leads into discussion about incentives, and how incentives drive human behavior. Discussing the difficulty of aligning goals between workers, shareholders and managers. Dr. Levitt discusses how broken education tenure is—his only has to teach 50 hours at the University of Chicago. He spends at least 10 hours a week, just thinking.

Do What You are Good At

He studies small things, discussing controversy about his theory about abortion reducing crime rates 20 years later. Talking about what makes an economist good—it’s not math, it’s basic understanding of how the economy works. Talks about how he wasn’t great at math—had to take Math 101 at Harvard after scoring the lowest score ever on his high school AP math exam.

He starts talking about research into Drunk Driving—and it’s still really bad. Chances of causing fatal crash 8-10x higher than driving sober. Mentions how walking drunk is much more dangerous than driving drunk—a town in Alaska has banned drunk walking.

Economics of Crack

Discusses economists doing a survey—and lacked common sense. “How does it feel to be poor and black?” Discussing a colleague administering a survey to gang members in Cabrini Green.  Talks with gang leader—to begin understanding dynamics of gang behavior. It’s discussed here in Sudhir’s book. Discussing looking at gangs financial records. Discusses how gangs are organized very much like a franchised business. Top gang members made 500k/yr. Most drug dealers still live with their mom’s because, being a street level drug deal is a bad deal. It’s 7x more dangerous to sell crack in Chicago, than to be in a war zone

Businesses and Data

Dr. Levitt discusses how his forays into business consulting were abject failures. He starts discussing how poorly most enterprises do with data. Talking about surveying a large company on their hiring practices. Offers data analysis of all hiring, retention and performance data for this enterprise. Negative correlation with testing and employee performance. Businesses have organized themselves in a way to minimize feedback. Talking about the benefits of experimenting with data. Discusses how scared organizations can be of using data. Was fired from biggest client after telling them the truth about their data. Finishes with a audience Q&A.

PASS Business Analytics Conference Day 1 Keynote

We start with PASS President Bill Graziano who leads off with talking about the growth of data analytics over the past few years. Connect.Share.Learn. If you are reading this and don’t know, SQL Saturday Chicago is this weekend. I’ll be presenting there.

Matt Wolken from Dell/Quest takes the stage to discuss Business Intelligence. Talks about the rate of change in the BI space. Talking about how the shifting demographics of social media are influencing spending. Mentioning great increases in mobile computing. Mentions the tweet from the Hudson River. Analytics used to backwards looking, now it’s looking forward and revenue driving. Companies that implement BI and BA solutions tend to be more profitable (13%). Dell has a new social command center to monitor feedback and support.

Now, we welcome Microsoft Keynote speakers Amir Netz and Kamal Hathi to the stage. Amir, is talking about his Apple II and how he programmed extensively on it. Mentions using VisiCalc and Lotus 123. Talking about how spreadsheets evolved into OLAP. Kamal is now talking about how OLAP evolved into Hadoop, and how do business users understand usage of tools in the big data space. Yet it all comes back to excel, I suspect this will come back to Powerview and data explorer. I shall refrain from editorial content, but ZOOMIT. Excel is getting much better at dealing with external data. Compares evolution of cars from model T to BMW. Amir is talking about how BI is now at the stage of the early slide projector. PowerView is awesome.

Microsoft is doing a nice job of using sentiment analysis from Twitter in this demo, to demonstrate American Idol success predictions. The data shows that positive sentiment is key into American Idol. Explains how Twitter has changed the dynamics of how people watch TV. This means if you are showing something that should be live–SHOW IT LIVE (I’m talking about bike and F1 racing, NBC Sports Network)

Seeing a new excel plugin called geo flow, which lays out spatial data within excel, this functionality has been getting better, but this is really nice. GeoFlow has the ability to zoom in via touch, very nice, and additionally it has a replay functionality. Outstanding graphics and functionality.

And that’s a wrap…more tomorrow

Why Linux and Hadoop Matter, and Why You Should Know Them

Most of work in the last several years has been pretty involved in the SQL Server space—I do a lot of presentations about making SQL Server highly available, and perform better. My background does involve of lot of work in various RDBMSs on Linux and UNIX platforms, and those skills have served me well. I can translate between Oracle and SQL Server teams, as well as between Linux and Windows teams. So I have always known Linux shell scripting, and it has been a real asset to me learning PowerShell as most of the concepts are the same (you are learning PowerShell right? I couldn’t manage a large environment without it.) Additionally the other thing that has come into my scope is NoSQL solutions, particularly Hadoop and columnar datastores like HBase and Cassandra. So where am I going with this?

Big Data Isn’t Just a Fad…and It’s Not a Size Thing

Despite what some people say, Big Data, and NoSQL are real things, with quantifiable benefits. I’m not going to repeat buzzwords about 4 Vs, talk about map-reduce, or no schemas, but companies everywhere are looking at these solutions to handle a wide variety of analytic tasks that weren’t easily doable in the relational model. Note—I said easily doable, I have friends that are total superheroes in SQL tuning and system design, and they have the ability to make magic happen. However, most companies don’t have resources like them, so they can’t tune their data warehouse to perform those tasks. Or maybe, they have a short term need and just want to do some analysis in something like Amazon Map Reduce (think Hadoop without all the hard stuff like building a 100 node cluster).

So with SQL Server jobs and activity at what seems like an all-time high (I’m averaging 3.5 recruiter emails a week), why should you care about this stuff? After all relational databases aren’t going away anytime soon, and generally speaking are part of any ecosystem involving Hadoop. Well, the industry is really moving towards highly available systems, that scale horizontally—and RDBMSs don’t do that well. Also, you may have noticed that SQL Server licensing got really expensive with the introduction of SQL Server 2012, and if Oracle’s recent earnings report are any indication, businesses have noticed how expensive RDBMS licensing and support are, and are exploring other options. I know my large enterprise is. Eventually throwing hardware and more licenses at the problem is going to be too slow and too expensive. Hadoop and Hive are free (you can buy support and there are some closed source options, but the products themselves are free). While Microsoft does have its own distribution in conjunction with Hortonworks, it seems to be mostly a cloud based solution, but even if you wanted to go on-premise the prospect of licensing a 100 node Windows cluster is daunting.

What Data Professionals Should Be Doing

So what are most shops running Hadoop and other open source solutions on? CentOS, which you download here for free. That’s right—free as in beer. CentOS is an unsupported version of the Red Hat Enterprise Linux operating system—but it is widely adopted and has broad community support. Also, nearly all of the development work on all of these solutions are done in Linux. While I’m glad that Microsoft has embraced big data, I just don’t see the foothold of Hadoop running on Linux going away anytime soon.

The other interesting thing about the NoSQL model, is there isn’t really a place for the specialized, one trick DBA. Work is typically performed by the systems admin (or cloud provider) and data analysts. So what does this mean for you as a data professional? You are probably the most qualified in your company (assuming you don’t work at Facebook, LinkedIn, or Yahoo) to work with NoSQL solutions. You understand how data works, you understand metadata, and how hardware works. All of these skills still matter in any model. So why do I think you should learn Linux and Hadoop as a Microsoft data professional? (And if you’re an Oracle DBA and reading this—I assume you know Linux, try playing with Hadoop, it doesn’t bite)

  • It’s always good to learn new skills
  • If you are using PowerShell already, the bash shell will make a lot sense to you
  • All the big data cool kids are doing it
  • The more you keep thinking it’s a fad, the less relevant you become
  • It always helps to know what enemy is up to

If you want to do this, go to Cloudera’s site and download the Hadoop VM. It’s running the aforementioned CentOS (which has a nice GUI, so don’t be scared). When you are there—throw some files into the Hadoop file system, and the query them using Hive. You’ll find it’s not that dissimilar from working with SQL. Distributed systems (see what I did there—I didn’t say the BD phrase) are here, and as a leading data professional, you should start to learn them, if for no other reason than to be able to explain where a relational database might be a better solution.