Why Linux and Hadoop Matter, and Why You Should Know Them

Most of work in the last several years has been pretty involved in the SQL Server space—I do a lot of presentations about making SQL Server highly available, and perform better. My background does involve of lot of work in various RDBMSs on Linux and UNIX platforms, and those skills have served me well. I can translate between Oracle and SQL Server teams, as well as between Linux and Windows teams. So I have always known Linux shell scripting, and it has been a real asset to me learning PowerShell as most of the concepts are the same (you are learning PowerShell right? I couldn’t manage a large environment without it.) Additionally the other thing that has come into my scope is NoSQL solutions, particularly Hadoop and columnar datastores like HBase and Cassandra. So where am I going with this?

Big Data Isn’t Just a Fad…and It’s Not a Size Thing

Despite what some people say, Big Data, and NoSQL are real things, with quantifiable benefits. I’m not going to repeat buzzwords about 4 Vs, talk about map-reduce, or no schemas, but companies everywhere are looking at these solutions to handle a wide variety of analytic tasks that weren’t easily doable in the relational model. Note—I said easily doable, I have friends that are total superheroes in SQL tuning and system design, and they have the ability to make magic happen. However, most companies don’t have resources like them, so they can’t tune their data warehouse to perform those tasks. Or maybe, they have a short term need and just want to do some analysis in something like Amazon Map Reduce (think Hadoop without all the hard stuff like building a 100 node cluster).

So with SQL Server jobs and activity at what seems like an all-time high (I’m averaging 3.5 recruiter emails a week), why should you care about this stuff? After all relational databases aren’t going away anytime soon, and generally speaking are part of any ecosystem involving Hadoop. Well, the industry is really moving towards highly available systems, that scale horizontally—and RDBMSs don’t do that well. Also, you may have noticed that SQL Server licensing got really expensive with the introduction of SQL Server 2012, and if Oracle’s recent earnings report are any indication, businesses have noticed how expensive RDBMS licensing and support are, and are exploring other options. I know my large enterprise is. Eventually throwing hardware and more licenses at the problem is going to be too slow and too expensive. Hadoop and Hive are free (you can buy support and there are some closed source options, but the products themselves are free). While Microsoft does have its own distribution in conjunction with Hortonworks, it seems to be mostly a cloud based solution, but even if you wanted to go on-premise the prospect of licensing a 100 node Windows cluster is daunting.

What Data Professionals Should Be Doing

So what are most shops running Hadoop and other open source solutions on? CentOS, which you download here for free. That’s right—free as in beer. CentOS is an unsupported version of the Red Hat Enterprise Linux operating system—but it is widely adopted and has broad community support. Also, nearly all of the development work on all of these solutions are done in Linux. While I’m glad that Microsoft has embraced big data, I just don’t see the foothold of Hadoop running on Linux going away anytime soon.

The other interesting thing about the NoSQL model, is there isn’t really a place for the specialized, one trick DBA. Work is typically performed by the systems admin (or cloud provider) and data analysts. So what does this mean for you as a data professional? You are probably the most qualified in your company (assuming you don’t work at Facebook, LinkedIn, or Yahoo) to work with NoSQL solutions. You understand how data works, you understand metadata, and how hardware works. All of these skills still matter in any model. So why do I think you should learn Linux and Hadoop as a Microsoft data professional? (And if you’re an Oracle DBA and reading this—I assume you know Linux, try playing with Hadoop, it doesn’t bite)

  • It’s always good to learn new skills
  • If you are using PowerShell already, the bash shell will make a lot sense to you
  • All the big data cool kids are doing it
  • The more you keep thinking it’s a fad, the less relevant you become
  • It always helps to know what enemy is up to

If you want to do this, go to Cloudera’s site and download the Hadoop VM. It’s running the aforementioned CentOS (which has a nice GUI, so don’t be scared). When you are there—throw some files into the Hadoop file system, and the query them using Hive. You’ll find it’s not that dissimilar from working with SQL. Distributed systems (see what I did there—I didn’t say the BD phrase) are here, and as a leading data professional, you should start to learn them, if for no other reason than to be able to explain where a relational database might be a better solution.

NJ SQL — Data Compression Presentation 3/19

The Microsoft White Paper on Data Compression is here.

Columnstore index information is here.

Awesome Use of SQL Injection

We don’t have a lot of speed cameras in the US, but they are a plague across Europe. While working in Switzerland, I once got a ticket for going 4 kph (2.4 mph) over the speed limit. Someone emailed me this photo, and I approve wholeheartedly.

20130321-104013.jpg

Just in case you don’t see it the driver has replaced his license plate with a license plate number, followed with a drop database command. I’m guessing the speed cameras use OCR to get the license plate, and they probably also use a default database name from a vendor.

SQL Saturday Richmond Resources

Resources from my presentation in Richmond’s SQL Saturday.

Slides:

Sales Reps–Please Don’t BS Me, Alright?

Today is my morning of big data storage events, I’m attending two from two different vendors in about four hours. One down so far, and it was pretty good, until…

I’ve bashed sales reps before (on twitter and on this blog), I’ve even offered lists of things not to do. Well today’s presentation was on par with some of the best I’ve seen. I was engaged, and we had a good discussion of the architecture of Hadoop, and the kind of data applications where it really sense. I was engaged, and wasn’t bashing the vendor on twitter like I sometimes do.

But Then,

The vendor had a slide with the Hadoop ecosystem up–there are a lot of components there. And they aren’t all needed. I though a really good comparison would be to SQL Server, we don’t always need replication or analysis services installed, but if we want to have a database we need the engine. Hadoop is a lot like that–you can get by with just a few components out of the total stack.

At that moment the presenter mentioned SQL Server, and I thought, great this will be a really great example. Then he asked “What is the core engine to SQL Server?” (The right answer I think is Sybase, then it was rewritten for 2005, iirc, someone correct me if I’m way off) He eventually responded with “Jet Database” using the example that you can install SQL Server without installing Jet. As far as I can tell and from my twitter queries, SQL has never run on jet, but Jet may run on SQL Server now.

Anyway, the trivia isn’t the point–if you are quoting a fact in your presentation, be certain of it, and if you aren’t either don’t use that fact, or clarify, saying “I think this to be the truth, but I’m open to facts”. After this, A) I didn’t trust the speaker’s credibility and B) I was distracted trying to confirm the fact the Jet was never a part of SQL Server.

I guess I can add one more thing for sales reps not to do–don’t make $&%# up, you may have a subject matter expert in the room, and you will look like an idiot.

PASS Business Analytics Conference—Why Am I Presenting There?

The new PASS Business Analytics Conference is a new concept for PASS—we’ve seen Business Intelligence (BI) User Groups and even SQL Saturdays dedicated to this subset of PASS, but a whole conference? What is driving this demand? I can’t explain the whole industry, but I can at least provide some perspective from what I see in my window.

I don’t intend to start a debate between relational databases and NoSQL datastores—that’s a religious war I have no intention of jumping into. I’m also not going to abuse the terms, big data, and data in combination with some body of water (data pond, data lake, data ocean, etc.—seriously who comes up with this stuff?). What I will talk about, is how a relational database isn’t always the right answer for every data set, and how relational databases from major vendors (especially with enough cores to do serious analytic workloads) are REALLY EXPENSIVE. So, especially since a lot of my expertise is in Infrastructure based solutions, how did I end up presenting at BaCON?

My organization sees the changing landscape of data—and we generate and save TONS of data. We’re not always choosing the best path for our architecture. So, given I’m on the architectural team, I started investigating some alternative solutions like Hadoop and Hive for less structured non-transactional data. To make it easy to learn this stuff, it helped to have a use case, where I could take it from start to finish. I’m not by any means an expert in Data Analysis, but I am fortunate to be presenting with a great friend who is—Stacia Misner (b|t). So what are we going talk about at BaCON?

Our data set represents about a week’s worth of set top box data from the largest cable provider in the US. We are going to discuss, our data source, and how we used Hadoop and then Hive, to allow us to perform multiple types of analysis on the data in an extremely nimble fashion. From there using PowerView and some other tools, we see the impacts of various events on metrics such as viewer engagement and channel preferences.

For those of you who are SQL Server and/or Oracle professionals—this is a brave new world, but think of like learning a new version of something. You are building on an existing skill set—you already do tons of data analysis in your job. This is just another step in the process, and it will be part the skill set of the 21st century data professional.

SQL Saturday Tampa Resources

Thank you for attending my presentation at SQL Saturday #192 in Tampa.

The slides for the presentation are located here:

Link to Matt Velic’s blog on building your own virtual cluster here.

Understanding Quorum Configs in a Failover Cluster

%d bloggers like this: