Why Linux and Hadoop Matter, and Why You Should Know Them
March 22, 2013 4 Comments
Most of work in the last several years has been pretty involved in the SQL Server space—I do a lot of presentations about making SQL Server highly available, and perform better. My background does involve of lot of work in various RDBMSs on Linux and UNIX platforms, and those skills have served me well. I can translate between Oracle and SQL Server teams, as well as between Linux and Windows teams. So I have always known Linux shell scripting, and it has been a real asset to me learning PowerShell as most of the concepts are the same (you are learning PowerShell right? I couldn’t manage a large environment without it.) Additionally the other thing that has come into my scope is NoSQL solutions, particularly Hadoop and columnar datastores like HBase and Cassandra. So where am I going with this?
Big Data Isn’t Just a Fad…and It’s Not a Size Thing
Despite what some people say, Big Data, and NoSQL are real things, with quantifiable benefits. I’m not going to repeat buzzwords about 4 Vs, talk about map-reduce, or no schemas, but companies everywhere are looking at these solutions to handle a wide variety of analytic tasks that weren’t easily doable in the relational model. Note—I said easily doable, I have friends that are total superheroes in SQL tuning and system design, and they have the ability to make magic happen. However, most companies don’t have resources like them, so they can’t tune their data warehouse to perform those tasks. Or maybe, they have a short term need and just want to do some analysis in something like Amazon Map Reduce (think Hadoop without all the hard stuff like building a 100 node cluster).
So with SQL Server jobs and activity at what seems like an all-time high (I’m averaging 3.5 recruiter emails a week), why should you care about this stuff? After all relational databases aren’t going away anytime soon, and generally speaking are part of any ecosystem involving Hadoop. Well, the industry is really moving towards highly available systems, that scale horizontally—and RDBMSs don’t do that well. Also, you may have noticed that SQL Server licensing got really expensive with the introduction of SQL Server 2012, and if Oracle’s recent earnings report are any indication, businesses have noticed how expensive RDBMS licensing and support are, and are exploring other options. I know my large enterprise is. Eventually throwing hardware and more licenses at the problem is going to be too slow and too expensive. Hadoop and Hive are free (you can buy support and there are some closed source options, but the products themselves are free). While Microsoft does have its own distribution in conjunction with Hortonworks, it seems to be mostly a cloud based solution, but even if you wanted to go on-premise the prospect of licensing a 100 node Windows cluster is daunting.
What Data Professionals Should Be Doing
So what are most shops running Hadoop and other open source solutions on? CentOS, which you download here for free. That’s right—free as in beer. CentOS is an unsupported version of the Red Hat Enterprise Linux operating system—but it is widely adopted and has broad community support. Also, nearly all of the development work on all of these solutions are done in Linux. While I’m glad that Microsoft has embraced big data, I just don’t see the foothold of Hadoop running on Linux going away anytime soon.
The other interesting thing about the NoSQL model, is there isn’t really a place for the specialized, one trick DBA. Work is typically performed by the systems admin (or cloud provider) and data analysts. So what does this mean for you as a data professional? You are probably the most qualified in your company (assuming you don’t work at Facebook, LinkedIn, or Yahoo) to work with NoSQL solutions. You understand how data works, you understand metadata, and how hardware works. All of these skills still matter in any model. So why do I think you should learn Linux and Hadoop as a Microsoft data professional? (And if you’re an Oracle DBA and reading this—I assume you know Linux, try playing with Hadoop, it doesn’t bite)
- It’s always good to learn new skills
- If you are using PowerShell already, the bash shell will make a lot sense to you
- All the big data cool kids are doing it
- The more you keep thinking it’s a fad, the less relevant you become
It always helps to know what enemy is up to
If you want to do this, go to Cloudera’s site and download the Hadoop VM. It’s running the aforementioned CentOS (which has a nice GUI, so don’t be scared). When you are there—throw some files into the Hadoop file system, and the query them using Hive. You’ll find it’s not that dissimilar from working with SQL. Distributed systems (see what I did there—I didn’t say the BD phrase) are here, and as a leading data professional, you should start to learn them, if for no other reason than to be able to explain where a relational database might be a better solution.