I’m starting with my last paragraph–this blog post is a really long read. I recommend going to the SQLPASS website and watching this keynote, you will learn a ton about NoSQL and how it works..
This was truly incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.
So this is the talk that all of the hardcore database folks have been looking forward to. Dr. David Dewitt, Technical Fellow at Microsoft will be presenting. These talks in the pass have been deep dives into the optimizer, and the math behind it. Speculation is that he”s going to be talking about Big Data.
We start out with Rob Farley and Buck Woody singing a duet about query performance–it was awesome.
We then heard about SQLPass and SQLRally dates for next year–the Summit is back in November the 9th through the 12th. SQL Rally will be May 8th-12th in Dallas. Additionally SQL Rally Nordic has been extremely successful and sold out.
Then Dr. Dewitt takes the stage to discuss Big Data. Starts out talking about the work Rimma Nehme did on the query optimizer for Parallel Data Warehouse. The good doctor talks about the pain of preparing his keynotes, something I think we all experience as speakers.
Talking about very big RDBMS systems–in about 2009 a Zetabyte (1,000,000 petabytes), and it’s expected to grow by a factor of 40!!! More data is generated by sensors–and not entered by hand, or it’s entered by a larger group (social media). Talks about the dramatic costs in hardware costs.
Then goes into the discussion about about how to manage big data. eBay manages 10PB across 256 nodes. Facebook on the other hand uses 2700 nodes to support 20PB. Bing is 150 PB on 40k nodes.
Talks about NoSQL–benefits include JSON, no schema first. Updates don’t really happen to this data. Lower upfront software costs. The time from converting insight to business intelligence can be lower. Data arrives, and isn’t put into a schema, and then check the application program.
Records are shared across nodes by a key–MongoDB, Cassandra, Windows Azure. Hadoop is designed for large amounts of data, with no data model, records. Talking about structured versus unstructured data, and how “unstructured data” has some structure.
Key value stores are OLTP, Hadoop more like Data Warehouse. He talked about eBay versus Facebook, and how much more CPU efficient a RDBMS can be.
Talking about the shift from hierarchical systems to relational, and how SQL is not going away. Talking about Hadoop and it’s ecosystem of software tools. This really all started at google–had to be reliable and cheap for PBs of clickstream data. HDFS is the file system and MapReduce is the process at Google for analyzing massive amounts of data.
Start talking about Hadoop Distributed File System–designed to be scalable to 1000s of nodes. Assumes that hardware and software failures are common. Targeted towards small numbers of very large files–written in Java and highly portable. Files are partitioned into big 64 mb chunks. Sits on top of NTFS.
Each block is replicated to nodes of the cluster, based on the replication factor. First copy is written to original node, second copy is written to another node in the same rack (to reduce network traffic), and the third is put in another rack or even another data center.
The name node has one instance per cluster, and is a single point of failure, but there is a backup node or checkpoint node which will take over in the event of failure. Name node always checks the state of the name nodes–much like the quorum or heartbeat in a regular cluster. It also tells the client to which node it wants to write the block to. The name node also returns block locations to the client.
Talks about types of failures–disk errors, data node failures, and switch/rack failures. Name node and data center failures. When a data node fails, it’s blocks are replicated to another node. The name nodes fails–it’s not an epic failure–automatically fails to the backup node. The file system does automatic load balancing, to evenly spread blocks amongst the nodes. In summary, it’s built to support 1000s of nodes and 100s of TBs. Large block sizes–this is designed for scanning not OLTP. No use of mirroring or RAID–the RAID comes in from the highly replicated blocks.
The negatives are that makes it’s impossible to employ many optimizations used successfully by RDBMS.
MapReduce–programming framework to analyze data sets in HDFS. User only writes map and reduce functions, the framework takes care of everything else. It takes a large problem and divides into a bunch of much smaller problems, and then perform the same function to all of the much smaller pieces. The reduce phase combines that output.
The components include a job tracker (which runs on the name node). Manages the job queues, and scheduling, it schedules the task tracker. Has task tackers, which execute individual task tracks (which run on the data nodes). Shows an example to sum sales by zip code–this is really great example of map reduce compared to SQL. Blocks are stored locally after the map operation (better performance for small writes than HDFS). The map reduce framework does the sorting of the operation, again the results are stored locally.
The worker’s load is distributed amongst the nodes, but data skew is still a problem, because the reducer can get stuck (example–large numbers of New York zip codes, versus say Iowa). This is highly fault tolerant. MR framework removes burden of dealing with failures from the programmer.
On the other hand, you can’t build indexes, constraints or views.
Now, we are talking about Hive and Pig–Facebook produced a SQL like language called and Yahoo produced Pig. These are pretty SQL like, to hide the hard work of MapReduce–it’s basically an abstraction. This looks almost exactly like the SQL we know and love, but with a schema definition on top. Every day facebook runs 150k warehouse jobs–only 500 are map reduce, the rest are HiveQL.
Column and data types are richer in Hadoop than SQL (columns can be structures, lists), and Hive tables can be partitioned. When you position a Hive table the partition name becoms what it’s partitioned by, so it’s not repeated (it’s taken out of the records).
In a simple TPC benchmark, Parallel Data Warehouse was 4-10x faster than Hive.
Now talking about Sqoop–to move data from unstructured universe into SQL. Not an efficient process.
Summarizes–relational databases vs hadoop. Relational databases and Hadoop are designed to meet different needs. Neither will be the only default. Feels like Enterprise Data Managers have the capability to merge the two worlds.
This was incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.