Presenting at SQL Saturday 96 (Washington DC)

I’ll be presenting two sessions at SQL Saturday #96 in Washington, DC (well actually Chevy Chase, MD) on November 6. If you are in the mid-Atlantic and looking to get a great day of SQL Server education, please signup.

The two topics I will be presenting on are Virtualization for DBAs and SANs for DBAs.

In the virtualization talk I will cover:

  • Basics of Virtualization (Hypervisors, Hosts, and Guests)
  • Advanced Features of Virtualization (vMotion, Live Migration, and DR)
  • Potential Gotcha’s of running your database servers in a Virtual Environment
  • Things you need to configure in your DB servers for Virtual Environments

In the SAN talk I will cover:

  • How databases interact with storage
  • Why memory is faster than a hard drive
  • RAID levels and what they mean for databases
  • Why a SAN is like living in a New York highrise
  • How to talk to you SAN administrator

All in all it should be a great day of training–there will be many top notch speakers and MVPs in attendance. I’m honored to be speaking.

PASS Wrapup

The best thing I can tell you about the PASS Summit is my sleeping schedule since I’ve been home–on Saturday night I slept for about 11 hours, Sunday was more normal at 9, and then Monday night was big, with me going to bed at 8 and waking up around 7. So, now that I’m finally caught up with my sleep I can tell you a little more about the Summit.

I’ve been to other database conferences before, but this was my first trip to the PASS Summit, and I can tell you there is an energy and vibe there like no other. That and the learning is absolutely top notch–the people who write the books and the software are there, and eager to share what they know. I’m not going to recap bit by bit everything I went through, but I definitely learned something in every session I went to, I made a ton of great connections and all in all I had a blast.

There is no better learning, networking or fun value in all in of IT. The SQL Server community is a family, and really gets and understands the value of community, I didn’t have a meal by myself, and I spent so little time in my hotel room, I requested a discount.  Do whatever it takes, but if you are a SQL Server professional, get yourself to Seattle in 2012.

PASS Day Three Keynote–Dr. DeWitt

I’m starting with my last paragraph–this blog post is a really long read. I recommend going to the SQLPASS website and watching this keynote, you will learn a ton about NoSQL and how it works..

This was truly incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.

So this is the talk that all of the hardcore database folks have been looking forward to. Dr. David Dewitt, Technical Fellow at Microsoft will be presenting. These talks in the pass have been deep dives into the optimizer, and the math behind it. Speculation is that he”s going to be talking about Big Data.

We start out with Rob Farley and Buck Woody singing a duet about query performance–it was awesome.

We then heard about SQLPass and SQLRally dates for next year–the Summit is back in November the 9th through the 12th. SQL Rally will be May 8th-12th in Dallas. Additionally SQL Rally Nordic has been extremely successful and sold out.

Then Dr. Dewitt takes the stage to discuss Big Data. Starts out talking about the work Rimma Nehme did on the query optimizer for Parallel Data Warehouse. The good doctor talks about the pain of preparing his keynotes, something I think we all experience as speakers.

Talking about very big RDBMS systems–in about 2009 a Zetabyte (1,000,000 petabytes), and it’s expected to grow by a factor of 40!!! More data is generated by sensors–and not entered by hand, or it’s entered by a larger group (social media). Talks about the dramatic costs in hardware costs.

Then goes into the discussion about about how to manage big data. eBay manages 10PB across 256 nodes. Facebook on the other hand uses 2700 nodes to support 20PB.  Bing is 150 PB on 40k nodes.

Talks about NoSQL–benefits include JSON, no schema first. Updates don’t really happen to this data. Lower upfront software costs. The time from converting insight to business intelligence can be lower. Data arrives, and isn’t put into a schema, and then check the application program.

Records are shared across nodes by a key–MongoDB, Cassandra, Windows Azure.  Hadoop is designed for large amounts of data, with no data model, records. Talking about structured versus unstructured data, and how “unstructured data” has some structure.

Key value stores are OLTP, Hadoop more like Data Warehouse. He talked about eBay versus Facebook, and how much more CPU efficient a RDBMS can be.

Talking about the shift from hierarchical systems to relational, and how SQL is not going away. Talking about Hadoop and it’s ecosystem of software tools. This really all started at google–had to be reliable and cheap for PBs of clickstream data.  HDFS is the file system and MapReduce is the process at Google for analyzing massive amounts of data.

Start talking about Hadoop Distributed File System–designed to be scalable to 1000s of nodes. Assumes that hardware and software failures are common. Targeted towards small numbers of very large files–written in Java and highly portable. Files are partitioned into big 64 mb chunks. Sits on top of NTFS.

Each block is replicated to nodes of the cluster, based on the replication factor. First copy is written to original node, second copy is written to another node in the same rack (to reduce network traffic), and the third is put in another rack or even another data center.

The name node has one instance per cluster, and is a single point of failure, but there is a backup node or checkpoint node which will take over in the event of failure.  Name node always checks the state of the name nodes–much like the quorum or heartbeat in a regular cluster. It also tells the client to which node it wants to write the block to. The name node also returns block locations to the client.

Talks about types of failures–disk errors, data node failures, and switch/rack failures. Name node and data center failures. When a data node fails, it’s blocks are replicated to another node. The name nodes fails–it’s not an epic failure–automatically fails to the backup node. The file system does automatic load balancing, to evenly spread blocks amongst the nodes. In summary, it’s built to support 1000s of nodes and 100s of TBs. Large block sizes–this is designed for scanning not OLTP. No use of mirroring or RAID–the RAID comes in from the highly replicated blocks.

The negatives are that makes it’s impossible to employ many optimizations used successfully by RDBMS.

MapReduce–programming framework  to analyze data sets in HDFS. User only writes map and reduce functions, the framework takes care of everything else. It takes a large problem and divides into a bunch of much smaller problems, and then perform the same function to all of the much smaller pieces. The reduce phase combines that output.

The components include a job tracker (which runs on the name node). Manages the job queues, and scheduling, it schedules the task tracker. Has task tackers, which execute individual task tracks (which run on the data nodes). Shows an example to sum sales by zip code–this is really great example of map reduce compared to SQL. Blocks are stored locally after the map operation (better performance for small writes than HDFS). The map reduce framework does the sorting of the operation, again the results are stored locally.

The worker’s load is distributed amongst the nodes, but data skew is still a problem, because the reducer can get stuck (example–large numbers of New York zip codes, versus say Iowa). This is highly fault tolerant. MR framework removes burden of dealing with failures from the programmer.

On the other hand, you can’t build indexes, constraints or views.

Now, we are talking about Hive and Pig–Facebook produced a SQL like language called  and Yahoo produced Pig. These are pretty SQL like, to hide the hard work of MapReduce–it’s basically an abstraction. This looks almost exactly like the SQL we know and love, but with a schema definition on top. Every day facebook runs 150k warehouse jobs–only 500 are map reduce, the rest are HiveQL.

Column and data types are richer in Hadoop than SQL (columns can be structures, lists), and Hive tables can be partitioned. When you position a Hive table the partition name becoms what it’s partitioned by, so it’s not repeated (it’s taken out of the records).

In a simple TPC benchmark, Parallel Data Warehouse was 4-10x faster than Hive.

Now talking about Sqoop–to move data from unstructured universe into SQL. Not an efficient process.

Summarizes–relational databases vs hadoop. Relational databases and Hadoop are designed to meet different needs. Neither will be the only default. Feels like Enterprise Data Managers have the capability to merge the two worlds.

This was incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.

Salary Data–Where to Get It..

So in yesterday’s WIT luncheon, there was a bit of discussion on salary data. There was even a discussion on going to your local HR department, to find out the range for your position. DON’T FREAKING DO THAT!!!! This could be a whole another topic, but you don’t want to send a loud signal to your HR organization that you are shopping jobs. That’s as much of a career limiting move, as not having a backup.

Here’s the link the IRS data on the average (median, mean, and top split) data for DBAs. it’s broken down by region–from my experience with the data across a variety of IT positions, and it has been reasonably accurate.

It covers SQL, and those other databases, but should give you a good starting point for negotiations. And remember–salary is always negotiable, even if your unemployed–if you don’t get that money up front, you will never see it.

What DBA’s do, and how much they make.

SQL Pass–Day 2 Keynote

Bill Graziano took the stage in a lovely green kilt, with proper socks. Talking about growth in PASS outside of North America. For those of you not directly involved, PASS has been making a big (and successful) push to grow regions particularly outside in Europe and Asia. Lori Edwards (twitter) is the winner of the 2011 PASSion award for outstanding volunteer. Bill discussing PASS financials. Revenue has grown 45% (mostly from Summit), and expenditures to chapters have grown by 105%.
Lots of hardware on stage. Quentin Clark, Corporate Vice President for SQL Server from Microsoft takes the stage, as we get videos of attendees discussing some of the benefits of the new features in SQL Server 2012. He will be talking about what’s coming in SQL Server 2012. Slide up with the vision–any data, any size, anywhere. Connecting the World’s data–I feel like Microsoft with it’s new Data Store, may be opening up to competing with Google on data.
SQL Azure is powered by SQL Server 2012 codebase. Integration Services as a server, HA for stream insight. Additionally, discussing SQL 2012 Always On. I’ll be blogging about that more here in the near future. Bob Harrison, VP of Interlink Transport Technologies, number two import/export in the world takes the stage to discuss, their HA solution. Discussed their Mission Critical systems running on SQL Server. Their primary databases in New York, DR is in New Jersey. They then discussed AlwaysOn Availability Groups which allow up to 4 readable copies of a database, and allows for database to be grouped together. Also, they displayed the availability monitoring solution.
Now showing reporting off a read only copy of the database–we could do this before with a snapshot of a mirror, but this is way better. This will probably be an Enterprise Edition feature (I have no NDA–this is just speculation on my part).
Talking about ColumnStore indexes–this a feature that flattens tables to improve performance. This is a big win for analytic workloads. He moved into PowerView and PowerPivot, some of the BI that integrate Excel, Analysis Services, and Sharepoint. These seem good, but don’t seem to happen much in the Fortune 500, as firms tend to stick with their ERP vendor for analytics.
Now talking about the BI Semantic model, and Data Quality. Just coming off of the SAP project, I’m curious to see of Data Quality Services (new in 2012) and Master Data Services can be a real competitor to Business Objects Data Services. Lara (@sqlgal) is demoing SharePoint reporting. She builds a columnstore index to try to improve performance on a slow running report. MDM allows mapping to Azure Data Marketplace, and will do data correction. I have to say–this looks better than BOBJ-DS. And it’s adaptive–it has an intelligent engine. After building a columnstore index, performance on the query goes from 47 seconds to .3 seconds.

We then see some of the data quality monitoring features that are baked into Master Data Services.

Talking about compliance–a subject that is near and dear to my heart about 9 years in health care. SQL 2012 has user-defined auditing, as well as user-defined server roles. Frankly, this has been a big hole in SQL for a while in my opinions (especially for my dev servers)

Distributed testing–this will allow for workloads to be tested. Discusses SCOM, and a cloud based Premier Mission Critical support services.
Now we move onto the PDW solutions. This is an appliance based solution, that is provided from HP or Dell, and allows for massively parallel processing. You work with an implementer and Microsoft to do this. Originally, it was really expensive, but now Microsoft is providing some options that may be suitable to smaller shops, especially with the Data Warehouse Appliance–these go from full racks all the way down to 1U. These devices only provide Network, Power and Security info. From the box to loading data, this a 20 minute process.

Shows the HP Database Consolidation Appliance–this can provide a big private cloud.

Talking about ODBC drivers to Linux, and Change Data Capture for SSIS & Oracle. These have been requested for a long time. Finally—seriously we’ve need this for 10 years now.
Micheal Rys took the stage to demonstrate a visualization based on the Semantic Search feature in 2012. Using a file table to do semantic search–this does language processing. He did a very good demo around and actually zoomed in on his code.
Next we saw how to deploy a DACPAC to Azure. Additionally, we can now backup Azure databases to Windows Azure storage. This should have been in SQL Azure from the beginning, and is a good feature add. Also, the data sync is moved into SSMS–again this should have been there sooner, I was using a tool from CodePlex for this functionality before. Discussing Federations in SQL Azure, which will allow your domain to be joined to MS–for Domain Based Authentication and Sharding. Microsoft renounced data sync, which will allow for actual DR scenarios in Azure.

What PASS Has Meant to Me and My Career

On the eve of leaving for my first PASS Summit, I wanted to talk a little bit about what the SQL Server community has meant to me and my career. In my older roles at Wyeth, I was largely an Oracle DBA, but I had always dabbled in SQL Server since the beginning of my career. The position I was hired into at Synthes would require me to learn a lot more about SQL Server, since their environment was mostly SQL. So, I decided to start going to the Philadelphia SQL Server User Group meeting in the area (if your reading this from Philly–go to the website and sign up for the next meeting), and around the same time, I got involved with twitter, and found Brent Ozar (blog|twitter), which eventually led me into the much broader SQL community.

I submitted my first presentation that fall, to IOUG (Oracle’s version of PASS), and my first major presentation was at their conference in April. In the meantime, I had done a couple of presentations at our SQL group, and more importantly I was learning a ton more about the platform. Various code camps, videos at SSWUG, SQL Saturday’s and a SQL Rally later, I’m fairly confident in my speaking skills, and a few of you actually read my blog. Unfortunately, I’m not presenting at the PASS Summit this year, because around the time for submissions, I didn’t know I was going to be going.

One refrain I see tossed about, is that it really doesn’t matter if a person has 10 years of experience, if that person has the same year of experience 10 times. Getting involved in the community can really help you avoid that rut–you will be exposed to a wide variety of presentations on topics you might not have seen before, and if you make the leap to presenting, you’ll dig far deeper into topics than you might have ordinarily in your job,

So how does this tie back to my career? As the SAP project I was on  really started to suck (and I really figured this out while talking to colleagues at SQL Rally), I started looking for a new role, and I started talking to a couple of companies about DBA roles. It really helps the interview process, when you can respond to a question with, “oh I have an article about that on my blog” or “I did a presentation on that”, really helps.

So a big thank you to those who make up the community, your dedication and hard work has made a much better technologist (not just DBA) than I was three years ago.

Lastly, and the really great thing, is my new role (which is really awesome!) basically happened because of my speaking and blogging work I had done. Get out to a user group meeting, or a SQL Saturday! Stay active in the community, and see you at PASS!

Meme Monday

This Meme Monday is hosted by Tom Larock (blog|twitter) and the subject is the worst way you have ever been asked for help.

Well my last project had some nightmarish moments, like getting called at midnight for a minor problem, or having a 2 am conference call.

But one developer really took the cake with the question-“the database is slow”. After looking at wait stats and finding nothing (it turned out to be a network/app issue-too many round trips), I gave the instance more memory. And wrote back saying I thought it was an app problem.

His response-“did you increase the performance in the database yet?”

%d bloggers like this: