Here are the slides from my presentation at the Philadelphia SQL Server User’s group last night.
Also, if you have any comments, please leave them here.
Here are the slides from my presentation at the Philadelphia SQL Server User’s group last night.
Also, if you have any comments, please leave them here.
I’ll be presenting at the Philadelphia SQL Server User’s Group at Microsoft Malvern on Wednesday December 7. The topic will be a new feature in SQL Server 2012–Always On Availability Groups.
I’ll give an overview of the existing DR and HA technologies currently available with SQL 2008, talk about what’s need in 2012 Clustering, and then move on to the star of the show Availability Groups. I’ll demo building an Availability Group from scratch and walk through the failover process.
These are my presentations from SQL Saturday in Washington. I’ll post a follow up on a couple of the questions I had in the presentation later this week.
So there are a ton of things I could talk about with this–helping each other remotely, being stuck, bored in a hotel room in another country and assisting with a restore for entertainment, or the fact that if I travel to almost anywhere I have someone to hang out with, but there’s more.
The night I got back from the PASS Summit (after a wonderfully spectacular time), I was greeted by parents’ news, that my dad would be having heart surgery the following week. I was pretty annoyed at the time (they had known for a week and didn’t call), but more concerned.
So I ended up having to fly to New Orleans the next weekend and my dad’s surgery was on the following Tuesday. I tweeted about it, and the outpouring of thoughts, prayers, and emails was incredible. My dad isn’t fully recovered yet, but I’m still thankful for everyone’s thoughts and prayers.
I even spent my birthday, “celebrating” over a pink bubbly beverage with a couple of “family members”.
We truly are a very special community. And one that I am very proud to be a part of.
Last week one of the guys on my DBA team, asked me if I knew of a way to identify which data from a table is in a particular data file. SQL Server (when it has multiple data files) typically does a proportional fill of data files—it will try to write an even amount of pages in each of the data files. If a new file is added the engine will continue to add pages there until the fill levels are about even.
The reason this question came up, was that we had a rather large table, and a two data files—one of which was pretty empty, so we wanted to see how much data was left in the second file (these tables are refilled pretty regularly, so it might have been possible to get all of the data out and drop the second data file).
Unfortunately, in SQL 2008 R2 (and below) there is no documented method of doing these checks. The process involves dumping DBCC IND into a temp table and running a select statement against it (see below code):
if object_id(‘#dbcc_indresults’) is not null
drop table #dbcc_indresults
create table #dbcc_indresults (
insert into #dbcc_indresults exec(‘dbcc ind (”pagesplittest”, ”testtable”, 1)’);
select t1.name as ‘File Name’, count(*) as ‘Pages’ from #dbcc_indresults t2
join sys.database_files as t1 on
group by t2.PageFID, t1.name;
However, (and I have to credit this find to this weekend’s Northeastern snow storm, as I was without cable and internet, so I didn’t have much to do, except play with my local SQL Server and read), there is a new system function name sys.dm_db_database_page_allocations, built into SQL Server 2012. It accepts the object_id as a parameter (similar to sys.dm_db_physical_index_stats) and brings us a lot of good information about the pages within a given object. This data was available through various commands in older versions of SQL Server, but this brings it together in one place. This code below, will bring back the number of pages per datafile for a given object.
select t1.name as ‘File Name’, count(*) as ‘Pages’ from sys.dm_db_database_page_allocations
(db_id(), object_id(245575913), null,null,’detailed’) as t2
join sys.database_files as t1 on
group by t2.extent_file_id, t1.name
Only, one problem I ran into—the page counts didn’t match the older method of using DBCC IND. So I asked Paul Randal (blog|twitter), for his thoughts on the matter—he had me check sys.dm_db_index_physical_stats which matched DBCC IND nearly (DBCC IND accounts for the IAM page)
So either my test data is weird (I did perform this test twice after wiping out and recreating the database) or there is a bug in sys.dm_db_database_page_allocations. I opened a connect bug with Microsoft on it, so hopefully I will get some feedback soon.
I’ll be presenting two sessions at SQL Saturday #96 in Washington, DC (well actually Chevy Chase, MD) on November 6. If you are in the mid-Atlantic and looking to get a great day of SQL Server education, please signup.
The two topics I will be presenting on are Virtualization for DBAs and SANs for DBAs.
In the virtualization talk I will cover:
In the SAN talk I will cover:
All in all it should be a great day of training–there will be many top notch speakers and MVPs in attendance. I’m honored to be speaking.
The best thing I can tell you about the PASS Summit is my sleeping schedule since I’ve been home–on Saturday night I slept for about 11 hours, Sunday was more normal at 9, and then Monday night was big, with me going to bed at 8 and waking up around 7. So, now that I’m finally caught up with my sleep I can tell you a little more about the Summit.
I’ve been to other database conferences before, but this was my first trip to the PASS Summit, and I can tell you there is an energy and vibe there like no other. That and the learning is absolutely top notch–the people who write the books and the software are there, and eager to share what they know. I’m not going to recap bit by bit everything I went through, but I definitely learned something in every session I went to, I made a ton of great connections and all in all I had a blast.
There is no better learning, networking or fun value in all in of IT. The SQL Server community is a family, and really gets and understands the value of community, I didn’t have a meal by myself, and I spent so little time in my hotel room, I requested a discount. Do whatever it takes, but if you are a SQL Server professional, get yourself to Seattle in 2012.
I’m starting with my last paragraph–this blog post is a really long read. I recommend going to the SQLPASS website and watching this keynote, you will learn a ton about NoSQL and how it works..
This was truly incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.
So this is the talk that all of the hardcore database folks have been looking forward to. Dr. David Dewitt, Technical Fellow at Microsoft will be presenting. These talks in the pass have been deep dives into the optimizer, and the math behind it. Speculation is that he”s going to be talking about Big Data.
We start out with Rob Farley and Buck Woody singing a duet about query performance–it was awesome.
We then heard about SQLPass and SQLRally dates for next year–the Summit is back in November the 9th through the 12th. SQL Rally will be May 8th-12th in Dallas. Additionally SQL Rally Nordic has been extremely successful and sold out.
Then Dr. Dewitt takes the stage to discuss Big Data. Starts out talking about the work Rimma Nehme did on the query optimizer for Parallel Data Warehouse. The good doctor talks about the pain of preparing his keynotes, something I think we all experience as speakers.
Talking about very big RDBMS systems–in about 2009 a Zetabyte (1,000,000 petabytes), and it’s expected to grow by a factor of 40!!! More data is generated by sensors–and not entered by hand, or it’s entered by a larger group (social media). Talks about the dramatic costs in hardware costs.
Then goes into the discussion about about how to manage big data. eBay manages 10PB across 256 nodes. Facebook on the other hand uses 2700 nodes to support 20PB. Bing is 150 PB on 40k nodes.
Talks about NoSQL–benefits include JSON, no schema first. Updates don’t really happen to this data. Lower upfront software costs. The time from converting insight to business intelligence can be lower. Data arrives, and isn’t put into a schema, and then check the application program.
Records are shared across nodes by a key–MongoDB, Cassandra, Windows Azure. Hadoop is designed for large amounts of data, with no data model, records. Talking about structured versus unstructured data, and how “unstructured data” has some structure.
Key value stores are OLTP, Hadoop more like Data Warehouse. He talked about eBay versus Facebook, and how much more CPU efficient a RDBMS can be.
Talking about the shift from hierarchical systems to relational, and how SQL is not going away. Talking about Hadoop and it’s ecosystem of software tools. This really all started at google–had to be reliable and cheap for PBs of clickstream data. HDFS is the file system and MapReduce is the process at Google for analyzing massive amounts of data.
Start talking about Hadoop Distributed File System–designed to be scalable to 1000s of nodes. Assumes that hardware and software failures are common. Targeted towards small numbers of very large files–written in Java and highly portable. Files are partitioned into big 64 mb chunks. Sits on top of NTFS.
Each block is replicated to nodes of the cluster, based on the replication factor. First copy is written to original node, second copy is written to another node in the same rack (to reduce network traffic), and the third is put in another rack or even another data center.
The name node has one instance per cluster, and is a single point of failure, but there is a backup node or checkpoint node which will take over in the event of failure. Name node always checks the state of the name nodes–much like the quorum or heartbeat in a regular cluster. It also tells the client to which node it wants to write the block to. The name node also returns block locations to the client.
Talks about types of failures–disk errors, data node failures, and switch/rack failures. Name node and data center failures. When a data node fails, it’s blocks are replicated to another node. The name nodes fails–it’s not an epic failure–automatically fails to the backup node. The file system does automatic load balancing, to evenly spread blocks amongst the nodes. In summary, it’s built to support 1000s of nodes and 100s of TBs. Large block sizes–this is designed for scanning not OLTP. No use of mirroring or RAID–the RAID comes in from the highly replicated blocks.
The negatives are that makes it’s impossible to employ many optimizations used successfully by RDBMS.
MapReduce–programming framework to analyze data sets in HDFS. User only writes map and reduce functions, the framework takes care of everything else. It takes a large problem and divides into a bunch of much smaller problems, and then perform the same function to all of the much smaller pieces. The reduce phase combines that output.
The components include a job tracker (which runs on the name node). Manages the job queues, and scheduling, it schedules the task tracker. Has task tackers, which execute individual task tracks (which run on the data nodes). Shows an example to sum sales by zip code–this is really great example of map reduce compared to SQL. Blocks are stored locally after the map operation (better performance for small writes than HDFS). The map reduce framework does the sorting of the operation, again the results are stored locally.
The worker’s load is distributed amongst the nodes, but data skew is still a problem, because the reducer can get stuck (example–large numbers of New York zip codes, versus say Iowa). This is highly fault tolerant. MR framework removes burden of dealing with failures from the programmer.
On the other hand, you can’t build indexes, constraints or views.
Now, we are talking about Hive and Pig–Facebook produced a SQL like language called and Yahoo produced Pig. These are pretty SQL like, to hide the hard work of MapReduce–it’s basically an abstraction. This looks almost exactly like the SQL we know and love, but with a schema definition on top. Every day facebook runs 150k warehouse jobs–only 500 are map reduce, the rest are HiveQL.
Column and data types are richer in Hadoop than SQL (columns can be structures, lists), and Hive tables can be partitioned. When you position a Hive table the partition name becoms what it’s partitioned by, so it’s not repeated (it’s taken out of the records).
In a simple TPC benchmark, Parallel Data Warehouse was 4-10x faster than Hive.
Now talking about Sqoop–to move data from unstructured universe into SQL. Not an efficient process.
Summarizes–relational databases vs hadoop. Relational databases and Hadoop are designed to meet different needs. Neither will be the only default. Feels like Enterprise Data Managers have the capability to merge the two worlds.
This was incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.
So in yesterday’s WIT luncheon, there was a bit of discussion on salary data. There was even a discussion on going to your local HR department, to find out the range for your position. DON’T FREAKING DO THAT!!!! This could be a whole another topic, but you don’t want to send a loud signal to your HR organization that you are shopping jobs. That’s as much of a career limiting move, as not having a backup.
Here’s the link the IRS data on the average (median, mean, and top split) data for DBAs. it’s broken down by region–from my experience with the data across a variety of IT positions, and it has been reasonably accurate.
It covers SQL, and those other databases, but should give you a good starting point for negotiations. And remember–salary is always negotiable, even if your unemployed–if you don’t get that money up front, you will never see it.
Bill Graziano took the stage in a lovely green kilt, with proper socks. Talking about growth in PASS outside of North America. For those of you not directly involved, PASS has been making a big (and successful) push to grow regions particularly outside in Europe and Asia. Lori Edwards (twitter) is the winner of the 2011 PASSion award for outstanding volunteer. Bill discussing PASS financials. Revenue has grown 45% (mostly from Summit), and expenditures to chapters have grown by 105%.
Lots of hardware on stage. Quentin Clark, Corporate Vice President for SQL Server from Microsoft takes the stage, as we get videos of attendees discussing some of the benefits of the new features in SQL Server 2012. He will be talking about what’s coming in SQL Server 2012. Slide up with the vision–any data, any size, anywhere. Connecting the World’s data–I feel like Microsoft with it’s new Data Store, may be opening up to competing with Google on data.
SQL Azure is powered by SQL Server 2012 codebase. Integration Services as a server, HA for stream insight. Additionally, discussing SQL 2012 Always On. I’ll be blogging about that more here in the near future. Bob Harrison, VP of Interlink Transport Technologies, number two import/export in the world takes the stage to discuss, their HA solution. Discussed their Mission Critical systems running on SQL Server. Their primary databases in New York, DR is in New Jersey. They then discussed AlwaysOn Availability Groups which allow up to 4 readable copies of a database, and allows for database to be grouped together. Also, they displayed the availability monitoring solution.
Now showing reporting off a read only copy of the database–we could do this before with a snapshot of a mirror, but this is way better. This will probably be an Enterprise Edition feature (I have no NDA–this is just speculation on my part).
Talking about ColumnStore indexes–this a feature that flattens tables to improve performance. This is a big win for analytic workloads. He moved into PowerView and PowerPivot, some of the BI that integrate Excel, Analysis Services, and Sharepoint. These seem good, but don’t seem to happen much in the Fortune 500, as firms tend to stick with their ERP vendor for analytics.
Now talking about the BI Semantic model, and Data Quality. Just coming off of the SAP project, I’m curious to see of Data Quality Services (new in 2012) and Master Data Services can be a real competitor to Business Objects Data Services. Lara (@sqlgal) is demoing SharePoint reporting. She builds a columnstore index to try to improve performance on a slow running report. MDM allows mapping to Azure Data Marketplace, and will do data correction. I have to say–this looks better than BOBJ-DS. And it’s adaptive–it has an intelligent engine. After building a columnstore index, performance on the query goes from 47 seconds to .3 seconds.
We then see some of the data quality monitoring features that are baked into Master Data Services.
Talking about compliance–a subject that is near and dear to my heart about 9 years in health care. SQL 2012 has user-defined auditing, as well as user-defined server roles. Frankly, this has been a big hole in SQL for a while in my opinions (especially for my dev servers)
Distributed testing–this will allow for workloads to be tested. Discusses SCOM, and a cloud based Premier Mission Critical support services.
Now we move onto the PDW solutions. This is an appliance based solution, that is provided from HP or Dell, and allows for massively parallel processing. You work with an implementer and Microsoft to do this. Originally, it was really expensive, but now Microsoft is providing some options that may be suitable to smaller shops, especially with the Data Warehouse Appliance–these go from full racks all the way down to 1U. These devices only provide Network, Power and Security info. From the box to loading data, this a 20 minute process.
Shows the HP Database Consolidation Appliance–this can provide a big private cloud.
Talking about ODBC drivers to Linux, and Change Data Capture for SSIS & Oracle. These have been requested for a long time. Finally—seriously we’ve need this for 10 years now.
Micheal Rys took the stage to demonstrate a visualization based on the Semantic Search feature in 2012. Using a file table to do semantic search–this does language processing. He did a very good demo around and actually zoomed in on his code.
Next we saw how to deploy a DACPAC to Azure. Additionally, we can now backup Azure databases to Windows Azure storage. This should have been in SQL Azure from the beginning, and is a good feature add. Also, the data sync is moved into SSMS–again this should have been there sooner, I was using a tool from CodePlex for this functionality before. Discussing Federations in SQL Azure, which will allow your domain to be joined to MS–for Domain Based Authentication and Sharding. Microsoft renounced data sync, which will allow for actual DR scenarios in Azure.