How to Find Data File Page Counts for a Table

Last week one of the guys on my DBA team, asked me if I knew of a way to identify which data from a table is in a particular data file. SQL Server  (when it has multiple data files) typically does a proportional fill of data files—it will try to write an even amount of pages in each of the data files. If a new file is added the engine will continue to add pages there until the fill levels are about even.

The reason this question came up, was that we had a rather large table, and a two data files—one of which was pretty empty, so we wanted to see how much data was left in the second file (these tables are refilled pretty regularly, so it might have been possible to get all of the data out and drop the second data file).

Unfortunately, in SQL 2008 R2 (and below) there is no documented method of doing these checks. The process involves dumping DBCC IND into a temp table and running a select statement against it (see below code):

if object_id(‘#dbcc_indresults’) is not null

begin

drop table #dbcc_indresults

end

create table #dbcc_indresults (

pageFID bigint,

pagePID bigint,

IAMFID bigint,

IAMPID bigint,

ObjectID bigint,

IndexID bigint,

PartitionNumber bigint,

PartitionID bigint,

IAM_CHAIN_TYPE varchar(50),

PageType bigint,

IndexLevel bigint,

NextPageFID bigint,

NextPagePID bigint,

PrevPageFID bigint,

PrevPagePID bigint);

insert into #dbcc_indresults exec(‘dbcc ind (”pagesplittest”, ”testtable”, 1)’);

select t1.name as ‘File Name’, count(*) as ‘Pages’  from #dbcc_indresults t2

join sys.database_files as t1 on

t1.file_id=t2.PageFID

group by t2.PageFID, t1.name;

However, (and I have to credit this find to this weekend’s Northeastern snow storm, as I was without cable and internet, so I didn’t have much to do, except play with my local SQL Server and read), there is a new system function name sys.dm_db_database_page_allocations, built into SQL Server 2012. It accepts the object_id as a parameter (similar to sys.dm_db_physical_index_stats)  and brings us a lot of good information about the pages within a given object.  This data was available through various commands in older versions of SQL Server, but this brings it together in one place. This code below, will bring back the number of pages per datafile for a given object.

select t1.name as ‘File Name’, count(*) as ‘Pages’  from sys.dm_db_database_page_allocations

(db_id(), object_id(245575913), null,null,’detailed’) as t2

join sys.database_files as t1 on

t1.file_id=t2.extent_file_id

group by t2.extent_file_id, t1.name

Only, one problem I ran into—the page counts didn’t match the older method of using DBCC IND. So I asked Paul Randal (blog|twitter), for his thoughts on the matter—he had me check sys.dm_db_index_physical_stats which matched DBCC IND nearly (DBCC IND accounts for the IAM page)

So either my test data is weird (I did perform this test twice after wiping out and recreating the database) or there is a bug in sys.dm_db_database_page_allocations.  I opened a connect bug with Microsoft on it, so hopefully I will get some feedback soon.

Presenting at SQL Saturday 96 (Washington DC)

I’ll be presenting two sessions at SQL Saturday #96 in Washington, DC (well actually Chevy Chase, MD) on November 6. If you are in the mid-Atlantic and looking to get a great day of SQL Server education, please signup.

The two topics I will be presenting on are Virtualization for DBAs and SANs for DBAs.

In the virtualization talk I will cover:

  • Basics of Virtualization (Hypervisors, Hosts, and Guests)
  • Advanced Features of Virtualization (vMotion, Live Migration, and DR)
  • Potential Gotcha’s of running your database servers in a Virtual Environment
  • Things you need to configure in your DB servers for Virtual Environments

In the SAN talk I will cover:

  • How databases interact with storage
  • Why memory is faster than a hard drive
  • RAID levels and what they mean for databases
  • Why a SAN is like living in a New York highrise
  • How to talk to you SAN administrator

All in all it should be a great day of training–there will be many top notch speakers and MVPs in attendance. I’m honored to be speaking.

PASS Wrapup

The best thing I can tell you about the PASS Summit is my sleeping schedule since I’ve been home–on Saturday night I slept for about 11 hours, Sunday was more normal at 9, and then Monday night was big, with me going to bed at 8 and waking up around 7. So, now that I’m finally caught up with my sleep I can tell you a little more about the Summit.

I’ve been to other database conferences before, but this was my first trip to the PASS Summit, and I can tell you there is an energy and vibe there like no other. That and the learning is absolutely top notch–the people who write the books and the software are there, and eager to share what they know. I’m not going to recap bit by bit everything I went through, but I definitely learned something in every session I went to, I made a ton of great connections and all in all I had a blast.

There is no better learning, networking or fun value in all in of IT. The SQL Server community is a family, and really gets and understands the value of community, I didn’t have a meal by myself, and I spent so little time in my hotel room, I requested a discount.  Do whatever it takes, but if you are a SQL Server professional, get yourself to Seattle in 2012.

PASS Day Three Keynote–Dr. DeWitt

I’m starting with my last paragraph–this blog post is a really long read. I recommend going to the SQLPASS website and watching this keynote, you will learn a ton about NoSQL and how it works..

This was truly incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.

So this is the talk that all of the hardcore database folks have been looking forward to. Dr. David Dewitt, Technical Fellow at Microsoft will be presenting. These talks in the pass have been deep dives into the optimizer, and the math behind it. Speculation is that he”s going to be talking about Big Data.

We start out with Rob Farley and Buck Woody singing a duet about query performance–it was awesome.

We then heard about SQLPass and SQLRally dates for next year–the Summit is back in November the 9th through the 12th. SQL Rally will be May 8th-12th in Dallas. Additionally SQL Rally Nordic has been extremely successful and sold out.

Then Dr. Dewitt takes the stage to discuss Big Data. Starts out talking about the work Rimma Nehme did on the query optimizer for Parallel Data Warehouse. The good doctor talks about the pain of preparing his keynotes, something I think we all experience as speakers.

Talking about very big RDBMS systems–in about 2009 a Zetabyte (1,000,000 petabytes), and it’s expected to grow by a factor of 40!!! More data is generated by sensors–and not entered by hand, or it’s entered by a larger group (social media). Talks about the dramatic costs in hardware costs.

Then goes into the discussion about about how to manage big data. eBay manages 10PB across 256 nodes. Facebook on the other hand uses 2700 nodes to support 20PB.  Bing is 150 PB on 40k nodes.

Talks about NoSQL–benefits include JSON, no schema first. Updates don’t really happen to this data. Lower upfront software costs. The time from converting insight to business intelligence can be lower. Data arrives, and isn’t put into a schema, and then check the application program.

Records are shared across nodes by a key–MongoDB, Cassandra, Windows Azure.  Hadoop is designed for large amounts of data, with no data model, records. Talking about structured versus unstructured data, and how “unstructured data” has some structure.

Key value stores are OLTP, Hadoop more like Data Warehouse. He talked about eBay versus Facebook, and how much more CPU efficient a RDBMS can be.

Talking about the shift from hierarchical systems to relational, and how SQL is not going away. Talking about Hadoop and it’s ecosystem of software tools. This really all started at google–had to be reliable and cheap for PBs of clickstream data.  HDFS is the file system and MapReduce is the process at Google for analyzing massive amounts of data.

Start talking about Hadoop Distributed File System–designed to be scalable to 1000s of nodes. Assumes that hardware and software failures are common. Targeted towards small numbers of very large files–written in Java and highly portable. Files are partitioned into big 64 mb chunks. Sits on top of NTFS.

Each block is replicated to nodes of the cluster, based on the replication factor. First copy is written to original node, second copy is written to another node in the same rack (to reduce network traffic), and the third is put in another rack or even another data center.

The name node has one instance per cluster, and is a single point of failure, but there is a backup node or checkpoint node which will take over in the event of failure.  Name node always checks the state of the name nodes–much like the quorum or heartbeat in a regular cluster. It also tells the client to which node it wants to write the block to. The name node also returns block locations to the client.

Talks about types of failures–disk errors, data node failures, and switch/rack failures. Name node and data center failures. When a data node fails, it’s blocks are replicated to another node. The name nodes fails–it’s not an epic failure–automatically fails to the backup node. The file system does automatic load balancing, to evenly spread blocks amongst the nodes. In summary, it’s built to support 1000s of nodes and 100s of TBs. Large block sizes–this is designed for scanning not OLTP. No use of mirroring or RAID–the RAID comes in from the highly replicated blocks.

The negatives are that makes it’s impossible to employ many optimizations used successfully by RDBMS.

MapReduce–programming framework  to analyze data sets in HDFS. User only writes map and reduce functions, the framework takes care of everything else. It takes a large problem and divides into a bunch of much smaller problems, and then perform the same function to all of the much smaller pieces. The reduce phase combines that output.

The components include a job tracker (which runs on the name node). Manages the job queues, and scheduling, it schedules the task tracker. Has task tackers, which execute individual task tracks (which run on the data nodes). Shows an example to sum sales by zip code–this is really great example of map reduce compared to SQL. Blocks are stored locally after the map operation (better performance for small writes than HDFS). The map reduce framework does the sorting of the operation, again the results are stored locally.

The worker’s load is distributed amongst the nodes, but data skew is still a problem, because the reducer can get stuck (example–large numbers of New York zip codes, versus say Iowa). This is highly fault tolerant. MR framework removes burden of dealing with failures from the programmer.

On the other hand, you can’t build indexes, constraints or views.

Now, we are talking about Hive and Pig–Facebook produced a SQL like language called  and Yahoo produced Pig. These are pretty SQL like, to hide the hard work of MapReduce–it’s basically an abstraction. This looks almost exactly like the SQL we know and love, but with a schema definition on top. Every day facebook runs 150k warehouse jobs–only 500 are map reduce, the rest are HiveQL.

Column and data types are richer in Hadoop than SQL (columns can be structures, lists), and Hive tables can be partitioned. When you position a Hive table the partition name becoms what it’s partitioned by, so it’s not repeated (it’s taken out of the records).

In a simple TPC benchmark, Parallel Data Warehouse was 4-10x faster than Hive.

Now talking about Sqoop–to move data from unstructured universe into SQL. Not an efficient process.

Summarizes–relational databases vs hadoop. Relational databases and Hadoop are designed to meet different needs. Neither will be the only default. Feels like Enterprise Data Managers have the capability to merge the two worlds.

This was incredible–really in-depth technical content, explained in a way that DBAs could easily understand. If anyone from PASS or Microsoft (or hardware vendors) is reading this–this is the content we want, deep technical and incredibly well presented.

Salary Data–Where to Get It..

So in yesterday’s WIT luncheon, there was a bit of discussion on salary data. There was even a discussion on going to your local HR department, to find out the range for your position. DON’T FREAKING DO THAT!!!! This could be a whole another topic, but you don’t want to send a loud signal to your HR organization that you are shopping jobs. That’s as much of a career limiting move, as not having a backup.

Here’s the link the IRS data on the average (median, mean, and top split) data for DBAs. it’s broken down by region–from my experience with the data across a variety of IT positions, and it has been reasonably accurate.

It covers SQL, and those other databases, but should give you a good starting point for negotiations. And remember–salary is always negotiable, even if your unemployed–if you don’t get that money up front, you will never see it.

What DBA’s do, and how much they make.

SQL Pass–Day 2 Keynote

Bill Graziano took the stage in a lovely green kilt, with proper socks. Talking about growth in PASS outside of North America. For those of you not directly involved, PASS has been making a big (and successful) push to grow regions particularly outside in Europe and Asia. Lori Edwards (twitter) is the winner of the 2011 PASSion award for outstanding volunteer. Bill discussing PASS financials. Revenue has grown 45% (mostly from Summit), and expenditures to chapters have grown by 105%.
Lots of hardware on stage. Quentin Clark, Corporate Vice President for SQL Server from Microsoft takes the stage, as we get videos of attendees discussing some of the benefits of the new features in SQL Server 2012. He will be talking about what’s coming in SQL Server 2012. Slide up with the vision–any data, any size, anywhere. Connecting the World’s data–I feel like Microsoft with it’s new Data Store, may be opening up to competing with Google on data.
SQL Azure is powered by SQL Server 2012 codebase. Integration Services as a server, HA for stream insight. Additionally, discussing SQL 2012 Always On. I’ll be blogging about that more here in the near future. Bob Harrison, VP of Interlink Transport Technologies, number two import/export in the world takes the stage to discuss, their HA solution. Discussed their Mission Critical systems running on SQL Server. Their primary databases in New York, DR is in New Jersey. They then discussed AlwaysOn Availability Groups which allow up to 4 readable copies of a database, and allows for database to be grouped together. Also, they displayed the availability monitoring solution.
Now showing reporting off a read only copy of the database–we could do this before with a snapshot of a mirror, but this is way better. This will probably be an Enterprise Edition feature (I have no NDA–this is just speculation on my part).
Talking about ColumnStore indexes–this a feature that flattens tables to improve performance. This is a big win for analytic workloads. He moved into PowerView and PowerPivot, some of the BI that integrate Excel, Analysis Services, and Sharepoint. These seem good, but don’t seem to happen much in the Fortune 500, as firms tend to stick with their ERP vendor for analytics.
Now talking about the BI Semantic model, and Data Quality. Just coming off of the SAP project, I’m curious to see of Data Quality Services (new in 2012) and Master Data Services can be a real competitor to Business Objects Data Services. Lara (@sqlgal) is demoing SharePoint reporting. She builds a columnstore index to try to improve performance on a slow running report. MDM allows mapping to Azure Data Marketplace, and will do data correction. I have to say–this looks better than BOBJ-DS. And it’s adaptive–it has an intelligent engine. After building a columnstore index, performance on the query goes from 47 seconds to .3 seconds.

We then see some of the data quality monitoring features that are baked into Master Data Services.

Talking about compliance–a subject that is near and dear to my heart about 9 years in health care. SQL 2012 has user-defined auditing, as well as user-defined server roles. Frankly, this has been a big hole in SQL for a while in my opinions (especially for my dev servers)

Distributed testing–this will allow for workloads to be tested. Discusses SCOM, and a cloud based Premier Mission Critical support services.
Now we move onto the PDW solutions. This is an appliance based solution, that is provided from HP or Dell, and allows for massively parallel processing. You work with an implementer and Microsoft to do this. Originally, it was really expensive, but now Microsoft is providing some options that may be suitable to smaller shops, especially with the Data Warehouse Appliance–these go from full racks all the way down to 1U. These devices only provide Network, Power and Security info. From the box to loading data, this a 20 minute process.

Shows the HP Database Consolidation Appliance–this can provide a big private cloud.

Talking about ODBC drivers to Linux, and Change Data Capture for SSIS & Oracle. These have been requested for a long time. Finally—seriously we’ve need this for 10 years now.
Micheal Rys took the stage to demonstrate a visualization based on the Semantic Search feature in 2012. Using a file table to do semantic search–this does language processing. He did a very good demo around and actually zoomed in on his code.
Next we saw how to deploy a DACPAC to Azure. Additionally, we can now backup Azure databases to Windows Azure storage. This should have been in SQL Azure from the beginning, and is a good feature add. Also, the data sync is moved into SSMS–again this should have been there sooner, I was using a tool from CodePlex for this functionality before. Discussing Federations in SQL Azure, which will allow your domain to be joined to MS–for Domain Based Authentication and Sharding. Microsoft renounced data sync, which will allow for actual DR scenarios in Azure.

What PASS Has Meant to Me and My Career

On the eve of leaving for my first PASS Summit, I wanted to talk a little bit about what the SQL Server community has meant to me and my career. In my older roles at Wyeth, I was largely an Oracle DBA, but I had always dabbled in SQL Server since the beginning of my career. The position I was hired into at Synthes would require me to learn a lot more about SQL Server, since their environment was mostly SQL. So, I decided to start going to the Philadelphia SQL Server User Group meeting in the area (if your reading this from Philly–go to the website and sign up for the next meeting), and around the same time, I got involved with twitter, and found Brent Ozar (blog|twitter), which eventually led me into the much broader SQL community.

I submitted my first presentation that fall, to IOUG (Oracle’s version of PASS), and my first major presentation was at their conference in April. In the meantime, I had done a couple of presentations at our SQL group, and more importantly I was learning a ton more about the platform. Various code camps, videos at SSWUG, SQL Saturday’s and a SQL Rally later, I’m fairly confident in my speaking skills, and a few of you actually read my blog. Unfortunately, I’m not presenting at the PASS Summit this year, because around the time for submissions, I didn’t know I was going to be going.

One refrain I see tossed about, is that it really doesn’t matter if a person has 10 years of experience, if that person has the same year of experience 10 times. Getting involved in the community can really help you avoid that rut–you will be exposed to a wide variety of presentations on topics you might not have seen before, and if you make the leap to presenting, you’ll dig far deeper into topics than you might have ordinarily in your job,

So how does this tie back to my career? As the SAP project I was on  really started to suck (and I really figured this out while talking to colleagues at SQL Rally), I started looking for a new role, and I started talking to a couple of companies about DBA roles. It really helps the interview process, when you can respond to a question with, “oh I have an article about that on my blog” or “I did a presentation on that”, really helps.

So a big thank you to those who make up the community, your dedication and hard work has made a much better technologist (not just DBA) than I was three years ago.

Lastly, and the really great thing, is my new role (which is really awesome!) basically happened because of my speaking and blogging work I had done. Get out to a user group meeting, or a SQL Saturday! Stay active in the community, and see you at PASS!

Meme Monday

This Meme Monday is hosted by Tom Larock (blog|twitter) and the subject is the worst way you have ever been asked for help.

Well my last project had some nightmarish moments, like getting called at midnight for a minor problem, or having a 2 am conference call.

But one developer really took the cake with the question-“the database is slow”. After looking at wait stats and finding nothing (it turned out to be a network/app issue-too many round trips), I gave the instance more memory. And wrote back saying I thought it was an app problem.

His response-“did you increase the performance in the database yet?”

Time for Change

As some of you may have seen on my LinkedIn profile, I recently made a job change. This as always was a hard decision, as I really liked the folks I was working with at my old company and getting to work in Europe (Switzerland) was a great career experience.  Here’s a little bit of background.

Last December, my company was in the midst of a big hiring spree for our global SAP implementation–it was a big project, and it was obviously where most of our IT resources were going to be going for the next several years. It also meant leaving my comfort zone–databases, to be the Infrastructure lead for the project. I decided to do it–the ERP experience would be great, and my backup plan was that I would continue presenting on SQL, I could always go back to being a DBA.

The project kicked off in February, and one of key early decisions was to outsource the hosting of the Infrastructure–this would, in theory make my job easier, as it would limit me to connectivity, and relationship management activities. However, things didn’t work out well with the hosting (the vendor was awful, and we weren’t much better), so in April, when we were coming up against some project deadlines, I jetted off to Switzerland to build the development and sandbox environments with my consultant. The Swiss had some excess hardware, and the plan was for this to be a temporary environment until we got the hosting worked out–it wasn’t, and the VMs we built then, laid the groundwork for development.

From my perspective, this was good and bad at the same time. It tested every part of my technical skills, I did SQL, Oracle, Windows, Linux, VMWare and a bit of SAN. I even was fairly involved in the network and remote access pieces of the project. The bad side of this, was my team hadn’t expanded–it was two of us, and we were beginning to get overwhelmed with requests, both from the development team and the project management stuff I was having to do. (A common week was 6 hours of meetings a day, all while trying to work). After Switzerland (pt 1), I took a few days to go speak at SQL Rally, and relax a bit.

One interesting tidbit I didn’t mention, was that during my trip to Switzerland, it was announced that my company was being acquired by a much larger health care firm. I think I would be safe, but that’s always a big place for concern.

I was talking with some really smart folks John Sterrett (blog|twitter), Kevin Kline (blog|twitter), and Jen McCown (blog|twitter) at Rally, and they suggested I start looking for another role. I only applied to two jobs, and I heard back from both of them–one of them was at a very prominent company in the Philadelphia area, where a couple of my Microsoft friends had worked. I interviewed there in late May–everything went great, the process took forever, but their HR recruiting did an excellent job of staying touch with me, and letting me know that they were still interested.

The project progressed, things only got crazier. SAP has a crazy number of modules, each which have their own inter and cross dependencies, additionally there are a decent number of ancillary systems that also require support. I’m looking at you Business Objects Data Services.  So needless to say free time was at a premium. May-July consisted of a lot of 60 hour weeks–we finally decided to dump the hosting guys, and do it ourselves, so the end of July had another trip to Switzerland (this  would be our vacation for the year, it was fun), this time to build the QA environment.

The day before I left for Switzerland (and SQL Saturday Wheeling), I got a call from the big company I had interviewed with, with a great offer, pending a drug test (I passed, woo hoo!) . While, I was in Switzerland, I began hearing rumors that the SAP project may be cancelled, as the company is trying to save cash in advance of the merger. This along with a couple of other things that happened in the US during that trip, lead me to accept the offer. I do have to thank Erin Stellato (blog|twitter) and Karen Lopez (blog|twitter) for helping me with advice during that trip. Thanks ladies!!!

So, the epilogue of this story is that two days after I started my new role, the project was cancelled, and everyone was reassigned into either their old roles or something else. I felt pretty awful for my colleagues, but like I said on twitter, I felt like I hit the lottery.

Now, that I’m in a different role, you should see some more blogs here. Later this week, I’ll talk about how the community can help your career!

unSQL Friday–Lessons Learned while Presenting

Jen McCown (blog|twitter) has organized a great topic for this unSQL Friday–lessons learned while presenting. Like she mentioned in her post–these don’t necessarily have to be tragedies, just things you’ve picked up along the way while presenting.

Fortunately, in my few years of doing technical presentations, I’ve never had a major demo failure, or a laptop crap out, but I have picked up some tips and hints along the way. So here goes:

  1. PowerPoint presentation mode, for the win. This is feature in PowerPoint that allows you to have your slides up on the monitor, while looking at your presentation notes and timer on your monitor. I kind of use it like a mini-teleprompter. The one pain point of this is, it does make a bit harder to go in and out of demos, but I feel like that headache is easier than dealing with 10 pages of 16 point speech notes.
  2. Get a presentation mouse. I use a Microsoft presentation mouse, that I was able to buy off of eBay for about $20. It has a timer, which is a nice feature that I like to use. The presentation mouse allows me to move around the room, which leads me to my next point.
  3. Engage, engage, engage the audience. Some of the best presenters I’ve seen do this the best, but try to plan a couple of points in your presentation where you can engage the audience. In my recent SQL Azure presentation, I sent an Azure Reporting Services report to an audience member who had a Windows Phone 7. It breaks up the monotony of a technical presentation, and helps keep people from falling asleep–I’m looking at you guy in row 3.

One last note, based on some feedback I received from SQL Rally, it’s generally not the speakers responsibility for how comfortable the chairs and the room are. If you have a problem with the room please see the conference organizers, please use the evaluation form to comment on me, so I can improve my talks in the future.

If you are reading this, and you are not a regular presenter, but would like to get in on the act, talk to your local user group, they are always on the lookout for speakers, and many organize special events where first time speakers can talk for 10-15 minutes.