SQL Saturday #294—It’s a Wrap! #sqlsat294

I just finished organizing my third SQL Saturday event in Philadelphia, this one being our biggest ever. I’d like to thank all of my volunteers for making this event happen and be successful again. This year, we were on either side of TechEd, so I feel like that helped us with speakers. Also, for the first time ever we decided to have a precon day with myself and Stacia Misner (b|t) talking about Big Data and Power BI, and Allan Hirt (b|t) doing a really well regarded lab session on Always On Availability Groups. These were well received—I was surprised at how late most of our attendees registered. The last week saw a significant increase in registrations with a couple of attendees registering the night before. I did offer speakers a healthy discount on pre-cons—we had a few speakers sign up for some extra learning.

We are fortunate enough to have a really fantastic facility at Microsoft and a great representative there. The venue is central, close to my house, and the layout allows good foot traffic for all of our sponsors. One of the issues we’ve had in the past, was having attendees not print out their SpeedPASSes, this year we had the budget to offer a Xbox One as a prize, and it seemed like we had about 85% of our attendees show up with their SpeedPASSes printed.

In closing, I would like to thank all of my sponsors, my volunteers (you were awesome), and my speakers. Putting these events on isn’t easy, and it takes a great team to do it all.

T-SQL Tuesday #055: SQL Server 2014, The Good, The Bad and The Ugly

T-SQL Tuesday is here (sorry for delay, work is busy, and I’m running a SQL Saturday this week). I will be hosting this month’s edition. Since SQL Server 2014 has been out for a couple of months now, I wanted to solicit opinions. What features are you really happy about? What features should be tossed/fixed immediately/taken out back and shot? Did SQL 2014 break your application, and do you know why?

 

 

 

 

  1. The post must go live on your blog between 00:00 GMT Tuesday, June 2, 2014 and 00:00 GMT Wednesday, June 11, 2014

2) Include the T-SQL Tuesday logo at the top of your post, and link your post back to this post, and the link must be anchored from the logo (found above) which must also appear at the top of the post

3) Leave a comment here (below) or I won’t be able to find your post.            

Is My Static IP Address in Windows Azure Really Set?

I’ve been working with Windows Azure VMs since they became available last year, and I’ve built out some pretty complex scenarios with them (hybrid clusters using AlwaysOn Availability Groups, for one). One the early limitations was that all of the VMs were all dynamic IP (DHCP) addresses—there were some workarounds to this, but with database servers and domain controllers this wasn’t the best option. Starting early in 2014, a new PowerShell command called “Set-AzureStaticVNetIP” appeared on GitHub, and slowly made its way into the public domain. There’s a great article on how to configure this using Windows Azure PowerShell (which you will need to install after creating your VMs) at Windows IT Pro.

Note: This can only be done on Azure VMs in a virtual network

I’m in the middle of creating some VMs for some work that I am doing, and I went through the above process to assign a static IP to my domain controller in Azure. Pro tip—don’t set your IP address from within an RDP session to that machine. You will get kicked out (hangs head in shame). Also, make note that your machine may reboot—it’s not mentioned in the article, and I’m not 100% sure if it related to my being in the RDP session, but be forewarned.

As I was promoting it to a domain controller, I noticed that Windows still thought it had a dynamic IP address. Which I thought was odd.

Figure 1 Server Manager Showing Dynamic IP

From there I checked the IPv4 properties (note—this server has been promoted to a domain controller, why it is using localhost (127.0.0.1) and the other domain controller for DNS)

Figure 2 IPv4 Properties of AzureVM with Static IP

Of course, the proof was in the pudding—I had rebooted this VM several times and it was still keeping the same IP address (10.0.1.4). In the traditional dynamic IP address model, each DHCP call would increment by 1—so by my third reboot I would expect to see 10.0.1.7. So I went to Powershell to check using the command “Get-AzureStaticVNetIP”:

 

Figure 3 Static IP Address Assigned to VM.

So even though in most places it doesn’t look like your VM has a static IP address, it has been reserved on the Azure side. I think this is likely being done at the hypervisor level somehow, and hasn’t been exposed to Windows yet, but that’s just speculation on my part.

 

 

How Do I Show Memory Usage From A Columnstore Index in SQL Server?

Columnstore indexes in SQL Server are a great feature from an analytic and data warehouse perspective—you can get great compression ratios and batch mode processing allows for great performance. One of the ways these performance gains happen is by columnstore indexes bypassing the 8k pages in the SQL Server bufferpool (mostly—I’ll touch on that in a bit). Columnstore indexes get their own area of memory—which uses large pages (which are stored adjacently) to support the underlying lobs that the columnstore index is stored in. So I was curious to see how much memory my columnstore index was actually using—before I knew better, I looked in the buffer cache, which returned:

This shows my index has a 106 pages in the buffer cache—which is interesting because columnstore indexes aren’t supposed to use that right? So I take a look at sys.dm_os_buffer_descriptors:

select
*
from
sys.dm_os_buffer_descriptors
where database_id=
6 and allocation_unit_id=72057594085113856;

 


 

So there are some pages in the buffer pool from the columnstore—I suspect these are pages from the columnstore dictionary, but I can’t say for certain. So how do we find our actual columnstore memory usage? After chasing around DMVs a bit, I centered on sys.dm_os_memory_cache_entries:

select
*
from
sys.dm_os_memory_cache_entries
where type=CACHESTORE_COLUMNSTOREOBJECTPOOL

 

Unfortunately, none of the columns in that DMV tie back directly to a specific object or partition id. Except for entry_data—which presents data in this wonderful format (and not it’s not XML, that would have been easier):

<object_type=’1′ db_id=’12’ hobt_id =’72057594042646528′ column_id=’30’ object_id =’0′ object_size=’22760’/>

I knew if I could get that hobt_id, I could show the memory utilization of the columnstore. So with much thanks to at Jeremiah Peschka (b|t) who helped me work on some of the myriad of string functions and concatenation that I had to do to make this work, I present columnstorebuffers.sql. This procedure will return the name of the source object (the table here) and the amount of memory it’s using in the columnstore object pool:

Note—you’ll have to run this procedure on a database level, even though dm_os_memory_cache_entries is global, the other views that join to bring back the name of the object are not.

The Modern Analytics Architecture—My PASS BA Conference Session

In a couple of weeks, I will be presenting the PASS Business Analytics Conference in San Jose, CA (and this weekend at SQL Saturday Chicago.) This topic feels like it has been a couple of years in the making—I got into “big data” when I was at Comcast, where we had a large number of effectively append-only applications which were running on very expensive relational database engines on top of a lot of equally as expensive enterprise SAN space. Even as recently as two years ago, the solutions weren’t fully enterprise ready, unless you had a really specialized team of engineers. However, now Hadoop is much easier to build and manage, there are SQL and other tools that make connecting to the data very easy. I had been wanting to build this session since I started with it to illustrate from infrastructure, application and business perspective of why the toolsets are changing, and how you can use these new toolsets to gain richer analysis of your data.

This session isn’t just going to be about Hadoop—you will learn about how SQL Server (and other platforms) can be used as very powerful analytics platform across large volumes of data. You will learn the infrastructure that drives big data and analytical solutions, and how to decide on whether to deploy on-premises or to the cloud. You will learn about the skills needed to work on Hadoop and some of other toolsets, and I might even crack a few jokes about Oracle.

 

Office 365—Are You Done Yet?

Power BI in Office 365 has some new and exciting features like Power Map and Power Q&A, and it is also required to use the Power BI “modern” app (that’s metro to most of us.) In general, I’ve been really happy with the way things work with Office 365, but from time to time I run into odd behavior. This week was one of those weeks.

The upper limit of a file that you can load into Power BI is 250 MB—and if you file is above 30 MB you need to enable it, for Power BI. I was building a demo this week that has a good amount of data, and ends up being right around 200 MB. I was a little nervous about uploading a file that “big” to the cloud over a regular internet connection. As it happened, I was working from home yesterday, and kicked off the upload process right before I ran an errand over lunch. When I got back an hour or so later, I saw this:

 

Figure 1 Upload Dialog from Power BI

However, when I checked resource monitor for network activity, I saw none from Internet Explorer.

Figure 2 Network Activity

I opened a new window, and my file was actually in Power BI, so I left the upload dialog alone overnight. When I logged in this morning, the dialog box is still there. While this isn’t a big to deal to IT folks (we are used to things not working correctly), and can imagine it being disconcerting to a business user trying to upload a report.

 

Does Power View Have a Problem with Shrinkage?

I do a decent amount of work with Power View—I think it is a pretty nice tool for doing visualizations across disparate data sets and it is mostly easy enough to be accessible to power users. However there are a few things I feel the need to complain about. My biggest complaint is that the user doesn’t have the ability to change a chart title—the only way to avoid the system generated chart title is to hide it, and then add a text box. However, the other day, I accidentally found a new bug—chart shrinkage.

I was building a very simple chart to test something in Power BI (SharePoint online)—so I used Power Query and my one of my favorite test data sites (Pro Football Reference—they have good HTML tables that work really nicely with Power Query.) So I grabbed the 2013 NFL defensive stats and threw them in a cluster bar chart.

Figure 1 Power View Chart of NFL Defensive Stats 2013

Look at that—the Saints were pretty good on defense last year, particularly pass defense (thanks Rob Ryan and Keenan Lewis.) This chart was fine and served my purpose, but in order to demo something to someone, I tiled it by team.

 


Figure 2 Tiled by Team

 

When I remove the tile by the chart now has a scroll bar (note the last visible team is the Arizona Cardinals, rather than the Washington NFL team in figure 1)

 

Figure 3 Shrunken Chart

Repeating this process several times, the chart ends up looking like this:

Figure 4 Who shrunk my chart?

It is easy to drag down the chart to make the data visible again, but this clearly isn’t the intended behavior. Hopefully, like Power Query (which gets new fixes seemingly every week) this matter (and the chart title one) will get resolved quickly.

 

When the Cloud is Definitely Right for You

So a lot of my colleagues who have a lot of IT experience working on some very large workloads (and I generally count myself in this number) have balked at the notion that all IT workloads can moved to cloud computing models. In general, being in a multi-tenant environment is not the ideal scenario for very large, intensive database workloads, and cloud computing doesn’t change that. The vendors are incentivized to provide you adequate performance, while maximizing the utilization of their infrastructure. So if you want dedicated performance—you are going to pay (at least at Amazon, as of this writing Microsoft doesn’t have a performance storage offering in Azure.) For the purposes of this post—I’m referring to “cloud computing” as Infrastructure as a Service (IaaS), platform and software as a service are different scenarios with different requirements.

When the Cloud is Right

So right now I see two real scenarios where IaaS is an ideal fit—there are others (quick dev environments, testing new platforms, restore testing), but my focus is attuned to these two scenarios:

  • Disaster Recovery
  • Small Business

Disaster Recovery

If you are a working for a major Multinational Corporation which has three or four data centers, feel free to ignore this section (actually, ignore this post and go read a licensing one.) If you work for a small to medium size business and your DR strategy is to take your backups home with you—you are my target audience. The cloud brings you a lot of really nice options:

  • Backup directly to Azure (or other providers through API)
  • All SQL Server DR options are available in hybrid models

I’m not suggesting you stop taking backups—just that you think about the cloud as your offsite backup repository. You can also think of the cloud as your second data center—that you only have to pay for when you use it. For example, maybe you only DR for an application, during certain business times of the year—IaaS makes this a viable solution.

Small Business

The real genesis of this post, was an email I received following the presentation I did last week called “Building Perfect SQL Servers, Every Time.” My goal was teaching people how to install and configure SQL Server in an optimal and repeatable fashion. The session went well, and I had a number of a follow up questions via email and twitter—but I noticed something in a few of them, people where asking how to optimally configure hardware that simply wasn’t enough to server there needs (e.g. a server with only 2 physical drives running SQL Server), or they were a part time SA and DBA (and probably a couple of other things.)

If you have less than 10 servers, and they are running in a closet, you should definitely consider a cloud model. The networking is getting easier to setup and you can only pay for exactly what you need. There are other things to think about such as traffic and bandwidth, but the more important thing is your data is far more protected than it would be running on a server with two drives. In Azure, there are three copies of your data (this doesn’t mean you shouldn’t take database backups to Azure, but when you take those there are also three copies of the data. You may be concerned about security—Microsoft has all of these security certications, and your servers are in your closet.

Summary

Cloud computing isn’t for everyone—but if you are supporting a few servers, it can make both financial and technical sense. Focus your efforts on your core business needs, instead of trying to manage infrastructure with band aids and duct tape.

 

 

 

Where Should I Use a Columnstore Index in SQL Server?

I’ve been blogging a lot about columnstore indexes in SQL Server 2014—for good reason, I think it is a great feature that adds a lot of value to SQL Server for analytic workloads. One of the other (and probably more talked about) features in SQL Server 2014 is In-Memory OLTP (formerly known as Hekaton)—the SQL team added a neat new report that’s accessible to users of Management Data Warehouse (MDW) to show which tables and procedures in a given database might be well suited for In-Memory. There is nothing like that yet for a columnstore index.

 

Figure 1 SQL 2014 Management Data Warehouse In-Memory Tables Assistant

 

Microsoft’s official guidance is to use columnstores on large fact tables and dimensions greater than 1,000,000 rows. Typically (and this applies to regular page compression as well, you are going to look to tables that don’t have a very high update percentage as columnstore candidates. Despite the ability to update the clustered columnstores, this feature remains largely focused on the append-only type workloads present in data warehousing system. The insert process has been designed around doing bulk inserts very efficiently, and not so much doing singleton updates and inserts.

So I wanted to put logic together to identify candidate tables. The below script checks for tables with > 1,000,000 rows and an update percentage below 20—it would be easy enough to modify, the output list is a the list of candidate tables for columnstore indexes. Additionally, it won’t return any tables that already have a columnstore index on them.

with widetables as

(select a.name,
count(b.column_id)
as ColumnCount, sum(c.row_count)
as Row_Count

from
sys.objects a

inner
join
sys.columns b on b.object_id=a.object_id

inner
join
sys.dm_db_partition_stats c on c.object_id=a.object_id

inner
join
sys.partitions d on d.object_id=a.object_id

where d.data_compression not
in
(3,4)

group
by a.name

having
count(b.column_id)
> 5 and count(b.column_id)
< 1200 and sum(c.row_count)
>1000000)

,

 

–Check Update %

–Checks for Percentage of Update Statements

updaterate as

(

SELECT o.NAME AS [Table_Name]

    ,x.NAME AS [Index_Name]

    ,i.partition_number AS [Partition]

    ,i.index_id AS [Index_ID]

    ,x.type_desc AS [Index_Type]

    ,i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) AS
[Percent_Update]

FROM
sys.dm_db_index_operational_stats(db_id(),
NULL,
NULL,
NULL) i

INNER
JOIN
sys.objects o ON o.object_id = i.object_id

INNER
JOIN
sys.indexes x ON x.object_id = i.object_id

    AND x.index_id = i.index_id

WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)
!= 0

    AND
objectproperty(i.object_id,
‘IsUserTable’)
= 1 and i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)
> 20

)

 

select name as CandidateTables from widetables

except

select table_name from updaterate

 

I hope this script is useful, and can provide some nice guidance as to where to use a columnstore index.

 

Columnstore Indexes and Transparent Data Encryption in SQL Server 2014

I spoke on columnstore indexes in SQL 2014 last weekend at SQL Saturday #262 in Boston—I had an audience question come up about using Transparent Data Encryption (TDE) in conjunction with columnstore indexes. Of the top of my head, I couldn’t think of any reason why it would be a problem—I was correct, but I also wanted to test it out and check the performance and compression levels. I used the standard BigTransactionHistory table (courtesy of Adam Machanic b|t) for all testing.

Testing

The first thing I did was create a new database with Transparent Data Encryption turned on.

USE
master;

GO

CREATE
DATABASE TDE_TEST;

GO

CREATE
MASTER
KEY
ENCRYPTION
BY
PASSWORD
=
‘Demo4Blog’;

GO

CREATE
CERTIFICATE DemoBlog WITH
SUBJECT
=
‘DemoBlogCert’

GO

USE TDE_Test

GO

CREATE
DATABASE
ENCRYPTION
KEY

WITH
ALGORITHM
=
AES_128

ENCRYPTION
BY
SERVER
CERTIFICATE DemoBlog

GO

ALTER
DATABASE TDE_Test

SET
ENCRYPTION
ON

GO

 

Next, I created a table and a clustered columnstore index in my new database.

select
*
into tde_test.dbo.TDE_CS_Test from bigtransaction_cs;

create
clustered
columnstore
index CCI_TDE_Test on tde_test.dbo.TDE_CS_Test;

 

BigTransaction_CS is my source table which lives in the unencrypted AdventureWorks2012 database—I’m just making a copy of it in my new encrypted TDE_Test database which was created above.

Size Matters

Using a query to find the sizes of objects in my databases I examined the size of each object.

TableName indexName RowCounts TotalPages TotalSpaceMB
Columnstore ClusteredColumnStoreIndex-20140219-151205

62527202

38570

301

Encrypted_Columnstore CCI_TDE_Test

62527202

43322

338

TDE_NoCompression NULL

31263601

143441

1120

Clustered_NoEncrypt pk_bigTransactionHistory

31263601

143693

1122

 

As you can see—the row counts are the same, but the page counts and space used is higher—this is one of the penalties of TDE—the data won’t compress as well, however the difference in compression is only about about 10%. Despite this the compression effects are rather dramatic, even with encryption. The below chart shows the compressed versus uncompressed sizes of the data.

Figure 1 Size of uncompressed and compressed tables with and without encryption

Performance

In my demos on this topic, I have a fairly standard query that I use. Each query was run 6 times, with the last 5 runs counting for the data.

SELECT transactiondate

    ,avg(quantity)

    ,avg(actualcost)

FROM bigtransaction_cs

WHERE TransactionDate <
‘2007-07-01′

    AND Quantity > 70

    AND Quantity < 92

GROUP
BY TransactionDate

ORDER
BY transactionDate;

 

Encryption typically does negatively impact query performance, due to the CPU overhead, and it is clearly illustrated below.

Figure 2 Query Performance with Encryption

There’s roughly a 36% performance degradation using TDE—based on the logical read counts from the queries, it is all related the overhead and not really affected by the smaller difference in compression.

Summary

Generally speaking, if you are using encryption there is some legal or regulatory reason why you are using it, and you don’t have any other options. Just be aware, that you can still get pretty good compression levels using it in conjunction with clustered columnstore indexes in SQL Server 2014, even though your performance may not be as amazing as it otherwise would be.

Follow

Get every new post delivered to your Inbox.

Join 1,895 other followers

%d bloggers like this: