The Modern Analytics Architecture—My PASS BA Conference Session

In a couple of weeks, I will be presenting the PASS Business Analytics Conference in San Jose, CA (and this weekend at SQL Saturday Chicago.) This topic feels like it has been a couple of years in the making—I got into “big data” when I was at Comcast, where we had a large number of effectively append-only applications which were running on very expensive relational database engines on top of a lot of equally as expensive enterprise SAN space. Even as recently as two years ago, the solutions weren’t fully enterprise ready, unless you had a really specialized team of engineers. However, now Hadoop is much easier to build and manage, there are SQL and other tools that make connecting to the data very easy. I had been wanting to build this session since I started with it to illustrate from infrastructure, application and business perspective of why the toolsets are changing, and how you can use these new toolsets to gain richer analysis of your data.

This session isn’t just going to be about Hadoop—you will learn about how SQL Server (and other platforms) can be used as very powerful analytics platform across large volumes of data. You will learn the infrastructure that drives big data and analytical solutions, and how to decide on whether to deploy on-premises or to the cloud. You will learn about the skills needed to work on Hadoop and some of other toolsets, and I might even crack a few jokes about Oracle.

 

Office 365—Are You Done Yet?

Power BI in Office 365 has some new and exciting features like Power Map and Power Q&A, and it is also required to use the Power BI “modern” app (that’s metro to most of us.) In general, I’ve been really happy with the way things work with Office 365, but from time to time I run into odd behavior. This week was one of those weeks.

The upper limit of a file that you can load into Power BI is 250 MB—and if you file is above 30 MB you need to enable it, for Power BI. I was building a demo this week that has a good amount of data, and ends up being right around 200 MB. I was a little nervous about uploading a file that “big” to the cloud over a regular internet connection. As it happened, I was working from home yesterday, and kicked off the upload process right before I ran an errand over lunch. When I got back an hour or so later, I saw this:

 

Figure 1 Upload Dialog from Power BI

However, when I checked resource monitor for network activity, I saw none from Internet Explorer.

Figure 2 Network Activity

I opened a new window, and my file was actually in Power BI, so I left the upload dialog alone overnight. When I logged in this morning, the dialog box is still there. While this isn’t a big to deal to IT folks (we are used to things not working correctly), and can imagine it being disconcerting to a business user trying to upload a report.

 

Does Power View Have a Problem with Shrinkage?

I do a decent amount of work with Power View—I think it is a pretty nice tool for doing visualizations across disparate data sets and it is mostly easy enough to be accessible to power users. However there are a few things I feel the need to complain about. My biggest complaint is that the user doesn’t have the ability to change a chart title—the only way to avoid the system generated chart title is to hide it, and then add a text box. However, the other day, I accidentally found a new bug—chart shrinkage.

I was building a very simple chart to test something in Power BI (SharePoint online)—so I used Power Query and my one of my favorite test data sites (Pro Football Reference—they have good HTML tables that work really nicely with Power Query.) So I grabbed the 2013 NFL defensive stats and threw them in a cluster bar chart.

Figure 1 Power View Chart of NFL Defensive Stats 2013

Look at that—the Saints were pretty good on defense last year, particularly pass defense (thanks Rob Ryan and Keenan Lewis.) This chart was fine and served my purpose, but in order to demo something to someone, I tiled it by team.

 


Figure 2 Tiled by Team

 

When I remove the tile by the chart now has a scroll bar (note the last visible team is the Arizona Cardinals, rather than the Washington NFL team in figure 1)

 

Figure 3 Shrunken Chart

Repeating this process several times, the chart ends up looking like this:

Figure 4 Who shrunk my chart?

It is easy to drag down the chart to make the data visible again, but this clearly isn’t the intended behavior. Hopefully, like Power Query (which gets new fixes seemingly every week) this matter (and the chart title one) will get resolved quickly.

 

When the Cloud is Definitely Right for You

So a lot of my colleagues who have a lot of IT experience working on some very large workloads (and I generally count myself in this number) have balked at the notion that all IT workloads can moved to cloud computing models. In general, being in a multi-tenant environment is not the ideal scenario for very large, intensive database workloads, and cloud computing doesn’t change that. The vendors are incentivized to provide you adequate performance, while maximizing the utilization of their infrastructure. So if you want dedicated performance—you are going to pay (at least at Amazon, as of this writing Microsoft doesn’t have a performance storage offering in Azure.) For the purposes of this post—I’m referring to “cloud computing” as Infrastructure as a Service (IaaS), platform and software as a service are different scenarios with different requirements.

When the Cloud is Right

So right now I see two real scenarios where IaaS is an ideal fit—there are others (quick dev environments, testing new platforms, restore testing), but my focus is attuned to these two scenarios:

  • Disaster Recovery
  • Small Business

Disaster Recovery

If you are a working for a major Multinational Corporation which has three or four data centers, feel free to ignore this section (actually, ignore this post and go read a licensing one.) If you work for a small to medium size business and your DR strategy is to take your backups home with you—you are my target audience. The cloud brings you a lot of really nice options:

  • Backup directly to Azure (or other providers through API)
  • All SQL Server DR options are available in hybrid models

I’m not suggesting you stop taking backups—just that you think about the cloud as your offsite backup repository. You can also think of the cloud as your second data center—that you only have to pay for when you use it. For example, maybe you only DR for an application, during certain business times of the year—IaaS makes this a viable solution.

Small Business

The real genesis of this post, was an email I received following the presentation I did last week called “Building Perfect SQL Servers, Every Time.” My goal was teaching people how to install and configure SQL Server in an optimal and repeatable fashion. The session went well, and I had a number of a follow up questions via email and twitter—but I noticed something in a few of them, people where asking how to optimally configure hardware that simply wasn’t enough to server there needs (e.g. a server with only 2 physical drives running SQL Server), or they were a part time SA and DBA (and probably a couple of other things.)

If you have less than 10 servers, and they are running in a closet, you should definitely consider a cloud model. The networking is getting easier to setup and you can only pay for exactly what you need. There are other things to think about such as traffic and bandwidth, but the more important thing is your data is far more protected than it would be running on a server with two drives. In Azure, there are three copies of your data (this doesn’t mean you shouldn’t take database backups to Azure, but when you take those there are also three copies of the data. You may be concerned about security—Microsoft has all of these security certications, and your servers are in your closet.

Summary

Cloud computing isn’t for everyone—but if you are supporting a few servers, it can make both financial and technical sense. Focus your efforts on your core business needs, instead of trying to manage infrastructure with band aids and duct tape.

 

 

 

Where Should I Use a Columnstore Index in SQL Server?

I’ve been blogging a lot about columnstore indexes in SQL Server 2014—for good reason, I think it is a great feature that adds a lot of value to SQL Server for analytic workloads. One of the other (and probably more talked about) features in SQL Server 2014 is In-Memory OLTP (formerly known as Hekaton)—the SQL team added a neat new report that’s accessible to users of Management Data Warehouse (MDW) to show which tables and procedures in a given database might be well suited for In-Memory. There is nothing like that yet for a columnstore index.

 

Figure 1 SQL 2014 Management Data Warehouse In-Memory Tables Assistant

 

Microsoft’s official guidance is to use columnstores on large fact tables and dimensions greater than 1,000,000 rows. Typically (and this applies to regular page compression as well, you are going to look to tables that don’t have a very high update percentage as columnstore candidates. Despite the ability to update the clustered columnstores, this feature remains largely focused on the append-only type workloads present in data warehousing system. The insert process has been designed around doing bulk inserts very efficiently, and not so much doing singleton updates and inserts.

So I wanted to put logic together to identify candidate tables. The below script checks for tables with > 1,000,000 rows and an update percentage below 20—it would be easy enough to modify, the output list is a the list of candidate tables for columnstore indexes. Additionally, it won’t return any tables that already have a columnstore index on them.

with widetables as

(select a.name,
count(b.column_id)
as ColumnCount, sum(c.row_count)
as Row_Count

from
sys.objects a

inner
join
sys.columns b on b.object_id=a.object_id

inner
join
sys.dm_db_partition_stats c on c.object_id=a.object_id

inner
join
sys.partitions d on d.object_id=a.object_id

where d.data_compression not
in
(3,4)

group
by a.name

having
count(b.column_id)
> 5 and count(b.column_id)
< 1200 and sum(c.row_count)
>1000000)

,

 

–Check Update %

–Checks for Percentage of Update Statements

updaterate as

(

SELECT o.NAME AS [Table_Name]

    ,x.NAME AS [Index_Name]

    ,i.partition_number AS [Partition]

    ,i.index_id AS [Index_ID]

    ,x.type_desc AS [Index_Type]

    ,i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) AS
[Percent_Update]

FROM
sys.dm_db_index_operational_stats(db_id(),
NULL,
NULL,
NULL) i

INNER
JOIN
sys.objects o ON o.object_id = i.object_id

INNER
JOIN
sys.indexes x ON x.object_id = i.object_id

    AND x.index_id = i.index_id

WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)
!= 0

    AND
objectproperty(i.object_id,
‘IsUserTable’)
= 1 and i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count)
> 20

)

 

select name as CandidateTables from widetables

except

select table_name from updaterate

 

I hope this script is useful, and can provide some nice guidance as to where to use a columnstore index.

 

Columnstore Indexes and Transparent Data Encryption in SQL Server 2014

I spoke on columnstore indexes in SQL 2014 last weekend at SQL Saturday #262 in Boston—I had an audience question come up about using Transparent Data Encryption (TDE) in conjunction with columnstore indexes. Of the top of my head, I couldn’t think of any reason why it would be a problem—I was correct, but I also wanted to test it out and check the performance and compression levels. I used the standard BigTransactionHistory table (courtesy of Adam Machanic b|t) for all testing.

Testing

The first thing I did was create a new database with Transparent Data Encryption turned on.

USE
master;

GO

CREATE
DATABASE TDE_TEST;

GO

CREATE
MASTER
KEY
ENCRYPTION
BY
PASSWORD
=
‘Demo4Blog’;

GO

CREATE
CERTIFICATE DemoBlog WITH
SUBJECT
=
‘DemoBlogCert’

GO

USE TDE_Test

GO

CREATE
DATABASE
ENCRYPTION
KEY

WITH
ALGORITHM
=
AES_128

ENCRYPTION
BY
SERVER
CERTIFICATE DemoBlog

GO

ALTER
DATABASE TDE_Test

SET
ENCRYPTION
ON

GO

 

Next, I created a table and a clustered columnstore index in my new database.

select
*
into tde_test.dbo.TDE_CS_Test from bigtransaction_cs;

create
clustered
columnstore
index CCI_TDE_Test on tde_test.dbo.TDE_CS_Test;

 

BigTransaction_CS is my source table which lives in the unencrypted AdventureWorks2012 database—I’m just making a copy of it in my new encrypted TDE_Test database which was created above.

Size Matters

Using a query to find the sizes of objects in my databases I examined the size of each object.

TableName indexName RowCounts TotalPages TotalSpaceMB
Columnstore ClusteredColumnStoreIndex-20140219-151205

62527202

38570

301

Encrypted_Columnstore CCI_TDE_Test

62527202

43322

338

TDE_NoCompression NULL

31263601

143441

1120

Clustered_NoEncrypt pk_bigTransactionHistory

31263601

143693

1122

 

As you can see—the row counts are the same, but the page counts and space used is higher—this is one of the penalties of TDE—the data won’t compress as well, however the difference in compression is only about about 10%. Despite this the compression effects are rather dramatic, even with encryption. The below chart shows the compressed versus uncompressed sizes of the data.

Figure 1 Size of uncompressed and compressed tables with and without encryption

Performance

In my demos on this topic, I have a fairly standard query that I use. Each query was run 6 times, with the last 5 runs counting for the data.

SELECT transactiondate

    ,avg(quantity)

    ,avg(actualcost)

FROM bigtransaction_cs

WHERE TransactionDate <
‘2007-07-01′

    AND Quantity > 70

    AND Quantity < 92

GROUP
BY TransactionDate

ORDER
BY transactionDate;

 

Encryption typically does negatively impact query performance, due to the CPU overhead, and it is clearly illustrated below.

Figure 2 Query Performance with Encryption

There’s roughly a 36% performance degradation using TDE—based on the logical read counts from the queries, it is all related the overhead and not really affected by the smaller difference in compression.

Summary

Generally speaking, if you are using encryption there is some legal or regulatory reason why you are using it, and you don’t have any other options. Just be aware, that you can still get pretty good compression levels using it in conjunction with clustered columnstore indexes in SQL Server 2014, even though your performance may not be as amazing as it otherwise would be.

Building Perfect SQL Servers, Every Time

Tomorrow I’ll be presenting to the DBA Fundamentals VC on Building Perfect SQL Servers, Every Time. This is a session I’ve been wanting to build for a while—as many of you know SQL Server does not install with best practices configured out of the box; in my role as a consultant I see a lot of installations where the installer simply clicked next. What I really wanted to call this presentation was “Don’t Just #$%^ing Click Next When You Install SQL Server”

Additionally, you can grab the script files here.

In tomorrow’s session I will cover the following:

  • General best practices for configuring SQL Server
  • What you should change after install
  • How to manage a larger environment
  • How you can automate this process, so you get the same server every time.

We will talk about how to automate physical builds, and a little bit about building your own private SQL Server cloud—what it takes, and what your next steps are.

 

Getting Started with Azure Hybrid Models—Point to Site Networking

I’ve done several presentations on hybrid disaster recovery (DR) using SQL Server and Windows Azure, and configuring the network has been a headache every time. Some SQL DR methods do not require a direct network connection (log shipping, mirroring, backups) and mat be secured using certificates for encyption, more advanced configurations require a VPN connection. Also, if domain services are required (either to connect back to an on-premises Windows Domain Controller (DC) (good) or to replicate your on-premises Domain Controller to a ReadOnly DC (best)) you will need a VPN connection.

If you would like a little background on networking in Azure, I highly recommend Mark Russinovich’s (b|t) session from TechEd last year on Windows Azure internals.

There are three types of direct connections to Windows Azure:

  1. Site to Site Network—an approved VPN device in your DMZ connects via VPN (encrypted, but not dedicated) to a network gateway Windows Azure. Best scenario for low volume, but regularly used scenarios
  2. Point to Site Network—the focus of this post, a certificate based VPN allowing individual machines to connect directly to the network gateway in Windows Azure, using a certificate and a VPN client. Each machine must have the certificate. This is designed for demos, and admin users who may need to connect directly to Azure virtual networks
  3. ExpressRoute—this is the true Enterprise class solution for connecting to Windows Azure. You can connect either at your ISP (participating ISPS on this list) or via a dedicated MPLS VPN from your WAN to Windows Azure. This guarantees a fixed amount of bandwidth and performance.

Figure 1 ExpressRoute Networking — http://www.windowsazure.com/en-us/services/expressroute/

Configuring Point to Site Networking

So if you are reading this, I am going to assume you have a Windows Azure account—if you don’t, get one—there’s lots of interesting things to play with. Connect to the Azure management portal and scroll down to networks.

Figure 2 Add new network in Windows Azure

From that menu select Virtual Network > Custom Create. You will be prompted to give your network a name and select an existing or create a new affinity group. I’m not going to write about affinity groups today, but they are an important part of Windows Azure—you can read more about them here.

Optionally, you can add a DNS server name and address. If you do not, Windows Azure will provide DNS services for you (in your VMs.) In my experience with this, if you want your VMs to connect to the internet, do not setup a DNS server (this mainly applies for point to site demo work—if you are doing this in prod, setup a DNS server.) There is an also an option to configure point to site networking—check this box and if you have a real VPN, you may also check site-to-site.

Figure 4 Configuring Point to Site VPN in Network

The next screen will just cover your IP address space, for limited demos you can just accept the default and click next. Finally click the “Add Gateway Subnet” option.

Figure 5 Add Gateway Subnet

You now have a new network, but with no VPN gateway to connect to. From the network dashboard click the Create Gateway button on the bottom of the screen. This will take about 20-25 minutes—don’t worry, there is other stuff to do.

Figure 6 Network without gateway

You have two other tasks—create an Azure VM, that is on this network (See below screenshot—you can choose network on the third screen of the VM creation process)

Figure 7 Creating a new Azure VM

While your VM is being created, you will need to create your certificates, to both upload to your Azure gateway and to install on the client machine. In my opinion, this isn’t clear in the Microsoft documentation, so I’ll draw a picture.

So you will create the root certificate on the main machine you want to connect to the VPN (in my case it is my on-premises Domain Controller) and then upload it to your Azure network.

Figure 8 Cert uploaded to network

You will also create a client certificate using this root certificate. You will install the client certificate on any machine you want connect to your VPN—do not recreate and upload a new root certificate for each machine that you want to connect. In order to create the certificates, you will need makecert.exe—you can get it as part of the Windows SDK. For the detailed instructions on creating the certificates, go to this MSDN page and find “Create a self-signed root certificate.” Execute those instructions all the way through “Install the client certificate.”

Now, you are ready to connect—on your gateway page (it’s probably finished creating by now) on the right side of the screen you will see “Download the 64-bit (You aren’t really using a 32 bit OS, are you?) Client VPN Package” click that link. It’s an exe that is not signed by Microsoft, so Windows will yell at you for installing it. That will install quickly—if you expand the network pane in Windows, you will see a new connection.

Figure 9 Connect to Azure VPN

Click connect, and then you will see…

Figure 10 Azure VPN Connection

If you run an IPConfig—you will see the connection to Azure.

Figure 11 IP Record from Azure VPN

One warning here—I’ve had this happen with a client, and it happened to me in the office. When I connect the VPN, I can’t get to the internet on any of my connections. Oddly, this has only happened on corporate networks, and hasn’t happened when I’ve been on the same connection through a virtual machine on my laptop. More to come later.

Update: This behavior is tied to Cisco firewall equipment. I will address in a follow up post.

Now (if you’ve turned Windows Firewall off) you can ping your VMs, and get started using SQL Server HA and DR functions with them.

Into the Blue—Building DR in Windows Azure

Tomorrow (at 1300 EDT/1800 GMT), I’ll be presenting to the SQL Pass High Availability and Disaster Recovery VC on one my favorite topics of late—building disaster recovery from your data center into Windows Azure Infrastructure as a Service. You can register for the session here. While I’m not screaming “cloud, cloud, cloud” from the rooftops, I do feel like DR is a really great point of entry to the cloud for many smaller organizations, who may lack the resource to have a second data center, or perhaps only need their applications to be highly available during certain parts of year. Additionally, the ability to backup SQL Server databases directly to Azure Blob Storage also can meet your offsite backup requirements with very little headaches.

The cloud isn’t all unicorns and rainbows however—there are some definite challenges in getting these solutions to work properly. I’ll discuss that and the following in this session:

  • Choosing the right DR solution
  • Networking and the cloud
  • Economics of cloud DR
  • Implementation concerns for AlwaysOn Availability Groups

I hope you can join me on Tuesday.

Stupid Interview Questions—Trying to Find the Right Employees

I was driving into the office this morning, and as I do whenever I’m in the car, I was alternating between NPR and Mike and Mike on ESPN Radio (professional sports are an interest of mine). Anyway, Mike and Mike were discussing a recent story about the Cleveland Brown’s asking their potential draft picks questions such as “Name all of the things you could do with a single brick in one minute?” I’m not sure about you, but I’m not sure how that translates into hiring the most effective person to snap a football and then block a raging defensive lineman, but hey I don’t work in football.

What does this have to do with football?

Bad IT Interviews

I do, however work in IT—and I’ve interviewed for a lot of roles, as well as interviewed a lot of people for roles. Fortunately, for my current role I was hired through people I had worked with in the past, and there was barely a formal interview process. Even my previous role for the internet provider many of you are likely reading this post on, the interview process was straightforward and consisted mostly of conversations about technology and style of design. I actually have to think back to many moons ago to one particularly bad interview, with a director who thought he was the be-all and end-all of IT management. Some of the questions where:

  • How many gas/petrol stations are there in the United States?
  • Why is a manhole cover round?
  • How many pieces of bubble gum are in this jar? (Ok, I made this one up, but you get the idea)

To this list I would like add the following questions, which I hope are destined to the waste bin of history:

  • What is your biggest weakness?
  • Where do you see yourself in five years?
  • How did you hear about this position?

None of the above questions really help me (as the hiring manager) determine if a person a qualified for a role as a data professional (or frankly any other job). They are filler, and additionally any halfway prepared candidate is going to have prepared answers for them.

Building the Better Interview

I’ve been trained on a large number of interview techniques, between business school and corporate America. There is no magic set of interview questions—however, my favorite way to interview a candidate is to get them talking about a technology they’ve worked on and are passionate about. This serves two purposes—it lets me see if they are really into technology, or if it’s just a job to them, and additionally I can pretty quickly gather their technical level with follow on questions. Conversations are much better than trivia e.g.—”What is the third line item when you right click on a database in SQL Server Management Studio?”

One other thing I’ll add is make sure you have qualified people to do an interview—if you are trying to hire a DBA, and you don’t have any on staff, consider bringing in a consultant to do interviews—it’s small investment, that could save you a lot of money down the road.

So what stupid interview questions have you heard? Answer in the comments..

Follow

Get every new post delivered to your Inbox.

Join 1,818 other followers

%d bloggers like this: