Troubleshoot Hive ODBC Connections in Excel

1Microsoft has deprecated the Hive plug-in for Excel (which is a little disappointing—it was never production ready, but it did offer some really nice functionality). If you download and install the latest driver the Hive plugin will actually go away. So I started attempting to use the connection via ODBC in Excel.

So, I configured a new System DSN in ODBC settings using the default Hive driver and port. This is a local HDInsight installation—I’m still having issues connecting to my Linux cluster, but will work that out in a separate post.

2

 

 

 

 

 

 

 

 

I accept the default port which is 10000. This has worked in the past with HDInsight (I thought—then I reread my documentation). One thing to note—there is no “Test Connection” button in the DSN configuration, there is one in Excel, but not here. If anyone from Microsoft is reading this, a connection test would be a really nice feature to include in the next revision of the driver.

 

3

Once in Excel, I go through the process of creating a new ODBC connection though the Data Connection Wizard.

4

 

5

The wizard finds my connection called “hive”. However, a Data Link properties box is launched when the connection is unable to be made directly. If I execute a “Test Connection” I get the error batch size is not set or invalid.

6

That error was pretty cryptic—fortunately I had tracing turned on for the connection. The log revealed the below:

[DEBUG]-> ThreadID 10716 # 08/15/13 09:31:51 # src\Conn.c # SQLDriverConnectW # 1065 # DB = default, HOST = 10.10.50.111, PORT = 10001, PATH = , FRAMED = 0

[ERROR]-> ThreadID 10716 # 08/15/13 09:31:52 # src\Conn.c # ConnectToHive # 346 # 08001 : Could not connect client socket. Details: <Host: 10.10.50.111 Port: 10001> endpoint: 10.10.50.111, error: connection_refused

[ERROR]-> ThreadID 10716 # 08/15/13 09:31:52 # src\Conn.c # ConnectToHive # 348 # 08001 : Unable to establish connection with hive server

When I googled the error, I found this post which suggested pointing to port 10001 instead of the default 10000. SoI fixed it—changing port to 10001.

7

 

8

 

9

 

10

And success—we have Hive data into Excel and we can load into Power Pivot if needed. However, my one bit of disappointment with the Hive ODBC driver compared the Hive plugin for Excel is that there is no interactive query mode, it’s pretty much an all or nothing proposition. Ideally, I would like to see that functionality return into the driver.

Power Query and Hadoop

So I’ve been doing some testing with Power Query, both for my job and for some upcoming presentations I’m doing with Stacia Misner (b|t) and in general I like the tool. However, particularly in the Hadoop space I have some issues that I hope are addressed before the tool officially goes live. Stacia and I gave a similar presentation at the PASS Business Analytics conference in Chicago in April, and at the time we were using the Hive plugin for Excel. If you aren’t familiar with Hive, it’s a part of the Hadoop ecosystem that allows the user to create a meta-database on top of an existing file or set of files that live in the Hadoop Distributed File System (HDFS). Hive supports a language call HiveQL (also known as HQL), which is very similar to SQL. The nice thing about Hive (and the plugin), particularly for data files that don’t have a header row, is that it allows us to define a structure on the files and to bring in the column names into our BI tools. The column name feature is great and a limitation (to an extent) of Power Query—but that’s a different post.

So, enter Power Query—which can connect to most types of data and is a great new feature in the BI arsenal. First, let me show you how to import files from Hadoop, and then I’ll show you a problem I encountered that could potentially limit the usefulness of Power Query as a connector to Hadoop.

Importing Files from Hadoop: The Good

The process to import files is as follows:

  1. On the Power Query tab, click the “From Other Sources > From Hadoop File (HDFS)

 

  1. Enter the name of the Hadoop server—in this case, it’s the IP Address of name node

 

  1. We are provided with a list of files, that show up as Binaries

 

What’s nice about this, clicking on the word “Binary” next to one of the files automatically loads the content of that file into the preview pane (with the option to add it to our Excel worksheet), where it can be loaded into a data model.


 

 

Encountering the Preview Pane Fail

The ability to see a list of files in Hadoop in the preview pane is really nice, especially when we can easily take the next step to load a selected file into a data model. However, I’m running into a truncation issue in the preview pane.

 

 

I can filter the list, based on criteria like date or file extension, but I’m warned that the list may be incomplete.

 

 

I “Sent a Frown” to the Power Query team about this behavior—a Hadoop Cluster can easily have 1000s of files, if not many more, therefore limiting content (and at least not loading the full content into the filters) is problematic. To their credit, they quickly responded that the correct user behavior was to click the “Done” button in the Query Editor preview pane and load the data into an Excel worksheet. So I did that—

 

All of my files were loaded correctly. However, you’ll notice that the file type (where it says [Binary]) is no longer hypertext—I can’t load it directly into my data model from here. I think the best approach would be if the links from the preview pane simply went into the worksheet. I understand the limitations of the preview pane, and if I could just load the hot links into the worksheet, this capability would be fantastic.

 

***Update

Thanks to my new friend Ehren at Microsoft I worked through some of these issues. The biggest one—is using text filters to narrow down the listing. I was trying to filter this way:

 
 


 
 

Typing in that text box will only reference data that is currently in the pick list. Instead, I needed to click Text Filters > Begins With…


 
 

From there I can input my HDFS URL with a folder location.

 


 

This returns a more limited set of rows:

 
 


 
 

 
 

ColumnStore Archival Compression–SQL Server 2014

Since I’m fortunate enough to be presenting on Data Compression at the PASS Summit this year, so in addition to doing a lot of work with both data compression and ColumnStore indexes which were a new feature starting in SQL Server 2012 Enterprise Edition. In 2012, while these column oriented indexes gave us a great deal of both compression and performance (via both deep compression and a new query processing mode called batch) they had a great deal of limitations—the biggest one being that they were non-updateable, and the second biggest one being that they could not be clustered indexes, so we had to store duplicated copies of our data.

Before I go any further, for a great deal more detail I have to recommend my colleague Niko Neugebauer’s in-depth blog series on clustered ColumnStore indexes—Niko is going through the ins and outs of this feature that is new for SQL Server 2014, I am just briefly touching in this post. Additionally, there is a really great white paper from Microsoft research that also goes pretty in-depth on the enhancements to this feature for SQL 2014 here. One thing to note—some of the features the white paper promises (such as removal of some of the data type restrictions) are not implemented in CTP1 of SQL 2014 as far as I can tell.

However, the interesting feature I wanted to touch on the new columnstore_archive compression mode option. From the above white paper:

For example, the data may be partitioned by date and the most recent data is accessed much more frequently than older data. In such cases the older data can benefit from additional compression at the cost of slower query performance. To enable this scenario we added support for archival compression of SQL Server column stores. … The further reduction obtained by archival compression is substantial, ranging from 37% to 66% depending on the data. We also compared with GZIP compression of the raw data. Archival compression consistently achieved a better compression ratio, sometimes considerably better.

–source “Enhancements to SQL Server Column Stores, Larson et al. June 2013”

So this feature looks like it could really attractive, particularly for scenarios where we have archival data that we are really only keeping around for regulatory reasons. Additionally, this has been configured in such a way that we can partition a table with a clustered columnstore index, and then change the compression mode for individual partitions, limiting our performance impact to the older records that we are querying less. There’s not a lot of documentation on this that I’ve seen, so I will show how to do I here.

The first thing I did was load up a copy of the AdventureWorksDW2012 and made a copy of dbo.FactResellerSales—note, clustered columnstore indexes do not currently support foreign key relationships (which was why I created a copy of the table—to get rid of the FKs). I then partitioned the table on date range, and then proceeded to create a clustered ColumnStore index on it.

 

By default all of the columns in the table get added to the index. The syntax for creation is below—note we don’t any compression options in the index build process.

 

 

However, after we build the ColumnStore, if we right click on the table it belongs to in SSMS and select storage, we get the following menu:

I’ve chosen to choose Archival Columnstore compression for the last three partitions in that table. In T-SQL that looks like this:

ALTER
TABLE [dbo].[fact_Cs_test] REBUILD
PARTITION
= 5 WITH(DATA_COMPRESSION = COLUMNSTORE_ARCHIVE )

USE [AdventureWorksDW2012]

ALTER
TABLE [dbo].[fact_Cs_test] REBUILD
PARTITION
= 6 WITH(DATA_COMPRESSION = COLUMNSTORE_ARCHIVE )

USE [AdventureWorksDW2012]

ALTER
TABLE [dbo].[fact_Cs_test] REBUILD
PARTITION
= 7 WITH(DATA_COMPRESSION = COLUMNSTORE_ARCHIVE )

USE [AdventureWorksDW2012]

ALTER
TABLE [dbo].[fact_Cs_test] REBUILD
PARTITION
= 8 WITH(DATA_COMPRESSION = COLUMNSTORE_ARCHIVE )

 

So let’s see what kind of space savings we get from this process. First we’ll look at the space used by the basic ColumnStore compressed partitions:

I used the below query to show the space used per partition:

select b.name, a.partition_number, a.rows,

((c.in_row_data_page_count+c.lob_used_page_count+c.row_overflow_used_page_count)*8)
as
‘Data KB’


from
sys.partitions a, sys.objects b, sys.dm_db_partition_stats c

where a.object_id=b.object_id and b.name=‘Fact_CS_Test’ and a.partition_id=c.partition_id;

 

So here is the space, before I changed the partitions over to archival ColumnStore compression.

After I compressed partitions 5-8 using archive compression the space utilization was as below:

 

So we averaged 10.5% more compression over existing ColumnStore compression, this is lower than Microsoft observed, but I also didn’t really evaluate my data to see how well it would compress. Archival compression has the potential to be a really useful feature, particularly for firms who’s regulatory requirements mean they need to keep data online for extended periods of time.

%d bloggers like this: