Troubleshoot Hive ODBC Connections in Excel

1Microsoft has deprecated the Hive plug-in for Excel (which is a little disappointing—it was never production ready, but it did offer some really nice functionality). If you download and install the latest driver the Hive plugin will actually go away. So I started attempting to use the connection via ODBC in Excel.

So, I configured a new System DSN in ODBC settings using the default Hive driver and port. This is a local HDInsight installation—I’m still having issues connecting to my Linux cluster, but will work that out in a separate post.

2

 

 

 

 

 

 

 

 

I accept the default port which is 10000. This has worked in the past with HDInsight (I thought—then I reread my documentation). One thing to note—there is no “Test Connection” button in the DSN configuration, there is one in Excel, but not here. If anyone from Microsoft is reading this, a connection test would be a really nice feature to include in the next revision of the driver.

 

3

Once in Excel, I go through the process of creating a new ODBC connection though the Data Connection Wizard.

4

 

5

The wizard finds my connection called “hive”. However, a Data Link properties box is launched when the connection is unable to be made directly. If I execute a “Test Connection” I get the error batch size is not set or invalid.

6

That error was pretty cryptic—fortunately I had tracing turned on for the connection. The log revealed the below:

[DEBUG]-> ThreadID 10716 # 08/15/13 09:31:51 # src\Conn.c # SQLDriverConnectW # 1065 # DB = default, HOST = 10.10.50.111, PORT = 10001, PATH = , FRAMED = 0

[ERROR]-> ThreadID 10716 # 08/15/13 09:31:52 # src\Conn.c # ConnectToHive # 346 # 08001 : Could not connect client socket. Details: <Host: 10.10.50.111 Port: 10001> endpoint: 10.10.50.111, error: connection_refused

[ERROR]-> ThreadID 10716 # 08/15/13 09:31:52 # src\Conn.c # ConnectToHive # 348 # 08001 : Unable to establish connection with hive server

When I googled the error, I found this post which suggested pointing to port 10001 instead of the default 10000. SoI fixed it—changing port to 10001.

7

 

8

 

9

 

10

And success—we have Hive data into Excel and we can load into Power Pivot if needed. However, my one bit of disappointment with the Hive ODBC driver compared the Hive plugin for Excel is that there is no interactive query mode, it’s pretty much an all or nothing proposition. Ideally, I would like to see that functionality return into the driver.

Power Query and Hadoop

So I’ve been doing some testing with Power Query, both for my job and for some upcoming presentations I’m doing with Stacia Misner (b|t) and in general I like the tool. However, particularly in the Hadoop space I have some issues that I hope are addressed before the tool officially goes live. Stacia and I gave a similar presentation at the PASS Business Analytics conference in Chicago in April, and at the time we were using the Hive plugin for Excel. If you aren’t familiar with Hive, it’s a part of the Hadoop ecosystem that allows the user to create a meta-database on top of an existing file or set of files that live in the Hadoop Distributed File System (HDFS). Hive supports a language call HiveQL (also known as HQL), which is very similar to SQL. The nice thing about Hive (and the plugin), particularly for data files that don’t have a header row, is that it allows us to define a structure on the files and to bring in the column names into our BI tools. The column name feature is great and a limitation (to an extent) of Power Query—but that’s a different post.

So, enter Power Query—which can connect to most types of data and is a great new feature in the BI arsenal. First, let me show you how to import files from Hadoop, and then I’ll show you a problem I encountered that could potentially limit the usefulness of Power Query as a connector to Hadoop.

Importing Files from Hadoop: The Good

The process to import files is as follows:

  1. On the Power Query tab, click the “From Other Sources > From Hadoop File (HDFS)

 

  1. Enter the name of the Hadoop server—in this case, it’s the IP Address of name node

 

  1. We are provided with a list of files, that show up as Binaries

 

What’s nice about this, clicking on the word “Binary” next to one of the files automatically loads the content of that file into the preview pane (with the option to add it to our Excel worksheet), where it can be loaded into a data model.


 

 

Encountering the Preview Pane Fail

The ability to see a list of files in Hadoop in the preview pane is really nice, especially when we can easily take the next step to load a selected file into a data model. However, I’m running into a truncation issue in the preview pane.

 

 

I can filter the list, based on criteria like date or file extension, but I’m warned that the list may be incomplete.

 

 

I “Sent a Frown” to the Power Query team about this behavior—a Hadoop Cluster can easily have 1000s of files, if not many more, therefore limiting content (and at least not loading the full content into the filters) is problematic. To their credit, they quickly responded that the correct user behavior was to click the “Done” button in the Query Editor preview pane and load the data into an Excel worksheet. So I did that—

 

All of my files were loaded correctly. However, you’ll notice that the file type (where it says [Binary]) is no longer hypertext—I can’t load it directly into my data model from here. I think the best approach would be if the links from the preview pane simply went into the worksheet. I understand the limitations of the preview pane, and if I could just load the hot links into the worksheet, this capability would be fantastic.

 

***Update

Thanks to my new friend Ehren at Microsoft I worked through some of these issues. The biggest one—is using text filters to narrow down the listing. I was trying to filter this way:

 
 


 
 

Typing in that text box will only reference data that is currently in the pick list. Instead, I needed to click Text Filters > Begins With…


 
 

From there I can input my HDFS URL with a folder location.

 


 

This returns a more limited set of rows:

 
 


 
 

 
 

%d bloggers like this: