Power Query and Hadoop
August 15, 2013 Leave a comment
So I’ve been doing some testing with Power Query, both for my job and for some upcoming presentations I’m doing with Stacia Misner (b|t) and in general I like the tool. However, particularly in the Hadoop space I have some issues that I hope are addressed before the tool officially goes live. Stacia and I gave a similar presentation at the PASS Business Analytics conference in Chicago in April, and at the time we were using the Hive plugin for Excel. If you aren’t familiar with Hive, it’s a part of the Hadoop ecosystem that allows the user to create a meta-database on top of an existing file or set of files that live in the Hadoop Distributed File System (HDFS). Hive supports a language call HiveQL (also known as HQL), which is very similar to SQL. The nice thing about Hive (and the plugin), particularly for data files that don’t have a header row, is that it allows us to define a structure on the files and to bring in the column names into our BI tools. The column name feature is great and a limitation (to an extent) of Power Query—but that’s a different post.
So, enter Power Query—which can connect to most types of data and is a great new feature in the BI arsenal. First, let me show you how to import files from Hadoop, and then I’ll show you a problem I encountered that could potentially limit the usefulness of Power Query as a connector to Hadoop.
Importing Files from Hadoop: The Good
The process to import files is as follows:
- On the Power Query tab, click the “From Other Sources > From Hadoop File (HDFS)
- Enter the name of the Hadoop server—in this case, it’s the IP Address of name node
- We are provided with a list of files, that show up as Binaries
What’s nice about this, clicking on the word “Binary” next to one of the files automatically loads the content of that file into the preview pane (with the option to add it to our Excel worksheet), where it can be loaded into a data model.
Encountering the Preview Pane Fail
The ability to see a list of files in Hadoop in the preview pane is really nice, especially when we can easily take the next step to load a selected file into a data model. However, I’m running into a truncation issue in the preview pane.
I can filter the list, based on criteria like date or file extension, but I’m warned that the list may be incomplete.
I “Sent a Frown” to the Power Query team about this behavior—a Hadoop Cluster can easily have 1000s of files, if not many more, therefore limiting content (and at least not loading the full content into the filters) is problematic. To their credit, they quickly responded that the correct user behavior was to click the “Done” button in the Query Editor preview pane and load the data into an Excel worksheet. So I did that—
All of my files were loaded correctly. However, you’ll notice that the file type (where it says [Binary]) is no longer hypertext—I can’t load it directly into my data model from here. I think the best approach would be if the links from the preview pane simply went into the worksheet. I understand the limitations of the preview pane, and if I could just load the hot links into the worksheet, this capability would be fantastic.
Thanks to my new friend Ehren at Microsoft I worked through some of these issues. The biggest one—is using text filters to narrow down the listing. I was trying to filter this way:
Typing in that text box will only reference data that is currently in the pick list. Instead, I needed to click Text Filters > Begins With…
From there I can input my HDFS URL with a folder location.
This returns a more limited set of rows: