Chris Webb's BI Blog: MDX on Cloudera Impala

You may have seen today’s announcement (and comment from Mary-Jo Foley) on the official SQL Server blog about HortonWorks Data Platform for Windows; I won’t repeat what it says, but it’s clear Microsoft is taking Hadoop and Big Data (yuk, I hate that term, but sometimes I can’t avoid using it…) very seriously. However, yesterday I saw another, very interesting announcement from the guys at Simba that presents an alternative vision for Microsoft BI and Big Data. You can see a demo of it here:

Simba MDX Provider for Impala

There are some more details here:
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/impala-user/BniN1ELYNDs

Basically, Simba have built an OLEDB for OLAP Provider that translates the MDX generated by Excel PivotTables into HiveQL queries against Cloudera Impala. You can read a good, short overview of what Impala is here:
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
and there’s more detailed information here:
https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions

In summary, what this gives you is fast, interactive analysis from within Excel going direct against large amounts of data stored via Impala, with no need for users to have to write Hive queries or stage the data in PowerPivot in the way the current Microsoft/Hortonworks solution does. Even more interesting is the fact that Simba support MDX calculated members and not just MDX queries, so given that Excel 2013 allows you to define your own calculated members you could do some very powerful analysis this way. Well, those of us who know some MDX, at least 🙂

Impala is open source and isn’t shy about the debt it owes to Google’s Dremel, which of course is available publicly now as BigQuery. When I first saw BigQuery I thought putting an MDX interface over the top would make it appealing to a much wider audience – maybe not a good thing for Microsoft, but it would be a clever move on the part of Google certainly. Microsoft hasn’t announced that it’s working on anything comparable to BigQuery, alas; already, BI tools like Tableau and BIME can connect to and query BigQuery (Tableau also connects to Amazon’s new Redshift database too), and it’s these tools that are Excel’s big competitors in the BI client tool space. I guess SSAS 2012 Tabular in DirectQuery mode going against PDW would be the only vaguely similar Microsoft solution, but PDW is on-prem only and pretty expensive. Translating MDX to the SQL used by tools like Impala, in the way that the Simba MDX Provider does, puts Excel on a more equal footing with Tableau et al. I don’t know how Simba/Cloudera will be making this MDX Provider available but I would be surprised if they didn’t charge for it; Microsoft’s close relationship with Hortonworks, a competitor to Cloudera, makes me think that Microsoft might not want to promote this particular tool either, which is a shame. Maybe an acceptable solution for MS would be to build new cartridges for SSAS 2012 and enable DirectQuery for data sources other than SQL Server? It’s unlikely to happen, I think, but it would be an option.

UPDATE: While I was writing this post, Cathy Dumas of Simba (you were wondering what happened to her, weren’t you?) also blogged about this
http://blogs.simba.com/simba_technologies_ceo_co/2013/02/demo-microsoft-excel-pivottables-on-cloudera-impala-via-simba-mdx-provider.html

5 thoughts on “MDX on Cloudera Impala”

Hello,
Very interesting article ! Just a message to say that Bime connect also Amazon Resdshit 🙂
Benoit

Hey Chris,
Awesome article, Impala offers great improvement over standard Hive and this integration is key to bringing “big data” technology to end-user in a responsive manner.

Speaking of a SSAS interface on top of Hive, I had a similar idea a while ago and started investigating it, using Microsoft’s new HiveODBC drivers you could actually have an SSAS model (Multidimensional or Tabular) updating directly from Hive (using SQL as an intermediate interface with Linked Servers)… Klout actually implements this architecture for their Hive backed OLAP environment. Wrote a wee post about it

Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.

Chris Webb says:

February 26, 2013 at 2:29 pm

Thanks, that’s an interesting post!

Reply

Using PowerPivot and Power View in Excel to access Hive data and mash it up with other data sources is a powerful tool. See an example here: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-mash-up-hive-sql-server-powerpivot-amp-power-view.aspx

Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.

Chris Webb says:

February 27, 2013 at 9:29 am

Hi Cindy,

I know it is, and I’m not denying it. However the point I’m making here is that while the current approach has the advantage of allowing you to mash-up Hive data with other data sources in PowerPivot, it has the obvious disadvantages that 1) You’ve got to import a subset, or an aggregation, of your original data into PowerPivot – so you’re no longer working with the complete dataset you started with, and 2) In order to do that import you need to know how to write the queries to get data from Hive, which some users will be fine with but many more won’t.

Given that a slew of fast, ad-hoc query engines like Impala have been announced for Hadoop recently it’s clear that the industry as a whole recognises that users need to be able to do interactive analysis on ‘Big Data’ without exporting it anywhere else; what I’m saying is that putting an MDX interface on top of these new engines means that users will continue to be able to use Excel for this kind of interactive analysis, rather than go to the likes of Tableau.

I heard a suggestion yesterday, after I wrote this post, that since Apache Drill supports pluggable query languages, a version of MDX could be developed for it. For me, that is something that Microsoft should be investing in…

Chris

Reply

Benoit says:

February 26, 2013 at 9:01 am

Hello,
Very interesting article ! Just a message to say that Bime connect also Amazon Resdshit 🙂
Benoit

ToOsIK

ToOsIK says:

February 26, 2013 at 2:25 pm

Hey Chris,
Awesome article, Impala offers great improvement over standard Hive and this integration is key to bringing “big data” technology to end-user in a responsive manner.

Speaking of a SSAS interface on top of Hive, I had a similar idea a while ago and started investigating it, using Microsoft’s new HiveODBC drivers you could actually have an SSAS model (Multidimensional or Tabular) updating directly from Hive (using SQL as an intermediate interface with Linked Servers)… Klout actually implements this architecture for their Hive backed OLAP environment. Wrote a wee post about it

1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  February 26, 2013 at 2:29 pm
  
  Thanks, that’s an interesting post!
  
Cindy Gross says:

February 26, 2013 at 11:09 pm

Using PowerPivot and Power View in Excel to access Hive data and mash it up with other data sources is a powerful tool. See an example here: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-mash-up-hive-sql-server-powerpivot-amp-power-view.aspx

1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  February 27, 2013 at 9:29 am
  
  Hi Cindy,
  
  I know it is, and I’m not denying it. However the point I’m making here is that while the current approach has the advantage of allowing you to mash-up Hive data with other data sources in PowerPivot, it has the obvious disadvantages that 1) You’ve got to import a subset, or an aggregation, of your original data into PowerPivot – so you’re no longer working with the complete dataset you started with, and 2) In order to do that import you need to know how to write the queries to get data from Hive, which some users will be fine with but many more won’t.
  
  Given that a slew of fast, ad-hoc query engines like Impala have been announced for Hadoop recently it’s clear that the industry as a whole recognises that users need to be able to do interactive analysis on ‘Big Data’ without exporting it anywhere else; what I’m saying is that putting an MDX interface on top of these new engines means that users will continue to be able to use Excel for this kind of interactive analysis, rather than go to the likes of Tableau.
  
  I heard a suggestion yesterday, after I wrote this post, that since Apache Drill supports pluggable query languages, a version of MDX could be developed for it. For me, that is something that Microsoft should be investing in…
  
  Chris

MDX on Cloudera Impala

Published by Chris Webb

5 thoughts on “MDX on Cloudera Impala”

Leave a ReplyCancel reply

Share this:

Published by Chris Webb

5 thoughts on “MDX on Cloudera Impala”

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog