Hadoop on Azure?

Here’s something interesting I’ve just seen on James Dixon’s blog: apparently Microsoft is preparing to provide Hadoop on Windows Azure. Here’s the article James links to:


I wonder if this is just a stopgap, in response to customer demand, as the article suggests it might be? How does this fit with the recently-announced Technical Computing Initiative? Was Project Dryad a dead end?

Time Intelligence Without Date Columns in DAX

One of the things that annoys me a little bit in DAX is the way all the built-in time intelligence functionality relies so much on having columns of type Date. What happens if you don’t have a Date column to use? Many Time dimension tables that I see in data warehouses don’t: they use integers to represent dates instead, often in the form YYYYMMDD (eg so 20100525 would represent the 25th of May 2010). And what happens if you need to something time-intelligence-like on a non time dimension? For example you might have a dimension representing different budgets, and you might want to show the difference between the current budget and the previous budget. There are no equivalents of the .PrevMember or .NextMember, or any hierarchical functions, because there are no hierarchies in PowerPivot.

Anyway, after some thought (and a lot of help from Tomislav and Jeffrey Wang from the dev team) I’ve found out it is possible to create time-intelligence calculations without dates. Let’s take a simple PowerPivot model based on Adventure Works DW, with the DimDate table joined to FactInternetSales on the OrderDateKey column. Now since there’s a column of type Date on DimDate, FullDateAlternateKey, we can use the time intelligence functionality to create a calculation that returns the previous year’s Sales Amount as follows:

    , DATEADD(DimDate[FullDateAlternateKey], -1, YEAR)
    , ALL(DimDate))

Here’s what a query using this calculation looks like:


The above calculation works by returning the sum of SalesAmount for all the dates that are one year before the dates in the current context. So if you have the year 2003 on rows in your query, the for that cell the row context would contain all of the dates in the year 2003; these then would be shifted back one year and you’d get the sum of SalesAmount for all the dates in the year 2002. Incidentally, this explains why in Adventure Works you get strange values for the year 2004: it’s because in DimDate you only have dates from January 1st 2004 to August 31st 2004, so the calculation only returns the sum of SalesAmount for January 1st 2003 to August 31st 2003.

Now, what would happen if we didn’t have the FullDateAlternateKey column and we wanted to do the same calculation? What we need to do is recreate this algorithm and shift our dates back one year without using the DateAdd function. The way to do it relies on the format of the keys used in the DimDate[DateKey] column, the YYYYMMDD format I mentioned before. If we have a date key like 20100525 and we want to find the equivalent date in the previous year, all we need to do is subtract 10000 from that key value. In our expression, then, what we need to do is to get the set of dates in the current row context, and for each of those dates return the date that has the key value that is 10000 less than the current date, then sum up SalesAmount for those dates. Here’s the expression:

, FILTER(VALUES(DimDate[DateKey]), DimDate[DateKey]-10000=EARLIER(DimDate[DateKey])))
, ALL(DimDate))

Notice the use of two nested Filter statements here, and the way that the Earlier function is used to retrieve the value of DimDate[DateKey] from the outer row context. The outer Filter gets the set of all values from DimDate[DateKey], regardless of year or any other selection made, using the All function; the inner Filter filters this set so it only contains the keys that are 10000 less than the dates that have currently been selected.

It works in almost the same way as the original calculation, although I’ve noticed a few differences. First of all, the grand total for the new calculation displays the total of the values for each year, in a way that the original version does not:


Secondly, on February 29th in a leap year, the DateAdd function returns February 28th in the previous year and our new approach (unsurprisingly) does not:


It’s probably a matter of taste what should happen here; either a null or the value from February 28th in the previous year make sense to me. Note, though, that as shown in the last but one screenshot the year totals for 2004 for both calculations are identical, so the value for February 28th is not counted twice.

Thirdly, keeping dates on rows and measures on columns, when you add English Month Name to a slicer and choose one month, the new expression works ok but you get the following error from the original expression:

ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (22, 4) Function ‘DATEADD’ only works with contiguous date selections.


This particular error will be the subject of my next blog post, but in this case I think the DAX time intelligence functions are behaving a bit rubbishly when they should be returning results.

Google Prediction API

Here’s something interesting for anyone into cloud-based data mining: Google have announced a new prediction API. More details here:


Here’s the summary from the site:

The Prediction API enables access to Google’s machine learning algorithms to analyze your historic data and predict likely future outcomes. Upload your data to Google Storage for Developers, then use the Prediction API to make real-time decisions in your applications. The Prediction API implements supervised learning algorithms as a RESTful web service to let you leverage patterns in your data, providing more relevant information to your users. Run your predictions on Google’s infrastructure and scale effortlessly as your data grows in size and complexity.

Looks exciting! Now, why didn’t Microsoft do this? Clearly the ideas were there (remember the cloud version of the Excel DM addin?), but it seems like all the key members of the SQL data mining team had to leave MS to pursue their dreams: http://predixionsoftware.com/ 

Microsoft Technical Computing Initiative

An interesting announcement here from Microsoft about its new Technical Computing Initiative:


Lots of the usual PR-speak and vagueness, but from the post above here are the main points:

In terms of technology, the initiative will focus on three key areas:

  1. Technical computing to the cloud: Microsoft will help lead the way in giving scientists, engineers and analysts the computing power of the cloud.  We’re also working to give existing high-performance computing users the ability to augment their on-premises systems with cloud resources that enable ‘just-in-time’ processing. This platform will help ensure processing resources are available whenever they are needed—reliably, consistently and quickly. 
  2. Simplify parallel development: Today, computers are shipping with more processing power than ever, including multiple cores. But most modern software only uses a small amount of the available processing power. Parallel programs are extremely difficult to write, test, and troubleshoot.  We know that a consistent model for parallel programming can help more developers unlock the tremendous power in today’s computers and enable a new generation of technical computing. We’re focused on delivering new tools to automate and simplify writing software through parallel processing from the desktop… to the cluster… to the cloud.    
  3. Develop powerful new technical computing tools and applications: Scientists, engineers and analysts are pushing common tools (i.e., spreadsheets and databases) to the limits with complex, data-intensive models. They need easy access to more computing power using simpler tools to increase the speed of their work, and we’re building a platform with this objective in mind. We expect that these efforts will yield new, easy-to-use tools and applications that automate data acquisition, modeling, simulation, visualization, workflow and collaboration.

And from this article on the Wall Street Journal, here’s a practical example of what will be delivered:

Muglia offers an example of how Microsoft plans to make high-performance computing more accessible: Today many financial services firms use the company’s Excel spreadsheet application to develop financial models, but if the firms need the power of a supercomputer to crunch numbers, they often have to write specialized applications in programming languages like Fortran that a much smaller group of users are fluent in.

Microsoft’s Technical Computing group is working on software that will allow a program like Excel to run in parallel on thousands of machines so the application can be used to tackle monster financial computing chores on its own, Muglia says.

It’s been a while since there was any wild speculation on this blog but I can’t resist it – all this talk of running Excel in parallel on multiple machines and the cloud makes me wonder if this is going to work with PowerPivot too? Or rather, will this work with whatever PowerPivot/Vertipaq becomes when it grows up into a corporate BI tool?

Microsoft BI Indexing Connector

Just seen this on the Sharepoint BI blog, the Microsoft BI Indexing Connector:


From the blog post:

With this new Indexing connector, users have a dedicated Report tab where they can find the reports they are looking for, use refiners to further narrow their searches, and even get a preview of the report before opening it in the browser or client…

…In addition to discovering the documents, the MSBIIC will also crawl the data sources revealing items that are not necessary in the report itself but critical to the user’s discovery and as part of the refiners.

What happens when you do a Process Update on a dimension?

Over the last few days I’ve been involved in an interesting thread on the SSAS forum regarding what happens when you do a Process Update on a dimension. It’s a topic that is not widely understood, and indeed I’ve not known all the details until today, but it’s nonetheless very important: one of the commonest performance-related problems I see in my consultancy work is partitions that have aggregations designed for them, but where those aggregations aren’t in a processed state because a Process Update has been run on one or more dimensions. Anyway, just now Akshai Mirchandani from the dev team posted a really good overview of what actually happens when you run a Process Update on that thread, so I thought I’d copy here to ensure it gets a wider audience:

Here is a quick summary of what happens when you do ProcessUpdate:

1. After the dimension has been updated, the server analyzes the changes that occurred to the dimension. In 2005, this analysis was pretty simple and would often incorrectly detect that major changes had occurred that required clearing of indexes and aggregations. In 2008, this code was improved such that it more often would realize that nothing significant has occurred. It’s a fairly small (but useful) optimization — I guess nobody thought it was worth documenting!

2. Based on this analysis, the server will decide whether or not indexes and aggregations need to be cleared. If no (e.g. because records were only added and not deleted/updated), then the partitions won’t be affected.

3. If indexes/aggregations need to be cleared, then the server will check if ProcessAffectedObjects was enabled — if yes, then instead of clearing the indexes/aggregations it will rebuild the indexes/aggregations.

4. The act of clearing the indexes/aggregations also shows up as "partition processing operations" in Profiler — that’s one of the things that has been confusing some of you.

5. When aggregations are cleared, only the flexible aggregations need to be cleared because we’re guaranteed by the rigid relationships that the members cannot have moved and therefore the rollups cannot have changed. However, indexes can still have changed and therefore you may still see the partition processing jobs kick off to clear the indexes.

6. ProcessIndexes and ProcessClearIndexes take care of building both bitmap indexes (aka map) and aggregations — the context is that both aggregations and bitmap indexes are generically considered "indexes".

Really the main takeaway here is that if you ProcessUpdate a dimension, you should strongly consider either doing ProcessAffectedObjects or an explicit ProcessIndexes on the affected partitions so that bitmap indexes and flexible aggregations get rebuilt. The advantage of explicitly doing ProcessIndexes is that you can bring your cube online earlier and have the indexes/aggregations get processed more lazily in the background — a number of customers prefer to do that because their processing windows are too small to wait for the indexes to get processed.

Also related to this topic, I thought I’d also highlight a great post by Darren Gosbell where he shows how to find out if your aggregations are processed or not:


%d bloggers like this: