Hadoop on Azure?

Here’s something interesting I’ve just seen on James Dixon’s blog: apparently Microsoft is preparing to provide Hadoop on Windows Azure. Here’s the article James links to:

http://www.sdtimes.com/link/34319

I wonder if this is just a stopgap, in response to customer demand, as the article suggests it might be? How does this fit with the recently-announced Technical Computing Initiative? Was Project Dryad a dead end?

Time Intelligence Without Date Columns in DAX

One of the things that annoys me a little bit in DAX is the way all the built-in time intelligence functionality relies so much on having columns of type Date. What happens if you don’t have a Date column to use? Many Time dimension tables that I see in data warehouses don’t: they use integers to represent dates instead, often in the form YYYYMMDD (eg so 20100525 would represent the 25th of May 2010). And what happens if you need to something time-intelligence-like on a non time dimension? For example you might have a dimension representing different budgets, and you might want to show the difference between the current budget and the previous budget. There are no equivalents of the .PrevMember or .NextMember, or any hierarchical functions, because there are no hierarchies in PowerPivot.

Anyway, after some thought (and a lot of help from Tomislav and Jeffrey Wang from the dev team) I’ve found out it is possible to create time-intelligence calculations without dates. Let’s take a simple PowerPivot model based on Adventure Works DW, with the DimDate table joined to FactInternetSales on the OrderDateKey column. Now since there’s a column of type Date on DimDate, FullDateAlternateKey, we can use the time intelligence functionality to create a calculation that returns the previous year’s Sales Amount as follows:

=CALCULATE(SUM(FactInternetSales[SalesAmount])
    , DATEADD(DimDate[FullDateAlternateKey], -1, YEAR)
    , ALL(DimDate))

Here’s what a query using this calculation looks like:

image

The above calculation works by returning the sum of SalesAmount for all the dates that are one year before the dates in the current context. So if you have the year 2003 on rows in your query, the for that cell the row context would contain all of the dates in the year 2003; these then would be shifted back one year and you’d get the sum of SalesAmount for all the dates in the year 2002. Incidentally, this explains why in Adventure Works you get strange values for the year 2004: it’s because in DimDate you only have dates from January 1st 2004 to August 31st 2004, so the calculation only returns the sum of SalesAmount for January 1st 2003 to August 31st 2003.

Now, what would happen if we didn’t have the FullDateAlternateKey column and we wanted to do the same calculation? What we need to do is recreate this algorithm and shift our dates back one year without using the DateAdd function. The way to do it relies on the format of the keys used in the DimDate[DateKey] column, the YYYYMMDD format I mentioned before. If we have a date key like 20100525 and we want to find the equivalent date in the previous year, all we need to do is subtract 10000 from that key value. In our expression, then, what we need to do is to get the set of dates in the current row context, and for each of those dates return the date that has the key value that is 10000 less than the current date, then sum up SalesAmount for those dates. Here’s the expression:

=CALCULATE(SUM(FactInternetSales[SalesAmount])
,FILTER(ALL(DimDate[DateKey])
, FILTER(VALUES(DimDate[DateKey]), DimDate[DateKey]-10000=EARLIER(DimDate[DateKey])))
, ALL(DimDate))

Notice the use of two nested Filter statements here, and the way that the Earlier function is used to retrieve the value of DimDate[DateKey] from the outer row context. The outer Filter gets the set of all values from DimDate[DateKey], regardless of year or any other selection made, using the All function; the inner Filter filters this set so it only contains the keys that are 10000 less than the dates that have currently been selected.

It works in almost the same way as the original calculation, although I’ve noticed a few differences. First of all, the grand total for the new calculation displays the total of the values for each year, in a way that the original version does not:

image

Secondly, on February 29th in a leap year, the DateAdd function returns February 28th in the previous year and our new approach (unsurprisingly) does not:

image

It’s probably a matter of taste what should happen here; either a null or the value from February 28th in the previous year make sense to me. Note, though, that as shown in the last but one screenshot the year totals for 2004 for both calculations are identical, so the value for February 28th is not counted twice.

Thirdly, keeping dates on rows and measures on columns, when you add English Month Name to a slicer and choose one month, the new expression works ok but you get the following error from the original expression:

ERROR – CALCULATION ABORTED: MdxScript(Sandbox) (22, 4) Function ‘DATEADD’ only works with contiguous date selections.

image

This particular error will be the subject of my next blog post, but in this case I think the DAX time intelligence functions are behaving a bit rubbishly when they should be returning results.

Google Prediction API

Here’s something interesting for anyone into cloud-based data mining: Google have announced a new prediction API. More details here:

http://code.google.com/apis/predict/
http://code.google.com/apis/predict/docs/getting-started.html

Here’s the summary from the site:

The Prediction API enables access to Google’s machine learning algorithms to analyze your historic data and predict likely future outcomes. Upload your data to Google Storage for Developers, then use the Prediction API to make real-time decisions in your applications. The Prediction API implements supervised learning algorithms as a RESTful web service to let you leverage patterns in your data, providing more relevant information to your users. Run your predictions on Google’s infrastructure and scale effortlessly as your data grows in size and complexity.

Looks exciting! Now, why didn’t Microsoft do this? Clearly the ideas were there (remember the cloud version of the Excel DM addin?), but it seems like all the key members of the SQL data mining team had to leave MS to pursue their dreams: http://predixionsoftware.com/ 

Microsoft Technical Computing Initiative

An interesting announcement here from Microsoft about its new Technical Computing Initiative:

http://blogs.technet.com/microsoft_blog/archive/2010/05/17/modeling-the-world.aspx

Lots of the usual PR-speak and vagueness, but from the post above here are the main points:

In terms of technology, the initiative will focus on three key areas:

  1. Technical computing to the cloud: Microsoft will help lead the way in giving scientists, engineers and analysts the computing power of the cloud.  We’re also working to give existing high-performance computing users the ability to augment their on-premises systems with cloud resources that enable ‘just-in-time’ processing. This platform will help ensure processing resources are available whenever they are needed—reliably, consistently and quickly. 
  2. Simplify parallel development: Today, computers are shipping with more processing power than ever, including multiple cores. But most modern software only uses a small amount of the available processing power. Parallel programs are extremely difficult to write, test, and troubleshoot.  We know that a consistent model for parallel programming can help more developers unlock the tremendous power in today’s computers and enable a new generation of technical computing. We’re focused on delivering new tools to automate and simplify writing software through parallel processing from the desktop… to the cluster… to the cloud.    
  3. Develop powerful new technical computing tools and applications: Scientists, engineers and analysts are pushing common tools (i.e., spreadsheets and databases) to the limits with complex, data-intensive models. They need easy access to more computing power using simpler tools to increase the speed of their work, and we’re building a platform with this objective in mind. We expect that these efforts will yield new, easy-to-use tools and applications that automate data acquisition, modeling, simulation, visualization, workflow and collaboration.

And from this article on the Wall Street Journal, here’s a practical example of what will be delivered:

Muglia offers an example of how Microsoft plans to make high-performance computing more accessible: Today many financial services firms use the company’s Excel spreadsheet application to develop financial models, but if the firms need the power of a supercomputer to crunch numbers, they often have to write specialized applications in programming languages like Fortran that a much smaller group of users are fluent in.

Microsoft’s Technical Computing group is working on software that will allow a program like Excel to run in parallel on thousands of machines so the application can be used to tackle monster financial computing chores on its own, Muglia says.

It’s been a while since there was any wild speculation on this blog but I can’t resist it – all this talk of running Excel in parallel on multiple machines and the cloud makes me wonder if this is going to work with PowerPivot too? Or rather, will this work with whatever PowerPivot/Vertipaq becomes when it grows up into a corporate BI tool?

PowerPivot Samples Available

There are some new PowerPivot samples available for download – sample data in workbooks, plus DAX calculation examples. You can get them here:

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=eac83429-c6e5-48a6-87cf-00a4141e5441

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=1ae63bfb-c303-44e3-ae44-7413d499495d

Microsoft BI Indexing Connector

Just seen this on the Sharepoint BI blog, the Microsoft BI Indexing Connector:

http://blogs.msdn.com/sharepointbi/archive/2010/05/14/announcing-the-microsoft-business-intelligence-indexing-connector.aspx

From the blog post:

With this new Indexing connector, users have a dedicated Report tab where they can find the reports they are looking for, use refiners to further narrow their searches, and even get a preview of the report before opening it in the browser or client…

…In addition to discovering the documents, the MSBIIC will also crawl the data sources revealing items that are not necessary in the report itself but critical to the user’s discovery and as part of the refiners.

What happens when you do a Process Update on a dimension?

Over the last few days I’ve been involved in an interesting thread on the SSAS forum regarding what happens when you do a Process Update on a dimension. It’s a topic that is not widely understood, and indeed I’ve not known all the details until today, but it’s nonetheless very important: one of the commonest performance-related problems I see in my consultancy work is partitions that have aggregations designed for them, but where those aggregations aren’t in a processed state because a Process Update has been run on one or more dimensions. Anyway, just now Akshai Mirchandani from the dev team posted a really good overview of what actually happens when you run a Process Update on that thread, so I thought I’d copy here to ensure it gets a wider audience:

Here is a quick summary of what happens when you do ProcessUpdate:

1. After the dimension has been updated, the server analyzes the changes that occurred to the dimension. In 2005, this analysis was pretty simple and would often incorrectly detect that major changes had occurred that required clearing of indexes and aggregations. In 2008, this code was improved such that it more often would realize that nothing significant has occurred. It’s a fairly small (but useful) optimization — I guess nobody thought it was worth documenting!

2. Based on this analysis, the server will decide whether or not indexes and aggregations need to be cleared. If no (e.g. because records were only added and not deleted/updated), then the partitions won’t be affected.

3. If indexes/aggregations need to be cleared, then the server will check if ProcessAffectedObjects was enabled — if yes, then instead of clearing the indexes/aggregations it will rebuild the indexes/aggregations.

4. The act of clearing the indexes/aggregations also shows up as "partition processing operations" in Profiler — that’s one of the things that has been confusing some of you.

5. When aggregations are cleared, only the flexible aggregations need to be cleared because we’re guaranteed by the rigid relationships that the members cannot have moved and therefore the rollups cannot have changed. However, indexes can still have changed and therefore you may still see the partition processing jobs kick off to clear the indexes.

6. ProcessIndexes and ProcessClearIndexes take care of building both bitmap indexes (aka map) and aggregations — the context is that both aggregations and bitmap indexes are generically considered "indexes".

Really the main takeaway here is that if you ProcessUpdate a dimension, you should strongly consider either doing ProcessAffectedObjects or an explicit ProcessIndexes on the affected partitions so that bitmap indexes and flexible aggregations get rebuilt. The advantage of explicitly doing ProcessIndexes is that you can bring your cube online earlier and have the indexes/aggregations get processed more lazily in the background — a number of customers prefer to do that because their processing windows are too small to wait for the indexes to get processed.

Also related to this topic, I thought I’d also highlight a great post by Darren Gosbell where he shows how to find out if your aggregations are processed or not:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

Tuning SSRS-Generated MDX Parameter Queries

Sometimes you’ll find yourself in the position of building SSRS reports where you have parameters with a large number of available values. Using the Adventure Works cube as an example, if you were to drag the Customer attribute from the Customer dimension onto the filter area of the Query Designer for a simple query and check the Parameters box like so:

image 

…you’d end up with a parameter where you can choose any customer to filter on – and there are 18485 customers on that hierarchy.

If you right-click on your data source in the Report Data pane (in BIDS in SSAS 2008) and check the Show Hidden Datasets option, you can see the MDX query that BIDS generates to return the list of available values for the parameter query:

image 

Here’s what the query will look like for the Customers hierarchy for the Customer dimension:

WITH
MEMBER [Measures].[ParameterCaption]
AS [Customer].[Customer].CURRENTMEMBER.MEMBER_CAPTION
MEMBER [Measures].[ParameterValue]
AS [Customer].[Customer].CURRENTMEMBER.UNIQUENAME
MEMBER [Measures].[ParameterLevel]
AS [Customer].[Customer].CURRENTMEMBER.LEVEL.ORDINAL
SELECT
{[Measures].[ParameterCaption]
, [Measures].[ParameterValue]
, [Measures].[ParameterLevel]}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
ON ROWS
FROM [Adventure Works]

On my laptop this query executes in just over 1.5 seconds. Not bad, you might think, for a query that returns a fairly large number of rows. But we can do better!

This query returns all customers on rows and three columns: the caption of each member, the unique name, and the ordinal of the level (which is used for indenting the caption of each member in the dropdown list for the parameter, so you can easily distinguish between members on different levels). These values are returned as calculated members, but they can also be obtained as member properties and this is the key to tuning the query. So, if you create a new OLEDB connection to the cube (ie you don’t use the built-in Analysis Services connection type but you create a data source that connects to the cube using the OLEDB connection type)…

image

…and then create a new dataset with the following query:

WITH
MEMBER MEASURES.DUMMY AS NULL
SELECT
{MEASURES.DUMMY}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
DIMENSION PROPERTIES UNIQUE_NAME, MEMBER_CAPTION, LEVEL_NUMBER
ON ROWS 
FROM [Adventure Works]
CELL PROPERTIES VALUE

You’ll find you get almost exactly the same data back, although this time the query returns in around 0.2 seconds.

A few things need to be noted here. First of all, although MDX allows you to put an empty set on columns, in the query above I had to create a dummy calculated measure that returned null because otherwise the query didn’t return any rows from the OLEDB data source. Secondly, in the new query the All Member unique name and caption come out as nulls – that’s normal behaviour for flattened rowsets (which is what you get when you run queries through an OLEDB connection), unfortunately, and again something we’re going to have to work around ourselves. Thirdly, we also need to create a column with indented member names – the original parameter dataset did this using a SSRS expression in a calculated field – although in this case, where there’s only one level underneath the all member, we could probably skip this and not hurt usability.

To trap the nulls and make sure the All Customers member appears as a parameter option, you can use a calculated field on the new dataset with an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "[Customer].[Customer].[All Customers]"
, Fields!Customer_Customer_Customer_UNIQUE_NAME.Value)

And to generate the indented captions you can use an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "All Customers"
, " " + Fields!Customer_Customer_Customer.Value)

You then need to delete the original parameter dataset, point the report parameter to the new dataset and bind these two columns to it value and label fields. And lo and behold, you have a report that runs just over a second faster than it did before. This might seem like a lot of hassle to go through for such a small gain, but if you have more than one large parameter the time savings will add up and your users will notice the difference.

BI User Group, London, May 20th

Just a quick post to say I’ll be hosting and speaking at a BI user group event in London on May 20th. My session will be on ‘Implementing Common Business Calculations in DAX” and there’ll be one other session from someone still to be confirmed. More details and registration here:

http://sqlserverfaq.com/events/223/Business-Intelligence-Implementing-common-business-calcs-using-DAX-in-PowerPivot-Chris-Webb.aspx

Counting Returning Customers in DAX

As promised, I’m going to be putting up a few more DAX examples here – although they may not demonstrate any new and interesting concepts, I’m still learning the language and blogging is a good way for me to practise implementing common calculations.

So, today’s problem is: how do we find the number of distinct customers who bought something in the current time period and who have also bought something in the past? That’s to say we want to count the number of returning customers, as opposed to completely new customers who have never bought anything from us before. This is, basically, a variation on the problem of how to get a distinct count in DAX that Marco has already dealt with comprehensively here, but with aspects of a time intelligence calculation (a topic which is well covered here). I’ll be using two tables from Adventure Works to illustrate this: FactInternetSales and DimDate.

What we need to start off with is find the set of distinct customers who bought something in the current time period. The following simple DAX expression finds this set and gives me the distinct count:

=COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))

We now need to filter these customers so that we only return the ones who bought something in the range of dates from the very first date we have data for, up to the day before the first date in the current time range. This can be accomplished with the DatesBetween, FirstDate and DateAdd functions. With the DatesBetween function, if you pass a Blank value to the first parameter it will give you the first date you have data for as your start date (see here for another example of this); for the end date in the range, we find the first date in the current date range with FirstDate, then get the day before using DateAdd:

DATESBETWEEN(
DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)

We then need to use this date range inside the Filter function as follows:

=CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0))

However, there’s one last problem to solve: on the first date in the time dimension using the DateAdd function in this way will return a Blank value, and using Blank as both a start and an end date will result in us getting all the dates in the column back from the DatesBetween function. This means that for this first date we get the same value as the distinct count because we are finding the set of all customers who bought something on that first date and applying a filter to see whether these customers bought something on any date, which of course they did:

image

So we need to use the IF function to check if using DateAdd to get the previous date in this way returns a Blank, and if it does to return a Blank. Here’s the final expression:

=IF(
DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)=BLANK()
, BLANK()
,CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0)))

image

One last point: even though I’ve written this expression on the RC version of PowerPivot, I still need to use ALL(DimDate) on the last line I am joining FactInternetSales and DimDate using the OrderDateKey column, which is an integer surrogate key. This wouldn’t be necessary if I was joining the two tables using a datetime column. Personally I think this is going to confuse no end of people and is a bit rubbish – especially because, in many data warehouses, the only way to join a fact table and a time dimension table is using an integer surrogate key; even worse, if you try to bring a datetime column from a time dimension table down onto the fact table using a calculated column, you’ll get a circular reference error. Hopefully this will get fixed in the next release…

UPDATE: I’ve found a better way of doing this in DAX. Instead of using the Filter function, which can be a bit slow, you can just use the filter arguments of the Calculate function itself. Here’s the new version:

=IF(
CALCULATE(COUNTROWS(), DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))=0
, BLANK()
, CALCULATE(
COUNTROWS(DISTINCT(vwFactInternetSalesWithDate[CustomerKey]))
,DISTINCT(vwFactInternetSalesWithDate[CustomerKey])
, DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)))

Here, the two filter arguments I’m using are:

  1. DISTINCT(vwFactInternetSalesWithDate[CustomerKey]), to filter by the customers who bought something in the current time period, and
  2. DATESBETWEEN(DimDate[FullDateAlternateKey], BLANK(),  DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)), to filter by the set of all dates up to the day before the first date in the current period

ANDing these two arguments together gives us a context that contains only rows that have a customer that bought in the current time period and dates up to the current time period. It’s a little harder to understand what’s going on here but a lot more elegant, I think.