What happens when you do a Process Update on a dimension?

Over the last few days I’ve been involved in an interesting thread on the SSAS forum regarding what happens when you do a Process Update on a dimension. It’s a topic that is not widely understood, and indeed I’ve not known all the details until today, but it’s nonetheless very important: one of the commonest performance-related problems I see in my consultancy work is partitions that have aggregations designed for them, but where those aggregations aren’t in a processed state because a Process Update has been run on one or more dimensions. Anyway, just now Akshai Mirchandani from the dev team posted a really good overview of what actually happens when you run a Process Update on that thread, so I thought I’d copy here to ensure it gets a wider audience:

Here is a quick summary of what happens when you do ProcessUpdate:

1. After the dimension has been updated, the server analyzes the changes that occurred to the dimension. In 2005, this analysis was pretty simple and would often incorrectly detect that major changes had occurred that required clearing of indexes and aggregations. In 2008, this code was improved such that it more often would realize that nothing significant has occurred. It’s a fairly small (but useful) optimization — I guess nobody thought it was worth documenting!

2. Based on this analysis, the server will decide whether or not indexes and aggregations need to be cleared. If no (e.g. because records were only added and not deleted/updated), then the partitions won’t be affected.

3. If indexes/aggregations need to be cleared, then the server will check if ProcessAffectedObjects was enabled — if yes, then instead of clearing the indexes/aggregations it will rebuild the indexes/aggregations.

4. The act of clearing the indexes/aggregations also shows up as "partition processing operations" in Profiler — that’s one of the things that has been confusing some of you.

5. When aggregations are cleared, only the flexible aggregations need to be cleared because we’re guaranteed by the rigid relationships that the members cannot have moved and therefore the rollups cannot have changed. However, indexes can still have changed and therefore you may still see the partition processing jobs kick off to clear the indexes.

6. ProcessIndexes and ProcessClearIndexes take care of building both bitmap indexes (aka map) and aggregations — the context is that both aggregations and bitmap indexes are generically considered "indexes".

Really the main takeaway here is that if you ProcessUpdate a dimension, you should strongly consider either doing ProcessAffectedObjects or an explicit ProcessIndexes on the affected partitions so that bitmap indexes and flexible aggregations get rebuilt. The advantage of explicitly doing ProcessIndexes is that you can bring your cube online earlier and have the indexes/aggregations get processed more lazily in the background — a number of customers prefer to do that because their processing windows are too small to wait for the indexes to get processed.

Also related to this topic, I thought I’d also highlight a great post by Darren Gosbell where he shows how to find out if your aggregations are processed or not:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

Tuning SSRS-Generated MDX Parameter Queries

Sometimes you’ll find yourself in the position of building SSRS reports where you have parameters with a large number of available values. Using the Adventure Works cube as an example, if you were to drag the Customer attribute from the Customer dimension onto the filter area of the Query Designer for a simple query and check the Parameters box like so:

image 

…you’d end up with a parameter where you can choose any customer to filter on – and there are 18485 customers on that hierarchy.

If you right-click on your data source in the Report Data pane (in BIDS in SSAS 2008) and check the Show Hidden Datasets option, you can see the MDX query that BIDS generates to return the list of available values for the parameter query:

image 

Here’s what the query will look like for the Customers hierarchy for the Customer dimension:

WITH
MEMBER [Measures].[ParameterCaption]
AS [Customer].[Customer].CURRENTMEMBER.MEMBER_CAPTION
MEMBER [Measures].[ParameterValue]
AS [Customer].[Customer].CURRENTMEMBER.UNIQUENAME
MEMBER [Measures].[ParameterLevel]
AS [Customer].[Customer].CURRENTMEMBER.LEVEL.ORDINAL
SELECT
{[Measures].[ParameterCaption]
, [Measures].[ParameterValue]
, [Measures].[ParameterLevel]}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
ON ROWS
FROM [Adventure Works]

On my laptop this query executes in just over 1.5 seconds. Not bad, you might think, for a query that returns a fairly large number of rows. But we can do better!

This query returns all customers on rows and three columns: the caption of each member, the unique name, and the ordinal of the level (which is used for indenting the caption of each member in the dropdown list for the parameter, so you can easily distinguish between members on different levels). These values are returned as calculated members, but they can also be obtained as member properties and this is the key to tuning the query. So, if you create a new OLEDB connection to the cube (ie you don’t use the built-in Analysis Services connection type but you create a data source that connects to the cube using the OLEDB connection type)…

image

…and then create a new dataset with the following query:

WITH
MEMBER MEASURES.DUMMY AS NULL
SELECT
{MEASURES.DUMMY}
ON COLUMNS
, [Customer].[Customer].ALLMEMBERS
DIMENSION PROPERTIES UNIQUE_NAME, MEMBER_CAPTION, LEVEL_NUMBER
ON ROWS 
FROM [Adventure Works]
CELL PROPERTIES VALUE

You’ll find you get almost exactly the same data back, although this time the query returns in around 0.2 seconds.

A few things need to be noted here. First of all, although MDX allows you to put an empty set on columns, in the query above I had to create a dummy calculated measure that returned null because otherwise the query didn’t return any rows from the OLEDB data source. Secondly, in the new query the All Member unique name and caption come out as nulls – that’s normal behaviour for flattened rowsets (which is what you get when you run queries through an OLEDB connection), unfortunately, and again something we’re going to have to work around ourselves. Thirdly, we also need to create a column with indented member names – the original parameter dataset did this using a SSRS expression in a calculated field – although in this case, where there’s only one level underneath the all member, we could probably skip this and not hurt usability.

To trap the nulls and make sure the All Customers member appears as a parameter option, you can use a calculated field on the new dataset with an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "[Customer].[Customer].[All Customers]"
, Fields!Customer_Customer_Customer_UNIQUE_NAME.Value)

And to generate the indented captions you can use an expression like this:

=iif(
Fields!Customer_Customer_Customer_UNIQUE_NAME.Value is Nothing
, "All Customers"
, " " + Fields!Customer_Customer_Customer.Value)

You then need to delete the original parameter dataset, point the report parameter to the new dataset and bind these two columns to it value and label fields. And lo and behold, you have a report that runs just over a second faster than it did before. This might seem like a lot of hassle to go through for such a small gain, but if you have more than one large parameter the time savings will add up and your users will notice the difference.

BI User Group, London, May 20th

Just a quick post to say I’ll be hosting and speaking at a BI user group event in London on May 20th. My session will be on ‘Implementing Common Business Calculations in DAX” and there’ll be one other session from someone still to be confirmed. More details and registration here:

http://sqlserverfaq.com/events/223/Business-Intelligence-Implementing-common-business-calcs-using-DAX-in-PowerPivot-Chris-Webb.aspx

Counting Returning Customers in DAX

As promised, I’m going to be putting up a few more DAX examples here – although they may not demonstrate any new and interesting concepts, I’m still learning the language and blogging is a good way for me to practise implementing common calculations.

So, today’s problem is: how do we find the number of distinct customers who bought something in the current time period and who have also bought something in the past? That’s to say we want to count the number of returning customers, as opposed to completely new customers who have never bought anything from us before. This is, basically, a variation on the problem of how to get a distinct count in DAX that Marco has already dealt with comprehensively here, but with aspects of a time intelligence calculation (a topic which is well covered here). I’ll be using two tables from Adventure Works to illustrate this: FactInternetSales and DimDate.

What we need to start off with is find the set of distinct customers who bought something in the current time period. The following simple DAX expression finds this set and gives me the distinct count:

=COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))

We now need to filter these customers so that we only return the ones who bought something in the range of dates from the very first date we have data for, up to the day before the first date in the current time range. This can be accomplished with the DatesBetween, FirstDate and DateAdd functions. With the DatesBetween function, if you pass a Blank value to the first parameter it will give you the first date you have data for as your start date (see here for another example of this); for the end date in the range, we find the first date in the current date range with FirstDate, then get the day before using DateAdd:

DATESBETWEEN(
DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)

We then need to use this date range inside the Filter function as follows:

=CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0))

However, there’s one last problem to solve: on the first date in the time dimension using the DateAdd function in this way will return a Blank value, and using Blank as both a start and an end date will result in us getting all the dates in the column back from the DatesBetween function. This means that for this first date we get the same value as the distinct count because we are finding the set of all customers who bought something on that first date and applying a filter to see whether these customers bought something on any date, which of course they did:

image

So we need to use the IF function to check if using DateAdd to get the previous date in this way returns a Blank, and if it does to return a Blank. Here’s the final expression:

=IF(
DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)=BLANK()
, BLANK()
,CALCULATE(COUNTROWS(DISTINCT(FactInternetSales[CustomerKey]))
, FILTER(DISTINCT(FactInternetSales[CustomerKey])
, CALCULATE(SUM(FactInternetSales[SalesAmount]),
DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))
, ALL(DimDate))>0)))

image

One last point: even though I’ve written this expression on the RC version of PowerPivot, I still need to use ALL(DimDate) on the last line I am joining FactInternetSales and DimDate using the OrderDateKey column, which is an integer surrogate key. This wouldn’t be necessary if I was joining the two tables using a datetime column. Personally I think this is going to confuse no end of people and is a bit rubbish – especially because, in many data warehouses, the only way to join a fact table and a time dimension table is using an integer surrogate key; even worse, if you try to bring a datetime column from a time dimension table down onto the fact table using a calculated column, you’ll get a circular reference error. Hopefully this will get fixed in the next release…

UPDATE: I’ve found a better way of doing this in DAX. Instead of using the Filter function, which can be a bit slow, you can just use the filter arguments of the Calculate function itself. Here’s the new version:

=IF(
CALCULATE(COUNTROWS(), DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY))=0
, BLANK()
, CALCULATE(
COUNTROWS(DISTINCT(vwFactInternetSalesWithDate[CustomerKey]))
,DISTINCT(vwFactInternetSalesWithDate[CustomerKey])
, DATESBETWEEN(DimDate[FullDateAlternateKey]
,BLANK()
, DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)
)))

Here, the two filter arguments I’m using are:

  1. DISTINCT(vwFactInternetSalesWithDate[CustomerKey]), to filter by the customers who bought something in the current time period, and
  2. DATESBETWEEN(DimDate[FullDateAlternateKey], BLANK(),  DATEADD(FIRSTDATE(DimDate[FullDateAlternateKey]), -1, DAY)), to filter by the set of all dates up to the day before the first date in the current period

ANDing these two arguments together gives us a context that contains only rows that have a customer that bought in the current time period and dates up to the current time period. It’s a little harder to understand what’s going on here but a lot more elegant, I think.

 

SQL Server Metadata Toolkit 2008

One of the few sessions I was able to see last week at the PASS European Conference (I had to leave very early, unfortunately) was Markus Raatz’s excellent “New Treasures for Microsoft BI Found on Codeplex”. Among other things, he showed off a new project on Codeplex that I hadn’t seen before: SQL Server Metadata Toolkit 2008. It’s an updated version of the SQL Server Metadata Toolkit that was released for 2005 a few years ago, and allows you to trace metadata dependencies between the relational database, SSIS packages and SSAS objects. So if, for example, you change a column name in a dimension table you should be able to work out which SSIS packages and which SSAS dimensions will be broken. You can download it here:
http://sqlmetadata.codeplex.com/

I’ve not had a chance to download it yet, but it looks like it will be very useful.

Analysis Services and Solid State Disks

There’s a lot of discussion going on in the SQL Server relational world about solid state disks and their benefits, but for some reason very little has been said about how useful they could be for Analysis Services. And yet, with their impressive performance for random reads, SSAS and SSDs would seem to be the perfect combination. I think the reason for the silence is that there’s less general understanding of SSAS performance issues out there in the blogosphere, and probably also since there are fewer really large installations of SSAS not many people have had the chance to test SSDs with SSAS properly.

Anyway, the other week I was lucky enough to spend some time testing SSDs with SSAS for a customer of mine. For reasons of privacy I can’t share any significant details about what I found, and before I go on I need to point out that what I was doing was comparing the performance of a particular SSAS cube on a particular existing hard disk configuration with the same SSAS cube on a particular SSD configuration. So while it’s certainly possible to say that, in lab tests, SSDs can perform extremely well for certain tasks, in the real world you have to ask the following question: what benefit will an SSD give me for my cube, over my existing disk configuration? So in fact even if I could share the details of my tests they wouldn’t be all that useful.

That said, here are a few high-level findings:

  • Backup on the SSD was approximately 50% faster than on their existing, very respectable, hard disk configuration; Restore was 65% faster. These were the clearest examples of the SSD showing massive performance improvements.
  • For various reasons I couldn’t do full tests on processing performance, but Process Index operations were about 10-15% faster, as was Synchronisation.
  • Performance of individual queries was not significantly faster, maybe 1-2%; there were some cases where, with large numbers of concurrent users, I saw a bigger improvement of around maybe 10%. This didn’t surprise me at all because I knew that any query performance issues the customer has are related to the formula engine and not the storage engine, so IO wasn’t much of a problem in the first place.

These tests also led to some interesting behind-the-scenes discussions about how you can actually measure the impact of IO on SSAS storage engine operations, which Teo has already blogged about here and which are well worth reading:
http://prologika.com/CS/blogs/blog/archive/2010/04/13/analysis-services-i-o-time.aspx
http://prologika.com/CS/blogs/blog/archive/2010/04/16/using-xperf-to-test-analysis-services.aspx

For this particular customer, my recommendation was that it wasn’t worth the time, money and effort to move to SSDs, at least from a query performance point of view. In fact, based on several years experience of tuning SSAS implementations I find that in most cases IO is not one of the most important factors in determining query performance – it’s more often the case that the problem lies in the formula engine, and even when it doesn’t, an appropriate partitioning strategy and aggregation design can work wonders.

However, let me be clear: I am not making some general statement that SSDs are not useful for SSAS. What I am doing is giving the usual ‘your mileage may vary’ and ‘it depends’ answer. There are definitely going to be SSAS implementations where SSDs will make a massive difference to query performance, and where no other tuning technique can have the same impact. For example, I’ve heard that with distinct count measures (especially when there are a large number of concurrent users) SSDs can lead to very impressive improvements in query performance; I would assume that where there are large many-to-many relationships you would also benefit accordingly. What you need to do before you spend money on SSDs is to understand the causes of any performance issues you currently have and make sure you tune your cube to the best of you abilities.

Has anyone else had any experience with SSAS and SSDs they’d like to share?

UPDATE: the SQLCat team have published this technical note on the subject: http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx

SQLBits VI in the bag

Wow, another SQLBits is over, and in my opinion it was the best yet. I know I always say that after we’ve done one, but this time we broke our attendance records (I think the official figure was 425 people, more than 50 more than the next highest attendance we’ve achieved) and everything went very smoothly indeed. We even escaped relatively unscathed by the volcanic ash problem: only one speaker couldn’t make it because of the flight cancellations, amazingly, although several European attendees were grounded and a lot of people will be having an unplanned weekend break in London right now. After his flight got cancelled, Bob Duffy got his pregnant wife to drive him overnight from Dublin so he could speak – that’s what I call dedication to the cause!

A big thank you goes to all my fellow members of the SQLBits committee, to all the volunteers who worked so hard on the day and the night before, to all our sponsors and to everyone who turned out and helped make the day such a success. Now, where shall we go for the next one? And how am I going to get to Germany next week for PASS Europe?

UPDATE: and if you did come to SQLBits, please fill out our feedback surveys online –
Conference Feedback
Session Feedback

Public Courses in Dublin

I’m pleased to announce that I’ll be teaching two public courses in Dublin, organised by my friends at Prodata. First up, on May 27th/28th, I’ll be teaching a course on real-world cube design and performance tuning for Analysis Services:
http://www.prodata.ie/Events/CubeDesign2010.aspx

Then, in September (the exact date is to be confirmed), I’ll be doing my two-day Introduction to MDX course:
http://www.prodata.ie/Events/MDX2010.aspx

Naming Objects in Analysis Services

How you should go about naming objects in Analysis Services is another hobby horse of mine – I had a small rant about it in “Expert Cube Development with SQL Server Analysis Services 2008” and I thought I’d mention it here because it’s sure to provoke some debate.

My guiding principle when naming objects (cubes, dimensions, attributes, measure groups or measures) in Analysis Services is to think like you’re designing a user interface and not a database. That’s because a cube really is a kind of user interface: it’s a friendly, easy-to-query window onto all of the data in your data warehouse for non-technical users. That means that any object that the user is likely to see in the metadata, and which will appear in any reports, pivot tables or other queries the user generates, should have a name they understand and want to see. That also means that any techy naming conventions you follow in your relational database design should be completely ignored because, while they might make sense to you as an IT professional, they are likely to be gobbledegook to your users.

The commonest example of bad practice that I see is having Analysis Services dimensions called “Dim Something” – “Dim Product”, “Dim Time”, “Dim Customer” and so on. Hands up who has a cube with dimensions named like this? OK, feel ashamed. Ask yourself, do your users want to go to the CEO with an Excel spreadsheet containing column headers like this? No, of course not, they want to see “Product”, “Time” and “Customer”. They know these things are dimensions already and the “Dim” bit is totally redundant. Similarly, they don’t want to see measures called “Sls_Pct_Chg” or attributes called “CustID” or any of that; and even if you come up with what seems to be a reasonable, human-readable name yourself but it’s still not what the users want they’ll do their best to change it. By not giving them the names they want you’re generating extra work and hassle for them, putting them off using the cube, and making it more likely that different users will come up with different names for the same thing in reports.

Of course this means you have to go and talk to your users about what they want to have their objects called. Since changing an object’s name can end up breaking calculations and any reports that your users have already defined, then you need to do this right at the beginning of your project, even before you’ve run any wizards for the first time. You still need to make sure the names make sense, are consistent, and are acceptable to the whole user community, but ultimately it’s them making the decisions and not you. And if it’s too late to change things now on your current project, remember this post the next time you set about building a cube!

Different Approaches To Handling Tied Ranks

Even if blogging about MDX feels, these days, a bit like blogging about COBOL (I’ll be returning to DAX soon, I promise), here’s an interesting MDX problem I came across the other day that I thought was writing about.

Calculating ranks is one of those things in MDX that is slightly trickier than it first appears. There’s a RANK function, of course, but in order to get good performance from it you need to know what you’re doing. It’s fairly widely known that with normal ranks what you need to do is to order the set you’re using before you find the rank of a tuple inside that set. Consider the following query on Adventure Works:

WITH
MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER,
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC))

SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANK}
ON COLUMNS,
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)
ON ROWS
FROM [Adventure Works]

It’s unbearably slow (in fact I killed the query rather than wait for it to complete) because, in the calculated member, what we’re doing is ordering the set of all customers every time we calculate a rank. Obviously we don’t need to do this, so the solution to this problem is of course to order the set just once, use a named set to store the result, and then reference the named set in the calculated member as follows:

WITH

SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)

MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER,
ORDEREDCUSTOMERS)

SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]

This query now executes in 5 seconds on my laptop. You probably knew all this already though.

But what happens if you need to handle tied ranks? The approach above doesn’t give you tied ranks because the RANK function, in its two-parameter form, simply finds the position of a tuple in a set, and no two tuples can occupy the same position in a set. That’s why you get results like this:

image

Even though Courtney A. Edwards and Jackson L. Liu have the same value for Internet Sales Amount, their ranks are 799 and 800 respectively, because Courtney A. Edwards comes before Jackson L. Liu in the ORDEREDCUSTOMERS set.

BOL tells us of the three-parameter form of RANK that does give us tied ranks. This is how you use it:

WITH

SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)

MEMBER [Measures].REGULARRANKTIED AS
RANK([Customer].[Customer].CURRENTMEMBER
,[Customer].[Customer].[Customer].MEMBERS,[Measures].[Internet Sales Amount])

SELECT
{[Measures].[Internet Sales Amount], [Measures].REGULARRANKTIED}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]

But, unfortunately, the performance is as bad as the original version of the non-tied rank calculation, ie incredibly bad, because once again we’re sorting the set every time we calculate a rank. So how can we get tied ranks and good performance?

The first approach I tried was to use a recursive calculation, which used the named set approach to calculate the non-tied rank and then checked to see if the CurrentMember on Customer had the same value for Internet Sales Amount as the member before it in the set of Ordered Customers; if it did, it displayed the rank of the previous Customer. Here it is:

WITH

SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)

MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)

MEMBER MEASURES.TIEDRANK AS
IIF(
[Measures].[Internet Sales Amount] =
(ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), [Measures].[Internet Sales Amount])
AND MEASURES.REGULARRANK>1
, (ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), MEASURES.TIEDRANK)
,MEASURES.REGULARRANK)

SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]

Now this particular query performs pretty well – 6 seconds on my laptop, only marginally worse than the non-tied rank. And it gives the correct results; the middle column of values below shows the tied rank:

image

Unfortunately, the performance of this approach varies a lot depending on the number of tied ranks that are present in the set. If we slice the query by the year 2001, when there were a lot more customers with tied ranks, as follows:

WITH

SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)

MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)

MEMBER MEASURES.TIEDRANK AS
IIF(
[Measures].[Internet Sales Amount] =
(ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), [Measures].[Internet Sales Amount])
AND MEASURES.REGULARRANK>1
, (ORDEREDCUSTOMERS.ITEM(MEASURES.REGULARRANK-2), MEASURES.TIEDRANK)
,MEASURES.REGULARRANK)

SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]
WHERE([Date].[Calendar Year].&[2001])

…then performance gets really bad once again.

Then I came up with a new approach. After ordering the set of all Customers, I made a second pass over it and created a second set with exactly the same number of items in it: for every customer in the first set, in the second set I added the current Customer if that Customer did not have a tied rank; if the Customer did have a tied rank, I added the first Customer in the original set that shared its tied rank. So if there were four customers, A, B, C and D, and if A had sales of 1, B had sales of 2, C had sales of 2 and D had sales of 3, then this new set would contain the members A, B, B, D. I could then say, for Customer C, that it was the third Customer in the original set, but the third item in the new set was B, and that was the Customer whose rank I needed to display for Customer C. So each item in this second set gives us the member whose rank we need to display for the member in the same position in the set of ordered Customers.

Here’s the MDX:

WITH

SET
ORDEREDCUSTOMERS AS
ORDER(
[Customer].[Customer].[Customer].MEMBERS
, [Measures].[Internet Sales Amount]
, BDESC)

MEMBER MEASURES.REGULARRANK AS
RANK([Customer].[Customer].CURRENTMEMBER, ORDEREDCUSTOMERS)

SET CUSTOMERSWITHTIES AS
GENERATE(
{INTERSECT({ORDEREDCUSTOMERS.ITEM(0)} AS FIRSTTIE,{}), ORDEREDCUSTOMERS} AS ORDEREDCUSTOMERS2
, IIF(
({ORDEREDCUSTOMERS2.CURRENT AS CURRCUST}.ITEM(0), [Measures].[Internet Sales Amount]) =
({ORDEREDCUSTOMERS2.ITEM(ORDEREDCUSTOMERS2.CURRENTORDINAL-2) AS PREVCUST}.ITEM(0), [Measures].[Internet Sales Amount])
, IIF(
(PREVCUST.ITEM(0), [Measures].[Internet Sales Amount])
=
(FIRSTTIE.ITEM(0), [Measures].[Internet Sales Amount])
, {FIRSTTIE}
, {PREVCUST AS FIRSTTIE})
, {CURRCUST})
, ALL)

MEMBER MEASURES.HASTIE AS
RANK([Customer].[Customer].CURRENTMEMBER, CUSTOMERSWITHTIES)

MEMBER MEASURES.TIEDRANK AS
(MEASURES.REGULARRANK, CUSTOMERSWITHTIES.ITEM(MEASURES.REGULARRANK-1))

SELECT
{[Measures].[Internet Sales Amount], MEASURES.TIEDRANK,[Measures].REGULARRANK}
ON COLUMNS,
ORDEREDCUSTOMERS
ON ROWS
FROM [Adventure Works]

The named set CUSTOMERSWITHTIES is where the interesting stuff happens. I’m iterating over the set ORDEREDCUSTOMERS using the GENERATE function, and using inline named sets to store the current Customer in the iteration, the previous Customer, and the first Customer containing the shared tied rank (see here for a similar example of using named sets). It consistently executes in 12 seconds regardless of how you slice the query, so it’s not as good as the best performance of the recursive approach but it’s much, much better than the worst performance of the recursive approach. If anyone has any other ideas on how to solve this problem, I’d love to hear them. I’m still sure there’s a better way of doing this…

Of course, what I really want is for the Formula Engine to be able to optimise queries containing set functions like Order in scenarios like this – I’d want it to know that when a particular set operation returns the same result for a block of cells, it should only perform that set operation once. However, even this wouldn’t necessarily be good enough in all cases – there are plenty of situations where you need to perform the same expensive set operation like a sort or a filter in multiple similar calculations, and you’d like to share the result of this set operation between calculations. For example, you might have a calculated member that counted the number of Customers who bought something both this month and in the previous month, and a second calculated member that counted the number of Customers who not only bought this month and in the previous month and spent more than $1000. In both cases you end up finding the set of Customers who bought this month and last month, which may take a long time to do. This is why I think it would be useful to be able to have calculated members return set objects, which can then be cached, so you can share the set between multiple other calculated members; if you agree, please vote on this Connect